A Mitochondrial Genome of Rhyparochromidae (Hemiptera: Heteroptera) and a Comparative Analysis of Related Mitochondrial Genomes

The Rhyparochromidae, the largest family of Lygaeoidea, encompasses more than 1,850 described species, but no mitochondrial genome has been sequenced to date. Here we describe the first mitochondrial genome for Rhyparochromidae: a complete mitochondrial genome of Panaorus albomaculatus (Scott, 1874). This mitochondrial genome is comprised of 16,345 bp, and contains the expected 37 genes and control region. The majority of the control region is made up of a large tandem-repeat region, which has a novel pattern not previously observed in other insects. The tandem-repeats region of P. albomaculatus consists of 53 tandem duplications (including one partial repeat), which is the largest number of tandem repeats among all the known insect mitochondrial genomes. Slipped-strand mispairing during replication is likely to have generated this novel pattern of tandem repeats. Comparative analysis of tRNA gene families in sequenced Pentatomomorpha and Lygaeoidea species shows that the pattern of nucleotide conservation is markedly higher on the J-strand. Phylogenetic reconstruction based on mitochondrial genomes suggests that Rhyparochromidae is not the sister group to all the remaining Lygaeoidea, and supports the monophyly of Lygaeoidea.


Results and Discussion
Genome organization and structure. The mitochondrial genome of P. albomaculatus contains 16,345 bp (GenBank Accession Number: KX216853; Fig. 1). The gene order of the 37 genes (including 13 PCGs, 22 tRNAs and two rRNAs) as well as the non-coding region (control region) is the same as observed in most other mt-genomes of true bugs 25 (Table 1). The genome is relatively compact with gene overlaps at 16 gene junctions, involving a total of 75 bp. The longest overlap occurs between tRNA-Ser and ND1 and involves 20 bp. Two of the overlaps (between ATP8/ATP6 and ND4L/ND4) consist of the same seven nucleotides (ATGATAA). There are also 82 nucleotides in 4 intergenic spacers (in addition to the control region), which range in size from 1 to 77 bp.
Protein coding genes. In the mt-genome of P. albomaculatus, eleven of the thirteen protein-coding genes initiate with ATN as the start codon (five start with ATT, four start with ATG, one starts with ATA, and one starts with ATC) ( Table 1). The COI gene initiates with a non-traditional start codon, TTG, as previously observed in other true bugs 25,26 . The ND4L gene has the unusual start codon GTG, as previously reported in assassin bugs 27 (Hemiptera: Reduviidae). Most protein coding genes end with a complete termination codon. Eight genes (ND2, COI, ATP8, ATP6, ND5, ND4L, ND6 and ND1) terminate with TAA, and two genes (ND3 and CytB) terminate with TAG. The remaining three PCGs (COII, COIII and ND4) are terminated with an incomplete stop codon T.
In Pentatomomorpha mt-genomes, most PCGs use the standard start codon (ATN), but we also observed other initiation codons of ATCA (ND2), CTG (ND1), GTG (COII, ND1, ND4L and ND6), and TTG (ATP8, COI, COII, ND1, ND4L, ND5 and ND6; see Supplementary Table S1). Many PCGs use multiple start codons, but three PCGs always initiate with only one start codon (COI with TTG, COIII and ND4 with ATG). Furthermore, several PCGs initiate with a start codon only found within Lygaeoidea mt-genomes (e.g., ATP6 start with ATT only in Yemmalysus parallelus). Most PCGs stop with termination codons TAA/TAG, but truncated termination codons T/TA are also used in Pentatomomorpha mt-genomes. The phenomenon of incomplete stop codons is common in insect mt-genomes and it is likely that these truncated stop codons are completed by posttranscriptional polyadenylation 28 .
We compared the rate of nonsynonymous substitutions (Ka), the rate of synonymous substitutions (Ks), and the ratio of Ka/Ks for each PCG among the 13 PCGs in Pentatomomorpha and Lygaeoidea mt-genomes (see Supplementary Figure S1). The results showed that the evolutionary patterns of 13 PCGs were highly consistent between Pentatomomorpha and Lygaeoidea, and shared many common features: (1) ATP8 has the highest evolutionary rate and can be used to analyze intraspecific relationships 29 , while COI appears to have the lowest evolutionary rate and has been used for DNA barcoding markers 30 . (2) The Ka/Ks ratios for all PCGs are less than 1, indicating that these genes are likely evolving primarily under the purifying selection 31 . (3) The uniformly low evolutionary rates and low Ka/Ks ratios (Ka/Ks < 0.2) for four genes (COI, COII, COIII and CytB) indicate strong purifying selection and evolutionary constraints in cytochrome c oxidase and cytochrome b 32 . (4) Negative correlations have been detected between the Ka/Ks ratios and the G + C content of each PCG in Pentatomomorpha (R 2 = 0.918) and Lygaeoidea (R 2 = 0. 891) mt-genomes (see Supplementary Figure S2), suggesting that the variation of G + C content probably causes the different evolutionary patterns among genes 22 . Additionally, most PCGs of Lygaeoidea mt-genomes had lower evolutionary rates than Pentatomomorpha.
Nucleotide composition and codon usage. The nucleotide composition of the mt-genome of P.
albomaculatus is significantly biased toward A/T. The overall A + T content of the J-strand is 76% (A = 44.4%, T = 31.6%, C = 14.5%, G = 9.5%; see Supplementary Table S2). This bias in A + T content falls within the known range for hemipteran mt-genomes (65.7% in Bemisia afer to 86.3% in Aleurodicus dugesii) 29 . The A + T content ranges from 75.1% for the PCGs, to 78.7% for the rRNAs. Among the PCGs, the lowest A + T content is 68.8% in COI, whereas the highest A + T content is 84% in ATP8. The nucleotide skew statistics 33 for the mt-genome of P. albomaculatus reveal that the J-strand PCGs are AT-skewed and CG-skewed, whereas the N-strand PCGs are GC-skewed and TA-skewed. The observed strand bias is likely related to asymmetric mutation processes during replication 34 .

Comparison of tRNAs in Pentatomomorpha and Lygaeoidea mt-genomes.
There are 22 tRNA genes in the mt-genome P. albomaculatus mt-genomes, as observed in other arthropod mt-geneomes. These genes range in length from 62 to 71 bp. Among these tRNAs, the tRNA-Ser (GCU) is the only one that does not fold into the typical cloverleaf secondary structure; the dihydrouridine (DHU) stem simply forms a loop (Fig. 2). Moreover, the predicted tRNA-Ser (GCU) anticodon stem is longer (9 bp vs. the normal 5 bp) and contains an unpaired nucleotide, as seen in many other insect mt-genomes 35 . A total of 20 unmatched G-U pairs exist in P. albomaculatus tRNA secondary structures. Of these, 16 mismatches are concentrated in 7 tRNAs that are encoded on the N-strand. These mismatches are common in arthropod mt-genomes; RNA editing processes are likely involved in correcting the mismatches 36 .
Most tRNA genes encoded on the J-strand are more conserved than those encoded on the N-strand, based on comparisons of aligned genes in Pentatomomorpha and Lygaeoidea (Fig. 2). Furthermore, some tRNAs (tRNA-Gln, tRNA-Gly, and tRNA-Met) within Lygaeoidea are much more conserved than they are within Pentatomomorpha, with changes mainly on the amino acid acceptor and TΨ C stems. The anticodon stem and loop, and the DHU stem, are extremely conserved in most tRNAs within Pentatomomorpha, whereas most other tRNA regions have a high level of variation (the exceptions are the highly conserved TΨ C stem of tRNA-Leu and acceptor stem of tRNA-Lys). The stems of each tRNA in Lygaeoidea mtDNA are more conserved than the remaining loops (TΨ C, DHU and variable loops), except for several amino acid acceptor arms (tRNA-Ala, tRNA-Asp, tRNA-Cys, and tRNA-Val) and TΨ C arms (tRNA-His, tRNA-Phe and tRNA-Pro). Most nucleotide substitutions are restricted to those loops (TΨ C, DHU, and variable loops), which usually accompany insertion-deletion polymorphisms. The presence of indels at different taxonomic levels supports potential phylogenetic value of tRNA sequences in the study of insect phylogenetic relationships, especially when secondary structures are taken into account 21,37 .
We determined the percentage of identical nucleotides (INP%) for each tRNA family in Pentatomomorpha and Lygaeoidea mt-genomes, respectively ( Ribosomal RNAs. We assumed that the rRNA genes in the mt-genome of P. albomaculatus start and end at the boundaries of the flanking genes, as is the case in most other insect mt-genomes 22,38 . The large subunit (16S) rRNA gene is located between the tRNA-Val and tRNA-Leu (TAG) genes. The small subunit (12S) rRNA is located between the tRNA-Val gene and the control region. The 16S rRNA gene is 1251 bp with an A + T content of 79.9%, and the 12S rRNA gene is 787 bp with an A + T content of 76.8%. The secondary structure of P. albomaculatus 16S rRNA contains six structural domains (domain III is absent as in other arthropods) and 45 helices (Fig. 4), whereas the secondary structure of 12S rRNA contains three domains and 27 helices (Fig. 5).
Based on the alignment of sequenced species between Heteroptera and Lygaeoidea, domains IV and V of the large subunit (16S) rRNA are more conserved than other three domains (I, II, and VI). Six helices (H563, H1775, H2064, H2507, H2547, and H2588) are highly conserved in both sequence (with 2-7 nucleotide substitutions) and secondary structure among sequenced heteropterans. In the variable domains II and VI, we found no conserved helices within heteropterans, but helices H579, H822, and the terminal loop of H2646 are highly conserved with 0-3 nucleotide substitutions in Lygaeoidea mtDNA. In domain IV, helices from H1830 to H1906 are highly conserved among Lygaeoidea. Our proposed secondary structure of helix H1835 is based on the model proposed by Buckley et al. 39 and Cannone et al. 40 . In domain V, most helices are conserved among insect mtDNAs, except for helices H2077 and H2347, which are highly divergent in both length and secondary structure 39 . Helix H2077 has no apparent conserved motifs, but it contains an 18-paired-base stem and two loops in P. albomaculatus. Helix Scientific RepoRts | 6:35175 | DOI: 10.1038/srep35175 H2347 is composed of a short stem of the terminal 3 paired bases, which is similar to the structure proposed for Chauliops fallax (Hemiptera) 10 and Zygaena sarpedon lusitanica (Lepidoptera) 41 .
Domain III of the 12S rRNA is among the most conserved regions among heteropterans in terms of sequence and secondary structure. Domains I and II are much less conserved than domain III, as most helices are highly variable among the studied taxa and are difficult to align. The primary exceptions are the distal sections of H511 and H769, which are highly conserved across insects 42,43 . In contrast to H511 and H769, there are considerable differences in the sequence and length of H47 across insects 43,44 , and alignments of this region across insects are ambiguous (partly because of its high AT content). We could not detect or predict a consistent secondary structure for H47 across the available insect 12S rRNA sequences. In P. albomaculatus, a possible secondary structure of H47 includes a long stem, two internal loops, and 10 bp terminal loop, which is similar to the model proposed by Niehuis et al. 42 . The nucleotide sequence between helices H577 and H673 suggests no obvious secondary folding, as in other heteropterans 10,27 . In domain III, there are several possible secondary structures for helices H1047, H1068, H1074, and H1113; this complex region likely contains several non-canonical base pairs 43,44 . Nonetheless, the sequence and predicted secondary structure of this region are highly conserved within Lygaeoidea. Helix H1399 is the most conserved helix of 12S rRNA among true bugs, except for its terminal loop.  Non-coding regions. There are only five non-coding regions in the mt-genome of P. albomaculatus, and most of them are shorter than 10 bp (Table 1). The largest non-coding region (1,853 bp) is located between the 12S rRNA and the tRNA-Ile (I)-tRNA-Gln (Q)-tRNA-Met (M) gene cluster (Fig. 1), as frequently found in other insects 15,43 . This region is considered to be the control region as it contains both an origin of transcription and replication 9,45 .
The control region is highly variable in length at all taxonomic scales within insects; most of the length variation is a result of variable numbers of tandem repeats 13,16 . Tandemly repeated sequences are commonly reported from the control region of most insect orders 46 , although both the size and copy number of tandem repeat units differ greatly across groups 29 . The repeat units of insect mt-genomes range from 8 bp to more than 700 bp, and the copy number of tandem repeats ranges from two to nearly 50 16,47 . However, the control region of P. albomaculatus has a novel feature not previously observed in other insect mt-genomes. The majority of the control region is made up of a 949 bp tandem-repeat region, which contains 53 tandem duplications including 52 repeat units (of 18 bp each) and a partial copy of the repeat (13 bp) (Fig. 6). This is the largest number of tandem repeats reported in insect mt-genomes to date. In addition, the nucleotide composition of this tandemly repeated region on the J-strand is strongly biased in composition toward A (A = 54.3%, T = 25.5%, C = 5.5%, G = 14.7%), which results in a higher AT-skew value for control region compared to the other genes of P. albomaculatus mt-genome (see Supplementary Table S2).
The remainder of the control region can be divided into five sections (Fig. 6A): (1) the 5′-end of the control region contains a 321-bp region with 32.1% G + C content (higher than the average for the entire mt-genome); (2) a 8-bp poly-C structure is located between two microsatellite-like repeat regions ((AATTT) 3 and (TA) 5 , respectively); (3) a 103-bp region with very high A + T composition (90.3%), as also found in the control region of other insects 48,49 ; (4) a 8-bp poly-A structure is located 34-bp upstream of the tandem repeat region; and (5) a potential stem-loop secondary structure, with a 'TATA' element at the 5′ end (see Supplementary Figure S3). Our comparative analyses of control regions among sequenced Pentatomomorpha mt-genomes revealed four distinct types of control region organization, which are summarized in Supplementary Figure S4. In some species of Pentatomomorpha, the control region contains four different motifs as summarized for arthropods 50 : tandem repeats, a poly-T stretch, a subregion of even higher A + T content, and stem-loop structures (e.g., Chauliops fallax in Supplementary Figure S4A). However, in most mt-genomes of Pentatomomorpha, tandem repeats occupy the majority of the control region, and two different types (see Supplementary Figure S4B) or identical tandem repeat units (Supplementary Figure S4C) are interrupted by a non-coding region 23,51 . A few control regions do not contain all four motifs or have only one; for example, Kleidocerys resedae resedae 13 has just a stem-loop structure (see Supplementary Figure S4D). These various elements of the control region can provide valuable phylogenetic information within specific groups 29,52,53 . Tandem repeats. The origin and persistence of tandem repeats is thought to involve slipped-strand mispairing during mtDNA replication 15,19 . Other mechanisms for repetitive sequences generation, such as recombination, are considered absent or rare in animal mt-genomes 54 . In P. albomaculatus, the repeat units consist of four different patterns with two variable sites (Fig. 6B): Type I (18 copies), Type II (15 copies), Type III (15 copies), and Type IV (4 copies). These four types are indicated by different colors in Fig. 6A.
These four types of repeat units occur throughout the tandem repeats region, with the following patterns ( Fig. 6): (1) A partial copy Type I (GAATTAGATTAAA) is located at the 3′-end of the tandem repeats region, so Type I repeats occur at both the start and end of the repeat unit. (2) Most copies of Type II are located immediately downstream of Type I repeats, and all copies of Type IV are followed by Type II repeats. (3) Type III repeats are located between Type II and Type I repeats. (4) There are only four copies of Type IV repeats, but the other three types are all duplicated at least 15 times.
Considering these patterns of repeats, we suggest that the ancestral sequence consisted of a Type I-Type II-Type III-Type I cluster. The current distribution can then be generated from this ancestral sequence via replication slippage. Type IV repeats may be a product of slipped-strand mispairing from deletion of the unpaired section between Type II and Type I (Fig. 7). Duplication rates have been shown to be higher than deletion rates in several previous studies 19,55 , which likely accounts for the lower number of Type IV repeats compared to the other three types.
There is a stem-loop structure, which is generally thought to be involved in the origin of replication 15 , located 34-bp downstream of the tandem repeats region (see Supplementary Figure S3). The significant number of tandem repeats adjacent to the origin of replication may indicate that they promote their own copy number amplification.
Phylogenetic analyses. We conducted phylogenetic analyses on the mt-genomes of 37 heteropteran species and 3 outgroup hemipteran insect species (Acyrthosiphon pisum, Sivaloka damnosa, and Lycorma delicatula). Maximum likelihood (ML) and Bayesian inference (BI) analyses yielded fully resolved trees with identical topologies (Fig. 8). The resulting phylogeny of Heteropteran infraordinal relationships is similar to that of Li et al. 56 . We found Enicocephalomorpha to be the sister group to all the remaining heteropterans, with high support values from both BI and ML analyses. In Fig. 8, all infraorders are monophyletic except Cimicomorpha. The non-monophyly of Cimicomorpha is incongruent with previous studies 57, 58 ; a more thorough taxon sampling of taxa within Cimicomorpha is needed to test these competing hypotheses. In Pentatomomorpha, our results support the groupings (Aradoidea + (Pentatomoidea + (Lygaeoidea + (Pyrrhocoroidea + Coreoidea)))), which have been widely supported in previous studies 1, 10,23 . The monophyly of each superfamily within Pentatomomorpha is strongly supported in our analyses. The taxonomic status of Rhyparochromidae within Lygaeoidea is a highly contentious issue 1,5,6,8 . In our analyses, both BI and ML analyses provide strong support that Rhyparochromidae is not the sister group to the remaining Lygaeoidea, which is consistent with previous studies based on molecular data [4][5][6] . We also recovered the groups (Colobathristidae + (Berytidae + ((Lygaeidae + Malcidae) + (Geocoridae + Rhyparochromidae)))), but without strong support from either BI or ML analyses. We suggest that more thorough sampling of taxa will be needed to adequately resolve the family relationships within Lygaeoidea.

Conclusions
We present the first description of a complete mitochondrial genome from a species of Rhyparochromidae, a large and diverse group within Heteroptera 3 . Comparative analysis of tRNA gene families showed that the level of conservation of many tRNAs was not consistent between sequenced Pentatomomorpha and Lygaeoidea mt-genomes, and most tRNA genes encoded on the J-strand are more conserved than those encoded on the N-strand. The control region of P. albomaculatus has novel patterns and large numbers of tandem repeats, including 53 tandem duplications, which can be explained by a replication slippage model. Phylogenomic analysis based  on mt-genomes supported findings from nuclear genes that Rhyparochromidae is not the sister group to all the remaining Lygaeoidea, in contrast to the conclusions from morphological studies. This study should facilitate additional studies on the evolution and phylogeny of Rhyparochromidae, and on the evolution of tandem repeats within control regions of mt-genomes.

Materials and Methods
Specimen collection. We collected adult specimens of Panaorus albomaculatus from Nankai University (39°6.016N, 117°9.977E), Tianjin, China, on July 7th, 2008. We preserved specimens in 95% ethanol and stored them at − 20 °C until tissues were used for DNA extraction. Voucher specimens are deposited in the Insect Molecular Systematic Lab, Institute of Entomology, College of Life Sciences, Nankai University, Tianjin, China.
PCR amplification and sequencing. We extracted genomic DNA from thoracic muscle tissue using the CTAB-based method 59 . We generated the entire mt-genome of Panaorus albomaculatus by amplification of four overlapping PCR fragments. We designed primers from the sequenced fragments, modifying primers from previous studies 13 (see Supplementary Table S4). We performed PCR reactions using TaKaRa LA Taq under the following conditions: 1 min initial denaturation at 94 °C, followed by 30 cycles of 20 s at 94 °C, 1 min at 50 °C, and 2-8 min at 68 °C, and a final elongation for 10 min at 72 °C. We separated PCR products using electrophoresis in 1% agarose gels, purified the fragments, and then sequenced both strands of each fragments using a ABI 3730XL capillary sequencer with the BigDye Terminator Sequencing Kit (Applied Bio Systems).
Sequence analysis and annotation. We proofread the raw sequence files and then aligned them into contigs using BioEdit version 7.0.5.2 60 . We identified PCGs using ORF Finder as implemented at the NCBI website (http://www.ncbi.nlm.nih.gov/gorf/gorf.html), using invertebrate mitochondrial genetic codes. We compared the boundaries of our predicted PCGs and rRNAs with published insect mt-genomes, and the alignments were produced with CLUSTAL X version 1.83 61 . We identified tRNA genes using tRNAscan-SE version 1.21 62 , with the invertebrate mitochondrial codon predictors, and a cove cut-off score of 5. A few of the tRNA genes that could not be detected using tRNAscan-SE were determined through comparing the sequences to other heteropterans. Our analyses of nucleotide composition and codon usage were conducted using MEGA 6.0 63 . We calculated nucleotide skew statistics using the following formulas: AT skew = [A− T]/[A + T] and GC skew = [G− C]/[G + C] 33 . Tandem repeats were identified using the Tandem Repeats Finder server (http://tandem.bu.edu/trf/trf.html) 64 , and the stem-loop structure was inferred using the mfold web server (http://mfold.rna.albany.edu/) 65 .

Phylogenetic analyses.
We conducted phylogenetic analyses on the 40 complete or nearly complete mt-genomes of true bugs that have sequences in GenBank, as well as three species from Sternorrhyncha and Auchenorrhyncha which we used as our outgroup (see Supplementary Table S5). We aligned PCGs based on amino acid sequence alignment using MUSCLE in MEGA version 6.0 63 . Our alignments of individual genes excluded the stop codon and the third codon. We used GPU MrBayes 66 for Bayesian inference and raxmlGUI 1.5 67 for ML phylogenetic analysis. All analyses were based on the GTR + I + Γ model of sequence evolution, as selected using on Modeltest version 3.7 68 . We conducted two simultaneous runs of 10,000,000 generations each for the Bayesian analyses, with a burn-in of 2,500,000 generations and sampling every 100 generations. In ML analyses, we conducted 1000 bootstrap replicates with thorough ML searches.