Application of RNA-seq for mitogenome reconstruction, and reconsideration of long-branch artifacts in Hemiptera phylogeny

Hemiptera make up the largest nonholometabolan insect assemblage. Despite previous efforts to elucidate phylogeny within this group, relationships among the major sub-lineages remain uncertain. In particular, mitochondrial genome (mitogenome) data are still sparse for many important hemipteran insect groups. Recent mitogenomic analyses of Hemiptera have usually included no more than 50 species, with conflicting hypotheses presented. Here, we determined the nearly complete nucleotide sequence of the mitogenome for the aphid species of Rhopalosiphum padi using RNA-seq plus gap filling. The 15,205 bp mitogenome included all mitochondrial genes except for trnF. The mitogenome organization and size for R. padi are similar to previously reported aphid species. In addition, the phylogenetic relationships for Hemiptera were examined using a mitogenomic dataset which included sequences from 103 ingroup species and 19 outgroup species. Our results showed that the seven species representing the Aleyrodidae exhibit extremely long branches, and always cluster with long-branched outgroups. This lead to the failure of recovering a monophyletic Hemiptera in most analyses. The data treatment of Degen-coding for protein-coding genes and the site-heterogeneous CAT model show improved suppression of the long-branch effect. Under these conditions, the Sternorrhyncha was often recovered as the most basal clade in Hemiptera.

Hemiptera is the largest nonholometaboan group of insects, with approximately 82,000 described species 1 . The hemipteran insects have distinctive piercing-and-sucking mouthparts, which make them better adapt to extensive evolutionary radiation 2 . Many species are the important insect pests due to their high reproductive rates and characteristic ability of transmitting human and plant diseases. Despite the significant importance of Hemiptera in biology, the phylogenetic relationships within this group remain unresolved. In particular on the deep-level relationships, almost every possible arrangement among super-families has been proposed. Different relationships of super-families lead to incongruent hypotheses on inter-suborder relationships. Traditionally, two insect groupings rank as suborders or orders: Homoptera and Heteroptera, constituting the Hemiptera [3][4][5] . The former includes the two suborders Sternorrhyncha and Auchenorrhyncha, while the latter comprises the Heteroptera and Coleorrhyncha (which together form clade Heteropterodea) 6 . However, numerous morphological and molecular studies have shown that Homoptera is not a monophyletic group 7-11 (e.g., Fig. 1A-C). More recently, mitogenomic 12,13 and whole genomic 14 data also recovered the Sternorrhyncha as the sister group to all other Hemiptera, rendering the Homoptera to be a paraphyletic group (e.g., Fig. 1D). In contrast, some other researches supported the monophyly of Homoptera [15][16][17] , that is the Cicadomorpha, Fulgoromorpha and Sternorrhyncha clustered in one clade (e.g., Fig. 1E,F). The key to determine whether Homoptera is monophyletic or not is in the placement of Sternorrhyncha. Besides the question on the Homptera, the monophyly of Auchenorrhyncha 18 and the position of Coleorrhyncha 14,17,19 are also the focus of the debate on the higher-level phylogeny of Hemiptera.
The debate may be due, in part, to the unbalanced distribution of mitogenomic studies among suborders of Hemiptera. Traditionally, the order Hemiptera are comprised of three suborders: Sternorrhyncha (aphids, psyllids, whiteflies, and coccids), Auchenorrhyncha (cicadas, spittlebugs, leafhoppers, treehoppers, and planthoppers), and Heteroptera (true bugs) 20,21 . To date, the determined complete or nearly complete mitogenomes Scientific RepoRts | 6:33465 | DOI: 10.1038/srep33465 are mainly concentrated in the suborder Heteroptea (61 sequenced heteroptean mitogenomes, data to January 2015), while limited mitogenomes from suborder Sternorrhyncha can be available in GenBank (only 19 sternorrhynchan species of total 103 hemipterans). Therefore, increasing the number of taxa sampled, particularly in Sternorrhyncha, is needed to robustly test the value of mitogenome data in resolving relationships within Hemiptera. Classically, most insect mitogenomes were sequenced by long PCR plus primer walking under a series of designing conserved PCR primers. However, this method is time consuming and inefficient due to the varied amplification conditions to different insect lineages. In comparison, retrieving mitogenome data from transcriptome sequencing could overcome the difficulty of long PCR and cumbersome primer designs, which typically includes all of the mitochondrial protein-coding and rRNA genes 22 . The thirteen mitochondrial protein-coding genes (PCGs) and two rRNA genes contain the vast majority of phylogenetic information of the whole mitogeome 23 , and can meet the requirement of inferring phylogeny using majority of mitogenome data.
Rhopalosiphum padi (Aphididae) is an economically important pest on wheat and the main vector of barley yellow dwarf virus on cereals in the world. Infection with barley yellow dwarf virus causes wheat to turn yellow, and further leads to grain number and weight to reduce sharply. Affected plants are generally severely stunted and non-productive. The resulting losses in grain yield are often more than 40% in China 24 . Although R. padi has been the subject of extensive biological and molecular studies, phylogenetic study of this insect species is limited. According to prior studies on Hemiptera phylogeny 12,15 , the clade of aphids have relatively lower evolutionary rate than whiteflies in the suborder Sternorrhyncha, and the former usually display shorter branch length. Additional mitogenome sequence from aphids, for example the R. padi, can be helpful to ameliorate the long-branch problems in Hemiptera.
In the present study, we applied the method of RNA-seq plus gap filling to R. padi mitogenome determination. The nearly complete mitogenome of R. padi was sequenced and annotated, with the exception of the trnF locus, which required direct sequencing. In addition, improved phylogenetic analyses with increased mitogenome data (including one newly sequenced mitogenome of R. padi for this study plus 121 published insect mitogenomes from GenBank) were utilized to investigate the Hemiptera relationships, with an emphasis on the placement of Sternorrhyncha. To reduce the effect of sequence saturation and compositional heterogeneity, which are potentially related to long-branch attraction artifact, the most comprehensive methods of Hemiptera phylogeny including a series of data coding schemes, locus refinement, phylogenetic inference method, and model settings are performed for tree reconstruction.

Methods
Sampling strategy. Lab-reared populations of R. padi used for high-throughput sequencing and mitogenome assembly were originally collected in China (Zhengzhou, Henan province, November 2014). Total RNA was extracted from 35 to 50 aphid individuals using Trizol reagent (Invitrogen, CA, USA) following the manufacturer's protocol.
Construction of R. padi mitogenome. Total RNA was quantified using a NanoPhotometer spectrophotometer (IMPLEN, CA, USA) and RNA quality was verified using a 2100 Bioanalyzer with the RNA Nano 6000 Assay Kit (Agilent Technologies, CA, USA). The cDNA library were constructed using IlluminaTruSeq RNA Sample Preparation Kit (Illumina, San Diego, CA, USA). The sequence sample contained > 50 μ g of total RNA, which was diluted with nuclease-free ultrapure water to a final volume of 50 μ l before mRNA was purified from total RNA using poly-T oligo-attached magnetic beads. The purified mRNA was fragmented and used for first strand cDNA  15 monophyletic Homoptera based on limited mitogenomic data. synthesis using reverse transcriptase Super Script II and random hexamers. The RNA templates were then removed using RNaseH and the second strand was synthesized using DNA polymerase I to generate double-stranded cDNA fragments. The purified cDNA fragments were repaired on the 3′ end and adenylated before being ligated to sequencing adapters. Size selection was performed using AMPure XP beads. Finally, the cDNA fragments were enriched by PCR amplification using random hexamers, and products were purified by AMPure XP beads to generate final library. The size and purity of the cDNA sequencing libraries were determined using a 2100 Bioanalyzer with the RNA Nano 6000 Assay Kit (Agilent Technologies, CA, USA) and the quantity was estimated using Q-PCR. RNA transcript was sequenced on an Illumina Hiseq 2000 in Novogene Bioinformatics Institute (Beijing, China). In total, 62,468,816 raw data of paired-end reads of 100 bp length were generated in a lane.
Raw data of fastq format were processed through in-house perl scripts. In this step, reads containing adapters, reads containing ploy-N, and low quality reads were removed from raw data. At the same time, Q20, Q30, GC-content and sequence duplication level of the cleaned data were calculated. All the downstream analyses were based on clean data with high quality. A global de novo assembly of the resultant reads was performed using the Trinity method with min_kmer_cov set to 2 by default and all other parameters set default. To annotate the obtained unigenes, the databases of NR (with a cutoff evalue ≤ 1e-5), NT (evalue ≤ 1e-5), KO (evalue ≤ 1e-5), Swiss-Prot (evalue ≤ 1e-5), PFAM (evalue ≤ 1e-5), GO (evalue ≤ 1e-6), and KOG/COG (evalue ≤ 1e-3) were searched. The resultant data were inputted into BioEdit version 7.0.5.3 25 to build a local BLAST to search mitochondrial genes, with the published aphid mitogenomes (mainly using Schizaphis graminum and Sitobion avenae) 26,27 as bait sequences.
Through local BLAST searching, twelve complete or partial sequences of mitochondrial PCGs were found in the RNA-seq results, while the atp8 was missing. In addition, partial regions of two mitochondrial rRNA genes (1200 bp 3′ end of rrnL and 322 bp 5′ end of rrnS) and the full length trnV gene were identified in the transcriptome assembly. The remaining mitochondrial gene fragments were sequenced from genomic DNA by designing primers and PCR amplification (primers are listed in Table S1). Total genomic DNA was isolated using Qiagen DNA extraction kits (Qiagen, Beijing, China). The species-specific primers were designed based on the sequences from the RNA-seq, and by referring to Simon et al. 28 . The PCR cycling parameters were as follows: initial denaturation of 5 min at 94 °C; 35 cycles of 30 s at 95 °C, 30 s at the annealing temperature 45-55 °C, 1-2 min elongation at 72 °C; and a final elongation of 10 min at 72 °C. The PCR products were electrophoretically inspected in 1.5% agarose gels, and directly sequenced after purification. DNA sequencing was performed with a BigDye Terminator Cycle Sequencing Kit and an ABI 3730XL Genetic Analyzer (PE Applied Biosystems, USA).
The mitochondrial DNA sequences were assembled by SeqMan as implemented in Lasergene software package (DNAStar, Inc.). The finally assembled mitogenome sequences were annotated by MITOS 29   was generated with mtviz http://pacosy.informatik.uni-leipzig.de/mtviz/ (Fig. 2). New mitogenome sequence obtained in this study was deposited in GenBank under accession number of KT447631.  Table 1. Nodal supports and branch lengths for major lineages in each tree. The branch lengths were calculated from the longest terminal taxon of each lineage to the common ancestor to the Paraneoptera. "-" denote the monophyletic lineage not to be recovered by the dataset. NS: nodal support; BL: branch length. the 37 mitochondrial genes were aligned for further analyses. For PCGs stop codons were excluded, each was aligned separately based on the invertebrate mitochondrial genetic code with Perl script transAlign 32 , and then each alignment was concatenated in a single matrix using FASconCAT_v1.0 33 . Both the mitochondrial tRNA and rRNA genes were aligned with reference to the conserved secondary structure. Every tRNA gene was aligned manually: first, the anticodon was located for each gene; second, the seven base-paired anticodon arm sequences were found; third, separately from both sides of the sequence, the acceptor arm sequences were partitioned; finally, the highly variable regions (DHUarm, Tψ C arm and variable loop of them) were refined by MUSCLE in MEGA version 6.0 34 . All 22 tRNA gene alignments were concatenated by FASconCAT to construct the tRNAs dataset. Each of the two rRNAs was aligned separately by the R-Coffee web server 35 , and the aligned sequences were refined by eye and compiled as the dataset of rRNAs. Nucleotide composition of these sequences including the R. padi determined in this study was calculated using MEGA. GC-skew values were calculated under the formula: (G-C)/(G + C). Sequence potential saturation was assessed using the index of substitution saturation (Iss) of Xia et al. 36 implemented in DAMBE5 37 . To detect nucleotide homogeneity across taxa, the chi-square test was performed for the concatenated datasets using PAUP 4.0b10 38 . Estimates of synonymous (dS) and nonsynonymous (dN) substitution rates of each PCG were obtained using the program yn00 of PAML package 39 . And the Orthoptera was used as references. Based on the results from substitution rate analysis, the relatively conservative gene locus were selected to construct the matrix to reduce the effect of rapid evolutionary rate of mtDNA on tree building. That is the dataset of 122taxa_ PCG_7genes_AA: the amino acid sequences of 122 taxa from seven PCGs (i.e., atp6, cox1, cox2, cox3, cytb, nad1 and nad4).

Phylogenetic reconstruction and tests.
To eliminate the effect of saturation and compositional heterogeneity on phylogenetic reconstruction, we applied six data coding strategies to the sequences of PCGs To test effect of combined analysis on the tree building, sequences of RNA genes were concatenated into the PCGs alignments. Four correspoding datasets were compiled: 1) PCGRNA: 13 PCGs, two rRNAs and 22 tRNAs with 14,502 nucleotides; 2) PCG12RNA: 13 PCGs removing the third codon positions, two rRNAs and 22 tRNAs with 10,819 nucleotides; 3) PCG1RY2ndRNA: 13 PCGs with 1RY+ 2nd coding strategy, two rRNAs and 22 tRNAs with 10,819 nucleotides; 4) PCGDegenRNA: 13 PCGs with Degen-coding strategy, two rRNAs and 22 tRNAs with 14,502 nucleotide. In addition, to investigate the effect of taxon sampling on tree topology, 106 taxa dataset with all genes (i.e., 106taxa_PCGRNA) were created.
Tree searches were conducted on each of three types of genes (PCGs, tRNAs and rRNAs) and on the combined dataset. In total, fourteen datasets were analyzed using both Maximum likelihood (ML) and Bayesian inference (BI) ( Table 1). Before undertaking ML analyses, PartitionFinder 42 was employed to infer the optimal partitioning strategy, meanwhile the best-fitting model was selected for each partition using the Bayesian Information Criterion (BIC). The data blocks were defined by gene types (each of 13 PCGs, 22 tRNAs and 2 rRNAs separately as independent block) and by codon positions. The partition schemes and best-fitting models were calculated for 122taxa_PCGRNA and 122taxa_PCG_AA, and 106taxa_PCGRNA, respectively (Table S3). ML searches were carried out using the partition schemes and the selected models described above with RAxML as implemented in the CIPRES Portal 43 . Support for nodes was assessed with the fast bootstrap method using 1000 non-parametric bootstrap inferences.
The BI analyses were initially conducted using MrBayes_v.3.2 43,44 with the following priors: independent substitution model for each partition separated by genes and codons, all model parameters were unlinked across partitions, four Markov chains, two independent runs each for 10 million generations, sampling every 1000th generation, and the first 25% discarded as burn-in. Convergence was considered to be reached when the average standard deviation of split frequencies fell below 0.01. BI analyses were also performed by PhyloBayes with a parallel version (pb_mpi1.5a) 45,46 as implemented on a HP server with twenty-four CPU and 64 G memory. The model GTR-CAT was used for nucleotide analyses, while the model CAT for amino acids. Two chains were run, and started from a random topology. The Maximum\maxdiff " value to accept was set as 0.1. All sequence alignment files and original tree files constructed in this article are available in the TreeBASE http://purl.org/phylo/ treebase/phylows/study/TB2:S18206.
To test statistically the conflict between alternative hypotheses of Hemiptera phylogeny, we compared relations proposed in previous studies (e.g., the basal position of Sternorrhyncha in Fig. 1A: Campbell et al. 9 and Fig. 1D 15 ) to the molecular phylogeny obtained in this study. The Shimodaira-Hasegawa test (SH) 47 and the approximately unbiased test (AU) 48 were conducted for a series of datasets with 122 taxa. The site-log-likelihood values were calculated under the GTR + I + G model for nucleotides and the MtREV + I + G model for amino acids using TREE-PUZZLE 5.3 49 .
The obtained values were used as input for the software CONSEL 50 . Constraint likelihood trees were constructed on the basis of dataset of 122taxa_PCGRNA using RAxML as implemented above. After RNA-seq and directly sequencing for gap filling, the final assembly of mitogenome of R. padi contained 13 PCG, 2 rRNA and 21 tRNA genes, and a putative control region (Fig. 2, and Table 2). The only gap region not being determined was located between trnE and nad5. Though we successfully amplified this region by PCR reactions, yet sequencing consistently failed due to higher A + T content or highly repetitive sequences within it. The mitogenome of R. padi was very compact, twelve overlaps (a total of 42 bp) between adjacent genes were observed. The intergenic spacers were relatively small, except for the gap region found between trnE and nad5. The largest one (10 bp) was found in the region between trnS(UCN) and nad1. All thirteen PCGs in R. padi started with the typical codon ATN. Specifically, two genes (cox3 amd cytb) started with ATG, three (atp8, atp6 and nad3) with ATT, and the remainder with ATA. Besides canonical stop codons (TAA or TAG), a single T was used for cox1, nad4 and nad5.
The standard 21 tRNA genes were identified in the mitogeome of R. padi, which ranged from 62 bp [trnD, trnS(AGN), trnT and trnV] to 73 bp (trnK) in size. The two rRNA genes (rrnL and rrnS) in the R. padi mitogenome were located between trnL(CUN) and trnV, and between trnV and the putative control region, respectively (Fig. 2, and Table 2). The lengths of the rrnL and rrnS genes were respectively 1,259 bp and 767 bp, with the A + T contents of 85.1% and 84.0%. A putative control region was determined between rrnS and trnI, with length of 745 bp and A + T content of 87.1%. In the insect mitogeome, the control region is also called AT-rich region, like that in R. padi with a control region having the highest A + T content through the whole majority strand. Three repeat motifs with about 100 bp elements were detected in the 5′ end of the control region of R. padi, which was followed by a [TA(A)]n-like region with 230 bp in size. The 3′ end was composed of a region with higher G + C content.

Saturation test and evolutionary rate estimate. None of the DAMBE tests yielded an observed
index of substitution saturation (Iss) greater than the critical value (Iss.c), with the exception of the third codon position when NumOTU was 32 (Table 3). For the index of Iss.cAsym, only values of Iss based on PCG12 and PCG2 were significantly lower than Iss.cAsym when NumOTU was 32. The PCG3 failed to pass this test whether NumOTU was 16 or 32. This result suggested that the positions of third codon experienced so much saturation that they had poor phylogenetic information for tree reconstruction. Chi-square tests indicated significant heterogeneity of base composition between taxa in each codon, and each types of gene dataset (p < 0.05). Estimates of substitution rates showed that data matrix compiled from cox1 (dN = 0.1623), cytb (dN = 0.2428), cox2 (dN = 0.2803), cox3 (dN = 0.3084), nad1 (dN = 0.3235), atp6 (dN = 0.3475) and nad4 (dN = 0.3820) contained fewer non-synonymous substitution sites.

Phylogenetic analyses.
For each tree reconstructed in the phylogenetic analyses, nodal support values for major lineages and corresponding branch lengths are provided in Table 1. In all analyses, a monophyletic Aphidoidea was recovered. In addition, the monophyly of Aphididae and a sister group relationship between R. padi and S. graminum were supported by the trees from various datasets. These resulting phylogenies confirmed the validity of mitogenome data of R. padi sequenced in this study. Separate analysis. The mitochondrial rRNA, tRNA and PCG genes represent three different kinds of makers, which resulted in different phylogenetic relationships within Hemiptera. In the ML analyses based on rRNAs dataset, Hemiptera was monophyletic, and the most basal position of Sternorrhyncha was supported, which led to a paraphyletic Homoptera. Although rRNAs MrBayes recovered a monophyletic Hemiptera, the Sternorrhyncha was recovered in a more derived position, and the Homoptera was retained. PhyloBayes analyses under site-heterogeneous model CAT produced a similar tree structure to MrBayes, the only difference was on the placement of Coleorrhyncha.
In contrast, the tRNAs dataset displayed weaker resolution of Hemiptera phylogeny. All tRNAs analyses failed to recover the monophyly of Hemiptera, due to the nested position of Thysanoptera and Phthiraptera.
For PCGs, different data coding strategies (i.e., PCG123, PCG12, PCG3RY, PCG1RY2nd, PCGDegen and PCG_AA) had dramatic effects on the estimated phylogenetic trees. With the site-homogeneous model, ML and MrBayes analyses yielded nearly identical tree topology, where outgroups Thysanoptera and Phthiraptera were always embedded into the ingroup and had a close affinity to the long-branched Sternorrhyncha. Under the heterogeneous model, the PhyloBayes analyses produced better-resolved trees. In the PhyloBayes trees based on the PCG1RY2nd and PCGDegen datasets, the long branches were separated, and outgroups Thysanoptera and Phthiraptera were pulled into a more basal position. By removing the Thysanoptera, the monophyly of Hemiptera were supported by both datasets. Nonetheless, only the PCGDegen recovered a basal placement of Sternorrhyncha thus causing the Homoptera to be a paraphyletic group. The remaining four PCG datasets (i.e., PCG123, PCG3RY, PCG12 and PCG_AA) gave poor results in respect of breaking long-branched assemblages. Though the PhyloBayes analysis based on amino acid sequence from seven relatively conserved PCGs (i.e., the dataset of 122taxa_PCG_7genes_AA) could not recover a monophyletic Hemiptera, the long-branched outgroups were set apart from the ingroup taxa with the longest branches and pulled toward the base of tree. This demonstrated that employing relatively conserved gene regions can reduce long-branch attraction to some extent.

Combined analysis.
Despite with potential saturation, the third codon positions can still contain phylogenetic signal [52][53][54] . Thus, the third codon positions were included in the combined analyses. ML and MrBayes analyses with combined datasets revealed very similar tree topologies to those from separate analysis. In all these trees, intermediate branching position of Sternorrhyncha was retrieved, and it consistently clustered with Thysanoptera and Phthiraptera.
PhyloBayes analyses of the combined data using the site-heterogeneous model showed improvement in suppressing long-branch attraction. Four combined PhyloBayes analyses only differed in the extent of breaking long branches. Both PCGRNA and PCG1RY2ndRNA recovered outgroups Thysanoptera and Phthiraptera as the basal clades in the trees. And in both trees, the Sternorrhyncha formed the most basal hemipteran lineage, and was a sister group of all other Hemiptera (Fig. 4). The PCG12RNA and PCGDegenRNA resulted in the topology that only Thysanoptera were recovered outside ingroup. Despite being separated away from the longest branching Sternorrhyncha, the Phthiraptera still fell inside ingroup and were in a sister clade to the Fulgoroidea.
The weak support for key nodes and frequent conflicts between different analyses based on the above-mentioned concatenate datasets led us to do further analyses. Long branches occurred in all datasets, in particular with regard to the outgroup clades Thysanoptera, Phthiraptera, Liposcelididae and the ingroup Aleyrodidae (Sternorrhyncha). Thus, in order to reduce the effect of long-branch attraction, we excluded the Thysanoptera, Phthiraptera, and Liposcelididae to compile a reduced dataset of 106taxa_PCGRNA. Based on this dataset, ML and MrBayes analyses recovered a monophyletic Hemiptera, and supported the Sternorrhyncha as a sister group to the remaining Hemiptera (BP = 100, and PP = 1) (Fig. 5). The position of clade (Coleorrhyncha + Fulgoromorpha) changed dramatically with comparison of Fig. 4, and they together grouped with the Heteroptera. However, this relationship received no statistical support. The Phylobayes analysis on the basis of 106taxa_PCGRNA recovered a different relationships within Hemiptera, namely that the cicadas diverged firstly and the clade (Coleorrhyncha + (Fulgoromorpha + Sternorrhyncha)) formed a sister group to Heteroptea. But the latter sister group relationship was weakly supported (PP = 0.7).
The topology test indicated that the ingroup relationships depicted in Fig. 5 represented the most likely Hemiptera phylogeny, and other hypotheses were confidently rejected (Table 4). On the basis of likelihood scores, the ingroup topology of Fig. 5 was most similar to the hypothesis of Fig. 4. For the remaining alternative hypotheses, the Hemiptera phylogenetic relationships inferred by the study of Misof et al. 14 (Fig. 1F) had the highest likelihood scores across all tests. This indicated that our phylogenetic inference of Hemiptera was more similar to that of Misof et al. 14    tomes are now easier to obtain using RNA-seq. Based on the transcriptome data, relatively few primers (13 primers in this study) were used to amplify short gaps mainly in the region of tRNA genes. Whereas, conventional method based on primer walking strategy required 40-50 primers to complete a typical insect mitogenome 55 . This study demonstrated the feasibility of using RNA-seq and gap-filling sequencing for de novo assembly of insect mitogenome. The mitogenome of R. padi examined in this study has the similar gene content, order, and structure to other published aphid mitogenomes 26,27,30,31 . For the mitochondrial PCGs, the presence of incomplete stop codons is a common phenomenon found in insect mitogenome including the published aphid mitogenomes, for example the Diuraphis noxia 30 . A common interpretation for this phenomenon is that the complete termination codon is created by polyadenylation of mRNA 56 . In this study, the poly(A) stretches were found in the 3′ end of cox1 and nad4 transcripts, which might be critical to generate the complete termination codon 57 .
In the case of   trnS(AGN) of R. padi, the dihydrouridine (DHU) arm cannot form, as in many other insect species 58 . The gene length and base composition of two R. padi rRNAs are similar other aphid species 26,27,30,31 . Strategies to ameliorate long-branch attraction artifact. Sequences of the mitogenome have been extensively used for inferring phylogenetic relationships at different taxonomic levels 59 . In particular, they have been widely used for deciphering intraordinal relationships within insects 12,60,61 . Due to mutational saturation, heterogeneity in nucleotide composition, and lineage-specific rate acceleration, the usefulness of mitogenome as a marker for higher level insect systematics remains controversial 62 . Some insects within Paraneoptera have been shown to be having accelerated substitution rates and significant saturation on the third codon positions 13,31,63 . In the phylogenetic reconstruction, these insects usually display extremely long-branch length, which have the potential to cause the long-branch attraction artifacts (LBA) 13,64 . Within Hemitptera, the Aleyrodidae (Sternorrhyncha) showed the longest branches ( Table 1, branch lengths of the Sternorrhyncha are represented by the longest whitefly), and consistently clustered together with long-branched outgroups Thysanoptera, Phthiraptera and Lepidopsocidae, particularly in analyses using homogeneous model. This result may be due to their shared nucleotide compositional biases as shown by the high GC-skew values ( Fig. 3 and Fig. S1). Here we applied various sequence coding strategies and model settings to reduce the impact of long-branch attraction. From the analysis of 13 different datasets with full taxa (i.e., rRNAs, tRNAs, PCG123, PCG_AA, 7gene_AA, PCG12, PCG3RY, PCG1RY2nd, PCGDegen, PCGRNA, PCG12RNA, PCG1RY2ndRNA, and PCGDegenRNA) under homogeneous model, only the rRNA sequences recovered the Hemiptera as monophyletic (Table 1). A previous study had suggested that different mitochondrial genes have distinct rates of molecular evolution 65 . According to Mueller (2006), two mitochondrial rRNAs (rrnS and rrnL) have the slowest rates of evolution 65 . Thus, our results indicated that slowly evolving genes were relatively immune to LBA artifacts, and might be preferred loci for resolving deep-level relationships in Hemiptera. In addition, increased taxon sampling is necessary to break up long branches. Previous phylogenetic studies of Hemiptera have shown that the aphid is one of the closet relatives of whiteflies 9,10,14,15,18 . Although they are both the stemorrhynchans with similar biological characters, such as faster generation times and more generation, yet the aphids' mitogenomes exhibit lower sequence evolutionary rate and shorter branch lengths than whiteflies 12,15,31 . Thus, addition of aphid mitogenome data has the potential to alleviate the long-branch effect caused by whiteflies, and to increase the accuracy of phylogenetic estimation of Hemiptera. Among various data treatments, Degen-coding strategy may be the most effective method of suppressing long-branch attraction (Fig. S2). Degen-coding was designed to reduce nucleotide compositional heterogeneity and improve resolution of deep-level arthropod relationships 40,41 . This approach eliminates all synonymous changes by extending other coding schemes (e.g., RY-coding) to degenerate all codons completely. At the same time, all nonsynonymous changes are retained at the third codon position. This is the advantage of this type of data treatment method compared with other data coding schemes. Because the often used data coding strategy of completely removing sites pays a very high cost of reduction in resolution of phylogenetic relationships 66 . In the trees inferred from Degen-coding datasets (PCGDegen and PCGDegenRNA), the long-branch attraction between ougroup and ingroup was significantly reduced. Moreover, for the hypothesis testing, Degen-coding datasets showed greater likelihood scores (Table 4). This also demonstrated the benefit of Degen-coding strategy to Hemitpera phylogenetic inference. In addition, our analyses indicated that the site-heterogeneous model can mediate long-branch effects and recover a rational Hemitpera phylogeny. The fit of the heterogeneous model to paraneopteran mitogenomic data over the homogeneous model was also suggested in prior study by Li et al. 13 . Finally, in order to obtain a well-resolved Hemiptera phylogeny from current mitogenomic data, the long-branch outgroup taxa were excluded to overcome LBA artifacts. This approach had a drastic effect in the tree reconstruction, where Hemiptera were recovered by all optimal criteria. And the Sternorrhyncha was strongly supported as the earliest splitting lineage in Hemiptera (Fig. 5).
Mitogenome-based phylogeny of Hemiptera. From the point of view of avoiding LBA artifacts, the Hemiptera relationships illustrated in Fig. 5 are preferred as the best estimation. This hypothesis was further corroborated by the topology test (Table 4). On the basis of this tree, the monophyly of Hemiptera was strongly supported. Sternorrhyncha was placed as the sister group to all other Hemiptera, rendering Homptera paraphyletic. This arrangement is in concordance with most recent phylogenetic studies on Hemiptera 7-10, [12][13][14] . Despite these, we should acknowledge the limitations of mitochondrial data in solving the deep-level phylogeny of Hemiptera, in particular some deepest nodes of the tree in this study do not receive good support. Mitogenome is a kind of rapidly evolving gene locus. In particular, contrasting rates of mitogenomes among paraneopteran insects resulted in significantly uneven branch length on phylogenetic trees (Table 1). This confounded the reconstruction of hemipteran relationships. If sequence compositional heterogeneity is not considered, simultaneously including long-branched outgroup and ingroup taxa in a phylogenetic analysis must introduce LBA artifacts. Under this situation, applying appropriate data treatments can improve phylogenetic results of Hemiptera. This study has shown that some recoding strategies can reduce the degree of compositional heterogeneity and substitution rates of mitogenome sequences. When the Degen-coding or RY-coding schemes were used to phylogenetic analyses, better resolved hemipteran trees were inferred from the full taxa datasets under the site-heterogeneous model.
The Cicadomorpha was placed next to Sternorrhyncha in the ML tree from 106taxa_PCGRNA (Fig. 5). For the remaining hemipterans, a sister-group of Coleorrhyncha and Fulgoromorpha formed a clade, which is sister to Heteroptea. This result appears anomalous, but it is in agreement with the fossil evidence that fulgoromorphs arose independently of a polyphyletic Cicadomorpha at the end of the early Permian 67 . The close relationship between Coleorrhyncha and Fulgoromorpha was recovered by all analyses in the present study. This arrangement largely concurred with the results from Cui et al. 12 and Misof et al. 14 . However, it rejected the hypothesis of Heteropterodea [7][8][9][10]18,19 . The paleontological record indicated no evidence of an immediate common ancestor of Coleorhyncha and Heteroptera, and they originated independently from separate lineages of auchenorrhynchan ingruids 68 . The Auchenorrhyncha is another controversial problem of Hemiptera phylogeny. Most molecular studies suggested Fulgoromorpha and Cicadomorpha were likely to be separate lineages occupying independent positions within Hemiptera 9,10,69 . This conclusion led to a non-monophyletic Auchenorrhyncha. Conversely, in some analyses 18,70 , Auchenorrhyncha was supported as a monophyletic group. Although the phylogenetic affiliations inferred by dataset of 106taxa_PCGRNA did not support the Auchenorrhyncha, three other analyses (122taxa_rRNAs RAxML, 122taxa_PCGDegen PhyloBayes and 122taxa_PCGRNA PhyloBayes) recovered a group of Auchenorrhyncha including Coleorhyncha in this study. Thus, the monophyly of Auchenorrhyncha deserved to be further tested by more mitogenome data from homopteran insects.
As a whole, the preferred hemipteran genealogy at suborder and infraorder levels reconstructed on the basis of current mitogenomic data are more similar to Misof et al. 14 . Both studies supported the most basal placement of Sternorrhyncha, and a closer relationship of Coleorhyncha to Fulgoromorpha than Heteroptera.
Sequence and original tree files. Sequence and original tree files constructed in this article are available in the TreeBASE http://purl.org/phylo/treebase/phylows/study/TB2:S18206.