Transcriptomes of three species of Tipuloidea (Diptera, Tipulomorpha) and implications for phylogeny of Tipulomorpha

Tipulomorpha has long been a problematic taxon in terms of familial composition, phylogenetic relationships among families and position relative to other ‘lower’ Diptera. Whole-transcriptome shotgun sequencing provides a powerful basis for phylogenetic studies. We performed de novo transcriptome sequencing to produce the first transcriptome datasets representing the families Pediciidae, Limoniidae and Cylindrotomidae using high-throughput sequencing technologies. We assembled cDNA libraries for Pedicia vetusta (Alexander) (Pediciidae), Rhipidia sejuga Zhang, Li and Yang (Limoniidae) and Liogma simplicicornis Alexander (Cylindrotomidae). Using the Illumina RNA-Seq method, we obtained 28,252, 44,152 and 44,281 unigenes, from the three respective species. Based on sequence similarity searches, 12,475 (44.16%), 20,334 (46.05%) and 17,478 (39.47%) genes were identified. Analysis of genes highly conserved at the amino acid sequence level revealed there were 1,709 single-copy orthologs genes across the analyzed species. Phylogenetic trees constructed using maximum likelihood (ML) based on the 1,709 single-copy orthologs genes indicated that the relationship between the four major infraorders of lower Diptera was: Culicomorpha + (Tipulomorpha + (Psychodomorpha + (Bibionomorpha + Brachycera))). Trichoceridae belongs within Tipulomorpha as the sister-group of Tipuloidea. Highly supported relationships within the Tipuloidea are Pediciidae + (Limoniidae + (Cylindrotomidae + Tipulidae)). Four-cluster likelihood mapping was used to study potential incongruent signals supporting other topologies, however, results were congruent with the ML tree.


Introduction
The infraorder Tipulomorpha is one of the most speciose groups of flies [1][2]. According to Hennig, this infraorder included four families, Trichoceridae, Tipulidae, Cylindrotomidae and Limoniidae, of which Trichoceridae was considered the sister-group of the remaining Tipulomorpha [3][4][5]. This arrangement of Tipulomorpha, containing both Trichoceridae and Tipuloidea (= Tipulidae sensu lato, or Cylindrotomidae, Limoniidae, Pediciidae, and Tipulidae a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 sensu stricto) was accepted by Dahl [6], Griffiths [7], Starý [8], Oosterbroek & Courtney [9] and Bertone et al. [10]. Hennig further hypothesized that Tipulomorpha was the sister-group of all remaining Diptera, a classification accepted by Krzeminski [11], Michelsen [12] and Blagoderov et al. [13], and partly accepted by Wood & Borkent [14]. Based on larval characters, Wood & Borkent considered that the concept of Tipulomorpha was restricted to just Tipuloidea, and the infraorder was sister-group to all other Diptera, while Trichoceridae was assigned to the Psychodomorpha [14]. This shift of Trichoceridae from the Tipulomorpha to nested within Psychodomorpha was also suggested by Friedrich & Tautz [15]. Although Oosterbroek & Courtney supported the sister-group relationship between Trichoceridae and Tipuloidea, they considered that Tipulomorpha was the sister-group of the clade Anisopodidae + Brachycera, and as such were a group of 'higher' Nematocera [9], a much more derived position than conceived by Hennig and followers [3][4][5]. More recent molecular and morphological studies have examined the higher classification of Diptera [10,[16][17][18], however, there is still no widely accepted hypothesis of the relationships among lower dipteran families, including the relationship among the four major nematoceran infraorders (Bibionomorpha, Culicomorpha, Psychodomorpha and Tipulomorpha). Tipulomorpha was found as the earliest branching of the four major infraorders of lower Diptera by Wiegmann et al. [16] and Beckenbach [17], while Lambkin et al. suggested Culicomorpha as the earliest branching infraorder [18]. Both the views were supported by Bertone et al. when using different analysis methods [10] (Fig 1).
Interfamilial relationships in the Tipuloidea are also unresolved. Tipuloidea was treated as a single family by Alexander [19][20], Savchenko [21][22][23] and Brodo [24]. The alternative classification as Tipuloidea with four families was used and supported by Hennig [4], Oosterbroek & Theowald [25] and Oosterbroek [26]. However, in both classification schemes, Pediciidae was recovered as nested within Limoniidae by both sets of studies. Alexander [19][20] and Savchenko [21][22][23] presented the earliest evolutionary hypotheses of Tipuloidea, which were qualitative and recovered relationships based on unstated criteria. Both of them considered Tipulidae as the sister-group of the remaining Tipuloidea. Starý raised the subfamily Pediciinae to full family rank and constructed a phylogenetic tree for Tipulomorpha, finding a monophyletic Limoniidae to be the sister-group to a clade containing Pediciidae + (Cylindrotomidae + Tipulidae) [8]. Recent studies revealed new insights into the higher-level classification of Tipuloidea and have suggested that, rather than Limoniidae, Pediciidae was the sister-group to the remaining Tipuloidea [27][28].
Whole-transcriptome shotgun sequencing provides a powerful basis for phylogenetic studies, and provides a means to overcome the limitations of multi-locus PCR based molecular phylogenetics [29][30]. In this study, we sequenced the cDNA from three species, P. vetusta, R. sejuga and L. simplicicornis, belonging respectively to the Pediciidae, Limoniidae and Cylindrotomidae. Additionally, we analyzed published transcriptome sequence data and annotated gene sets from publically available draft genome sequences. Based on data from representatives of the Trichoceridae, four families of Tipuloidea and the other three infraorders, we constructed a phylogenetic tree using maximum likelihood (ML). In addition, we applied Fourcluster Likelihood Mapping (FcLM) to study potential incongruent signal, which might not be revealed by traditional phylogenetic methods.

Ethics statement
No specific permits were required for the specimens collected for this study. The specimens were common in China and the field studies did not involve endangered or protected species. The species were not included in the "List of Protected Animals in China".

Specimen collection, preservation and RNA extraction
The specimens used for the RNA extraction and sequencing transcriptome were collected from several different regions of China. Specimens were immersed alive in RNAlater and crushed with sterile forceps immediately upon wild collection. Then they were stored at -80˚C  [16]. Combined molecular phylogenetic tree for Diptera (BV shown above branches shown above or below branches). (F) Beckenbach [17]. 1) Mitochondrial phylogenetic tree of major groups of Diptera derived from a Bayesian analysis of all major mitochondrial protein coding genes (PP shown above branches). 2) Bayesian mitochondrial tree using codon positions 1 and 2 for cox1-3, cytb, and atp6 genes, and all alignable sites for the ribosomal genes (PP shown above branches). (G) Lambkin et al. [18]. The Bayes combined majority rule consensus tree (PP shown above branches). until further processing. The number of specimens, stage, sex details, preserved information and collection data are listed in S1 Table. For each species, total RNA was extracted using the Trizol reagent according to the manufacturer's instructions (Invitrogen, CA, USA). RNA contamination and degradation was monitored on 1% agarose gels. Other quality parameters, such as purity, concentration and integrity, were examined using the NanoPhotometer1 spectrophotometer (IMPLEN, CA, USA), the Qubit1 RNA Assay Kit run on the Qubit12.0 Flurometer (Life Technologies, CA, USA), and the RNA Nano 6000 Assay Kit run on the Agilent Bioanalyzer 2100 system (Agilent Technologies, CA, USA).

Library preparation and transcriptome sequencing
Three cDNA libraries were prepared employing NEBNext1Ultra™ RNA Library Prep Kit for Illumina1 (NEB, USA), in each case using 3μg of total RNA. Messenger RNA was isolated by Poly-T oligo-attached magnetic beads and fragmented in fragmentation buffer under elevated temperature. After the first strand cDNA was synthesized, the second-strand cDNA synthesis was performed using DNA Polymerase I and RNase H. Libraries were then size-selected for cDNA target fragments of 150-200 bp with 3 μl of USER Enzyme (NEB, USA); this was followed by PCR amplification using Phusion High-Fidelity DNA polymerase, Universal PCR primers and Index (X) Primer. PCR products were purified using the AMPure XP system. The library preparations were sequenced using the Illumina HiSeqTM 2000 system.

Transcriptome assembly
Raw data were filtered to remove low quality reads, and reads containing adapter or Poly-N sequences. Quality parameters of clean data, such as Q20, Q30, GC-content and sequence duplication level, were calculated. Transcriptome assembly was accomplished using Trinity [31] with min_kmer_cov set to 2 and all other parameters set default based on the left.fq and right.fq pooled by the left files (read1 files) and right files (read2 files) for each library. The assembly process implemented in Trinity can be divided into three main steps. Firstly, all reads were broken into defined K-mers. Subsequently, these k-mers were merged to form edges, which were reported as contigs and then the contigs were clustered into components for construction of de Bruijn graphs. Lastly, real reads were used to resolve ties in the de Bruijn graphs and generate transcript sequences.

Published data and orthology assignment
Nucleotide sequence assemblies of published transcriptome data were obtained from the NCBI's Transcriptome Sequences Database (TSA) and other various web sources (S2 Table). We predicted the CDS and amino acid sequences using TransDecoder v1.0.0 (available on Github https://github.com/TransDecoder/TransDecoder).
The ortholog reference set for orthology assignment was selected on the basis of the database OrthoDB7 (http://cegg.unige.ch/orthodb7 and http://cegg.unige.ch/orthodb/browse). We compiled a set of genes that are single-copy orthologs among Diptera and occur in the genomes of each of the following eight reference species: Aedes aegypti, Anopheles gambiae, Culex quinquefasciatus, Lutzomyia longipalpis, Phlebotomus papatasi, Mayetiola destructor, Drosophila melanogaster and Drosophila persimilis. OrthoDB 7 specified 1,709 single-copy protein-coding genes (S3 Table). We downloaded the amino acid sequences that are associated with each of the 1,709 single-copy ortholog groups (OGs) along with their available gene description (S4 Table). We aligned the amino acid sequences for each OG with MAFFT v7.205 [32][33] using the L-INS-i alignment algorithm and then build pHMMs with the program hmmbuild from the HMMER 3.0 software package [34] on the basis of the resulting multiple amino acid sequence alignments.
Orthology assignment of the transcriptomes was assessed by HaMStR v13.2.4 [35]. We ran HaMStR with the following settings: (i) the E-value cut-off for the pHMM search was 1e-5, (ii) the reciprocity criterion was considered fulfilled if the candidate OG was found as best hit in at least one of the 8 reference species during the reciprocal best hit search (RBH) (relaxed option), (iii) in case of multiple transcripts being assigned to a given OG, the best set of non-overlapping transcripts was chosen while non-overlapping transcripts are automatically concatenated (representative option). We searched and removed all multipleassigned transcripts since it was possible that a given transcript was assigned to more than one OG.

Phylogenetic and FcLM analysis
Phylogenetic analysis was conducted on the alignments obtained using MAFFT v7.205 [32][33]. Spurious sequences or poorly aligned regions were removed from the multiple sequence alignment using trimAl v1.2 (automated1 option). ML analyses were constructed using PhyML v 3.0 with the default choice JTT [36]. The tree searching algorithm used the SPR options provided by PhyML. In order to analyze single phylogenetic splits, FcLM analyses were completed using the program TreePuzzle v5.3 [37][38]. We selected the following two phylogenetic hypotheses concerning the Tipulomorpha for FcLM testing: 1) which infraorder is the sister-group of the remaining Diptera, Culicomorpha or Tipulomorpha; 2) is Trichoceridae part of Tipulomorpha, or Psychodomorpha as suggested by Wood & Borkent [14]. For each hypothesis, all species in the decisive dataset were grouped into four clusters representing alternative resolutions of the phylogenetic question of interest.

Data deposition
The transcriptome sequencing data have been deposited in the NCBI Sequence Read Archive (SRA) database under the following accession numbers: SRR3452301 (P. vetusta), SRR3452300 (R. sejuga) and SRR3441821 (L. simplicicornis). The assembled transcriptomes have been deposited in the NCBI Transcriptome Shotgun Assembly (TSA) database under the accession numbers GEMI00000000, GEMJ00000000 and GEMK00000000 respectively.

Transcriptome sequencing and assembly
In this study, the transcriptomes were sequenced using the Illumina sequencing method. In total, 23

Functional annotation
The unigenes of the three species were annotated by searching against the NR [39], GO [40], COG [41], KOG, KEGG [42], Pfam [43], and Swiss-Prot databases ( After searching against the NR database, we obtained the sequence-homology distributions for each species (Fig 3). The majority of matches were with known genes from Aedes aegypti, followed by Culex quinquefasciatus, Anopheles gambiae, Anopheles sinensis and Anopheles darlingi.

Phylogenetic analysis
A phylogenetic tree, which included representatives of all tipulomorph families plus representatives of other lower dipteran families, was constructed based on the 1,709 single-copy orthologs genes using the ML method (Fig 7). In addition, according to the annotations of these single-copy orthologous genes, the following four groups of genes were also selected to construct ML trees (Fig 8) to make comparisons and provide insights into the phylogenetic importance of the groups of genes: 73 genes involved in ATP binding, 13 genes with receptor activity, 43 genes from the 'biological process associated with wing' group and 7 genes from the 'biological process associated with olfaction' group (S11 Table). In the tree based on all 1709 single-copy genes (Fig 7), the monophyly of each infraorder was strongly supported. This result was also strongly supported in three of the four analyses based on GO term group; the exception was the 'receptor activity' analyses in which Psychodomorpha is the only monophyletic infraorder (Fig 8B). 'Nematocera', or the 'lower' Diptera, was, as expected, paraphyletic. Relationships among the four major infraorders of lower Diptera in the 1709-gene analysis was Culicomorpha + (Tipulomorpha + (Psychodomorpha + (Bibionomorpha + Brachycera))). Of the GO group trees, the 'biological process associated with wing' analysis ( Fig 8C) had the same infraorder topology as the 1709-gene tree, 'ATP binding' analysis ( Fig 8A) supported a sister group relationship between Tipulomorpha and Psychodomorpha, and the 'biological process associated with olfaction' analysis ( Fig 8D) supported nematoceran monophyly, with Tipulomorpha as sister to Bibionomorpha. Infraordinal relationships could not be inferred from the 'receptor activity' analysis, as the monophyly of 3 of the 4 infraorders was not supported. The 1709-gene analysis (Fig 7) and one of the four GO-group (Fig 8D) supported the traditional concept of Tipulomorpha as containing Trichoceridae separate from Tipuloidea; in two of the other GO-group analyses Trichoceridae was nested within Tipuloidea (Fig 8A and  8C). As to the interfamilial relationships in Tipuloidea, Pediciidae was sister-group to a clade containing Limoniidae + (Cylindrotomidae + Tipulidae) in both the 1709-gene analysis and one of the GO-group analyses, whereas Pediciidae + Trichoceridae was sister to the remaining tipuloid families in the other two GO-group analyses (Fig 8A and 8C). Nodal support was uniformly strong across the analyses.

FcLM analysis
To evaluate signal for alternative resolutions of the Diptera tree we used Four Cluster Likelihood Mapping (FcLM) [37] to compare support for two major questions: (1) which infraorder is placed basally as sister to all remaining Diptera in our study? And (2) does our data support Trichoceridae placed in the Tipulomorpha (Table 3)? FcLM analysis favored Culicomorpha  Transcriptomes of Tipuloidea and transcriptome-based phylogeny of the Tipulomorpha (51.4%) over either Tipulomorpha (32.0%) or a clade comprising of Culicomorpha + Tipulomorpha (16.6%) (Fig 9A). Quartet mapping also showed strong support for Trichoceridae + Transcriptomes of Tipuloidea and transcriptome-based phylogeny of the Tipulomorpha Tipuloidea (Tipulomorpha monophyly) (87.5%) over either a sister grouping between Trichoceridae and Psychodidae (12.5%) or Trichoceridae outside of Tipuloidea + Psychodidae (0.0%) (Fig 9B). Both the results are compatible with the results of the ML tree based on all 1709 single-copy genes.

Discussion
Recent studies have demonstrated that whole transcriptomes can accurately resolve the phylogenetic relationships at various levels within insects [29][30]. A robust phylogeny of Tipulomorpha was obtained using three newly sequenced tipuloidean transcriptomes as well as some published transcriptome data. Strong support values were recovered for both traditionally recognized dipteran phylogenetic relationships as well as some recent novel findings (Fig 7). Several long standing issues in tipulomorph phylogeny are resolved, especially regarding its monophyly and placement with respect to other lower dipteran groups. As for the four major infraorders of lower Diptera, Culicomorpha was the basal branch in our analysis, which was consistent with former molecular [10] and morphological analysis [18]. The placement of Tipulomorpha towards the middle of the nematoceran grade was recovered, with Culicomorpha + (Tipulomorpha + (Psychodomorpha + (Bibionomorpha + Brachycera))). This is consistent with the result using Bayesian analysis from Bertone et al. [10]. The monophyly of Tipulomorpha as well as the sister-group relationship between Trichoceridae and Tipuloidea are confirmed. Several morphological characters of both larvae and adults support this sister group relationship, i.e. vein R 2 ending in R 1 , vein A 2 reaching the wing margin, reduction of male cerci, development of male terminalia from both imaginal discs and pupal ectoderm, and female cerci with a single article [4][5][6][7][8][9]. Despite differing in their resolution of the infraorders, both of Bertone et al. analyses supported a sister-group relationship between Trichoceridae and Tipuloidea based on multiple nuclear genes [10]. Analysis of whole mitochondrial genome data by Beckenbach did not support a monophyletic Tipulomorpha as Trichoceridae was sister to all other Diptera, however the exclusion of higher variability major genes and codon positions did support a sister-group relationship between Trichoceridae and Tipuloidea [17].
The superfamily Tipuloidea, which has a large number of species, are difficult to identify and really know. Therefore, the choice of taxa could influence the tree due to unpredictable rate dynamics (causing homoplasy or phylogenetic error). However, the family Tipulidae was the sole representative of Tipuloidea included in some previous studies (e.g. [17][18]) potentially biasing those phylogenetic findings. Limited taxon sampling could explain the conflicting results found in most previous studies for the placement and composition of Tipulomorpha. Our use of a very large gene sample may overcome the unpredictable rate effects that can come from insufficient taxon coverage, but additional studies with much broader taxon coverage would be required to fully explore this issue.
Regarding interfamilial relationships in Tipuloidea, our results corroborated Pediciidae as the sister-group to the remaining Tipuloidea, which was consistent with the results proposed by Ribeiro based on an analysis of 88 morphological characters [27] and Petersen et al. based on both morphological characters (adult, larvae and pupae) and nuclear sequence data (28S rDNA and CAD) [28]. Petersen et al. also showed a sister-grouping of Cylindrotomidae and Tipulidae, however, their placement within the Tipuloidea was less certain as there was no support for the monophyly of Limoniidae [28]. Although Limoniidae was found to be the sistergroup of Cylindrotomidae + Tipulidae in the present study, due to the limited number of exemplars sequenced (one per family) we can't rule out the possibility of a paraphyletic Limoniidae as found by Petersen et al. [28] with only some members of the family making up the sister-group of Cylindrotomidae + Tipulidae. More research into the higher-level classification of the Limoniidae based on a larger range of representative taxa should be used for phylogenetic analysis in the future.
Analysis of smaller numbers of genes corresponding to specific functional classes were less effective at resolving expected relationships among Tipulomorphan clades. GO clasess are broadly defined and specific aspects of molecular evolutionary rate and both identification or identity as an ortholog would affect both the use of the gene as phylogenetic marker and its ability to resolve a specific question. In general, it seems that larger gene samples that include multiple classes of genes and genes of differing phylogenetic utility are preferable for resolving relationships among the oldest and most diverse fly groups.