Genomic and transcriptomic resources for assassin flies including the complete genome sequence of Proctacanthus coquilletti (Insecta: Diptera: Asilidae) and 16 representative transcriptomes

Rebecca B. Dikow; Paul B. Frandsen; Mauren Turcatel; Torsten Dikow

doi:10.7717/peerj.2951

Genomic and transcriptomic resources for assassin flies including the complete genome sequence of Proctacanthus coquilletti (Insecta: Diptera: Asilidae) and 16 representative transcriptomes

Rebecca B. Dikow ¹, Paul B. Frandsen¹, Mauren Turcatel², Torsten Dikow ²

1Office of Research Information Services, Office of the Chief Information Officer, Smithsonian Institution, Washington, D.C., United States of America

2Department of Entomology, National Museum of Natural History, Smithsonian Institution, Washington, D.C., United States of America

DOI: 10.7717/peerj.2951

Published: 2017-01-31
Accepted: 2016-12-31
Received: 2016-10-25

Academic Editor: Thiago Venancio

Subject Areas: Entomology, Genomics
Keywords: Transcriptomics, Asilidae, Draft genome, Genomics, Phylogenomics

Copyright: © 2017 Dikow et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

Cite this article: Dikow RB, Frandsen PB, Turcatel M, Dikow T. 2017. Genomic and transcriptomic resources for assassin flies including the complete genome sequence of Proctacanthus coquilletti (Insecta: Diptera: Asilidae) and 16 representative transcriptomes. PeerJ 5:e2951 https://doi.org/10.7717/peerj.2951

Abstract

A high-quality draft genome for Proctacanthus coquilletti (Insecta: Diptera: Asilidae) is presented along with transcriptomes for 16 Diptera species from five families: Asilidae, Apioceridae, Bombyliidae, Mydidae, and Tabanidae. Genome sequencing reveals that P. coquilletti has a genome size of approximately 210 Mbp and remarkably low heterozygosity (0.47%) and few repeats (15%). These characteristics helped produce a highly contiguous (N50 = 862 kbp) assembly, particularly given that only a single 2 × 250 bp PCR-free Illumina library was sequenced. A phylogenomic hypothesis is presented based on thousands of putative orthologs across the 16 transcriptomes. Phylogenetic relationships support the sister group relationship of Apioceridae + Mydidae to Asilidae. A time-calibrated phylogeny is also presented, with seven fossil calibration points, which suggests an older age of the split among Apioceridae, Asilidae, and Mydidae (158 mya) and Apioceridae and Mydidae (135 mya) than proposed in the AToL FlyTree project. Future studies will be able to take advantage of the resources presented here in order to produce large scale phylogenomic and evolutionary studies of assassin fly phylogeny, life histories, or venom. The bioinformatics tools and workflow presented here will be useful to others wishing to generate de novo genomic resources in species-rich taxa without a closely-related reference genome.

Introduction

The evolution of genomes within midges, mosquitoes, and flies—Diptera—is better understood than for any other insect order with some 100 whole genomes that have been sequenced and are publicly available. However, the available Diptera genomes are not evenly distributed across this 250 Million year old radiation and skewed towards medically important malaria-transmitting mosquitoes (24 genomes) and species of Drosophila used as model organisms in genetic research (29 genomes) (Fig. 1). Here, we provide the first high-quality draft genome and several transcriptomes for orthorrhaphous flies and specifically Asiloidea in the center of the Diptera Tree of Life.

Figure 1: Phylogeny of Diptera (summary tree of hypothesis with higher taxa by Wiegmann et al., 2011) with number of completed genomes and position of Asiloidea.
* = includes low-coverage genomes published recently in Vicoso & Bachtrog (2015). Figshare doi: 10.6084/m9.figshare.4056057.

Download full-size image

DOI: 10.7717/peerj.2951/fig-1

Assassin flies (or robber flies, Diptera: Asilidae) are a diverse group of orthorrhaphous flies with more than 7,500 species known to date (Pape, Blagoderov & Mostovski, 2011). Their common name originates from their predatory behavior in the adult life stage: catching other insects or spiders in flight and injecting their venomous saliva to kill the prey and to liquefy the internal organs to suck out the prey (Dikow, 2009b; Fisher, 2009). Assassin flies have several unique adaptations in proboscis and sucking-pump morphology that enable them to inject venom into their prey and suck out the tissue (Dikow, 2009b). These adaptations and changes in life history from a nectar-feeding ancestor, which is still found in the sister group to Asilidae composed of Apioceridae and Mydidae (Dikow, 2009b; Dikow, 2009a; Trautwein, Wiegmann & Yeates, 2010; Wiegmann et al., 2011), have accelerated their diversification over the past 112 Million years as Apioceridae and Mydidae combined have only 619 described species. The oldest definitive fossils for both Asilidae and Mydidae are Cretaceous in age from the Santana Formation in Brazil (Grimaldi, 1990; Willkommen & Grimaldi, 2007) and Wiegmann et al. (2011) estimate the age of the clade (Asilidae + (Apioceridae + Mydidae)) to be 135 Million years.

Genomes available for Diptera

To date, out of the 160,000 species of Diptera (Pape, Blagoderov & Mostovski, 2011), complete genomes have been sequenced for 100 species (NCBI as of 04 October 2016). These represent 47% of the insect genomes available at NCBI and are concentrated within the earliest radiation of Diptera including the medically important mosquitoes (Culicidae) and sand flies (Psychodidae) and the higher flies including the model organism Drosophila and medically important Glossina tsetse flies (Fig. 1).

A recent study by Vicoso & Bachtrog (2015) added some 37 low-coverage genomes for a study on the sex chromosomes of Diptera. While the genome sequencing in this publication was not intended to add draft genomes, the genomes are spread across the Diptera Tree of Life (Fig. 1) and Vicoso & Bachtrog (2015) added two low-coverage (approximately 12×) genomes for Orthorrhapha in the center of fly evolution, i.e., the soldier fly Hermetia illucens (Stratiomyomorpha, estimated genome size = 1.3 Gbp, N50 = 2,778 bp) and the assassin fly Holcocephala fusca (Asiloidea, estimated genome size = 673 Mbp, N50 = 4,591 bp).

Aedes, Anopheles, and Culex mosquitoes and Drosophila vinegar flies shared a common ancestor some 240 Million years ago (mya) (Wiegmann et al., 2011). The most recent common ancestor of mosquitoes and Orthorrhapha likewise lived 240 mya and that of Drosophila and Orthorrhapha some 200 mya. Filling a gap in the center of the Diptera Tree of Life (Fig. 1) by providing data on novel, high-quality draft genomes from within Orthorrhapha and Asiloidea will open the opportunity to more meaningfully compare genomes across Diptera. Furthermore, the genomic resources provided here will advance the study of evolutionary history, life history, and the search for the venom of assassin flies.

Methods

Specimen source

Adult flies were hand-netted either directly from their resting/perching sites (Apioceridae, Asilidae, Bombyliidae, and Mydidae) or from within a Malaise Trap (Tabanidae) and kept alive in individual vials. They were identified to species, assigned unique identifiers, and either preserved in RNAlater (specimen cut open and placed directly in RNAlater) or liquid N₂ (specimen alive in individual vial dropped in dry shipper containing liquid N₂). RNAlater vials were emptied of any liquid before being placed in liquid N₂-filled tanks in the NMNH Biorepository where all specimens are stored and accessible by their unique specimen identifier (Table 1).

Table 1:

List of species included in study along with unique specimen identifier of sequenced specimen and preservation method.

Family: subfamily	Species	Specimen identifier	Preservation
Apioceridae	Apiocera parkeri Cazier, 1941	USNMENT01136047	liquid N₂
Asilidae: Asilinae	Machimus occidentalis (Hine, 1909)	USNMENT00951022	RNAlater
Asilidae: Asilinae	Philonicus albiceps (Meigen, 1820)	USNMENT01027314	RNAlater
Asilidae: Asilinae	Proctacanthus coquilletti Hine, 1911	USNMENT01136140	liquid N₂
Asilidae: Asilinae	Proctacanthus coquilletti^a	USNMENT01136139	liquid N₂
Asilidae: Asilinae	Tolmerus atricapillus (Fallén, 1814)	USNMENT01027313	RNAlater
Asilidae: Brachyrhopalinae	Nicocles dives (Loew, 1866)	USNMENT00951000	RNAlater
Asilidae: Dasypogoninae	Diogmites neoternatus (Bromley, 1931)	USNMENT00802587	liquid N₂
Asilidae: Laphriinae	Laphystia limatula Coquillett, 1904	USNMENT01136024	liquid N₂
Asilidae: Stenopogoninae	Scleropogon duncani Bromley, 1937	USNMENT01136006	liquid N₂
Asilidae: Stichopogoninae	Lasiopogon cinctus (Fabricius, 1781)	USNMENT00802771	RNAlater
Bombyliidae: Ecliminae	Thevenetimyia californica Bigot, 1875	USNMENT00951006	RNAlater
Mydidae: Ectyphinae	Ectyphyus pinguis Gerstaecker, 1868	USNMENT01136013	liquid N₂
Mydidae: Mydinae	Messiasia californica (Cole, 1969)	USNMENT01136023	liquid N₂
Mydidae: Mydinae	Mydas clavatus (Drury, 1773)	USNMENT00802763	liquid N₂
Tabanidae: Pangoniinae	Fidena pseudoaurimaculata Lutz, 1909	USNMENT01137217	liquid N₂
Tabanidae: Tabaninae	Tabanus discus Wiedemann, 1828	USNMENT01137218	liquid N₂

DOI: 10.7717/peerj.2951/table-1

Notes:

adenotes specimen for which the genome was sequenced.

RNA-Seq

Total RNA was extracted from specimens preserved in RNAlater or in liquid N₂ (see Table 1). A single specimen was used for each extraction. Muscular tissue was extracted from the thorax and cryogenically ground using CryoMill (Retsch, Haan, Germany). Total RNA was isolated using the TRI Reagent Protocol (Sigma-Aldrich, St. Louis, MO, USA) with overnight precipitation, and then quantified using Epoch Microplate Spectrophotometer and Gen5 software (both BioTek, Winooski, VT, USA). For the specimens sequenced using Ion Torrent, the isolation of mRNA was carried out using DynaBeads mRNA DIRECT Kit, and Ion Total RNA-Seq Kit (v2) for Whole Transcriptome Libraries (Thermo Fisher Scientific) was used for library preparation. The BluePippin System (Sage Science, Beverly, MA, USA) was used for selecting fragments of 170–350 bp. For the specimens sequenced using Illumina MiSeq and HiSeq2000, the isolation of mRNA and construction of stranded mRNA-Seq libraries were carried out using KAPA Stranded mRNA-Seq Kit (Kapa Biosystems, Boston, MA, USA) and NEBNext Multiplex Oligos (New England BioLabs, Ipswich, MA, USA). Library fragment size distribution was assessed using High Sensitivity D1000 ScreenTape System (Agilent, Waldbronn, Germany), and the BluePippin System was used for selecting fragments of 180–440 bp. After size selection, a sample of each library was quantified using the KAPA Library Quantification Kit for Illumina platforms, and pooled to 5 nM total concentration for sequencing.

Figure 2: Bioinformatics workflow for transcriptome and genome analysis.
Figshare doi: 10.6084/m9.figshare.4056069.

Download full-size image

DOI: 10.7717/peerj.2951/fig-2

Figure 3: Venn diagrams showing the numbers of GO (Gene Ontology) terms among selected sets of taxa.
Visualized at: http://bioinformatics.psb.ugent.be/webtools/Venn/, figshare doi: 10.6084/m9.figshare.4056054.

Download full-size image

DOI: 10.7717/peerj.2951/fig-3

RNA-Seq bioinformatics workflow is shown in Fig. 2. Raw data as well as assembled transcripts were screened for contamination with KRAKEN (Wood & Salzberg, 2014). RNA-Seq reads were trimmed with Trimmomatic (Bolger, Lohse & Usadel, 2014) and transcripts were assembled in Trinity (Grabherr et al., 2011). Transcriptome “completeness” was estimated using BUSCO (v2.0beta, Simão et al. (2015)) with the “Endopterygota” lineage specific set of 2,442 loci and the -m tran setting. BUSCO assesses completeness with near-universal single copy orthologs selected from OrthoDB (Kriventseva et al., 2015). The 16 sets of assembled transcripts were translated with Transdecoder (Haas & Papanicolao, 2015) under default parameters. Peptides were filtered for redundancy using CD-Hit (v4.6.1, Fu et al. (2012)) specifying a 95% similarity threshold. The Trinotate workflow was used for transcript annotation (Grabherr et al., 2011). Trinotate uses evidence from BLASTx, BLASTp (Altschul et al., 1990), PFAM (Punta et al., 2012), and HMMER (Finn, Clements & Eddy, 2011) to assign GO terms (Ashburner et al., 2000) to transcripts. Venn diagrams showing overlapping sets of GO-terms were generated at: http://bioinformatics.psb.ugent.be/webtools/Venn/ (see Fig. 3).

Genome sequencing

Genomic DNA was extracted from thoracic muscular tissue and legs of a single specimen of Proctacanthus coquilletti preserved in liquid N₂, using the DNEasy DNA Extraction kit (Qiagen, Hilden, Germany). The sample was quantified using Epoch Microplate Spectrophotometer and Gen5 software, and subsequently pooled to 50 ng/µL concentration. Sequencing took place at Johns Hopkins University Genetic Resources Core Facility’s High Throughput Sequencing Center. A PCR-free library was generated and two lanes of Illumina HiSeq2500 were sequenced to satisfy DISCOVAR recommendations.

Genome sequencing bioinformatics workflow is shown in Fig. 2. Genome size, heterozygosity, and repeat content were estimated with raw reads using GenomeScope (Sedlazeck, Nattestad & Schatz, 2016), which uses a kmer histogram generated by JELLYFISH (Marçais & Kingsford, 2011). Raw data as well as assembled contigs were screened for contamination with KRAKEN. Blobtools/Blobology (Kumar et al., 2013) was also used to assess contamination. Sequences were assembled using DISCOVAR de novo (Jaffe, 2015) and w2rap-contigger (Clavijo, 2016) with a kmer size of 200 and 260. w2rap-contigger provides performance improvements on DISCOVAR de novo, which is no longer being actively developed. Some scaffolding was performed as with 2 × 250 bp reads there is some space between overlap and DISCOVAR de novo and w2rap-contigger perform scaffolding internally as shown in Fig. 4. Genome completeness was estimated using BUSCO with the “Endopterygota” lineage specific set of loci and the -m genome setting.

Genome assembly statistics visualization of Proctacanthus coqilletti de novo genome (w2rap-contigger 200 kmer assembly, see also Table 5). — Figure 4: Genome assembly statistics visualization of *Proctacanthus coqilletti de novo* genome (w2rap-contigger 200 kmer assembly, see also Table 5).
Visualized at: http://lepbase.org/assembly-statistics/, figshare doi: 10.6084/m9.figshare.4056042.

Download full-size image

DOI: 10.7717/peerj.2951/fig-4

Gene prediction was performed using MAKER (Cantarel et al., 2008), which uses RepeatMasker (Smit, Hubley & Green, 2013), Augustus (Stanke & Waack, 2003), BLAST (Altschul et al., 1990), Exonerate (Slater & Birney, 2005), and SNAP (Korf, 2004). Contigs shorter than 2 kbp were not annotated, as they are too short to produce high-quality evidence. The maximum intron size was set as 10 kbp, which is recommended based on Drosophila intron sizes. The Augustus model species used was “fly” and Drosophila melanogaster Repbase libraries (Bao, Kojima & Kohany, 2015) were used for RepeatMasker.

Blobplots

A blobplot was created using blobtools (Kumar et al., 2013, v 0.9.19.5). Prior to generating the blobplot, two steps need to be taken: (1) the raw reads need to be mapped back to the genome to generate an estimate of sequencing coverage and (2) a taxon assignment for each contig needs to be generated by querying the NCBI nucleotide database. Raw reads were mapped using Bowtie 2 (Langmead & Salzberg, 2012, v 2.2.9) and taxon assignments generated using megablast (Altschul et al., 1990; Zhang et al., 2000).

Phylogenetic trees

Orthology detection was conducted in OMA standalone (v1.0.6, http://omabrowser.org/standalone/) using peptides processed in Transdecoder and CD-Hit. Amino acid alignments on individual resulting orthologs were conducted in MAFFT (Katoh, Asimenos & Toh, 2009). Phylogenetic model selection was performed with PartitionFinderProtein (Lanfear et al., 2012). Gene trees and trees of concatenated, partitioned, data were built in RAxML (raxmlHPC-PTHREADS-SSE3, Stamatakis (2014)). Best tree searches were run 100 times each and rapid bootstrapping was run under the AutoMRE option. ASTRAL (Mirarab et al., 2014) was used to generate a species tree based on individual gene tree topologies.

Fossil calibrations

Seven fossils ranging in age from 112–45 Million years old (myo) were used to calibrate the time-tree analysis. The maximum age of the root was set to 195 million years (my), an age proposed for the most recent common ancestor of Tabanidae and Asilidae (Wiegmann et al., 2011). MCMCtree, part of the PAML package (Yang, 2007), was used to generate a time-calibrated phylogeny based on the best-scoring RAxML tree.

Sequence, genome, and analysis data

The raw and assembled sequence data can be accessed under NCBI Umbrella BioProject PRJNA345052. Individual BioProject and BioSample accession numbers are provided in Table 2. The Proctacanthus coquilletti de novo genome assembly (w2rap-contigger 200 kmer) can be accessed under NCBI WGS MNCL00000000 and the Genomescope results at: http://qb.cshl.edu/genomescope/analysis.php?code=TRpKdSHytjlB1vBGsPne. Digital copies of visualizations, alignments, and phylogenetic trees can be accessed under a Figshare Collection (doi: 10.6084/m9.figshare.c.3521787, Table S1).

Table 2:

List of species included in study along with NCBI BioProject, BioSample, and Sequence Read Archive (SRA) numbers for access to raw sequencing reads.

Also accessible under NCBI Umbrella BioProject PRJNA345052.

Species	NCBI BioProject	NCBI BioSample	NCBI SRA
Apiocera parkeri	PRJNA343825	SAMN05803830	SRR4346321
Machimus occidentalis	PRJNA343807	SAMN05802935	SRR4345231
Philonicus albiceps	PRJNA343818	SAMN05803661	SRR4365562
Proctacanthus coquilletti genome	PRJNA343047	SAMN05772833	SRR4372731
Proctacanthus coquilletti transcriptome	PRJNA343047	SAMN05799370	SRR4346725
Tolmerus atricapillus	PRJNA343802	SAMN05800340	SRR4346294
Nicocles dives	PRJNA343892	SAMN05804943	SRR4345232
Diogmites neoternatus	PRJNA343891	SAMN05804928	SRR4345333
Laphystia limatula	PRJNA343827	SAMN05803875	SRR4346311
Scleropogon duncani	PRJNA343798	SAMN05800191	SRR4346727
Lasiopogon cinctus	PRJNA343889	SAMN05804927	SRR4345233
Thevenetimyia californica	PRJNA343898	SAMN05804952	SRR4345230
Ectyphyus pinguis	PRJNA343820	SAMN05803663	SRR4346320
Messiasia californica	PRJNA343822	SAMN05803780	SRR4346726
Mydas clavatus	PRJNA343821	SAMN05803778	SRR4345448
Fidena pseudoaurimaculata	PRJNA343896	SAMN05804949	SRR4346296
Tabanus discus	PRJNA343894	SAMN05804948	SRR4346303

DOI: 10.7717/peerj.2951/table-2

Table 3:

Summary of RNA-Seq results.

BUSCO results are based on a complete set out of 2,442.

Species	Sequencing	Total reads	Total	Total	Transcripts	BUSCO
	platform		RNA	transcripts	with GO	complete
Apiocera parkeri	HiSeq2000	143,373,948	1.80	298,313	14,571	1,588
Machimus occidentalis	Ion Torrent	2,754,607	29.31	9,330	9,600	52
Philonicus albiceps	HiSeq2000	107,425,636	0.32	46,977	15,775	2,212
Tolmerus atricapillus	HiSeq2000	108,444,670	0.78	43,915	14,417	2,190
Proctacanthus coquilletti	MiSeq	21,978,654	12.46	56,925	15,936	1,933
Nicocles dives	Ion Torrent	3,540,783	7.95	10,585	10,452	70
Diogmites neoternatus	HiSeq2000	120,499,106	39.14	43,199	14,527	1,951
Laphystia limatula	HiSeq2000	60,777,554	3.70	30,019	13,127	607
Scleropogon duncani	HiSeq2000	111,276,014	2.80	50,672	14,413	1,693
Lasiopogon cinctus	Ion Torrent	2,805,003		1,677	4,252	13
Thevenetimyia californica	Ion Torrent	3,050,487	34.60	4,318	743	38
Ectyphyus pinguis	MiSeq	29,249,574	9.17	60,424	14,870	1,661
Messiasia californica	HiSeq2000	109,427,750	3.20	42,895	14,438	1,329
Mydas clavatus	HiSeq2000	90,390,602	1.10	54,643	16,778	1,536
Fidena pseudoaurimaculata	MiSeq	32,434,530	16.13	132,246	15,052	1,343
Tabanus discus	MiSeq	42,458,502	7.72	96,506	14,816	1,915

DOI: 10.7717/peerj.2951/table-3