Illumina short-read sequencing data, de novo assembly and annotations of the Drosophila nasuta nasuta genome

The Drosophila nasuta nasuta (D. n. nasuta) is a member of nasuta subgroup of immigrans species group of Drosophila widely distributed across South-East Asia and central to Southern Africa. It displays morphological similarities with other members of the nasuta subgroup with which it has a recent divergence history. The genomic DNA of D. n. nasuta Coorg strain was paired-end sequenced using Illumina HiSeq 2500 technology to obtain a draft genome assembly of 145.64 Mb. The generated assembly retrieved 93.6% of the conserved dipteran BUSCO orthologs. Approximately 85% of the ab initio predicted proteins exhibit sequence similarity to the proteins of D. albomicans which is the closest annotated species. This draft genome sequence is a valuable resource to Drosophila geneticists and evolutionary biologists to understand molecular organisation of the genome and its evolution during early stages of speciation.


Specifications
Insect science Specific subject area Bioinformatics (Genomics) Type of data Tables  Chart  Graphs  How data were acquired The data was acquired by Next-Generation Sequencing technology using lllumina HiSeq 2500 and draft assembly was generated by Platanus v1.

Value of the Data
• This data is a valuable resource to Drosophila geneticists and evolutionary biologists studying genetics and species divergence. • Draft genome of D. n. nasuta will provide useful insights into understanding the mechanisms underlying genome evolution contributing to speciation. • The D. n. nasuta genome sequences when compared to its recently-diverged sibling species, can aid in understanding the mechanisms associated with whole chromosome fusions and their maintenance. • This genome will form a tractable genomic system for large-scale evolutionary experimentation on raciation and speciation as it constitutes an artificial hybrid zone in the environs of the laboratory with its sibling species D. n. albomicans and their fertile laboratory hybrid lineage called Cytoraces.

Data Description
Initial steps towards understanding the genetics of speciation includes the genomic characterization of the study target. To this end, we sequenced the genome of D. n. nasuta by Illumina paired-end sequencing technology. D. n. nasuta is a member of nasuta subgroup of immigrans species group belonging to genus Drosophila . The nasuta subgroup, which harbors several closely related species/ sub-species pairs exhibiting striking morphological similarities with varying degrees of reproductive isolation [1] . D. n. nasuta has a recent divergence history (0.3-0.6 million years) with its sibling species D. n. albomicans , with which it can produce fertile hybrids under laboratory conditions [2] . The major difference between the two species is marked by fusion of the 3rd autosomes and sex chomosomes in D. n. albomicans [ 1 , 3 ]. Hence, this draft genome can help in understanding mechanisms of whole chromosome fusion and their maintenance during the early stages of divergence. Here, we present a high quality Illumina draft genome assembly for D. n. nasuta (Coorg strain, Mysore) with better N50 values and contiguity than the existing asembly [4] of samples collected from different geographical location. Paired-end Illumina sequencing of D. n. nasuta gDNA generated 7.5 Gb of sequence data. Filtering of Illumina adapters and low-quality sequences removed 0.037% of the 137.4 million paired-end raw reads. Read statistics are provided in Table 1 . A total of 137.3 million high quality filtered 2 × 101 bp reads were initially assembled into 73,720 ungapped contigs. Scaffolding, gapfilling and NCBI check of the contig assembly produced a final draft asembly of 145.64 Mbp consisting of 20,246 scaffolds. Assembly statistics are presented in Fig. 1 and Table 2 . About 79.15% of the high quality filtered reads mapped back to the assembled draft genome. Collinearity between the draft assembly of D. n. nasuta and closely related D. albomicans genome and distantly related D. melanogaster genome is illustrated in Fig. 2 . For further analysis, only 11,950 scaffolds with minimum length of 10 0 0 bp were considered. The Benchmarking Universal Single-Copy Ortholgs (BUSCO) analysis on these scaffolds predicted the presence of 30 6 6 complete single copy conserved eukaryotic genes out of the 3285 known genes in diptera_odb10 dataset. Overall, 93.6% of the predicted genes were complete ( Fig. 3 ).
About 18,881,583 bp (13.27%) of the assembled sequences were softmasked as repetitive sequences by RepeatMasker. A total of 15,283 gene models were predicted by AUGUSTUS of which, 4483 had > 90% exon evidence. The genes predicted by other ab initio gene predictors are summarised in Table 3 . An input of all 83,966 gene models to EvidenceModeler (EVM) resulted in retrieval of 15,432 total gene models. After filtering of bad gene models containing gaps, transposable elements and shorter protein length ( < 100 amino acids), 13,766 protein coding genes were finally retained. The tRNAScan-SE tool identified 222 tRNA coding genes in the assembly.
11,673 out of the 13,766 predicted protein sequences annotated against D. albomicans proteins . 9353 (80%) of the annotated proteins were represented by nearly full-length transcripts having a protein alignment coverage of > 80%. The distribution of sequence similarity of the annotated protein at different query coverage (percentage of the annotated protein length included in the BLASTp alignment) intervals is shown in Fig. 4 . The KOG class distribution is shown in Fig. 5 . The data illustrated in Fig. 6 shows the Gene Ontology (GO) distribution of the protein coding genes.

Fly stock and DNA extraction
The Drosophila nasuta nasuta (Coorg strain, India; Stock number: 201.001, Drosophila Stock Centre, University of Mysore) was maintained on wheat cream agar media at 22 ±1 °C temperature, 60% humidity and 12hr light/dark cycle at the Department of Studies in Genetics and Genomics fly facility. Male flies were isolated on eclosion and aged for 5 days. Genomic DNA (gDNA) was extracted from 40 whole males using MP Biomedicals FastDNA TM SPIN Kit following the manufacturer protocol. Thermo Scientific TM NanoDrop 20 0 0 spectrophotometer and Qubit TM Flex Fluorometer (dsDNA BR Assay) checked the quality and quantity of gDNA.    HiQ-High quality ( type = "Dataset" > 90% exon evidence).

Structural and functional annotations
To maximize gene predictions, the repeat elements in the assembly were masked by Repeat-Masker (v4.1.0; http://www.repeatmasker.org ) using a custom repeat library constructed using RepeatModeler (v2.0.1) [14] . For structural genome annotations, a training dataset was generated by funannotate train function in Funannotate (v1.8.1) [15] using the RNA-seq reads from gonadal tissues of D. n. nasuta Coorg strain (SRA accession numbers: SRR10875323 and SRR8398946).  Briefly, RNA-seq reads were assembled by genome-guided module in Trinity. The predicted transcripts were then aligned to the softmasked genome to construct PASA gene models. Gene models were then generated by funannotate predict function in Funannotate. Briefly, using PAS A gene models as training dataset, gene models were constructed by ab initio gene predictors like AU-GUSTUS, GlimmerHMM and SNAP integrated in Funannotate. Additionally, gene models were also predicted by Genemark-EP + [16] . EvidenceModeler then combined all ab initio gene model predictions along with protein evidence from UniprotKB/Swissprot database and D. albomicans proteins to generate a final set of protein coding genes. Finally, tRNAscan-SE [17] validated the tRNA coding genes.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.