A highly contiguous genome assembly of the bat hawkmoth Hyles vespertilio (Lepidoptera: Sphingidae)

Abstract Background Adapted to different ecological niches, moth species belonging to the Hyles genus exhibit a spectacular diversity of larval color patterns. These species diverged ∼7.5 million years ago, making this rather young genus an interesting system to study a wide range of questions including the process of speciation, ecological adaptation, and adaptive radiation. Results Here we present a high-quality genome assembly of the bat hawkmoth Hyles vespertilio, the first reference genome of a member of the Hyles genus. We generated 51× Pacific Biosciences long reads with an average read length of 8.9 kb. Pacific Biosciences reads longer than 4 kb were assembled into contigs, resulting in a 651.4-Mb assembly consisting of 530 contigs with an N50 value of 7.5 Mb. The circular mitochondrial contig has a length of 15,303 bp. The H. vespertilio genome is very repeat-rich and exhibits a higher repeat content (50.3%) than other Bombycoidea species such as Bombyx mori (45.7%) and Manduca sexta (27.5%). We developed a comprehensive gene annotation workflow to obtain consensus gene models from different evidence including gene projections, protein homology, transcriptome data, and ab initio predictions. The resulting gene annotation is highly complete with 94.5% of BUSCO genes being completely present, which is higher than the BUSCO completeness of the B. mori (92.2%) and M. sexta (90%) annotations. Conclusions Our gene annotation strategy has general applicability to other genomes, and the H. vespertilio genome provides a valuable molecular resource to study a range of questions in this genus, including phylogeny, incomplete lineage sorting, speciation, and hybridization. A genome browser displaying the genome, alignments, and annotations is available at https://genome-public.pks.mpg.de/cgi-bin/hgTracks?db=HLhylVes1.


Order of Authors Secondary Information:
Response to Reviewers: We have uploaded a word file that provides a point-by-point response "PointByPointResponse". The file is labeled as Supplementary Material.
A cover letter to the editor is also uploaded. Adapted to different ecological niches, moth species belonging to the Hyles 27 genus exhibit a spectacular diversity of larval color patterns. These species 28 diverged about 7.5 Mya, making this rather young genus an interesting system 29 to study a wide range of questions including the process of speciation, 30 ecological adaptation and adaptive radiation. Here we present a high-quality 31 genome assembly of the bat hawkmoth Hyles vespertilio, the first reference 32 genome of a member of the Hyles genus. We generated 51X PacBio long reads 33 with an average read length of 8.9 kb. PacBio reads longer than 4 kb were 34 assembled into contigs, resulting in a 651.

107
We generated 51X PacBio long reads with an average read length of 8.9 kb and 108 an N50 read length of 16.5 kb. PacBio reads longer than 4 kb were assembled 109 into contigs with a customized assembler called DAmar. DAmar is a hybrid of 110 our MARVEL approach, which was used to assemble the Axolotl respectively. As shown in Figure 2, our contigs are >100 times longer than the 118 contigs of the M. sexta assembly (N50 value 52 kb) but shorter than the contigs 119 of the B. mori assembly (N50 value 12.2 Mb), which was based on a 120

133
To assess to which extent the H. vespertilio assembly consist of repetitive 134 sequences, we modelled and masked repeats using RepeatModeler and 135 RepeatMasker. To compare repeat content with B. mori and M. sexta, we 136 applied the same procedure to these genomes as well. We found that the H. To annotate genes in the H. vespertilio assembly, we used Evidence Modeler 152 [14] to produce consensus gene models from multiple sources of predictions. 153 A flowchart visualizing the gene annotation strategy is shown in Figure 4 genome browser visualization of the alignments to B. mori and M. sexta, the 218 gene annotation, and the underlying gene evidences is shown in Figure 5 for 219 an exemplary genomic locus. 220

221
The H. vespertilio genome provides a valuable molecular resource to study 222 speciation and hybridization processes in this genus. In particular, together with 223 newly generated molecular data, the genome will help to infer the phylogeny of 224 the 32 recognized species, which is not yet resolved due to a high degree of 225 hybridization between species and incomplete lineage sorting [24]. The patch phase detects and corrects read artefacts including missed adapters, 279 polymerase strand jumps, chimeric reads and long low-quality read segments 280 that are the primary impediments to long contiguous assemblies. To this end, 281 we first computed local alignments of all raw reads. Since local alignment 282 computation is by far the most time and storage consuming part of the pipeline, 283 we reduced runtime and storage by masking repeats in the reads as follows. 284 First, low complexity intervals, such as microsatellites or homopolymers, were 285 masked with DBdust (https://github.com/thegenemyers/DAZZ_DB/). Second, 286 tandem repeats were masked by using datander and TANmask 287 chimeric reads were passed on to the assembly phase because chimeric 303 breaks within large repeat regions were missed. Therefore, we improved the 304 detection of chimeric reads by re-analyzing repetitive regions up to a length of 305 8 kb for chimers. Any subread which includes a repetitive region that could not 306 be spanned by at least three proper alignment chains was excluded. This 307 additional step lead to a final overlap graph that was much cleaner, which made 308 manual validation easier (Supplementary Figure 2). 309

310
In the assembly phase, we first calculated all overlaps between patched reads 311 using the same alignment strategy of the patch phase. The subsequent steps 312 of (i) computing a quality track for all reads, (ii) computing a detailed repeat 313 mask, (iii) filtering overlap piles, (iv) computing the overlap graph, and (v) 314 touring the overlap graph to obtain primary contigs follow the steps of the 315 original MARVEL assembly pipeline [11,12] with the tool LAfilterMito. The filtered reads were then processed according to 357 the general assembly pipeline (read patching, assembly, error polishing). After 358 read patching, the reads were split into shorter reads with a 1500 bp overlap to 359 ensure that the assembly creates a circular contig that consist of more than a 360

380
Consensus gene models were produced from gene projections, protein 381 homology, transcriptome data, and ab initio predictions. Evidences were ranked 382 and weighted following the guidelines of the EVidenceModeler manual 383 (https://evidencemodeler.github.io/). The ab initio predictors were given the 384 lowest rank, followed by the spliced alignments. As transcript assembly was 385 done using data from another species, this was ranked second after the gene 386 projections.  Supplementary Table 3. Alignments were passed to Evidence 414 Modeler as "Protein" alignments and assigned a weight of 4. 415

416
As a third evidence, we used short read RNA sequencing data that was 417 generated from larvae tissue of the closely related H. euphorbiae, since RNA 418 sequencing data of H. vespertilio was not available. Reads were mapped to the 419 H. vespertilio assembly using hisat2 (v 2.0.0) with parameters "--dta --no-unal -420 -mp 4,1 --score-min L,0,-0.125", which resulted in mapping 65.82% of the reads.