First Annotated Genome of a Mandibulate Moth, Neomicropteryx cornuta, Generated Using PacBio HiFi Sequencing

Abstract We provide a new, annotated genome assembly of Neomicropteryx cornuta, a species of the so-called mandibulate archaic moths (Lepidoptera: Micropterigidae). These moths belong to a lineage that is thought to have split from all other Lepidoptera more than 300 Ma and are consequently vital to understanding the early evolution of superorder Amphiesmenoptera, which contains the order Lepidoptera (butterflies and moths) and its sister order Trichoptera (caddisflies). Using PacBio HiFi sequencing reads, we assembled a highly contiguous genome with a contig N50 of nearly 17 Mb. The assembled genome length of 541,115,538 bp is about half the length of the largest published Amphiesmenoptera genome (Limnephilus lunatus, Trichoptera) and double the length of the smallest (Papilio polytes, Lepidoptera). We find high recovery of universal single copy orthologs with 98.1% of BUSCO genes present and provide a genome annotation of 15,643 genes aided by resolved isoforms from PacBio IsoSeq data. This high-quality genome assembly provides an important resource for studying ecological and evolutionary transitions in the early evolution of Amphiesmenoptera.


Introduction
Lepidoptera are one of the most diverse herbivorous insect lineages, with more than 160,000 described species (Mitter et al. 2017). They are one of the two extant orders, along with caddisflies (Trichoptera), that comprise the superorder Amphiesmenoptera. Modern Lepidoptera and Trichoptera are morphologically similar in many ways, including dense covering of hairs or scales on their wings (Kristensen 1984), but in the $310 Myr since they were believed to have diverged from each other (Kawahara et al. 2019), they have developed very different behaviors and ecological roles. Trichoptera larvae are primarily aquatic, with a diversity of feeding behaviors ranging from pure herbivory to opportunistic scavenging to predation (Mackay and Wiggins 1979). In contrast, almost all Lepidoptera larvae are terrestrial and herbivorous, and the adults provide essential pollination services for many flowering plants (Scoble 1992).
The earliest diverging lineage of Lepidoptera includes the mandibulate archaic moths, Micropterigidae (Kawahara et al. 2019). The family includes roughly 20 extant genera (Van Nieukerken et al. 2011), and is sister to all other extant Lepidoptera. Its fossil record dates back to the Lower Cretaceous (Azar et al. 2010;Kristensen and Skalski 1999;Whalley 1978), though recent phylogenetic studies estimate that the family could be as old as 300 Myr (Kawahara et al. 2019). Micropterigids are known for their unusual feeding habits and mouthpart morphology relative to other moths. The larvae feed on liverworts (Imada et al. 2011), whereas the larvae of most other extant Lepidoptera feed on angiosperms. Many micropterigid larvae, including those in the genus Neomicropteryx, have a plastron and other morphological features conducive to survival in flooded habitats (Davis and Landry 2012); this aquatic association is in sharp contrast with the primarily terrestrial habitats of most other Lepidoptera. Micropterigid adults have mandibulate (chewing) mouthparts, with some species feeding on angiospermous pollen (Kristensen 1999) or spores of ferns and lycopods (Gibbs 2014), whereas most other adult Lepidoptera are either nonfeeding or consume nectar with an elongate, flexible proboscis (siphoning-sucking mouthparts). Fossil evidence suggests that the most recent common ancestor of Lepidoptera was mandibulate, like extant Micropterigidae, and had small structures called galea (also present in micropterigids) that evolved into the proboscis found in nearly all other butterflies and moths (Kristensen 1984;Krenn 2010). Since micropterigids remained mandibulate, their genetic makeup could shed light on the early evolution of ancient Lepidoptera and Trichoptera.
Despite their unique ecology and the fact that Micropterigidae represent a possible important transition between Trichoptera and Lepidoptera, there are no existing genome assemblies of Micropterigidae. By November 2020, there were 118 Lepidoptera and six Trichoptera genome assemblies available on GenBank (Hotaling et al. 2021). With more than 250 Myr of evolution between available genomes of Trichoptera and Lepidoptera (Triant et al. 2018), a Micropterigidae genome is an important evolutionary resource. Moreover, both orders are known for producing silk, but the structure and function of that silk can vary greatly between the two orders. Modern genomic analysis is an essential tool for extrapolating the evolutionary processes and transitions that resulted in the extant diversity stemming from the ancestral amphiesmenopteran. Here, we provide an annotated genome of Neomicropteryx cornuta, the first available genome of any mandibulate Lepidoptera.
We use PacBio HiFi sequencing data to assemble a highly contiguous N. cornuta genome. This is especially important since many genomes of Lepidoptera are of low quality (Ellis et al. 2021). We also provide a genome annotation by resolved isoforms from PacBio IsoSeq data. Our genome assembly provides an important resource to study ecological and evolutionary transitions in the early evolution of Amphiesmenoptera and sets the stage for future studies on the genomics of Amphiesmenoptera.

Results and Discussion
Assembly Sequencing the N. cornuta genome using two PacBio SMRT cells produced 8.8 and 8.4 Gb of HiFi data, respectively, corresponding to $31Â PacBio HiFi read coverage. Blobtools analysis assigned 99.8% of all base pairs to the phylum Arthropoda (supplementary fig. 1, Supplementary Material online) and the resulting assembly contained 101 contigs with a contig N50 of 16,921,359 bp. This is the secondlongest contig N50 for an amphiesmenopteran genome published thus far (Hotaling et al. 2021), shorter only than the genome of the silk moth Samia ricini (GCA_014132275.1), which was also generated by PacBio HiFi sequencing. Assembly GC content was 33.4% and the total assembly length was 541,115,538 bp, which is intermediate in length compared with other Amphiesmenoptera genomes (with BUSCO scores > 90%), which range from 227,005,758 bp (Papilio polytes) to 1,369,180,260 bp (Limnephilus lunatus) (Hotaling et al. 2021). BUSCO analysis identified 98.1% (97.9% complete; 0.3% fragmented) of the Insecta gene set in the assembly ( fig. 1, table 1).

Annotation
We also report the functional annotations of N. cornuta. Of the 15,643 predicted proteins, 86.62% (13,550) were verified by BLAST and/or transcript evidence, 63.04% (9,862) were mapped to GO terms, and 43.95% (6,875) were functionally annotated in Blast2Go. Top GO annotations include catalytic activity (4,512), cellular process (4,492), binding (4,414), and metabolic process (4,296) (supplementary figs. 4-6, Supplementary Material online). We annotated a total of 48.61% of the genome assembly as repeats using RepeatModeler and RepeatMasker. Unclassified repeats comprise >161 million bases, which is the highest among all types of repeats (genome proportion of 29.82%). Long interspersed nuclear elements (LINEs) are the second most abundant repeat category with >39 million bases (7.23%), followed by DNA transposons and rolling circles and long terminal repeats (LTRs), which have >15 million (2.82%) and >11 million bases (2.10%) respectively. Percent composition of repeats and predominance of LINEs were similar to both the S. ricini and Bombyx mori genome assemblies (Lee et al. 2021).

Conclusions
Our results provide a new genome for a relict evolutionary lineage, separated by more than 250 Myr of evolution from any currently existing genome. Results from our study show that high fidelity, long-read sequencing facilitates the production of more contiguous assemblies and generates highquality resources for further investigation of genome functions. Our new genome will be useful for future studies on amphiesmenopteran genetics, conservation and ecology.

Sequencing and Assembly
Larval specimens of N. cornuta were field collected at two sites in Kochi Prefecture, Japan, and flash frozen (supplementary note 1, Supplementary Material online). DNA was extracted from a single specimen using a Zymo Quick-prep HMW DNA extraction kit. Following DNA extraction, the sequencing library was prepared according to the "Using Express Template Prep Kit 2.0 With Low DNA Input" protocol from PacBio. The library was then sequenced on two PacBio Sequel II SMRT cells in CCS mode. Further details are provided in supplementary note 1, Supplementary Material online. Q20 HiFi CCS reads were generated from the raw data using the pbccs tool, which is included in the pbbioconda package (https://github.com/PacificBiosciences/pbbioconda, l a s t accessed August 9, 2021). The reads were then assembled into contigs using Hifiasm v0.13-r307 with the option for aggressive duplicate purging enabled (option -l 2) (Cheng et al. 2021). The primary contig assembly was used for all downstream analyses.
RNA was extracted from the head and silk gland, and library preparation was performed using the IsoSeq express  workflow. The library was then sequenced on a single Sequel II PacBio SMRT cell. Further details are provided in supplementary note 1, Supplementary Material online. The IsoSeq3 pipeline, part of the pbbioconda package, was used to generate IsoSeq clusters, following the published PacBio IsoSeq workflow (https://github.com/PacificBiosciences/IsoSeq/blob/master/isoseq-clustering.md, last accessed August 9, 2021). The steps in the pipeline are 1) circular consensus sequence calling (CCS read generation), 2) primer removal and demultiplexing, 3) refining (trimming of polyA tails and concatemer removal), 4) clustering, and 5) polishing.
We screened the genome assembly for potential contaminants with BlobTools v1.0 (Laetsch and Blaxter 2017)

Repeat and Gene Annotation
We identified and classified repetitive elements de novo and generated a library of consensus repeat sequences for the genome using RepeatModeler 2.0 (Flynn et al. 2020). We then annotated and masked repeats in the assembly with RepeatMasker 4.1.1 (Smit et al. 2013(Smit et al. -2015 using the custom repeat library generated in the previous step. Finally, we reran RepeatMasker on the masked genome using the Repbase arthropod repeat library (Bao et al. 2015). We annotated the N. cornuta genome assembly using MAKER v3.01.03 (Cantarel et al. 2008) and generated ab initio gene predictions using SNAP (Korf 2004), with more details provided in supplementary note 5, Supplementary Material online. To generate functional predictions on the predicted proteins, we used Blast2GO. First, we extracted the CDS sequences from the genome and then used blastx (nr, e-value 1e-4, max_target_seqs ¼ 5) to compare the predicted genes against the NCBI RefSeq nonredundant protein database. We used Blast2GO v1.4.4 (Gö tz et al. 2008) to map functional annotation and GO terms to the resulting sequences.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online.