Long-read HiFi sequencing correctly assembles repetitive heavy fibroin silk genes in new moth and caddisfly genomes

Insect silk is a versatile biomaterial. Lepidoptera and Trichoptera display some of the most diverse uses of silk, with varying strength, adhesive qualities, and elastic properties. Silk fibroin genes are long (>20 Kbp), with many repetitive motifs that make them challenging to sequence. Most research thus far has focused on conserved N- and C-terminal regions of fibroin genes because a full comparison of repetitive regions across taxa has not been possible. Using the PacBio Sequel II system and SMRT sequencing, we generated high fidelity (HiFi) long-read genomic and transcriptomic sequences for the Indianmeal moth (Plodia interpunctella) and genomic sequences for the caddisfly Eubasilissa regina. Both genomes were highly contiguous (N50 = 9.7 Mbp/32.4 Mbp, L50 = 13/11) and complete (BUSCO complete = 99.3%/95.2%), with complete and contiguous recovery of silk heavy fibroin gene sequences. We show that HiFi long-read sequencing is helpful for understanding genes with long, repetitive regions.


DATA DESCRIPTION Background
Many phenotypic traits across the tree of life are controlled by repeat-rich genes [1]. There are many examples, such as antifreeze proteins in fish [2], keratin in mammals, and resilin in insects [1]. Silk is a fundamental biomaterial produced by many arthropods. Silk genes are often long (>20 kilobase pairs [Kbp]) and contain repetitive motifs [3]. Accurately sequencing through repeat-rich genomic regions is critical to understanding how functional genes dictate phenotypes. However, research thus far has been unable to quantify these regions. For silk genes, this is essential because these regions control the strength and elasticity properties of silk fibers [4][5][6].
Lepidoptera (moths and butterflies) and their sister lineage Trichoptera (caddisflies) display some of the most diverse uses of silk, from spinning cocoons to prey capture nets and protective armorment [7]. A complete heavy-chain fibroin (H-fibroin) sequence for the model silkworm moth, Bombyx mori, was assembled over two decades ago using bacterial artificial chromosome libraries [8]. Recently, a combination of Oxford Nanopore Technologies (hereafter referred to as 'Nanopore') and Illumina sequencing technologies helped to generate a full H-fibroin sequence of B. mori, but large regions of the genome remain unassembled [3]. We have had similar problems with Nanopore and Illumina hybrid assemblies in caddisfly genomes e.g., [9], where we were unable to assemble complete H-fibroin genes despite intensive efforts for ∼20 species. In these assemblies, the biggest hindrances were sequencing single strands across large repeat regions, and limited efficacy of Illumina polishing of repetitive regions in the Nanopore assembled data. Therefore, most research thus far has been limited, and has focused only on conserved Nand C-terminal regions e.g., [10]. Complete high-fidelity (HiFi) fully phased H-fibroin sequences are critical for advancing biomaterials discovery for insect silks.

Context
We generated HiFi long-read genomic sequences for the Indianmeal moth (Plodia interpunctella, NCBI:txid58824), and the caddisfly species Eubasilissa regina (NCBI:txid1435191), with the Pacific Biosciences (PacBio) Sequel II system. Our goal was to recover the area of the genome that has been nearly impossible to sequence because of its repeated regions. We chose these two taxa because they represent two species with very different life histories: Plodia interpunctella is an important model organism in Lepidoptera whose larvae feed on various grains and stored food products and secrete large amounts of thin silken webbing at their feeding sites. They also use silk to create a cocoon during pupation [11,12]. Eubasilissa regina, on the other hand, is a member of the insect order Trichoptera, whose larvae secrete silk in aquatic environments to produce protective silk cases made of broader leaf pieces from deciduous trees, cut to size [13]. These new resources not only expand our knowledge of a primary silk gene in Lepidoptera and Trichoptera, but also contribute new, high-quality genomic resources for aquatic insects and arthropods, which have thus far been underrepresented in genome biology [14][15][16].

Sample information and sequencing
A single adult specimen of each species was sampled for inclusion in the present study. For P. interpunctella, we used a specimen from the PiW3 colony line at the US Department of Agriculture laboratory in Gainesville, FL, USA. Its entire body was used for extraction, given its small size. For E. regina, a wild-caught female adult specimen (USNMENT01414923) from Enzan, Yamanashi, Japan (N35°43′ 24′′ E138°50′ 33′′, elevation ∼4,840 ft), was used, which has been deposited in the Smithsonian National Museum of Natural History (USNM) biorepository (#AK0WP01). The head and thorax were macerated and DNA was extracted.
The remainder of the tissue will be stored at the USNM biorepository.
Both specimens were flash-frozen in liquid nitrogen, and DNA was extracted using the Quick-DNA HMW MagBead Kit (Zymo Research). Extractions with at least 1 μg of high-molecular-weight DNA (>40 Kbp) were sheared, and the BluePippin system (Sage Science, Beverly, MA, USA) was used to collect fractions containing 15-Kbp fragments for library preparation.
Sequencing libraries were prepared for each species using the SMRTbell Express Template Prep Kit 2.0 (PacBio, Menlo Park, CA, USA) and following the ultra-low protocol.
All sequencing was performed using the PacBio Sequel II system. For P. interpunctella, the genomic library was sequenced on a single 8M SMRTcell and E. regina was sequenced on three 8M SMRTcells, all with 30-hour movie times. For the P. interpunctella Iso-seq transcriptome, RNA was extracted using TRIzol (Invitrogen) from freshly dissected silk glands of caterpillars and following the manufacturer's protocol. This species has a relatively small body size than other Lepidoptera, so we waited until caterpillars reached their maximum size (during the fifth instar) before dissection, to maximize yield. Genomic HiFi reads were generated by circular consensus sequencing, where consensus sequences have three or more passes with quality values equal to or greater than 20, from the subreads.bam files and using pbccs tool (v.6.0.0) in the pbbioconda package (RRID:SCR_018316) [17]. Using the same pbbioconda package and the Iso-seq v3 tools, high quality (>Q30) transcripts were generated from HiFi read clustering without polishing.

Genome size estimations and genome profiling
Estimation of genome characteristics, such as size, heterozygosity, and repetitiveness, were conducted using a k-mer distribution-based approach. After counting k-mers with K-Mer Counter (KMC) v.3.1.1 (RRID:SCR_001245) and a k-mer length of 21 (-m 21), we generated a histogram of k-mer frequencies with KMC transform histogram [18]. We then generated genome k-mer profiles on the k-mer count histogram using the GenomeScope 2.0 web tool (RRID:SCR_017014) [19], with the k-mer length set to 21 and the ploidy set to 2.

Genome statistics
All samples, raw sequence reads, and assemblies were deposited to GenBank [26] (  [9], the findings in this study may be an indication of tetraploidy. Future research should be done to further examine these patterns.
The P. interpunctella assembly represents a substantial improvement to existing, publicly available genome assemblies (Tables 2 and 3

Heavy-chain fibroin gene annotation
We extracted H-fibroin silk genes from both the P. interpunctella and E. regina assemblies.
For P. interpunctella, we also searched existing, short-read based assemblies. We downloaded two short-read based genome assemblies for P. interpunctella,   (Figure 3).

Structural and functional annotation
The major biological process found in both genomes were cellular and metabolic processes.

REUSE POTENTIAL
We provide a complete genome of two species of silk-producing insects in the superorder Amphiesmenoptera; the moth P. interpunctella and the caddisfly E. regina. We also recover the difficult-to-sequence repetitive regions of both genomes with HiFi sequencing.
P. interpunctella is currently being developed in multiple laboratoriess as a model organism, and this genome assembly will facilitate molecular genetics research on this species.  We show that PacBio HiFi sequencing allows accurate generation of repetitive protein-coding regions of the genome (silk fibroins), and this probably applies to other similarly repetitive regions of the genome. For Trichoptera, there are only four other HiFi genome assemblies available on Genbank, only one of which has been published [45].
Insects have largely been neglected (relative to their total species diversity) in terms of genome sequencing efforts [15,16], which is especially true for aquatic insects [14]. These data serve as the first step to study the evolution of adhesive silk in Amphiesmenoptera, which is an innovation beneficial for survival in aquatic and terrestrial environments.
Finally, the Iso-seq data that we provide serve as useful resources for the translational aspects of silk. These data provide information on how Amphiesmenoptera genetically modulate and regulate different silk properties, which allows them to use silk for different purposes, such as for nets, cases, and cocoons in both terrestrial and aquatic environments.

AVAILABILITY OF SOURCE CODE AND REQUIREMENTS
All custom-made scripts used in this study are available on GitHub. •

DATA AVAILABILITY
Raw sequence data, genome assemblies, and sample information are all available from NCBI under Bioproject number PRJNA741212. Individual accessions can be found in Table 1. Snapshots of the code and supporting data are available in GigaDB [32], including assemblies and annotations for P. interpunctella [46] and E. regina [47].

ETHICAL APPROVAL
Not applicable.

CONSENT FOR PUBLICATION
Not applicable.

COMPETING INTERESTS
The authors declare that they have no competing interests.