Discovering Novel Genes in Non-Model Fly Accessory Glands Using De Novo Nanopore Transcriptomics


 BackgroundOxford Nanopore Technologies (ONT) long-read transcriptomes offer many advantages including long reads (>10kbp), end-to-end transcripts, structural variants, isoform-level resolution of genes and expression. However, uptake of ONT transcriptomics is still low, largely due to high error rates (2 to 13%) and reliance on reference databases that are unavailable for many non-model species. Additionally, bioinformatics tools and pipelines for de novo ONT transcriptomics are still in early stages of development. ResultsHere, we use de novo ONT GridION transcriptomics to discover novel genes from the male accessory glands (AG) of a widespread, non-model dung fly, Sepsis punctum. Insect AGs are of particular interest for this as they are hotspots for rapid evolution of novel reproductive genes, and they synthesize seminal fluid proteins that lack homology to any other known proteins. We implement a completely de novo ONT GridION transcriptome pipeline, incorporating quality-filtering and rigorous error-correction procedures, to characterize this novel gene set and to quantify their expression. Specifically, we compare these ONT genes and their expression against de novo lllumina HiSeq transcriptome data. We find 40 high-quality and high-confidence ONT genes that cross-verify against Illumina genes; twenty-six of which are novel and specific to S. punctum. Read count based expression quantification in ONT samples is highly congruent with Illumina’s Transcript per Million (TPM), both in overall pattern and within functional categories. Novel genes account for an average of 81% of total gene expression underscoring their functional importance in S. punctum AGs. Eighty percentage of these genes are secretory in nature, responsible for 74% total gene expression. Notably, median sequence similarities of ONT nucleotide and protein sequences match within-Illumina sequence similarities indicating that our de novo ONT transcriptome pipeline successfully mitigated sequencing errors. ConclusionsThis is the first study to adapt ONT transcriptomics for completely de novo characterization of novel genes in animals. Our study demonstrates that ONT long-reads, constituting a quarter of the number of bases sequenced at less than a third the cost of Illumina reads, can be a resource-friendly and cost-effective solution for end-to-end sequencing of unknown genes even in the absence of a reference database.


Background
Advances in sequencing technologies have facilitated the incorporation of 'omics data into investigations of both microevolutionary processes and macroevolutionary patterns of diversi cation in biological systems. For instance, our ability to generate high-throughput data from both model and non-model organisms has dramatically enriched our knowledge of how populations diverge, using genome-wide sequencing (Carneiro et al., 2014;Jansson et al., 2020;Samuk et al., 2020), and how new genes are born and genomes evolve, using RNAseq based transcriptomics (Au et al., 2013;Klasberg et al., 2016;Mrinalini et al., 2021;Wu et al., 2011). Speci cally, several short-read technologies have been developed for RNAseq transcriptomics, and among these Illumina is a market leader due to high base calling accuracy (> 99.9%), high data yield, and the availability of well-established bioinformatics tools and best practices for data analysis (Conesa et al., 2016;Corchete et al., 2020;Hölzer & Marz, 2019). However, Illumina short-reads span ≤ 600 bp in length which presents signi cant challenges for resolving full length, protein coding transcripts, especially in non-model species that lack genic and genomic reference databases. Instead of complete DNA sequences (CDS), short-reads are stitched together in silico using a reference database or via de novo concatenated assembly to create contiguous sequences (contigs), and from this, full-length, protein-coding transcripts can be derived. Well-annotated and high-quality reference databases can be very useful for assembling CDS and quantifying gene expression, however they are often only available for well-studied organisms such as humans (Garg et al., 2020;Venter et al., 2001), fruit ies (Adams et al., 2000), mice (Bult et al., 2019) etc. Partially or spuriously assembled CDS and chimeric transcripts are common pitfalls of de novo short-read assembly (Conesa et al., 2016;Freedman et al., 2021). Moreover, in evolutionarily closely-related genes and in gene families consisting of multiple isoforms, it is di cult to resolve CDS and quantify gene expression to the isoform level using short-reads (Conesa et al., 2016;Steijger et al., 2013).

Recent inventions in Third Generation
Sequencing technologies that provide high throughput, long-read data have allowed end-to-end transcript sequencing, thereby eliminating the need for assembling contigs.
Two long-read sequencing technologies dominate the market at present: Oxford Nanopore Technologies (ONT) and Paci c Biosciences (PacBio) Single-Molecule Real Time (SMRT) Sequencing. Among the two, ONT is arguably the leader in long-read transcriptomics and offers short to ultra-long DNA/RNA molecules longer than 10 kbp in length. Sequencing is done in real-time, based on the principle of passing long, unfragmented complementary DNA (cDNA) or RNA molecules through a protein nanopore, recording minute changes to electric current, and translating these changes into sequence data (Jain et al., 2016;Rang et al., 2018). Single-use cartridges with pre-loaded reagents can be easily used with portable, benchtop instruments, which makes ONT a convenient platform for ecological studies even in the eld (Chang et al., 2020). Further, ONT offers cDNA sequencing, in both polymerase chain reaction (PCR) based and PCR-free formats, and direct RNA sequencing that bypasses the need for converting RNA into cDNA.
Despite all these advantages, a major obstacle to the adoption of ONT long-read transcriptomics is the high error rate in both cDNA and direct RNA sequencing (Workman et al., 2019). One way to mitigate sequencing errors is by adopting a reference-based approach, and as mentioned earlier, wellcharacterized, species-speci c reference databases can go a long way in resolving transcript sequences and their expression levels, even to the level of isoforms (Dong et (Sessegolo et al., n.d.), cattle (e.g., Halstead et al., 2021), fruit ies (Bayega et al., 2018), viruses (e.g., Boldogkői et al., 2018), and well-studied plants (e.g., Cui et al., 2020;Wang et al., 2021).
However, in non-model species without reference databases, the uptake of ONT long-read transcriptomics has been negligible and there are no studies comparing ONT transcriptomics with other sequencing platforms (e.g., Illumina, PacBio). In fact, a search for ONT transcriptomics of lesser-studied species led us to just one publication on the Muscovy duck where a database from an alternate duck species was used as a reference (Lin et  In this study, we explore the use of reference-free, de novo ONT long-read transcriptomics for novel gene discovery and gene expression quanti cation in the accessory glands of a dung y species, Sepsis punctum (Diptera; Sepsidae). This is an ecologically relevant insect, often found on decaying organic material such as cattle dung and is widespread across North America and Europe. Sepsid ies are emerging models in a range of disciplines such as eco-toxicology (Blanckenhorn et  However, with the exception of species from the genus Themira, which is distantly-related to S. punctum, there is generally a lack of genic, genomic, or transcriptomic data for sepsid species. Insect reproductive genes and proteins are known to diverge rapidly at the species level and even at the and are often closely associated with paired accessory glands (AG) that synthesize seminal uid proteins ( Figure 1). These play a crucial role in post-copulatory sexual selection and AGs, in particular, can be hotspots for the evolution of completely novel genes (Mrinalini et al., 2021). Rapid evolution of insect genomes, through the birth of novel genes and subsequent recruitment for high-expression in the AGs, underpins starkly divergent reproductive gene repertoires and protein compositions even in closely-related species (Mrinalini et al., 2021). Furthermore, proteins synthesized by newly-evolved AG genes often do not show any similarity to proteins from other organisms, and their functions are largely unknown (Bayram et al., 2017(Bayram et al., , 2019Mrinalini et al., 2021;Parthasarathy et al., 2009). Given the lack of genomic/transcriptomic reference databases, the likely presence of novel, species-speci c genes, and the lack of protein homology to any other species, it is challenging to resolve full-length accessory gland genes of insects, especially using a de novo transcriptomics approach with an emerging technology such as ONT.
Given this backdrop, we aim to: (i) perform ONT GridION long-read transcriptomics of S. punctum male accessory glands; (ii) implement a de novo ONT GridION transcriptome pipeline with error correction procedures; (iii) characterize novel accessory gland genes; (iv) quantify gene expression; and (v) evaluate the usefulness of de novo ONT transcriptomics by comparing results with Illumina HiSeq data.

De novo Transcriptome Statistics
After implementing two separate de novo transcriptome pipelines for ONT GridION and Illumina HiSeq read data (Figure 2), we generated summary statistics for the transcriptomes. Table 1 provides the statistics for experimental details, cDNA library preparation, RNAseq reads, and ltered de novo transcriptomes.
RNAseq Read Statistics: ONT long-read sequencing generated 4.48 M and 5.28 M reads for samples ONT-1 and ONT-2 respectively, whereas Illumina sequencing generated 29.64 M and 32.96 M reads for ILL-1 and ILL-2 respectively. Long-read lengths range from 50 bp to a maximum of 11,837 bp for ONT-1 and 50 bp to 9,370 bp for ONT-2. Average read quality, represented by Phred scores, are 11.7 and 11.6 for ONT-1 and ONT-2 respectively. For Illumina, read qualities were much higher at 35.5 on average from the two sample. Filtered RNAseq read GC content was 39.5% and 39.2% for ONT-1 and ONT-2, whereas GC content for both Illumina samples was 46%.
After quality ltering, a majority of long-reads were found to be distributed within the 4000 bp range, with an average base quality ≤ 19 (minimum base quality 7) for both ONT-1 and ONT-2 (Figure 3a Table 1). Finally, the cost of sequencing was USD 285/sample for ONT GridION and USD 882/sample for Illumina HiSeq.  (Table 1). For Illumina, de novo transcriptome assembly followed by ltering steps showed 44,523 and 45,681 assembled contigs for ILL-1 and ILL-2 respectively (Table 1). ONT long-read transcript N50 were 363 and 345 bp, with the largest transcripts at 4,422 and 2,964 bp, for ONT-1 and ONT-2 respectively (Table 1). For Illumina short-read transcripts, both values were much higher, with N50 values at 702 and 645 bp and longest transcript lengths at 13,932 and 12,372 bp, for ILL-1 and ILL-2 respectively. Although GC content of ltered RNAseq reads varied between ONT and Illumina, the GC content of the nal de novo transcriptomes from both technologies were comparable at c. 54% (Table 1).

Characterisation of Gene Expression and Gene Function
The functional aspects of S. punctum AG genes were characterized using gene expression analysis, BLASTP against nr for functional annotation, and signal peptide analysis of the translated protein sequences ( Figure 4). Ranking of accessory gland genes from high expression (1) to low expression (40) showed high level of congruence in overall gene expression pattern across all four samples despite using two different methods of quanti cation for the two technologies, i.e., read count for ONT samples and Functional annotation showed that S. punctum AG genes fell into four main categories: novel genes; protease inhibitors; C-type lectins; and cystatins. Of the 40 genes analysed, 65% (26) had no BLASTP hits in nr, indicating that these genes were novel and evolved speci cally in the genome of S. punctum ( Figure  4b). The functions of these novel S. punctum genes are unknown due to complete lack of BLASTP homology to any other species. However, these 26 novel genes accounted for a majority of gene expression in the accessory glands, i.e., 87% and 90% of total expression in ONT-1 and ONT-2 respectively, and 78% and 71% of total expression in ILL-1 and ILL2 respectively ( Figure 4c). Of the remaining 14 genes, 10 genes were protease inhibitors and they were the second highest expressed gene category, with 10% and 8% in ONT-1 and ONT-2, and 17% and 23% in ILL-1 and ILL-2 respectively (Figure 4b and c). Two genes each for cystatins and C-type lectin were found, and both gene families were the least expressed, with 1-3% of total expression in both ONT and Illumina (Figure 4b and c). Sequence analysis for the presence of secretory signals showed that 80% (32) of 40 AG genes synthesize proteins that were secretory in nature (Figure 4c). Moreover, secretory genes also accounted for the majority AG gene expression, with 90% and 83% of total expression in samples ONT-1 and ONT-2 and 87% and 75% in samples ILL-1 and ILL-2 respectively.
Gene expression quanti cation of individual genes in all the four samples are represented as a heatmap, with genes grouped by functional categories (Figure 5a). Novel genes of unknown function were among the highest expressed genes in S. punctum accessory glands in both ONT and Illumina samples ( Figure  5a). Expression levels of several individual genes were largely comparable across samples and across sequencing technologies, with some variation between and even within technologies (Figure 5a). Evaluation of ONT Transcriptomes Summary statistics of sequence similarity generated for three test samples, i.e., ONT-1, ONT-2, and ILL-2, by comparing against a representative Illumina control sample, ILL-1, are shown in Table 2. Median nucleotide sequence similarities of 40 genes in test samples ranged from 99.53% to 99.65 and were highly congruent between ONT samples and the single Illumina short-read sample. At the protein level, median similarities of ONT-1 and ONT-2 were 98.99% and 98.93% respectively, whereas for ILL-2, it was slightly higher at 99.31%. Further, 18-19% of protein sequences from test samples were 100% similar to ILL-1 control protein sequences. At the nucleotide level however, 53% and 50% of proteins from ONT-1 and ONT-2 respectively were 100% similar to ILL-2, whereas 65% of ILL-2 sequences were 100% similar to ILL-2. Sequences similarities of individual genes are represented as heatmaps in Figure 5b and c. No particular trend is observed within or among the four functional categories, i.e., novel genes, protease inhibitors, C-type lectins, and cystatins.

Discussion
To date, this is the rst study to adapt ONT long-read technology in a reference-free, completely de novo transcriptomics approach, and it is also the rst study to use ONT long-read RNAseq to characterize novel, species-speci c genes in any animal species. So far, ONT long-read transcriptomics has been used with one other species that did not have a species-speci c reference database available (the Muscovy duck). However, in that study, the authors used a reference database from a different genus, and characterized well-established and conserved genes (Lin et al., 2021). In contrast, our genes of interest are rapidly evolving, often species-speci c genes from AG of males insects, and we successfully discover more than two dozen novel genes that are found only in S. punctum (Figure 4b). Our de novo transcriptome pipeline incorporated read quality ltering and rigorous post-sequencing error correction procedures that successfully mitigated high error rates in ONT long-reads ( Figure 2a) (Sahlin et al., 2021;Sahlin & Medvedev, 2019). Gene sequences from our assembly-free and reference-free ONT long-read pipeline were of high quality, with sequence similarity levels comparable to genes derived from Illumina short-read assembly (Table 2; Figure 5b and c).
Although two different methods of gene expression quanti cation for ONT and Illumina pipelines were used (Figure 2a), there was a high degree of congruence in gene expression patterns between the two technologies: in terms of overall expression (Figure 4a) and in terms of expression in each functional categories (4d and e). At the level of individual S. punctum genes, expression patterns were less congruent, and could be attributed to variation in biological replicates (Figure 5a). This study achieved results comparable to Illumina with a quarter the number of sequenced bases and at less than a third the cost of Illumina sequencing (Table 1), demonstrating that ONT long-reads can be a resource-friendly, computationally less-demanding, and highly cost-effective solution for end-to-end transcript sequencing even in the absence of a reference database.
Novel Gene Discovery Using ONT GridION A total of 26 novel, species-speci c genes in S. punctum were discovered using ONT GridION long-reads (Figure 4b). These novel accessory gland genes constituted 65% of 40 of high-quality and highcon dence gene set that were also cross-veri ed against evidence from Illumina HiSeq transcriptomes (Figure 4b). Thus far, novel reproductive genes and gene products have been discovered in ies, beetles, and honey bees using microarrays and Illumina based RNA sequencing (Bayram et  . Our study suggests that error-prone Third Generation Sequencing, when combined with effective read quality ltering and rigorous error correction methods, can be a reliable new technology for end-to-end novel gene discovery and gene expression quanti cation without the need for transcriptome assemblies or reference databases. The functions of novel S. punctum AG genes are unclear as we nd no homology to any other proteins in nr. However, these genes are among the highest expressed in AG tissues, both as a gene category (81% of total expression) ( Figure 4c) and at the level of individual genes (Figure 5a). This suggests that novel genes are likely important for AG function and that they likely play a crucial role in S. punctum reproduction. Novel reproductive genes, particularly AG genes, are known to evolve through rapid genomic evolution and recruitment for high expression in the AG of other dung-dwelling insects like the dung beetle (Mrinalini et al., 2021). It remains to be seen whether novel S. punctum genes evolve through a similar or a different mechanism.

Role of Protease Inhibitors, C-type Lectins, and Cystatins
In addition to novel genes, S. punctum AG transcriptomes revealed ten protease inhibitors and two genes each of C-type lectins and cystatins (Figure 4c). Protease inhibitors were the largest group and the second highest expressed functional category of genes in S. punctum (Figure 4b and 4c). Consisting of a large and diverse group of genes that synthesize many classes of proteins, protease inhibitors have been found in seminal uids of D. melanogaster (Mueller et  Eighty percent (32) of AG genes analysed in S. punctum were secretory in nature, i.e., they synthesize secretory proteins (Figure 4d). All 10 protease inhibitors and both C-type lectins and cystatins contain secretory signals, whereas among 26 novel S. punctum genes, 69% (18) were secretory. Secretory genes also accounted for the majority of AG gene expression, with an average of 84% of total expression in the four samples (Figure 4e). These patterns observed in S. punctum were similar to those found in dung beetles, where 73% of AG genes analysed were found to be secretory in nature, and they accounted for over 80% of total gene expression (Mrinalini et al., 2021). This supports that the primary function of male AG in most male insects is the synthesis of secretory proteins.

Quality of ONT GridION Transcriptome
Among several sequencing platforms offered by ONT, GridION is a scaled-up version of the earlier MinION. ONT GridION is able to analyse data of up to ve MinION owcells or Flongles and therefore returns higher data yields, but with the added advantage of a compact benchtop device with onboard data processing capacity. Further, multiple Flongles and owcells can be run independent of each other, giving greater exibility for managing sequencing run and data analyses. Currently, GridION raw base calling error rate has been estimated to be ~ 4%, and it is generally prone to deletion errors. In our analysis of sequence similarity, median sequence similarities of ONT GridION genes and proteins (to their respective orthologs in the designated Illumina HiSeq control sample) are comparable to within-Illumina median sequence similarities (Table 2). At the nucleotide level, 64% of Illumina test genes were 100% similar to Illumina control orthologs, whereas ONT GridION genes show lower proportions of 53% and 50% in ONT-1 and ONT-2 respectively (Table 2). However, both ONT GridION and Illumina HiSeq transcriptomes show equal proportions of proteins (18-19%) that are 100% similar to Illumina control orthologs (Table 2). Moreover, due to the small size of S. punctum and the minute size of our tissue, our samples were collected by pooling tissues dissected from multiple ies collected at different time points. Therefore, higher proportion of similar protein sequences suggests that natural genetic variation together with differential representation and incorporation of transcripts during de novo transcriptome construction could be the likely source of sequence variation rather than sequencing errors.

Conclusions
Our study demonstrates that de novo ONT long-read transcriptomics is a reliable and cost-effective approach for novel gene discovery and gene expression analysis in the absence of reference databases.
We nd that by implementing rigorous post-sequencing error correction procedures, error-prone ONT longreads can produce gene sequence and gene expression data that are comparable to Illumina HiSeq. We discovered 26 novel reproductive genes that are recruited for high expression in the accessory glands of male S. punctum. Novel gene discovery has important implications for understanding fundamental evolutionary processes such as phenotypic trait evolution, adaptation, and speciation. In particular, male reproductive genes of insects are known to synthesize seminal uid proteins that interact with the female reproductive environment and thereby direct post-copulatory sexual selection. Hence, understanding rapid specialization and diversi cation of male reproductive genes in a species sheds light on mechanisms of divergence of populations and the processes of speciation.

Dissection of Accessory Glands From S. punctum
The sampling and maintenance of S. punctum cultures followed previously published work (Puniamoorthy et al., 2012a;Rohner et al., 2016). For this study, a North American population from Ottawa (45.43 °N, -75.67 °E) was used and adult ies were housed in plastic containers measuring 11 cm by 9 cm by 9 cm and reared at a temperature of 26 ºC with cattle dung, sugar, and water given ad libitum. A mixture of four to ten day old males were aspirated from culture containers into plastic vials, cooled at -20 ºC for 10 min, and placed on ice until dissection. Each y was transferred to a glass slide and the abdomen was dissected into 1X PBS. Paired accessory glands were separated from the testes and collected into 1.5 ul microcentrifuge tube snap frozen on dry ice. For ONT GridION, tissues from 80 ies were pooled to allow for protocol optimisation, and for Illumina HiSeq, tissues from 63 ies were pooled. For each sequencing technology two biological replicates of pooled tissues were collected and samples were stored at -80° C until RNA extraction.

RNA Extraction
Total RNA was extracted using Aurum Total RNA Mini Kit (BIO-RAD Cat # 732-6820). Samples stored at -80° C were centrifuged at 13,000 rpm for 20 minutes at 4° C and placed on ice. 700 ml lysis solution was added to each sample and homogenized using PTFE pestles. The lysate was centrifuged for 3 minutes at 4°C and the supernatant was transferred to new tube. 700 ul of 60% ethanol was added and thoroughly mixed by vortexing for 2-3 minutes to make sure there was no visible bilayer. 700 ul of homogenized lysate was transferred into an RNA binding column inserted into a wash tube and the set up was centrifuged for 1 minute. The ltrate was discarded and the same wash was repeated a second time. 700 ul of low stringency wash was added to column and centrifuged for 1 minute and ltrate discarded. 80 ul of DNase (5ul of DNase I solution + 75 ul of DNase solution) was added to each column and incubated at room temperature for 25 minutes. The samples were washed two more times, rst with 700 ul of high stringency wash solution and second with 700ul of low stringency wash with centrifuging for 1 minute and discarding of ltrate after each wash. The samples were spun for 3 minutes to remove residual wash solution and the RNA binding column was transferred to 1.5 ul microcentrifuge tubes. 40 ul of elution solution was added to the membrane of the binding column and after one minute of membrane saturation, the sample was centrifuged for 2 minutes to elute total RNA. cDNA library preparation and RNAseq Oxford Nanopore Technologies (ONT) GridION ONT offers Direct RNA or Direct cDNA library preparation and sequencing options, however these technologies require high amounts of starting RNA input. The RNA quantities of our samples were inherently low given the small size of our study species S. punctum (2-7 mm in length) and even smaller size of reproductive tissues (Figure 1). Therefore, the PCR-cDNA (PCB109) protocol was used, which allows for lower RNA input. Total RNA samples were submitted to Genome Institute of Singapore, Singapore, for ONT GridION long-read RNAseq. 100 ng total RNA was used for cDNA synthesis and strand switching of full-length poly AAA tail. cDNA was ampli ed with 5' barcoded primers and sequencing adapter annealing. Barcoded libraries were multiplexed by pooling at 100 fmol based on average Agilent DNA 12000 size, and sequencing was performed on one GridION owcell.

Illumina HiSeq
For Illumina short-read sequencing, total RNA was shipped to Genomics Research Center at University of Rochester, New York, for cDNA library preparation and sequencing. Total RNA concentration was determined with NanoDrop 1000 spectrophotometer (NanoDrop, Wilmington, DE) and RNA quality assessed with the Agilent Bioanalyzer (Agilent, Santa Clara, CA). TruSeq RNA Sample Preparation Kit V2 was used for library construction as per manufacturer's protocols. Brie y, mRNA was puri ed from 100 ng total RNA with oligo-dT magnetic beads and then fragmented. First-strand cDNA was synthesized with random hexamer priming followed by second-strand cDNA synthesis. End repair and 3` adenylation was performed on the double stranded cDNA. Illumina adaptors were ligated to both ends of the cDNA, puri ed by gel electrophoresis and ampli ed with Polymerase Chain Reaction (PCR) primers speci c to the adaptor sequences to generate amplicons of approximately 200 -500 bp in size. Libraries were ampli ed at a concentration of 8 pM per lane and Paired End reads of length 125 bp were sequenced on HiSeq 2500 v4 platform.

De Novo Transcriptome Pipelines and Gene Expression Analysis
Two different bioinformatics pipelines were employed, with some steps commonly implemented in both pipelines, for de novo transcriptome construction and analysis of ONT long-reads and Illumina shortreads (Figure 2a & b). For ONT, de novo gene clustering, consensus sequence calling, and gene polishing were used to derive error corrected gene sequences (Sahlin et al., 2021;Sahlin & Medvedev, 2019). For Illumina, a de novo transcriptome assembly approach was used to reconstruct contigs from which full length CDS could be derived. The two methods differed in gene expression quanti cation in that, read count was used as a proxy for gene expression in the case of ONT, whereas Transcript per Million (TPM) was used for Illumina based gene expression calculation ( Figure 2).

De Novo ONT GridION Pipeline
A suite of bioinformatics tools, including standalone tools and those developed by ONT, were used for de novo transcriptome analysis of ONT long-reads. Read quality ltering, orientation, and trimming was performed using Pychopper v2 (https://github.com/nanoporetech/pychopper) with default parameters, and read statistics were analysed using NanoPlot 1.32.1 (De Coster et al., 2018). A non-hybrid approach of using ONT long-reads for error correction was employed, as our ONT and Illumina read data were derived from separate biological samples. Error correction of each ONT transcriptome was performed using ONT long-reads within the same biological sample. IsONclust2, implemented in pipeline-nanoporedenovo-isoforms (https://github.com/nanoporetech/isONclust2) was used for de novo clustering of ONT long-reads and one sequence cluster was generated for each gene (Sahlin et al., 2021; Sahlin & Medvedev, 2019). Consensus sequences were called for each sequence cluster to generate one consensus sequence per gene. The consensus gene sequences were further polished using raw reads in medaka 1.2.5 (https://github.com/nanoporetech/medaka). Open Reading Frames (ORFs) were derived by translating polished sequences in all six frames using getorf provided with EMBOSS:6.6.0 (https://www.bioinformatics.nl/cgi-bin/emboss/getorf). ORF sequences ≥ 200 bp were translated into protein sequences using transeq provided with EMBOSS:6.6.0 (https://www.bioinformatics.nl/cgibin/emboss/transeq), and a nal dereplication was performed at 100% protein sequence identity using CD-HIT v4.7 (Fu et al., 2012;Li Ã & Godzik, 2006). For gene expression quanti cation, long-reads were mapped back to ltered de novo ONT transcriptomes using minimap2, excluding any secondary alignments (H, 2018). Samtools 1.7 (Li et al., 2009) was used to further lter aligned reads, with any supplementary and secondary alignments discarded. Only reads aligning on ≥ 80% of their length were counted towards gene expression quanti cation.

De Novo Illumina Hiseq Pipeline
Raw reads were processed in Trimmomatic-0.36 (Bolger et al., 2014) for adapter removal and quality trimming. A sliding window quality score cut-off of Q30 was applied and reads of minimum 100 bp in length were retained. For each sample, cleaned reads were de novo assembled into contigs using Trinity v2.8.6 (Grabherr et al., 2011), and the resulting contigs were de-replicated at 100% identity at nucleotide level using CDHIT 4.7 (Fu et al., 2012;Li Ã & Godzik, 2006). The remaining contigs were translated in all six frames to search for ORF prediction using getorf (EMBOSS:6.6.0) (https://www.bioinformatics.nl/cgibin/emboss/getorf), and all sequences containing ORFs of ≥200 bp were retained. The contigs with ORFs were translated into protein sequences using transeq (EMBOSS:6.6.0) (https://www.bioinformatics.nl/cgi-bin/emboss/transeq), and a nal dereplication was performed at 100% protein sequence identity. Reads were mapped back to the ltered transcriptome assembly using an alignment-free method in salmon v1.0.0 (R et al., 2017) to generate TPM values that represent gene expression. Sequence Curation and Gene Orthology Sepsis punctum lacks species-speci c reference databases to compare our de novo transcriptome constructions. While genomic/transcriptomic data are available Themira sp., it is not an appropriate reference point for curating AG transcripts from S. punctum because it is from a basal, distantly-related genus and insect AG genes and protein compositions vary at species and even population levels ( Therefore, building on our de novo approach transcriptomics, a reference-free approach was taken for transcript curation and an extensive manual curation of S. punctum AG genes was performed. A gene expression cut-off was applied, and the top 100 highest expressed transcripts were selected from each of the four de novo transcriptomes (ONT-1, ONT-2, ILL-1, ILL-2) since transcriptome-based quanti cation of gene expression generally shows a steep drop after the rst few transcripts. The subset of 400 sequences was further examined to lter out chimeric and contaminant sequences with BLASTX in nr using DIAMOND v 0.8 (Buch nk et al., 2014, 2021). Using the cleaned sequence set from each sample, putative gene orthologs were established in the de novo transcriptomes of the remaining three samples by a reciprocal BLASTP with an e-value cut-off of 1e-5. These putative orthologs were further curated by manually examining end-to-end alignments. Finally, a set of 40 high-con dence and highquality accessory gland genes were derived with orthologs established in all four samples, and used for the downstream analyses . Evaluation of ONT Transcriptome ONT long-reads are prone to high error rates, therefore the usefulness of our de novo ONT GridION transcriptomics pipeline and error correction procedures in mitigating the effects of sequencing errors was evaluated (Figure 2a). Given that our study is reference-free, and Illumina short-reads are of high quality, a sequence similarity analysis was performed by comparing ONT gene sequences to Illumina gene sequences. The Illumina sample, ILL-1, was designated as the control sample and the remaining samples ONT-1, ONT-2, as well as ILL-2 were the test samples. Comparing ILL-2 to ILL-1 control sample allowed for within-Illumina assessment that can uncover effects of tissue pooling and natural genetic variation in the population. Sequence similarity values were generated by performing BLASTN of nucleotide sequences and BLASTP of translated protein sequences from 40 genes of three test samples against 40 genes from ILL-1 control. Percent sequence similarities were summarized, including median sequence similarity and percentage of sequences with 100% match to ILL-1 control. Similarities of all 40 genes were plotted as a heatmap for visualization at individual gene level.

Availability of Data and Materials
The datasets generated and analysed during the current study are available in GenBank at https://www.ncbi.nlm.nih.gov/ and can be accessed with Project ID: PRJNA765219.

Competing Interests
The authors declare that they have no competing interests.

Funding
This work was supported by Ministry of Education (Singapore) Tier-1 research grants awarded to N.P.  Tables   Table 1. Statistics from ONT GridION long-read and Illumina HiSeq short-read cDNA library preparation, RNAseq reads ltering, and nal de novo transcriptomes. Table 2. Summary statistics of sequence similarity at nucleotide and protein level. Sequence similarities were generated by comparing test samples ONT-1 and ONT-2, as well as the second Illumina sample ILL-2 against a representative Illumina control sample, ILL-1. Figure 1 (a) Habitus of a male Sepsis punctum (1 mm scale bar) (b) Testes and accessory glands. Sepsid males can be easily distinguished from females based on fore leg modi cations, which are absent in the latter.

Figures
Males possess paired accessory glands that are translucent and are closely associated with the testes that are pigmented. Accessory glands of insects are hotspots for the rapid evolution of novel, speciesspeci c reproductive genes, and this underpins the divergence of male ejaculatory proteins that play a crucial role in post-copulatory sexual selection.