Genome of the Lord Howe Island Stick Insect Reveals a Highly Conserved Phasmid X Chromosome

Abstract We present a chromosome-scale genome assembly for Dryococelus australis, a critically endangered Australian phasmid. The assembly, constructed with Pacific Biosciences continuous long reads and chromatin conformation capture (Omni-C) data, is 3.42 Gb in length with a scaffold N50 of 262.27 Mb and L50 of 5. Over 99% of the assembly is contained in 17 major scaffolds, which corresponds to the species’ karyotype. The assembly contains 96.3% of insect Benchmarking Unique Single Copy Ortholog genes in single copy. A custom repeat library identified 63.29% of the genome covered by repetitive elements; most were not identifiable based on similarity to sequences in existing databases. A total of 33,793 putative protein-coding genes were annotated. Despite the high contiguity and single-copy Benchmarking Unique Single Copy Ortholog content of the assembly, over 1 Gb of the flow-cytometry-estimated genome size is not represented, likely due to the large and repetitive nature of the genome. We identified the X chromosome with a coverage-based analysis and searched for homologs of genes known to be X-linked across the genus Timema. We found 59% of these genes on the putative X chromosome, indicating strong conservation of X-chromosomal content across 120 million years of phasmid evolution.


Introduction
Dryococelus australis, the Lord Howe Island (LHI) stick insect, is a critically endangered (Rudolf and Brock 2017) Australian phasmid. Once found in abundance on LHI in the Tasman Sea, D. australis was driven locally extinct following the accidental introduction of black rats (Rattus rattus) in 1918. In 2001, a population was discovered on Ball's Pyramid (BP), a sea stack about 20 km southeast of LHI (Priddel et al. 2003), and shortly after, a mating pair was taken from BP and used to found a captive breeding program at the Melbourne Zoo in Melbourne, Australia. Inbreeding depression was detected in the captive population shortly after its founding (Honan 2008) and a genetic study has been recommended by previous conservation managers of the species (Priddel et al. 2003;Honan 2008;Carlile et al. 2009). An initial study recovered complete mitochondrial genomes from extinct LHI and captive specimens, confirming the conspecificity of the BP and LHI populations (Mikheyev et al. 2017). This study also used short read sequence data to assemble a draft genome (GenBank assembly accession GCA_002236955.1), which showed signs of polyploidy but also had very low contiguity (N50 = 17,265). In this study, we present a chromosome-scale, annotated genome assembly for D. australis and place the species in a phylogenetic context. We also investigate X chromosome conservation by leveraging existing phasmid genomic data. Presently, the only published chromosome-scale phasmid assembly is of Timema cristinae (GCA_002928295.1), a genus that forms the outgroup to all other phasmids and which is estimated to have diverged from D. australis 121.8 (95% confidence interval: 105.1-139.4) million years ago (Simon et al. 2019). The assembly presented herein expands the scope of available phasmid genomic resources and will facilitate future population and comparative genomic studies of D. australis, a flagship species for invertebrate conservation worldwide.

Results and Discussion
Karyotype Direct observation of chromosomes identified a diploid karyotype of 2N = 34 ( fig. 1a). C-banding demonstrated mostly uninformative heterochromatin distributions with only the centromeres differentiated except for two chromosomes per individual that differentially stained across their entire length ( fig. 1b). We infer these to be the sex chromosomes. The sex of the embryos dissected was unknown but the individual depicted in figure 1b has one large and one small putative sex chromosome, implying XY sex determination which has been observed in other stick insects (Maniko 1951).

Assembly Evaluation
Initial assembly of 31,640,841 reads (N50 = 35,056 after removing reads ≤20 kb) with WTDBG2 followed by removal of potential contaminants and secondary haplotypes generated an assembly of 2,788 contigs with N50 and L50 values of 13.48 Mb and 76 (table 1). Mapping of 486,619,719 OmniC read pairs to this intermediate assembly and scaffolding by HiRise joined 905 contigs and split 3, resulting in a final assembly of 1,879 scaffolds with scaffold N50 and L50 values of 262.28 Mb and 5. This is an almost 30,000-fold increase in N50 compared with the original short read assembly (table 1). At 3.42 Gb, the scaffold assembly is 30 Mb shorter than the short read assembly (table 1), suggesting the presence of duplicate contigs and contamination in the latter. Seventeen large (>60 Mb) scaffolds comprise 99.4% of the length of this assembly, corresponding to the observed karyotype of 2N = 34 ( fig. 1a and b). We mapped the OmniC reads back to these scaffolds and plotted the resultant contact matrix (supplementary fig. S1, Supplementary Material online). "Scaffold_4" shows fewer contacts with other scaffolds, linking it to the two highly heterochromatic chromosomes identified with C-banding ( fig. 1b). We assessed completeness of the scaffold assembly by searching for Benchmarking Unique Single Copy Ortholog (BUSCO) genes from the insecta_odb10 database. The genome contains complete single-copy sequences of 96.3% of insect BUSCOs with 0.7% present but duplicated (table 1). Only 46.7% were identifiable as single copies in the original short-read assembly (table 1).

Annotation
After mapping of our RNA-seq data to the completed genome assembly with STAR, we found only 88.41% of all read pairs mapped successfully. A combination of de novo and homology-based approaches identified 37,538 predicted protein-coding genes, 3,745 of which were flagged as errors by NCBI's automated submission portal due to internal stop codons, leaving 33,793 predicted protein-coding genes and 682 predicted tRNAs. Of these predicted protein-coding genes, 6,932 could be assigned putative homologs. Previous phasmid assemblies have reported similar or greater numbers of predicted protein-coding genes (Clitarchus hookeri, n = 66,470 [Wu et al. 2017]; Medauroidea extradentata, n = 35,742 [Brand et al. 2018]); however, Timema annotations tend to have 11,000 to 16,000 (Jaron et al. 2022; BioProject accession PRJEB31411). We also ran BUSCO on our predicted protein sequences after taking the longest isoform per gene and found only 51.4% and 53.8% of insect BUSCO genes in single copy in the protein sequences filtered and unfiltered by the NCBI submission portal respectively (table 1; only NCBI filtered proteins reported). This is lower than for the official gene set published with the C. hookeri assembly which contained 73.1% of arthropod BUSCO genes (Wu et al. 2017) although their single copy or duplicate status was not reported. Our protein annotation is therefore a poor representation of the total gene content of the D. australis genome, even though we used multiple adult tissues from both sexes. This is likely explained by our input data. Assembly of PacBio CLR reads produces contigs with many indel errors which can drastically impact the accuracy of protein prediction (Watson and Warr 2019). As a result, the set of protein sequences presented here only represents a fraction of the true CDS sequences present in the assembly. This does not affect the BUSCO results for the wholegenome sequence, as BUSCO detects intron-exon boundaries de novo in this mode and is likely more tolerant of sequencing error than our annotation approach. This shortfall in our predicted proteins might be alleviated by short read polishing to correct indel errors introduced by the long error-prone PacBio CLR reads. The assembly was highly repetitive. After filtering, 2.17 Gb (63.29%) of the assembly was masked by RepeatMasker (supplementary table S1, Supplementary Material online). The majority of these, 1.56 Gb, were not classifiable based on sequences in existing databases (supplementary table S1

Genome Size
Other assemblies of large polyneopteran genomes contain high numbers of recently diverged transposable and interspersed repetitive elements (Wang et al. 2014;Wu et al. 2017;Verlinden et al. 2021) and the positive relationship between eukaryotic genome size and the total number/cumulative sequence length of transposable elements is well FIG. 1.-Representative metaphase plates of Dryococelus australis embryonic tissue taken before (a) and after (b) digestion in saturated BaOH. The arrows indicate highly heterochromatic heteromorphic bodies, which we infer to be the sex chromosomes (X and Y). All slides were stained with a 5% Giemsa solution and observed under 1,000× with a compound light microscope and photographed with a mounted Nikon D90. (c) Subtree showing the topology of the maximum likelihood tree constructed from 774 amino acid alignments using transcriptome data from Simon et al. (2019) and this study. Only the Lanceocercata, the antipodean stick insects, and Timema cristinae are shown. Divergence times are taken from Simon et al. (2019). (d) Best BLAST hit location of X-linked genes in Timema from Parker et al. (2022). (e) Median GC normalized read depth of coverage of PacBio CLR aligned to the major scaffolds of the final assembly. The error bars show 95th percentile ranges of mean depth in 10 kb bins.
Genome Biol. Evol. 15(6) https://doi.org/10.1093/gbe/evad104 Advance Access publication 3 June 2023 established (Kidwell 2002;Lee and Kim 2014). High transposable and repetitive element load increases the chances of ectopic recombination and thus comes at a fitness cost. This creates an evolutionary arms race between transposable element expansion and the host's own evolved containment mechanisms (reviewed in Kelleher et al. 2020). The extant D. australis population (from which the assembly presented herein was derived) has experienced a recent colonization bottleneck and subsequent low population size (Carlile et al. 2009). In Arabidopsis lyrata, bottlenecked populations show increased transposable element (TE) transposition (Lockton et al. 2008), consistent with natural selection acting to suppress their proliferation. We suggest that genome size in D. australis has expanded partly due to the release of TEs from host containment as a result of the attenuated power of selection in this small, bottlenecked wild population. We previously estimated the size of the D. australis genome with flow cytometry at roughly 4.7 Gb so despite the high contiguity of the assembly, a large percentage (roughly 27.7%) of the true genome sequence remains unassembled. Recent bursts of transposable and repetitive elements, which we believe is likely and is borne out by the divergence landscapes of the major repeat classes (supplementary fig. S3, Supplementary Material online), would create the conditions required for this shortfall in the assembly: highly similar interspersed and tandemly duplicated sequences that cannot be spanned by our long reads. The N50 of the PacBio reads used in the contig assembly was 35,056 and the longest detected repeat region masked by RepeatMasker was 42,143.
We believe it is likely that many more long repeat sequences remain undetected due to collapse in the assembly. The annotated protein-coding genes had a mean intron length per transcript of 1,424 (supplementary table S1, Supplementary Material online). These introns are considerably smaller than expected based on previously estimated relationships between genome size and intron size (Wang et al. 2014, Supplementary Figure S9). In fact, the mean intron size in the D. australis genome assembly is not much larger than that of the Drosophila melanogaster assembly presented by (Adams et al. 2000;Verlinden et al. 2021). We believe this further supports our assertion that a large proportion of the D. australis genome sequence remains inaccessible with the sequencing technology we have employed. Indeed, even ultra-long reads generated by Oxford Nanopore technology could only assembly 85.8% of the full length of the human genome (Jain et al. 2018), which is considerably less burdened by repetitive and transposable elements than that of D. australis.

Phylogenetics
We generated a concatenated supermatrix of 164,692 amino acid residues across 774 orthologs for Phasmatodea. Mean missingness across taxa was 3.96% (1.10-10.55); D. australis had 2.21% characters missing (supplementary table S2, Supplementary Material online). With maximum likelihood inference, we obtained an almost identical topology to Simon et al. (2019) except for the placement of Bacillus rossius and Phyllium philippinicum, the branching of which had low bootstrap support. Here, we present only the subtree around our focal D. australis. We find that Extatosoma tiaratum, an east Australian endemic, is the closest sampled relative of D. australis ( fig. 1c). This places D. australis within the Australian Lanceocercata, corroborating previous findings based on mitochondrial genomes (Forni et al. 2021) and Sanger sequencing data sets (Buckley et al. 2009(Buckley et al. , 2010) that placed D. australis into a clade with Eurycnema spp. The latter studies inferred an Australian ancestral range and estimated the age of the D. australis lineage as much older than LHI, suggesting dispersal from Australia across now submerged islands.

X Chromosome Identification
We used tblastn to identify putative homologs of X-linked genes which are conserved across five Timema species (Parker et al. 2022). The majority (59%) of the best hits were found on Scaffold_4 ( fig. 1d), and the remainder mapped to the other major scaffolds mostly in proportion to their length, for example, the next scaffold with the most hits was Scaffold_1, the longest scaffold. Scaffold_11 is a notable outlier to this trend ( fig. 1d), harboring 10 best hits compared with Scaffold_3 which harbored 9, despite being almost three times as long. We also remapped the PacBio reads generated from the male sample used in the initial assembly back to the major scaffolds and calculated GC normalized depth in 10 kb bins. This same scaffold had roughly 50% mean depth of coverage compared with the other scaffolds ( fig. 1e). Finally, Scaffold_4 is also an outlier in repetitive element composition, overall and for specific element types (supplementary fig. S2, Supplementary Material online). This strongly implicates Scaffold_4 as the X chromosome. Parker et al. (2022) found high conservation of these Timema X genes on the B. rossius X chromosome; this study extends this finding more broadly across Phasmatodea. Repeated observation of heteromorphy in C-banded D. australis embryos ( fig. 1b) suggests the presence of a Y chromosome not apparent in the assembly. We attempted polymerase chain reaction (PCR) primer design from sequences of minor scaffolds showing inflated coverage in male versus female samples (with sequencing reads from another ongoing project) but no primers showed successful differential amplification from male and female gDNA extracts, and so we are unable to confirm the true sex determination system of D. australis.

Karyotype
We used Giemsa staining with BaOH C-banding (Ghosh and Singh 1975) to observe the karyotype of D. australis. Eggs from the captive colony were dissected at 2-3 months after laying and the embryos placed in distilled water for 10 min, washed twice in a 3:1 solution of methanol and glacial acetic acid, and finally incubated in this same solution for 2 h at room temperature. Embryos were then macerated on a glass microscope slide in a 3:2 solution of glacial acetic acid and distilled water using a blunt brass rod. Extraneous material was removed with forceps and the cellular suspension rolled across the surface of the slide. The slide was dried on a hotplate set to 35 °C for 15 min. Slides were then incubated in saturated BaOH solution for 5 min followed by washes in distilled water, 0.2 M HCl, distilled water again, and a final 30 min incubation in 2X saline sodium citrate at room temperature. Slides were washed with a solution of 1% w/v KH 2 PO 4 2% w/v Na 2 HPO 4 12H 2 O and stained in a 5% v/v solution of Giemsa in the same solution. Slides were visualized at 1,000× with a compound light microscope and imaged with a mounted Nikon D90.

Reference Genome Assembly
We sampled a single adult male D. australis from the Melbourne Zoo captive population for the assembly. The reference genome sequence was assembled and annotated by Dovetail Genomics; we briefly describe the method here. High-molecular weight DNA was extracted from snap frozen muscle tissue. About 588.2 Gb of PacBio CLR reads were generated with this extract. Contigs were assembled with WTDBG v2.5 (Ruan and Li 2020) with an estimated genome size of 4.7 Gb (based on flow cytometric estimates, data not shown), minimum read length of 20,000, and a minimum alignment length of 8,192. Contaminant contigs were identified with blobtools v1.1.1 (Laetsch and Blaxter 2017) and removed. Secondary haplotypes were removed with purge_dups v1.2.3 (Guan et al. 2020). Proximity ligation information was generated with 150 bp Omni-C libraries. Tissue from the same individual was fixed in formaldehyde and then DNA was extracted, followed by digestion with DNAse I, end repair, ligation to a biotinylated bridge adapter, and finally adapter ligation. After ligation, crosslinks were reversed and the DNA was purified. Purified DNA was treated to remove biotin not internal to ligated fragments. Sequencing libraries were generated using NEBNext Ultra enzymes (New England Biolabs). Biotin-containing fragments were isolated using streptavidin beads before PCR enrichment of each library. The library was sequenced on an Illumina HiSeqX platform to produce approximately 30× sequence coverage. Omni-C reads were mapped to the contig assembly with bwa mem (Li 2013), and HiRise (Putnam et al. 2016) was used to order and orient contigs into scaffolds.
We assessed completeness of the scaffold assembly using BUSCO analysis software v5.4.4 (Manni et al. 2021) with the BUSCO database insecta_odb10 containing 1,367 protein sequences.

Annotation
Repeat families were identified de novo and classified using the software package RepeatModeler v2.0.1 (Smit et al. 2021). We further filtered repeats by a BLAST search to the "nr" database and removed any unclassified family whose best hit was a known protein not originating from a transposable element, including mitochondrial proteins. We then used RepeatMasker v2.0.1 (Smit et al. 2021) to calculate the divergence among repeat families without CpG correction and plotted the repeat landscape in R v4.1.2 (R Core Team 2018) with ggplot2 (Wickham 2016). Coding sequences from C. hookeri (Wu et al. 2017), M. extradentata (Brand et al. 2018), and T. cristinae (Soria-Carrasco et al. 2014) were used to train the ab initio model for D. australis using both AUGUSTUS software v2.5.5 (Stanke et al. 2008) and SNAP version 2006-07-28 (Korf 2004. We extracted RNA from tissue samples of both sexes of the ventral ganglion, leg muscle, gut lumen, Malpighian tubules, and gonads with RNEasy spin column kits (Qiagen); libraries were prepared and sequenced at BGI Hong Kong on a BGISEQ-500 instrument in paired end mode with 150 bp reads. Reads were mapped onto the assembly using STAR v2.7 (Dobin et al. 2013) and intron hints generated with the bam2hints tools within the AUGUSTUS software. SNAP and AUGUSTUS (with intronexon boundary hints provided from RNA-Seq) were then used to predict genes in the repeat masked assembly. Only gene models that were predicted by both SNAP and AUGUSTUS were retained. Genes were further characterized for their putative function by performing a BLAST search of the peptide sequences against a set of protein sequences from UniProt. Finally, to check the completeness of our annotation, we ran BUSCO in protein mode with the nsecta_odb10 database on our final set of predicted protein sequences.

Phylogenetic Analysis
We placed D. australis into a recently published phasmid phylogeny (Simon et al. 2019), using a de novo assembled transcriptome. We assembled all RNA-Seq reads generated for annotation with Trinity (Henschel et al. 2012) and used all phasmid data from Simon et al. (2019). We clustered the sequences of each species with CD-HIT-EST v4.8.1 (Li and Godzik 2006) with a 95% sequence identity threshold. We constructed an ortholog database with Orthograph v0.7.1 (Petersen et al. 2017) using the OrthoDB official gene sets for Ephemera danica, Ladona fulva, Rhodnius prolixus, and Zootermopsis nevadensis ) and identified orthologs in the phasmid transcriptomes. We aligned resultant amino acid sequences with MAFFT v7.508 (Katoh and Standley 2013), using the default settings and removed poorly aligned bases with TrimAl v1.4.1 (Capella-Gutiérrez et al. 2009) in "automated1" mode. For each alignment, we removed taxa that had >50% missing characters compared with the maximum sequence length. Finally, we selected only alignments that had >80% representation of the full taxon list, always including D. australis. We constructed NJ trees of all alignments in R with apetools v5.5 (Paradis and Schliep 2019) and visually inspected alignments, manually removed any taxa with obviously outlying branch lengths using a bash script. We removed paralog alignments by identifying trees whose length was dominated by long internal branches that split taxa into >1 clades not corresponding to the published phylogeny (Simon et al. 2019). We filtered for missingness again as above. Finally, we constructed a single nexus file, concatenating all 774 alignments that passed filtering and retaining partitions corresponding to the original clusters of orthologous groups and estimated a maximum likelihood tree using IQ-TREE v1.6.12 (Nguyen et al. 2015;Minh et al. 2020). We used PartitionFinder (Lanfear et al. 2012) and ModelFinder (Chernomor et al. 2016;Kalyaanamoorthy et al. 2017) to find the best supported partitions and substitution models. We calculated branch support with 1,000 ultrafast bootstraps (Hoang et al. 2018).

X Chromosome Identification
We remapped the long reads used in the original assembly back to the final assembly with minimap v2.21 (Li 2018) and estimated the depth across the major scaffolds in 10 kb bins with mosdepth v0.3.1 (Pedersen and Quinlan 2018) and normalized this by GC content. We then calculated the median and 95% percentile range of depth across all bins for each scaffold. We obtained amino sequences of genes conserved across five Timema species (Parker et al. 2022) and used tBLASTn v2.11.0 (Madden 2013) to identify putative homologs in our assembly, retaining the best hit per input sequence based on e-values.

Supplementary Material
Supplementary data are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).

Acknowledgments
This work was supported by an Australian Government Research Training Program (AGRTP) stipend scholarship award to OPS and an Australian Research Council grant LP210200654. The authors thank the staff at Dovetail Genomics LLC for assistance with sequencing, assembly, and annotation of the reference genome. The assembly project was partially supported by a Dovetail Tree of Life Grant. Additional funding was provided by the Holsworth Wildlife Research Endowment and Zoos Victoria. The authors thank David M. Rowell for assistance with microscopy and chromosome smears, Nicholas Doidge for assistance with tissue dissections for RNA-Seq, and Darren Parker and Tanja Schwander for graciously sharing data for the Timema X-linked genes prior to publication.

Data Availability
All raw data, including fastq files for Omni-C, RNA, and PacBio continuous long-read sequencing, as well the assembly and annotation in fasta and gff formats, respectively, have been submitted to the National Centre for Biotechnology Information databases and are available under BioProject PRJNA930028. The authors also provide additional files in an Open Science Framework repository (https://osf.io/9v487/), including the original annotation gff file prior to filtering using NCBI's automated submission portal, as well as the fasta library of identified repeat families and a gff file of repetitive features. Adams MD, et al. 2000. The genome sequence of Drosophila melanogaster. Science 287:2185-2195.