Two high-quality de novo genomes from single ethanol-preserved specimens of tiny metazoans (Collembola)

Abstract Background Genome sequencing of all known eukaryotes on Earth promises unprecedented advances in biological sciences and in biodiversity-related applied fields such as environmental management and natural product research. Advances in long-read DNA sequencing make it feasible to generate high-quality genomes for many non–genetic model species. However, long-read sequencing today relies on sizable quantities of high-quality, high molecular weight DNA, which is mostly obtained from fresh tissues. This is a challenge for biodiversity genomics of most metazoan species, which are tiny and need to be preserved immediately after collection. Here we present de novo genomes of 2 species of submillimeter Collembola. For each, we prepared the sequencing library from high molecular weight DNA extracted from a single specimen and using a novel ultra-low input protocol from Pacific Biosciences. This protocol requires a DNA input of only 5 ng, permitted by a whole-genome amplification step. Results The 2 assembled genomes have N50 values >5.5 and 8.5 Mb, respectively, and both contain ∼96% of BUSCO genes. Thus, they are highly contiguous and complete. The genomes are supported by an integrative taxonomy approach including placement in a genome-based phylogeny of Collembola and designation of a neotype for 1 of the species. Higher heterozygosity values are recorded in the more mobile species. Both species are devoid of the biosynthetic pathway for β-lactam antibiotics known in several Collembola, confirming the tight correlation of antibiotic synthesis with the species way of life. Conclusions It is now possible to generate high-quality genomes from single specimens of minute, field-preserved metazoans, exceeding the minimum contig N50 (1 Mb) required by the Earth BioGenome Project.

>Reviewer #1 (Dr. Arong Lu): First, I'd like to commend the authors on attempting to sequence whole genomes of tiny metazoans, which account for a large part of biodiversity in nature and yet are difficult to be sequenced. Second, I am impressed by their ethanol-preserved specimens, which thus make genome sequencing more applicable and attractive in practice. We must admit that sometimes we cannot use fresh specimens directly for genome sequencing. Thus, I think this manuscript is really of scientific significance for specific fields such as insects. >I found that the focal part of their sequencing protocol is the "whole genome amplification-based Ultra-Low DNA Input Workflow for SMRT Sequencing (PacBio)" throughout the text, which of course is very complex. So, I suggest the authors provide a flowchart showing critical or main steps during their workflow, and the readers can then understand easily and refer to their workflow in future projects. We agreed with Reviewer #1 that a flowchart would help the readers to implement the workflow in their projects and thus added it to the manuscript (new Figure 2, figures numbering shifted accordingly). >Finer points: >Line 35: I suggest providing specific/important information for the 'novel' protocol herein.
We added some information about the ultra protocol in the abstract, as suggested. >Line119-120: Are the specimens later for DNA extraction also morphologically identified? The sequenced specimens were crushed and ground, therefore morphological identification was limited to stereomicroscope observations. Accurate identifications were done on many specimens collected together with the sequenced specimens. We slightly modified the paragraph to clarify our approach. >Line130-131: The DNA extract was selected randomly or based on certain measurements? DNA extract were selected randomly, we added this precision to the text.
>Reviewer #2 (Dr. Mahul Chakraborty): In "Two high-quality de novo genomes from single ethanol-preserved specimens of tiny metazoans (Collembola)." Schneider et al. described de novo genome assemblies of two tiny field collected Collembolan specimens. The authors collected high quality genomic DNA from the specimens following a Pacfiic Biosciences recommended protocol for ultra low input library, amplified them, and generated adequate sequence coverage to generate contiguous assemblies. This is a significant step forward in generating de novo genome assemblies from small amounts of tissues and cells and therefore will be a useful guide for not only people who are studying whole organisms but also people who are studying variation between cell or tissue types within an individual. >I have some minor comments: >"They were preserved in 96% ethanol, kept at ambient-temperature for one day until they would be stored at -20°C for 1.5 months, until DNA extraction." >Was the preservation at -20 a deliberate step to see the effect of this treatment on sequencing or just a conscious choice for specimen preservation? It was a conscious choice for specimen preservation. Cold storage in ethanol is a common practice to preserve specimens for future sequencing. When the possibility to sequence S. aquaticus came, we tried with this relatively "fresh" sample and were happy with the results. We modified slightly the paragraph to better reflect that the -20°C was a storage step, and not an experiment. >The specific conditions used (e.g. the time and speed of centrifuge) for the g-Tube shearing needs to be added in the Methods. We added them. >"Circularity was validated manually, and nucleotide bases were called with a 75% threshold Consensus.?" -please clarify what the 75% threshold consensus is. We mean 75% majority-rule consensus, i.e. a base is called when in agreement with at least 75% of the covering reads. In practice, all bases were resolved without ambiguity. We corrected the text to clarify. >"We then performed another estimation of the genome size by dividing the number of mapped nucleotides by mode of the coverage distribution" -Why was this done? Did the authors suspect the Genomescope estimate to be incorrect? Both methods were proposed to estimate the genome size. We were curious to see if the following values genomescope estimate, assembly size and genome size estimation from mapped reads would remain close to each other. >"We compared our new genomes sequenced to previous Collembola assemblies that were generated with long read and sometimes additional short read data." -This statement needs citations for the previous Collembola assemblies. This is right, we added the references that belongs with the sentence. Those references were already cited in other part of the text. >The authors used blastn and megablast to search the beta-lactams synthesis genes in the new assembly. Tblastx might be more appropriate. We repeated the search with tblastn to query the protein sequences; and also repeated the search on the alternate haplotype sequences. Results are not changed. We added the precision to the text. >"For D. tigrina a total of 20,22 Gb HiFi data (Q>=20) was generated," -Do you mean 20.22 ? Yes, this is a typo. >"For S. aquaticus a total of Gb HiFi data (Q>=20) was generated" -missing the number before Gb We apologize for the typo, and added the value to the text: 12.4 Gb. >The authors report only one assembly from hifiasm, which I presume is the primary assembly. Given that the authors assembled diploid individuals, I am curious whether hifiasm assembled the alternate haplotype sequences. Following Reviewer #2 suggestion, we now report the alternative haplotig assembly. We took the alternative haplotig produced by hifiasm and concatenated them to the haplotigs purged from the primary assembly. We applied Purge_dups and our decontamination strategy to get a clean alternative haplotig assembly. We performed a BUSCO search on the alternative assembly and found >87% complete BUSCOs for each species. We now report the alternative assembly in the manuscript, and deposit them in the Giga-DB, as supporting material. >"The insect genomes have higher BUSCO scores (96.5 and 99.6%), but lower contiguity (Table 2, Fig. 3)." -This statement is incorrect. A number of insect genomes are more contiguous than the assemblies presented here, including Drosophila melanogaster (PMID: 31653862) and several other Drosophila species, Anopheles stephensi (DOI:10.1101/2020.05.24.113019), Anopheles albimanus (PMID: 32883756) Our statement was ambiguous, and thus maybe misunderstood. We compared our assembly to what we estimated being relevant examples, namely insect genomes previously sequenced using a single specimen with a low-input DNA approach, and assembled only with PacBio long reads (no Hi-C scaffolding). Naturally, chromosome level assemblies are available for several insect model species such as fruit-fly; but this is not a useful comparison. For example, in "DOI:10.1101/2020.05.24.11301", the species is sequenced from 70 inbred specimens. The purpose of the limited comparison we offer in the manuscript is to report similar or improved results from 1previously genomes of Collembola (more similar organism to our species, but sequenced from large pool of fresh specimens where we worked with single ethanol preserved specimens); 2-previously low input PacBio sequenced insects (still bigger animals than our Collembola, and fresh). We edited the text to disambiguate the statement.
We are happy to acknowledge the help of the reviewers, Dr. Arong Luo and Dr. Mahul Chakraborty, in reviewing our work; and thank them for their suggestions and corrections that improved the quality of the manuscript. We hope to see our manuscript accepted for publication in Gigascience. As soon as we get a validation, we will request the public release of the data in ENA-EMBL and provide the definitive accession numbers for those data.
On behalf of the authors,

33
Genome sequencing of all known eukaryotes on Earth promises unprecedented advances in 34 biological sciences and in biodiversity-related applied fields such as environmental management 35 and natural product research. Advances in long read DNA sequencing make it feasible to generate 36 high-quality genomes for many non-genetic model species. However, long read sequencing today 37 relies on sizable quantities of high-quality, high molecular weight (hmw) DNA which is mostly 38 obtained from fresh tissues. This is a challenge for biodiversity genomics of most metazoan species, 39 which are tiny and need to be preserved immediately after collection. Here we present de novo 40 genomes of two species of submillimeter Collembola. For each, we prepared the sequencing library 41 from hmwDNA extracted from a single specimen and using a novel Ultra-Low input protocol from 42 Pacific Bioscience. This protocol requires a DNA input of only 5 ng, permitted by a whole genome 43 amplification step. 44

45
The two assembled genomes have N50 values over 5.5 and 8.5 Mb respectively, and both contain 46 ~96% of BUSCO genes. Thus, they are highly contiguous and complete. The genomes from all known ~1.5 M eukaryotic species [1]. High-quality (highly contiguous and 66 complete, preferentially chromosome-level) genomes sequenced from accurately species-identified 67 organisms are essential for these efforts. To achieve its goal, the biodiversity genomics faces a 68 major challenge: most of the eukaryotic biodiversity belongs to highly diverse families of tiny 69 species [2] that are 1-difficult to sequence and 2-difficult to identify. 70 Advances in long-read sequencing technology changed the game for biodiversity genomics as this 71 technology now allows to obtain high-quality genomes for diverse taxa. However, minute 72 metazoans pose a number of challenges to long read sequencing. Standard protocols for long-read 73 sequencing require a large input of hmw DNA-in the order of a microgram-which in turn 74 requires larger amounts of fresh or well-preserved input tissue. Pooling individuals from field 75 collected specimens is often not possible and not desirable: many species cannot be captured in 76 sufficiently large numbers, and pooling individuals complicates assembly by increasing genetic 77 heterogeneity and bears the risk of mixing cryptic species. Small animals often need to be preserved 78 as soon as they are removed from their natural habitats. Furthermore, to be precisely identified, 79 individuals have to be sorted, prepared and observed under a microscope. This results in delays 80 between specimen collection and DNA extraction and cannot be done on living specimens. 81 Therefore, most small metazoan species will have to be genome-sequenced from single, field-82 preserved specimens. 83 Recent progress has already decreased the amount of DNA needed for long read sequencing. produce libraries from as little as 5 ng DNA input. Using these libraries, we sequenced one SMRT 95 cell for each species. To set the genomes as reliable references, we followed a thorough taxonomic 96 workflow leading to the designation of a needed neotype for S. aquaticus. We investigated the 97 resulting genomes for the presence of a beta-lactam antibiotic synthesis pathway, an exceptional 98 trait in the metazoan kingdom known in some species of edaphic Collembola [6]. We placed the 99 two species in a genome-based phylogeny of Collembola. The resulting genomes are highly 100 contiguous and nearly complete. S. aquaticus assembly has even the highest contiguity compared 101 to the Collembola genomes sequenced so-far from hundreds of cultured specimens [7,8]. 102 Thus, we show that high-quality, de novo genomes can be sequenced following a typical taxonomic 103 workflow, even from submillimeter species that have been preserved for several days in 96 % 104 ethanol. This novel approach will add to the aim of biodiversity genomic to sequence all life on 105 Earth, and make closer the day when whole genome sequencing will be a routine component of 106 integrative taxonomy. 107 surface. The animals can walk and jump on water surfaces thanks to elongated claws and a strong 116 furca (jump appendage) with a tip that functions as a paddle on the surface tension. The species has 117 a remarkably pronounced sexual dimorphism: the male is significantly smaller than the female and 118 its modified antennae into a prehensile organ allows it to clasp the female antenna in a courtship 119 dance preceding external fecundation ( Fig. 1 B-E). 120 Specimens were extracted from the compost with a Berlese funnel directly into 96% ethanol. DNA 123 extraction was performed within ~72h. Sminthurides aquaticus was collected from a pond in a 124 public garden (2.3999° E, 48.8589° N, 27.x.2019). Specimens were caught manually by eye using 125 a small net and mouth-aspirator. They were preserved in 96% ethanol, kept at ambient-temperature 126 for one day until they could be stored at -20°C. They remained in cold storage for 1.5 months until 127 we could proceed with DNA extraction. For each species, we gathered a pool of specimens 128 collected simultaneously and pre-identified them all using a stereomicroscope (up to 60x 129 magnification). For D. tigrina, four specimens were used for DNA extraction (involving their 130 destruction) and 30 were used for precise morphological identification. For S. aquaticus, eight 131 specimens were used for DNA extraction, and 17 for morphological identification. Specimens used 132 for morphological identification were cleared in lactic acid and KOH, and they were mounted in 133 permanent slides using Marc-André II mounting medium. Observations were made using a Leitz 134 Wetzlar Diaplan with phase contrast, at 400-1000x magnification. 135

136
Our workflow for DNA extraction and Ultra-Low DNA input follows the flowchart shown in Fig.  137 2. Extraction was performed from a single specimen for both species. Specimens were rinsed in 1 138 x PBS (Sigma) to remove residual EtOH. The solution was replaced four times with fresh PBS. We used blastn to identify insertions of the mitochondrial genome in the nuclear genome (NUMTs). 200 For this query, we used a 2x duplicated sequence of the mitochondrial genome to handle circularity. 201 We recognized the presence of almost complete copies of the mitochondrial genome in the nuclear 202 genomes of both species. We investigated the mapping of the CCS to the assembly in those 203 locations using IGV [34] (v2.8.13), and recognized that in one instance, a mis-assembly occurred 204 through the soldering of two NUMTs with CCS of mitochondrial origin. All CCS aligning with 205 those two NUMTs were gathered with blastn and reassembled using Geneious. We could not find 206 unambiguous NUMTs CCS (i.e. CCS carrying both nuclear and mitochondrial sequence) that 207 would support the original assembly connection and therefore we split the contig. 208

Contamination control 209
We checked the assemblies for potential contamination from other organisms by querying the 210 contigs against the National Center for Biotechnology Information (NCBI) database using protein-211 based (DIAMOND, [35] ) and nucleotide-based (blastn) alignments. Results were merged with 212 Blobtools2 [36] (v2.3.3) using the "bestsum" algorithm. Contigs explicitly assigned to another 213 lineage than metazoan were excluded from the assembly. Contigs assigned to Chordata were 214 checked for presence of Arthropoda BUSCO. If Arthropoda BUSCOs were confirmed on such 215 contigs, we retained them for the assembly. 216

Assembly assessment 217
Curated assemblies were again evaluated with BUSCO (same parameters as before). We mapped 218 the CCS on the assemblies using backmap [37] (v0.3), a perl wrapper of minimap2 and QualiMap2 219 [38]. Minimap2 was run with "-H -ax asm10" to map CCS on the assembly. We then performed 220 another estimation of the genome size by dividing the number of mapped nucleotides by mode of 221 the coverage distribution [37]. 222

Comparison with previous long read assemblies 223
We compared our new genomes sequenced to previous Collembola assemblies that were generated 224 with long read and sometimes additional short read data [7,8,45]. We also compared our 225 Collembola assemblies to the draft genomes of two larger insects [3,39] (4 and 20 mm), which 226 were also sequenced from single specimens but with the PacBio Low input workflow [40] 227 (amplification-free). 228

Alternative haplotig assembly 229
For both species, we obtained the alternative haplotig assembly by concatenating the alternate 230 haplotig produced by Hifiasm with the duplicated contigs identified in the primary assembly during 231 the purging step. We then further curated the alternative haplotig assembly by using sequentially 232 Purge_dups and the decontamination strategy described above; and finally evaluated the BUSCOs We used blastn: blastn and megablast to query the DNA sequences and tblastn to query the protein 248 sequences against the combined primary and alternative haplotig assemblies. We also used blastp 249 to query the protein sequences against the predicted proteins sequences from the primary 250 assemblies. The NBCI accession number of the searched sequences are: IPNS-JX270832. 1  observed again in June 2020 in large numbers and with courtship behavior undergoing (Fig 1D, E). For S. aquaticus a total of 12.4 Gb HiFi data (Q>=20) was generated with mean length of 12,308 288 bp, median read length of 11,893 bp and max read length of 29,073 bp. The distribution of read 289 length is reported in Fig. 3. From the kmer content of the reads, the genome haploid length was 290 estimated to be ca 152 Mb with 0.96 % of heterozygosity and 0.78% duplications. 291 Genome assembly 292 Overall, Hifiasm produced the best assemblies for both species (supplementary file S2). 293

Species biology and taxonomy
For D. tigrina, the most contiguous assembly (Table 2, Fig. 4)  For S. aquaticus the best assembly was obtained by using all the reads (Table 2, Fig. 4). Purging 314 haplotigs with Purge_haplotigs resulted in less duplicated BUSCOs than with Purge_dups. Two 315 contigs (totalizing 243,436 bp) were found to be from a fungi and a cyanobacteria respectively, and 316 were removed. Some contigs were assigned to Chordata taxa but all of those carried Arthropoda 317 specific BUSCOs and were kept. The curated primary assembly of S. aquaticus is composed of 79 318 contigs, has a size of 165,915,169 bp and an N50 value of 8.78 Mb (Table 2, Fig. 4). Mean coverage 319 is 72.67 X, coverage distribution mode is 77 X. The genome size is 157 Mb, estimated from mapped 320 reads and coverage. BUSCO search on the whole assembly yielded 96.1 % complete BUSCOs 321 (including 1.6 % duplicated), 1.3 % fragmented BUSCOs and 2.6 % missing BUSCOs. The 322 mitochondrial genome assembly was complete, for a size of 16,099 bp. A large NUMT was 323 detected in one of the purged contigs (haplotigs), but none were found in the primary contigs, so 324 we decided to not investigate further. Several small contigs were found to be assembled from 325 mitochondrial reads and were removed. The alternative haplotig assembly of S. aquaticus is 326 composed of 459 contigs, has a size of 150,171,336 bp and an N50 value of 1.00 Mb; BUSCO 327 search on the alternative haplotig assembly yielded 87.5 % complete (including 2.9 % duplicated), 328 1.4 % fragmented and 2.6 % missing BUSCOS. 329 Comparison with previous long read assemblies 330 In terms of BUSCO completeness scores, our assemblies are comparable to previous high-331 quality Collembola genomes assembled from a large pool of specimens (95.8% and 96.1% vs. 332 94.5 -97.1% complete; Table 2). In terms of assembly contiguity, our S. aquaticus has the highest 333 and the D. tigrina assembly has the third-highest contig N50 value (Table 2, Fig. 4). The insect 334 genomes obtained sequencing from single specimen using the PacBio Low Input workflow have 335 higher BUSCO scores (96.5 and 99.6%), but lower contiguity (Table 2, Fig. 4). Together, this 336 shows that assemblies generated with the Ultra-Low input workflow and long read sequencing can 337 reach or surpass the level of quality of assemblies obtained with the standard or Low-Input 338 workflow. 339 genomes are also on par with the previously best reference genomes for Collembola; F. candida 386 and Sinella curviseta, which were DNA sequenced from hundreds of specimens maintained in 387 culture [7,8]. Sminthurides aquaticus even achieve the highest N50 and N75 among the compared 388

Genome annotation
assemblies. The quality of the new assemblies makes us consider that there are even further benefits 389 in the ultra-low input protocol than sequencing organisms too small for WGA-free approaches. For 390 not too small species, it can be used to generate long reads data from a fraction of the total DNA. 391 This could be levered to implement approaches combining long-read and Hi-C for even smaller 392 species than a fruit fly [4]. This can also allow to retain the sequenced specimen to serve as a 393 voucher, by removing the need to crush the specimen to maximize hmwDNA recovery. 394 Ensuring taxonomic quality 395 It is essential that a reliable reference genome is supported by a solid and revisable taxonomy, to 396 be useful for any meaningful downstream analysis. Taxonomy quality has always been an issue of 397 sequence databases [55,56]. This is especially true for field collected specimens from 398 taxonomically poorly known groups that are often riddled with cryptic diversity and difficulty of 399 species identification based on a few subtle characters. Therefore, we documented species Sminthurides aquaticus was originally described from France and has been recognized to be widely 413 spread throughout the Holarctic region. We confirmed that all our collected specimens are identical 414 to the accepted descriptions of S. aquaticus. The species was originally described by Bourlet in 415 1841 probably from the north of France. However, Bourlet did not make any reference to a type 416 series, and to our knowledge did not preserve any specimens. We consider that the population we 417 sampled in Paris is suitable to provide a neotype for this species: the population is abundant, settled, 418 and easily accessible for further studies. This also offers the uncommon opportunity to have a 419 neotype closely related to the reference genome for the species. 420 is expected to be more mobile due to its more developed legs, furca and eye-patch (F. candida is 447 eyeless). This suggests that antibiotic synthesis is specific to true soil dwelling (euedaphic) life-448 style, and it might be lost by more mobile species. 449 The LOEWE-TBG excellence cluster supports the idea of the EBP that aims to sequence all 465 eukaryotic species. Although the first high-quality genomes were generated for species with easy 466 access to abundant and fresh samples, similar high-quality genomes can now be generated for tiny 467 taxa or taxa that is otherwise difficult to sequence. Most of known eukaryotic biodiversity belongs 468 to very small metazoan which in addition needs to be preserved for some time before genome 469 sequencing. Access to their genomes provides insights into the formation, maintenance and 470 functioning of eukaryotic biodiversity, and presents new opportunities for natural resource 471 management and bioprospecting. The ability to genome-sequence these species is essential for the 472 success of biodiversity genomics initiatives. Our genomes sequenced from 5 ng DNA actually 473 exceed the 1Mb N50 contig continuity required by the EBP project when more than 100 ng DNA 474 are available. We are convinced that integrating high-quality genomics with the typical workflow 475 of small, field-collected metazoans is an essential approach toward the creation of a solid reference 476 genomes database for millions of minute non-model species belonging to taxonomically 477 challenging groups.

504
The genomes will contribute to the European Reference Genome Atlas and the Earth BioGenome 505 Project. The present study is a collaboration between of the LOEWE-TBG and the Max Planck 506 Genome-centre Cologne. It was supported through the programme "LOEWE -Landes-Offensive 507 zur Entwicklung Wissenschaftlich-ökonomischer Exzellenz" of Hesse's Ministry of Higher 508 Education, Research, and the Arts. We highly appreciate the generous support by Pacific 509 Bioscience with respect to the ultra-low amplification kit, library preparation kit as well as SMRT 510 cells and sequencing chemistry during the course of the beta test. The Max-Planck Genome Center 511 Cologne acknowledges the support from the Max-Planck Society. We give our warm thanks to 512 Tilman Schell for his advice on genome assembly. We thank Dr. Arong Luo and Dr. Mahul 513 Chakraborty for the reviewing our work, their suggestions and corrections improved the quality of 514 the manuscript. 515