A chromosomal-level genome assembly for the giant African snail Achatina fulica

Abstract Background Achatina fulica, the giant African snail, is the largest terrestrial mollusk species. Owing to its voracious appetite, wide environmental adaptability, high growth rate, and reproductive capacity, it has become an invasive species across the world, mainly in Southeast Asia, Japan, the western Pacific islands, and China. This pest can damage agricultural crops and is an intermediate host of many parasites that can threaten human health. However, genomic information of A. fulica remains limited, hindering genetic and genomic studies for invasion control and management of the species. Findings Using a k-mer–based method, we estimated the A. fulica genome size to be 2.12 Gb, with a high repeat content up to 71%. Roughly 101.6 Gb genomic long-read data of A. fulica were generated from the Pacific Biosciences sequencing platform and assembled to produce a first A. fulica genome of 1.85 Gb with a contig N50 length of 726 kb. Using contact information from the Hi-C sequencing data, we successfully anchored 99.32% contig sequences into 31 chromosomes, leading to the final contig and scaffold N50 length of 721 kb and 59.6 Mb, respectively. The continuity, completeness, and accuracy were evaluated by genome comparison with other mollusk genomes, BUSCO assessment, and genomic read mapping. A total of 23,726 protein-coding genes were predicted from the assembled genome, among which 96.34% of the genes were functionally annotated. The phylogenetic analysis using whole-genome protein-coding genes revealed that A. fulica separated from a common ancestor with Biomphalaria glabrata ∼182 million years ago. Conclusion To our knowledge, the A. fulica genome is the first terrestrial mollusk genome published to date. The chromosome sequence of A. fulica will provide the research community with a valuable resource for population genetics and environmental adaptation studies for the species, as well as investigations of the chromosome-level of evolution within mollusks.


Fig. 3
I would suggest to show the genome assembly comparison data in a table, not in a scatter plot. In general, scatter plot is used to see the correlation between two variables. This figure is not adequate to compare genom assemblies because 1) correlation between contig and scaffold N50s is not meaningful 2) most of the dots are put at the lower left and indistinguishable. In addition, references should be cited when the authors used these genome data in the study. Reply: Thanks a lot for the suggestion. We have changed the Figure 3 into Table 3 and added the references in the revised manuscript.
Lines 232-235, Fig. 5 What kinds of fossil record were used for molecular clock calibration? Honestly speaking, I cannot believe the result (Fig.5 showing Spiralia diverged from Ecdysozoa 831 Mya (200 million years before the Ediacaran Period). Reply: Thank you very much for the reminding. However, we re-estimated the divergence time among these species using the records for Protostomia and Mollusca downloaded from www.timetree.org and obtained the similar results (the figure below was downloaded from the place). Thus we believe the results might be reliable. The new results and the calibration information were updated in the revised ms. (lines 258-261 and fig 5) Version information of all software used are needed. Reply: Thank you very much for the reminding. All the version information available has been added in the revised ms.
Overall, this appears to be a well put together genome encompassing large amounts of data from different sources, includ long reads from PacBio and additional scaffolding from Hi-C. It is quite well presented and I"m sure this work will be usefu the community as a genomics resource. Nonetheless there are a few issues that I"d like to see resolved before the manuscript can be accepted for publication or the assembly is released into the public repositories.

Major comments
Contamination. There is no mention in the text of filters for possible contamination from non-target organisms in the sequencing data. I consider such an analysis to be a vital and necessary component of any genome project, to eliminate ( much as possible) errors from contaminating sequencing reads in sequence databases. Tools such as Blobtools (https://drl.github.io/blobtools/) are easy to implement and are highly informative as to the quality of the raw data and th final genome. Reply: Thanks for the reviewer"s reminding. Actually we did the contamination analysis at the step Survey since the DNA samples in Survey and Assembly was identical. In the survey step, we randomly extracted 10,000 pairs of short reads, an compared them to the nt database, and find no obvious external contamination from other species. This method has been described elsewhere (https://doi.org/10.1016/j.molp.2014.12.011) and we did not mention it since it performed as expected. The result of contamination analysis has been added in the revised ms (lines 154-155 in the revised ms).
Kmer analysis. There is much discussion about estimation of genome size from kmer analysis, but there is no kmer spectr presented. I would find this figure much more informative and useful than some of the figures that are included (e.g. 2 an 3). Reply: Thank you very much for your suggestions. The kmer spectra has been added in the revised version ( Figure 2).
Heterozygosity. Related to the above point: how did the authors resolve any regions containing heterozygous sites in the assembly? E.g., divergent allelic regions that might be co-assembled and both present in the final scaffolds? Reply: Thank you very much for you reminding. By mapping the subreads back to the genome, we estimated the sequenc depth for each region of the assembly and the results were shown below (the GC content were also shown, 10k window). shows that the distribution of the depth is unimodal, which means that almost all sites were homozygous, actually the heterozygosity of the species is not very high (<0.5%). And if there are too much divergent allelic regions, two peaks will obvious. -Transcriptome / RNA-seq. Table 1 shows 22.5Gb of transcriptomic reads but very little information is given about these da How they were generated and filtered, and then how they were used during the annotation process needs more details. Reply: Thank you very much for your reminding. The information has been added in the revised version (lines 106-124, lin 229 in the revised ms).
Language. Overall the manuscript is well written, but there were many cases of grammatical errors and/or small typos, to many to catch them all in the minor comments below (I mostly stopped after the abstract). Thus, the manuscript would benefit from a proof-read to correct these small mistakes in English, it would not be a big task. Reply: Thank you very much for your reminding. We corrected errors and typos thorough the manuscript in the revised version. the revised ms. Figure 1: "Figure 1. A picture of A. fulicathat INDIVIDUAL used for genome sequencing and assembly" Reply: We have corrected it in the revised ms. Figure 2: I struggle to extract anything useful from this figure, but I am not familiar with Hi-C data so maybe it"s just me Reply: The assumption of Hic is that the crosslinking signals are more strong as the loci located in a chromosome are mor closer. Thus ideally the contact matrix should be around the diagonal line, just as is shown in the figure (figure3 in the rev ms). Figure 3: Again, I"m not convinced this figure is very informative, as it currently is. For example, the majority of (unlabelle points all overlap somewhere near the X-Y intercept, with only three outwith this cluster. Then the size of the points and their colour appear to convey the same information -why twice? I think the point of the figure is to demonstrate the high contiguity of A. fulica genome compared to other mollusc genomes, but does plotting scaffold N50 versus contig N50 really achieve this? Better would be to plot cumulative assembly span curves, i.e. number of scaffolds on X vs cumulative span o Y Reply: Thank you very much for your suggestions. We have deleted the figure and listed these parameters such as scaffol N50 and contig N50 in Table 3 for comparison in the revised ms. Figure 4: It is interesting that exon length is so conserved, but intron lengths are much more variable. Is there any eviden that intron lengths are bimodally distributed? Reply: Bimodal distribution of the intron lengths was rarely reported. It is not surprise that the intron lengths is more variable than exon since the latter one is much more conservative than the former.
Reviewer #3: I thank the authors for the work presented on the manuscript "A chromosomal-level genome assembly for t giant African snail Achatina fulica". It is a great contribution for future studies of mollusk genomics and for the study of th molecular basis of invasiveness. I just have a few recommendations and comments.
1-) I would like to see the kmer distribution plot presented on the manuscript. It helps future researchers to understand th composition of this mollusk genome, and to plan future projects. Reply: Thank you very much for your suggestions. In the revised ms, we have added the kmer spectra as Figure 2. 2-) On lines 133-137: Canu and Falcon are both good assemblers generating high quality data. After deciding to move forward with the Falcon assembly, I would like to know why the authors have decided not to run FALCON-Unzip on the assembly? The phasing of haplotypes has been shown to help avoid assembly errors in genomic areas of complex structur variation between haplotypes. Even though the further analysis (mapping quality, etc) show the assembled genome to be good shape, it would be a good standard practice to run Falcon-Unzip before HiC scaffolding. Reply: Thank you very much for your suggestions and we strongly agree with you. We believe that using Falcon-Unzip will generate a high-quality genome, especially the heterozygosity of the species is very high (>1% for example). However, w used FALCON here by considering that the heterozygosity of the species is not very high (0.47%).
3-) After Lanchesis, around 1000 contigs were not placed into chromosomes. Have you investigated the composition of su contigs? Can you present also the size distribution of them? Reply: Thank you very much for your suggestions. We found that the average gene length is much shorter for contigs unanchored to chromosomes than the anchored ones (67.6 bp/kb vs 341.5 bp/kb), whereas the average length of repeat length is just the reverse. Out of the 1467 unanchored contigs, a total of 210 are longer than 10kb, with the longest one i 6,839 kb. And the size distribution of the unanchored contigs short than 10 kb is as follows: 4-) The sequencing of the transcriptome with IsoSeq technology was only briefly mentioned. Could you describe the evaluation of such transcripts in a few lines? For example, was it possible to find full-length transcripts sequenced? Reply: Thank you very much for your suggestions. In this study, a number of 553,889 Full-length Non-chimeric sequences (FLNC) representing 23,726 gene loci were obtained. However, the 5" end of the mRNA might be degraded before sequencing and we could not detect it as we did for the 3" end since a polyA tail is a sign of completeness for the latter on To evaluate the completeness of the isoforms, we compared them to the predicted mRNAs from genome sequences and found that 70.37% of the multi-exon FLNCs were really full-length sequences. ((lines 106-124 in the revised ms)) 5-) Finally, just a last read to review the English would be advised. Two examples of misspelling: The tittle on line 409. An 'fro' on line 223. Reply: Thank you very much for your reminding. We hope that all mistakes have been corrected in the revised version.