A chromosome-level reference genome assembly and a full-length transcriptome assembly of the giant freshwater prawn (Macrobrachium rosenbergii)

Abstract The giant freshwater prawn (Macrobrachium rosenbergii) is a key species in the aquaculture industry in several Asian, African, and South American countries. Despite a considerable growth in its production worldwide, the genetic complexities of M. rosenbergii various morphotypes pose challenges in cultivation. This study reports the first chromosome-scale reference genome and a high-quality full-length transcriptome assembly for M. rosenbergii. We employed the PacBio High Fidelity (HiFi) sequencing to obtain an initial draft assembly and further scaffolded it with the chromatin contact mapping (Hi-C) technique to achieve a final assembly of 3.73-Gb with an N50 scaffold length of 33.6 Mb. Repetitive elements constituted nearly 60% of the genome assembly, with simple sequence repeats and retrotransposons being the most abundant. The availability of both the chromosome-scale assembly and the full-length transcriptome assembly enabled us to thoroughly probe alternative splicing events in M. rosenbergii. Among the 2,041 events investigated, exon skipping represented the most prevalent class, followed by intron retention. Interestingly, specific isoforms were observed across multiple tissues. Additionally, within a single tissue type, transcripts could undergo alternative splicing, yielding multiple isoforms. We believe that the availability of a chromosome-level reference genome for M. rosenbergii, along with its full-length transcriptome, will be instrumental in advancing our understanding of the giant freshwater prawn biology and enhancing its molecular breeding programs, paving the way for the development of M. rosenbergii with valuable traits in commercial aquaculture.


Introduction
The giant freshwater prawn (Macrobrachium rosenbergii), the largest prawn within the Macrobrachium genus, holds significant economic importance as an aquaculture species in several Asian, African, and South American countries (Yang et al. 2012;Cao et al. 2017).Macrobrachium rosenbergii serves as a healthy dietary source for protein, lipids, and carbohydrates (Marques et al. 2016;Wu et al. 2021).Despite a substantial global production increase over the past 2 decades, reaching 273,738 tons in 2019 (FAOSTAT website; Jiang et al. 2020), its cultivation faces challenges mainly arising from the existence of various morphotypes and their social interactions.In general, male individuals exhibit faster growth compared with their female counterparts (Liu et al. 2015).However, mature males of different morphotypes such as small male, orange claw, blue claw, and old blue claw display a wide range of size variations (Supplementary Fig. 1; Tang et al. 2020;Ibrahim et al. 2023).Previous attempts to unravel the genetic regulations of these different morphotypes involved employing short-read RNA sequencing (RNA-seq) techniques.These studies aimed to investigate the mechanisms underlying growth differences among male morphotypes (Tang et al. 2020), identify candidate genes involved in sex development (Ren et al. 2021), and study the effects of androgenic gland ablation on gonad development (Ying et al. 2022) and of transcriptome changes on immune response exposure and stress (Ding et al. 2018;Gao et al. 2020;Ou et al. 2021).Unfortunately, the full exploitation of gene expression data was hindered by the absence of a high-quality reference genome assembly.
In this study, we present the first chromosome-scale reference genome assembly along with the high-quality full-length transcriptome assembly in M. rosenbergii.We employed the HiFi long-read PacBio sequencing to generate the preliminary assembly, which we further scaffolded to the chromosome level using the chromatin contact mapping (Hi-C) approach.The transcriptome assembly, derived from 13 organs and hemocytes (HCs), utilized long-read Pacific Biosciences (PacBio) isoform sequencing (Iso-seq).The PacBio Iso-seq platform captures full-length transcripts from their 5′ ends to poly(A) tails in single long reads.This approach has proven successful in obtaining reference transcriptome assemblies for various organisms lacking reference genome sequences (Pootakham et al. 2020;Carneiro et al. 2021;Guo et al. 2022).We believe that this highquality reference genome and transcriptome assemblies serve as valuable resources for genomic studies and breeding programs for M. rosenbergii.

Sample collection
Macrobrachium rosenbergii shrimps were cultured at the Aquaculture Genetics Research and Development Center, Pathum Thani, Thailand.For whole-genome sequencing, muscles were collected from a 5-month-old male shrimp and immediately frozen in liquid nitrogen and stored at −80°C until use.For transcriptome sequencing, a total of 12 organs and HCs were collected from three 5-monthold male shrimps and ovaries from three 4-month-old female shrimps.The following organs were dissected and immediately frozen in liquid nitrogen: gill (GI), heart (HE), testis (TT), ovary (OV), androgenic gland (AG), hepatopancreas (HP), stomach (ST), intestine (IN), thoracic ganglion (TG), abdominal ganglion (ABG), eyestalk (ES), pleopods (Pl), and muscle (MU).Hemolymph was centrifuged at 3,000 rpm at 4°C for 5 min to pellet the HCs.All samples were stored at −80°C until RNA extraction.This study was performed in accordance with the recommendations of Animal Research Ethics Guidelines, and the protocol was approved by the National Center of Genetic Engineering and Biotechnology Animal Research Ethics Committee (approval number BT-IACUC-RF01-10-01).

DNA and RNA extraction
Frozen muscle tissue was ground in liquid nitrogen, and genomic DNA was extracted using a Genomic Tip 100/G kit (Qiagen, USA), as previously described (Angthong et al. 2020).DNA quantity was measured using a NanoDrop ND-8000 spectrophotometer and a Qubit dsDNA BR Assay kit (Invitrogen, USA) using a Qubit fluorometer.The DNA quality and integrity were visualized under pulsed field gel electrophoresis at 80 V for 9 h in 0.5× KBB buffer (51 mM Tris, 28 mM TASP, 0.08 mM EDTA, pH 8.7; Sage Science, USA) containing SYBR Safe DNA gel staining (Invitrogen).Total RNA was extracted using the TRI-reagent according to the manufacturer's instruction (Molecular Research Center, USA) and DNase treated with RNase-free DNase (0.5 U/µg total RNA at 37°C for 30 min; Promega, USA).The treated RNA was quantified using a Qubit fluorometer (Invitrogen), and RNA quality was assessed by agarose gel electrophoresis and using Agilent 2100 Bioanalyzer (Supplementary Table 1; Agilent, Santa Clara, CA, USA).

Genome and transcriptome sequencing
For the reference genome sequencing, the PacBio SMRTbell library (∼20 kb) for PacBio Sequel was constructed using SMRTbell Express Template Prep Kit 2.0 (PacBio, Menlo Park, CA, USA) and following the manufacturer-recommended protocol.The library was bound to polymerase using Sequel II Binding Kit 2.0 (PacBio) and loaded onto PacBio Sequel II.Sequencing was performed on PacBio Sequel II 8M SMRT cells.For transcriptome sequencing, 1 µg of DNase-treated RNA sample from each tissue was used for cDNA synthesis with a SMARTer PCR cDNA Synthesis kit (Clontech, USA).Iso-seq library construction was carried out using a SMRTbell Template Prep Kit 1.0-SPv3 protocol (Pacific Biosciences, Menlo Park, CA, USA).Samples from each tissue were barcoded, pooled into a single library, and sequenced on the PacBio Sequel II platform.The PacBio sequencing was outsourced to Omics Drive, Singapore.

Hi-C (Dovetail Omni-C) library preparation and sequencing
A chromosome conformation capturing technique (Hi-C) was conducted by Dovetail Genomics (Scott Valley, USA) to scaffold M. rosenbergii preliminary assembly into a chromosome-level assembly.A DNase Hi-C library was prepared, as previously described in Ma et al. (2018).Briefly, chromatin was fixed in place with formaldehyde in the nucleus and then extracted.The fixed chromatin was digested with DNAseI, and chromatin ends were repaired and ligated to a biotinylated bridge adapter, followed by proximity ligation of adaptercontaining ends.After proximity ligation, cross-links were reversed and the DNA purified from protein.The purified DNA was treated to remove biotin that was not internal to the ligated fragments.Sequencing libraries were generated using NEBNext Ultra enzymes and Illumina-compatible adapters.Biotin-containing fragments were isolated using streptavidin beads before the process of PCR enrichment of each library.The libraries were sequenced on an Illumina HiSeq X to produce 116,073,039 read pairs.

De novo assembly and Hi-C (Omni-C) scaffolding of the M. rosenbergii genome
A total of 1,228 Gb PacBio raw reads (301.1×coverage) were subjected to trimming and read correction, resulting in 58.39 Gb high-accuracy circular consensus sequencing (CCS) data (14.3×).The PacBio CCS reads were used as an input to Hifiasm v0.15.4-r347 with default parameters (Cheng et al. 2021).The blast results of the Hifiasm output assembly against the nt database were used as input for blobtools v1.1.1 (Laetsch and Blaxter 2017), and scaffolds identified as possible contamination were removed from the assembly.We applied an error correction tool racon (https://github.com/isovic/racon) to improve the quality of the reads in low-coverage regions.Finally, purge_dups v1.2.5 (Guan et al. 2020) was used to remove haplotigs and contig overlaps.A PacBio draft assembly was used as an input for the subsequent scaffolding with HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies (Putnam et al. 2016).
The input de novo assembly and Dovetail Omni-C library reads (136.7 Gb, 33.5×) were used as input data for HiRise, a software pipeline designed specifically for using proximity ligation data to scaffold genome assemblies (Putnam et al. 2016).Dovetail Omni-C library sequences were aligned to the draft input assembly using bwa (Li and Durbin 2010).The separations of Dovetail Omni-C read pairs mapped within draft scaffolds were analyzed by HiRise to produce a likelihood model for genomic distance between read pairs, and the model was used to identify and break putative misjoins, to score prospective joins, and to make joins above a threshold.

Genome assembly evaluation
The quality of the final genome assembly was evaluated by aligning transcriptome (Iso-seq) data from 14 tissues (see the Sample collection section) using minimap version 2.17 (Li 2018).In addition, the presence and the completeness of the orthologs were determined using the Benchmarking Universal Single-Copy Orthologues (BUSCO) version 5.4.4 (Simão et al. 2015) and the Arthropoda OrthoDB release 10 ( Kriventseva et al. 2015).

Identification of repetitive sequences and gene annotation
The following pipeline was employed to identify repetitive sequences in M. rosenbergii, Macrobrachium nipponense, Penaeus monodon, Hyalella azteca, Litopenaeus vannamei, and Parhyale hawaiensis.Repbase library (Bao et al. 2015) was employed to identify repeats de novo from the genome assembly using RepeatModeler (Flynn et al. 2020).RepeatMasker (Tempel 2012) was subsequently used to identify repeats in the genome assembly and mask them by lowercase letters (soft masking).
To annotate the protein-coding sequences within our research, we utilized the EVidenceModeler (EVM) software, version 1.1.1,as described by Haas et al. (2008).This tool allows for the integration of different types of evidence into a cohesive annotation system.Our approach combined homology-based, RNA-based, and ab initio prediction methods for annotating the genome assembly.The IsoSeq3 pipeline (https://github.com/ylipacbio/IsoSeq3)was used to process SMRT-sequencing (Iso-seq) data from 14 different tissues.Briefly, the IsoSeq3 pipeline removed the primer sequences, trimmed the adapter/barcode sequences, and filtered out the concatemers as well as immature transcript sequences without the poly(A) tails.Polishing was carried out using the tool polish in the Isoseq3 pipeline, and only high-quality reads supported by at least 2 full-length nonchimeric reads with a predicted sequence accuracy of >99% were used in the downstream analyses.For the transcript-based predictions, Iso-seq reads were grouped into clusters at 97% similarity with CD-HIT (version 4.8.1;Fu et al. 2012).The longest open-reading frame (ORF) from each cluster was selected for further analysis.These selected ORFs were then aligned to the genome assembly using PASA version 2.5.3 (Haas et al. 2008) and GMAP version 2020-09-12 (Wu and Watanabe 2005).We also utilized protein sequences from related species such as H. azteca (GCF_000764305.2),L. vannamei (GCF_003789085.1),P. monodon (GCF_015228065.2),and M. nipponense (GCA_015104395.2).These sequences were aligned to our genome assembly using the Analysis and Annotation Tool (AAT), as outlined by Huang et al. (1997).For the ab initio predictions, we employed the Augustus software (version 3.2.1;Stanke et al. 2004) trained with the data obtained from H. azteca, L. vannamei, P. monodon, and M. nipponense and incorporated the PASA transcript alignments for further refinement.In the final step, EVM integrated these 3 types of evidence-assigning weights of 5 to PASA, 1 to GMAP, 0.5 to AAT, and 0.1 to Augustus predictions-to produce the consensus gene models.
Protein functions were predicted using InterProScan (Blum et al. 2021).Gene ontology (GO) terms and Kyoto Encyclopedia of Genes and Genomes pathway annotation were assigned to the sequences (Moriya et al. 2007;Kanehisa and Sato 2020).Infernal version 1.1 (Nawrocki and Eddy 2013) was used to perform a homology search and annotate noncoding RNA-seqs.

Identification of alternative splicing events
To explore alternative splicing events in M. rosenbergii, we employed the TAPIS pipeline (version 1.2.1;Abdel-Ghany et al. 2016) and SpliceGrapher program (Rogers et al. 2012) to identify the following events: alternative 5′ donor-site selection, alternative 3′ acceptor-site selection, exon skipping, and intron retention.

Phylogenetic analyses and comparative genomics
OrthoFinder version 2.4.0 (Emms and Kelly 2019) was employed to extract 210 single-copy orthologous proteins.Subsequently, gene alignments were performed using MUSCLE version 3.8 (Edgar 2004).The best-fit models of amino acid substitution were estimated, and a maximum-likelihood phylogenetic tree was reconstructed using Mega version 10.1.8(Tamura et al. 2011).The tree's robustness was assessed through 1,000 replicates of bootstrap analysis utilizing the WAG + G + F substitution model based on Mega's outcomes.

Genome assembly of M. rosenbergii
To acquire a high-quality genome assembly, we employed the PacBio HiFi sequencing technology to produce 1,228 Gb (∼301-fold coverage of the estimated genome size) of raw reads (read number: 4,259,973; average N50: 13,950 kb).Genome assembly was carried out using Hifiasm, and the preliminary genome of 3.73 Gb was generated with a 249-kb contig N50.Our assembly results demonstrated that the size of the assembled giant freshwater prawn was relatively close to the estimated genome size by the flow cytometry (4.08 Gb; Levy et al. 2019).To further improve the contiguity of our preliminary assembly, we employed the in vivo chromosome fixation technique (Hi-C) based on the information from 116,073,039 read pairs (136.7 Gb of Hi-C data) to generate a chromosome-level assembly.The final assembly encompassed 3,739,422,449 bases with an N50 scaffold length of 33.6 Mb (Table 1) and contained 59 pseudomolecules (hereafter referred to as chromosomes numbered according to sizes), corresponding to the haploid chromosome number in M. rosenbergii (2n = 108; Fig. 1; Supplementary Fig. 2).
We performed an evaluation of our assembly quality by aligning Iso-seq reads to the assembled genome, and 96.33% of M. rosenbergii Iso-seq transcripts were mapped to the genome.To further assess the completeness of the gene space in each assembly, we used the BUSCO software to check the gene content using an ar-thropoda_odb10 database (1,013 genes; Simão et al. 2015).Our gene prediction recovered 90.9% of the highly conserved orthologs in the Arthropod lineage with 94.8% identified as "complete and single-copy," 2.7% as "complete and duplicated" and 3.9% as "fragmented," while 5.2% of the conserved orthologs were missing from the assembly (Table 1).This evidence supported the high-quality assembly of the M. rosenbergii genome.

Gene annotation
We integrated 3 different approaches including an ab initio prediction, a homology-based search, and a transcript-based prediction in the annotation pipeline to predict 34,203 gene models and 29,181 protein-coding genes in M. rosenbergii.Gene sequences in M. rosenbergii were preferentially distributed near the telomeres for most of the chromosomes (Fig. 1; Supplementary Tables 2 and 3).The top 3 most Reference genome for M. rosenbergii | 3 prevalent GO terms associated with cellular components were nucleus, cytoplasm, and cytosol (Supplementary Fig. 3), while the largest categories of genes annotated to molecular function were protein binding, protein homodimerization activity and ATP hydrolysis activity.The most common terms for biological processes were multiorganism reproductive process, obsolete regulation of cellular macromolecule biosynthetic process, and obsolete cellular macromolecule metabolic process (Supplementary Fig. 3).In addition, noncoding RNAs were also annotated using homology searches, and there were 1,252 ribosomal RNA, 1,957 transfer RNA, 802,741 microRNA, and 12,981 small nuclear RNA (Supplementary Table 4).

Identification of repetitive elements
We employed a combination of a de novo repeat identification tool, RepeatModeler, and homology search tools to analyze repetitive sequences in M. rosenbergii, M. nipponense, P. monodon, H. azteca, L. vannamei, and Pa.hawaiensis.The total length of repetitive sequences accounted for 57.54% of the assembly (2.15 Gb; Table 2), which was higher than the proportion of repetitive sequences observed in M. nipponense (45.6%;Jin et al. 2021).Interestingly, simple sequence repeats represented the majority of classified repetitive elements in the M. rosenbergii genomes, comprising almost a quarter of the assembly (904 Mb) and 42% of the total repeat contents (Fig. 2).Likewise, a noticeably large proportion of simple  Retrotransposons also occupied a significant portion of the M. rosenbergii genome (443 Mb; 11.86%), with long terminal repeats (LTRs) being the most prevalent class (54% of all retrotransposons).The LTRs could be categorized into 2 main classes: the Copia-and Gypsy-like elements, which represented 0.24 and 10.27% of the total repeat contents, respectively (Table 2).

Comparative genomics and phylogenetic analyses
To determine the phylogenetic relationship between M. rosenbergii and related species, we analyzed sequence information from single-copy orthologous genes from 10 Arthropod species: Daphnia magna, Daphnia pulex, Drosophila melanogaster, Tigriopus californicus, Eurytemora affinis, H. azteca, M. nipponense, M. rosenbergii, L. vannamei, and P. monodon.A maximum-likelihood phylogenetic tree obtained from clustering 251,370 proteins into 27,192 orthologous groups (251,370 input proteins from 10 species) suggested that M. rosenbergii had a close relationship with M. nipponense as well as 2 penaeid shrimps, L. vannamei and P. monodon (Fig. 3).

Identification of alternative splicing variants
The availability of both the chromosome-scale assembly and the full-length transcriptome assembly allowed us to thoroughly  investigate the alternative splicing events in M. rosenbergii.We utilized the TAPIS pipeline (Abdel-Ghany et al. 2016) along with the SpliceGrapher program (Rogers et al. 2012) to thoroughly identify transcript variants exhibiting the following alternative splicing events: alternative 5′ donor-site selection, alternative 3′ acceptorsite selection, exon skipping, and intron retention.Our analysis revealed a total of 2,041 alternative splicing events in M. rosenbergii (Fig. 4a).Of these events, exon skipping was the most frequent, accounting for more than half (50.6%) of the total splicing events.The alternative 5′ donor-site selection and alternative 3′ acceptorsite selection appeared to be the least prevalent modes of alternative splicing, representing 12.7 and 12.0% of the total events identified, respectively.Finally, the occurrence of intron retention was observed in 24.6% of all alternative splicing events identified.Within individual genes, we observed different patterns of alternative splicing events in a combinatorial manner.Figure 4b depicts examples of these transcript isoforms containing multiple types of alternative splicing events.It is interesting to note that a specific isoform could be detected in multiple tissues.Furthermore, within a single tissue type, transcripts could undergo alternatively splicing, resulting in multiple isoforms.

Discussion
Macrobrachium rosenbergii is an important aquaculture species widely cultivated in Southeast Asia.In our study, we successfully sequenced and assembled the reference genome of M. rosenbergii, utilizing the PacBio HiFi sequencing together with the Hi-C technique.We achieved a chromosome-scale assembly of the giant freshwater prawn genome encompassing a total of 3.73 Gb.Our assembly contains 59 pseudochromosomes corresponding to M. rosenbergii's haploid chromosome number.The completeness of the gene space evaluated using BUSCO software demonstrated that our assembly contained 90.9% of the highly conserved orthologs in the Arthropod lineage.Compared with a related Macrobrachium species, the number of predicted genes in the M. rosenbergii (34,203) genome was lower than that in M. nipponense (44,086).The assembly revealed no evidence supporting a recent whole-genome duplication event in M. rosenbergii, contrary to a previously reported occurrence in M. nipponense (Jin et al. 2021).One remarkable feature of the M. rosenbergii genome was the unusually high occurrence of repetitive sequences, which occupied roughly 57% of the assembly (2.05 Gb).We observed a similar expansion of the repeatome in the oriental river prawn (M.nipponese; 2.11 Gb).The substantial accumulation of transposable elements in the Macrobrachium species likely contributes to the expansion of their genomes compared with the related penaeid shrimps (Fig. 3).In M. rosenbergii, the main category of repetitive elements that occupied almost half of its repeatome was the simple sequence repeats (42%; 904 Mb), comparable with the percentages observed in L. vannamei (51%; 453 Mb) and P. monodon (38%; 572 Mb).
The availability of our full-length transcript data from multiple tissues enabled us to investigate different forms of alternative splicing events, which have been known to regulate cell differentiation, development, and stress adaptation (Singh and Ahi 2022).Similar to the observations in Pacific white shrimp (Zhang et al. 2023), Pacific oyster (Huang et al. 2016), and fruit fly (Jakšic ́ and Schlötterer 2016), exon skipping was the predominant alternative splicing event in the giant freshwater prawn.However, while intron retention represented the least frequent events in Pacific white shrimp, Pacific oyster, and fruit fly, alternative 3′ acceptorsite and alternative 5′ donor-site selections were the least prevalent modes of alternative splicing in M. rosenbergii.We strongly believe that the availability of a chromosome-scale reference genome for M. rosenbergii will play a pivotal role in advancing our understanding of the giant freshwater prawn biology and greatly facilitating its molecular breeding programs that ultimately lead to the development of M. rosenbergii with commercially desirable characteristics.

Fig. 1 .
Fig. 1.A genomic landscape of M. rosenbergii.a) A physical map of 59 pseudomolecules (chromosomes) numbered according to size (in Mb).b) Repeat content represented by the proportion of genomic regions covered by repetitive sequences in 250-kb windows.c) Gene density represented by the number of genes in 250-kb windows.d) GC content represented by the percentage of G + C bases in 250-kb windows.

Fig. 3 .
Fig. 3.A maximum-likelihood tree of M. rosenbergii and related species constructed based on single-copy orthologous protein sequences.The numbers at each node represent the bootstrap values.The scale bar shows the number of substitutions per site.

Fig. 4 .
Fig. 4. Alternative splicing in M. rosenbergii.a) The distribution of alternative splicing events.Alt 3′, alternative 3′ acceptor-site selection; Alt 5′, alternative 5′ donor-site selection; IR, intron retention; SE, skipped exon.b) Examples of transcripts with alternative splicing.Gene models displaying possible splicing events are shown at the top, and aligned Iso-seq reads are shown at the bottom.Protein OPI10 homolog and RNA-binding protein squid-like isoform X10 serve as examples of genes that undergo multiple alternative splicing events, with the tissue source of each Iso-seq read indicated.Various isoforms are distinguished by different colors.

Table 1 .
Assembly statistics of the M. resenbergii genome a .
a Sequences generated by the initial assembly of PacBio HiFi reads are in contigs, whereas those generated by Hi-C scaffolding are in scaffolds.

Table 2 .
Repeat contents in the M. rosenbergii genome.
sequence repeats was observed in P. monodon and L. vannamei.