Genome survey, high-resolution genetic linkage map construction, growth-related quantitative trait locus (QTL) identification and gene location in Scylla paramamosain

Scylla paramamosain is one of the most economically important crabs in China. In this study, the first genome survey sequencing of this crab was performed, and the results revealed that the estimated genome size was 1.21 Gb with high heterozygosity (1.3%). Then, RAD technology was used to construct a high-resolution linkage map for this species. A total of 24,444 single nucleotide polymorphism (SNP) makers were grouped into 47 linkage groups. The total length of the linkage groups was 3087.53 cM with a markers interval of 0.92 cM. With the aid of transcriptome and genome scaffold data, 4,271 markers were linked to genes, including several important growth-related genes such as transforming growth factor-beta regulator I, immune related-gene C-type lectin and ecdysone pathway gene broad-complex-like protein. Further, 442 markers, representing 279 QTLs, associated with 24 traits were identified, and of these markers, 78 were linked to genes. Some interesting genes, such as dedicator of cytokinesis protein 3, tenascin-X and DNA helicase MCM8, were believed to have important relationship with specific traits and merit further exploration. The results of this study will accelerate the genetic improvement and genome sequencing analysis of the mud crab.

Mapping family and data collection. The F1 full-sib family for linkage map construction was created by two parents from the wild population of Hainan Province, China. The mapping population was reared at Wei-Er-Si Aquafarming Company of Rudong in 2017. In total, 99 progeny were randomly selected after being reared for 100 days in the same pond. Twenty-six traits, representing sex, carapace, cheliped, pereiopod and swimming stroke, were measured for each individual according to the measurement method described by Keenan et al. 22 . The measurement data are provided in the Supplementary Table 1.
Muscle tissues were sampled and preserved immediately in liquid nitrogen. Total DNA was extracted using a TIANamp Marine Animal DNA Extraction kit (TIANGEN, Beijing, China). DNA concentration was determined using GeneQuant (Amersham Biosciences Ltd., Piscataway, NJ, USA), and integrity was evaluated via electrophoresis in a 1% agarose gel.
RAD library construction and Illumina HiSeq3000 sequencing. Restriction site-associated DNA marker (RAD) library construction, sample indexing and pooling followed for the natural populations 23 . The restriction enzyme EcoRI was used to cut the DNA. A total of 26 multiplexed sequencing libraries were constructed, in which each DNA sample was assigned a unique nucleotide multiplex identifier (MID) for barcoding. Pair-end (125-bp) sequencing was performed using Illumina HiSeq3000 in a total throughput of six lanes. SNP discovery, filtering, genotyping and validation. Raw sequence reads were trimmed to 110 nucleotides from the 3′ end to ensure that more than 98% of the nucleotides had a quality value above Q30 (equal to 0.1% sequencing error). The trimmed reads were clustered into read tags (hereafter, RAD-tags) by sequence similarity using ustacks 24 to produce unique candidate alleles for each RAD locus. A maximum base-pair mismatch of one was allowed in this step for the genetic mapping population. RAD-tags were then collapsed into clusters using ustacks under the default parameters for SNP calling.
Linkage map construction and QtL mapping. Genotype calling refers to the process of determining the genotypes of the SNP loci of each individual after SNP calling has been performed in accordance with Xu et al. 25 . Then, customized Perl scripts were applied to generate a ".loc" format file, which was the input file for joinmap 4.0 26 . The genetic map was grouped by joinmap 4.0 with logarithm of the odds (LOD) = 6.0 and the marker order was determined by Lep-MAP 2 27 . Singular markers were added to the established LGs using the joinSingles module with an LOD score limit of 10 and a minimum difference of 3 between the best LG and the second best LG of each joined marker. Then, markers intervals which are larger than 30 cM were removed, except for the LG32, in which the large interval might be caused by the location of centriole.
QTL mapping of the 26 traits was performed with MapQTL 5.0 28 . QTL region detection, the percentage of the phenotypic variance explained, and the genotypic information coefficient (GIC) were calculated with the interval QTL mapping model (IM) 29 . In the QTL mapping step, the LOD threshold for testing the significance of the QTL peaks was calculated using 1,000 permutations for each of the trait data sets and a genome-wide significant level of 5%. For interval distances >1.0 cM, significant thresholds were estimated every 1.0 cM. In this study, a LOD value of ≥3.0 was set as the minimum threshold to indicate a QTL in the present study.
were BLASTed against genomic scaffolds and transcriptome unigenes, and a reference transcriptome was de novo assembled using the raw reads from our previous work 16,17 and some unpublished data using the BLASTn program via an identity value cut-off of 99% and alignment length of 50. For alignment lengths between 30 and 50, manual inspection was performed. The transcriptome data, the genome scaffold data, and tag sequences from linkage map data referred to in this article are provided in Supplementary File 1. RAD-Tag sequences were also annotated via the GenBank nr database, which is hosted by NCBI (http://www.ncbi.nlm.nih.gov/), by blastx with an E-value cut-off of 1.0 × 10 −5 .

Results
Genome survey of S. paramamosain. Two 400-to 500-bp and one 250-bp paired-end libraries were constructed for the genome survey analysis, and 147.18 Gb of sequencing data were generated. Raw sequencing data were submitted to the Sequence Read Archive (SRA) database of NCBI with the accession number SRP150472. After QC and filtering, a total of 78.41 Gb of high-quality reads were obtained with a 45.32% GC content. The frequency of 17-mers (nucleotide strings with a length of 17 bp) among the raw sequencing data was calculated, and a k-mer curve was constructed. k-mer analysis revealed that there was a peak at the k-mer length of 60. The genome size was estimated at 1.21 Gb with remarkably high heterozygosity (1.3%), and 55.20% of the genome was estimated to be repetitive 30 . A total of 16,925,345 contigs with an N50 size of 145 bp were obtained, and of these sequences, 7,027,514 contained at least one simple sequence repeat (SSR), of which 3,900,204 were one-base repeats and 4,446,896 were two-base repeats.
RAD sequencing and genotyping. There were 31 million high-quality reads for the female parent, and on average, 11 million for the offspring. The male parent could not be found because the female parent was captured from the open field sea. The female parent was sequenced at a higher depth (26.90×) than the mean depth of the offspring (8.37×). The statistics for the RAD-tag are summarized in Table 1. A total of 158,138 heterozygous polymorphic SNP markers were detected initially. A missed genotype number above nine on any SNP was filtered. After filtering, 29,069 high-quality SNP markers, that conformed to the expected Mendelian ratios (P ≥ 0.001) remained and were included in the linkage analysis.
Linkage map construction. At the LOD threshold of 6.0, 24,444 SNP loci were grouped into 47 linkage groups in the merged maps ( Fig. 1), and 4,625 SNP markers could not be grouped. The 24,444 SNP markers contained three segregation types, including 13,261 lm x ll (54.3%), 9,125 nn x np (37.3%) and 2,058 hk x hk (8.4%). The genotype information for each SNP locus is provided in Supplementary Table 2, and the paternal information was derived from the maternal and F1 data. Maternal and paternal maps were also constructed, with 46 and 47 linkage groups, respectively. The lengths of the maternal, paternal and merged maps were 3,230.70 cM, 3,334.42 cM and 3,087.53 cM, respectively, and the average marker intervals were 1.87 cM, 1.82 cM and 0.92 cM, respectively. The length of the linkage groups in the maternal map varied from 6.12 cM (LG47) to 279.34 cM (LG2), and the length of those in the paternal map ranged from 6.19 cM (LG47) to 312.65 cM (LG8) ( Table 2). The estimated genome size of S. paramamosain is 1.21 Gb; thus, the average recombination rate across all linkage groups was 2.55 cM/Mb. QtL mapping. The 26 traits of the 99 mapped filial individuals, which included 78 females and 21 males, are listed in Supplementary Table 3. In total, 442 SNP loci, with a LOD > 3, which representing 279 QTLs, corresponded to 24 traits. We divided these 24 traits into five categories: appearance size, head area, cheliped, pereiopod and swimming stroke, and reproductive ability (Fig. 2). The LOD values of these 442 loci ranged from 3.01 to 12.66. Five markers with a LOD value greater than five were associated with seven traits, including three cheliped traits (markers 11.hk_hk_10468 and 11.hk_hk_11304), three pereiopod and swimming stroke traits (markers 11.hk_hk_7558, 105.nn_np_34985 and 11.hk_hk_11304) and one size trait (11.nn_np_24679). Among the 442 markers, 36 markers were linked to more than one trait, suggesting that these traits might be under the same genetic control. Two to 58 markers distributed in one, two, three or four linkage groups were linked to 24 traits. SW had the most linked markers (58), followed by 2PML (55), 3PML (48), and FW (44), whereas CW and ML had only two markers. The markers associated with the 24 traits are summarized in Table 3.
Gene location and growth gene identification. By means of BLASTing against the nr database, transcriptome and scaffold of the genome survey, 4,271 of the linkage map markers were linked to genes, and the gene numbers distributed in each linkage group ranged from 15 to 210. LG2 has the largest number of located genes, followed by LG3 (204), LG1 (198), LG11 (180) and LG4 (171). LG47 has the smallest number of genes, followed by LG44 (16), LG45 (20), LG46 (32), LG41 (37) and LG43 (38) (Supplementary Table 4). Gene numbers were generally positively correlated with the length and markers of linkage groups, with correlation coefficient of 0.818 and 0.823, respectively. Some growth-related genes, including transforming growth factor-beta regulator I, insulin-like androgenic gland hormone and insulin-like growth factor 2 mRNA-binding protein, were located. C-type lectin, which participates in the immune defense process, HR4 nuclear receptor and broad-complex-like protein, which participates in the ecdysone regulation pathway, were also located.
In total, 78 markers have been linked to genes. Among them, 22 markers had annotation information from at least one public database, and others were linked to transcriptome data but lacked annotation information. Of these 22 markers, two were located in the 3′-UTR, seven were in the intron, 11 were in the open reading frame (ORF), and four led to a nonsynonymous mutation; the other two were located in a pseudogene because an abnormal termination codon existed in the translated region (Table 4). 11.hk_hk_4825, a synonymous SNP located in the ORF of a RING finger protein, was a QTL of FW and 5PW. Two RING finger proteins, nhl-1, nesprin-2-like, RNA-directed DNA polymerase from mobile element jockey-like, dedicator of cytokinesis protein 3 isoform X2, carbonic anhydrase 2-like and protein disulfide-isomerase A5-like each had one SNP marker associated with FW (four), DFMS (two) and 2DFLS (one), and these genes or their linked genes might be related to the development of head area. Furthermore, integrator complex subunit 8-like, tenascin-X-like, sialidase-like, BTB/POZ domain-containing protein 3, polypeptide N-acetylgalactosaminyl transferase 1 and cholinephosphotransferase 1 isoform or their linked genes might be related to pereiopod and swimming stroke development. Mitochondrial ribonuclease P protein 3, sedoheptulokinase-like and putative nuclease HARBI1 are three genes with SNP associated with cheliped related traits, and them or their linked genes might be related to cheliped development. In addition, ankyrin repeat and general transcription factor II-I repeat domain-containing protein 2B-like or their linked genes might be related to the appearance size of the crab. GPI mannosyltransferase 4, DNA helicase MCM8-like and kinesin II or their linked genes might be related to reproductive ability, and protein numb isoform X5 or its linked genes might be related to OCS.

Discussion
Mud crabs are euryhaline and widely distributed in tropical, subtropical and temperate waters. In China, S. paramamosain is the dominant species and is mainly cultured in southeastern coastal provinces 14 . Mud crabs are very popular in China because of their high content of protein, unsaturated fatty acids and trace elements such as vitamins 31 .
Genome sequencing is an important step for deciphering evolutionary status and molecular mechanisms and accelerating genetic improvements in traits of interest in economically important species. The genome size of S. paramamosain was estimated to be 1.21 Gb with remarkably high heterozygosity (1.3%). This finding is inconsistent with previous studies (1.64 pg), in which flow cytometry was used to assess genome size 32 . Different results for one species from two methods were also found in P. trituberculatus 9 and L. vannamei 7 . The flow cytometry method is fairly straightforward, but the accuracy is highly dependent on the internal standard and quality of the material used for DNA content measurement 33 . Additionally, a comparison between flow cytometry and k-mer analysis methods in the estimation of nine insect species suggested that k-mer analysis is more accurate 34 . Genome survey analysis will be highly useful for the formulation of sequencing strategies for the mud crab.
Currently, the only reported linkage map of this species was constructed by our team using microsatellite and AFLP markers. In Total, 212 markers were mapped, and the mean spacing of the markers was 18.68 cM 20 , which cannot fulfill the requirements of the genetic breeding industry. This study reported a high-resolution genetic  7 . A large number of markers in the map were at the same location, which means they are completly linked. Therefore, the marker interval should be 0.92 cM instead of 0.13 cM (3087.53/24,444≈0.13). However, the numerous markers will be very useful in QTL mapping, gene location and genome assemble. www.nature.com/scientificreports www.nature.com/scientificreports/ Forty-seven linkage groups were obtained in the merged map, and only 46 linkage groups were in the maternal map. The missing linkage group in the maternal map was LG38; this linkage group may not be the sex chromosome because alleles in this linkage were not separated significantly by sex in the F1 offspring. Forty genes were located on LG38, and no sex-determining genes were found in this linkage group. A high-density linkage map suggested a ZW sex determination system in Eriocheir sinensis; but only an ankyrin-2 gene was found on the putative sex chromosome, and the sex determination gene double-sex was located on a putative autosome 6 . This finding suggested that sex determination genes may not exist on sex-linked chromosomes or that there is a different sex determination system in crustaceans. Further, 12-sex linked QTL were identified, and verification in large numbers of wild populations is needed. www.nature.com/scientificreports www.nature.com/scientificreports/ Under the conditions of high marker density (24,444 SNPs), the linkage groups should be consistent with the haploid chromosome number. However, the number of linkage groups in the present map seems to be less than the 49 haploid chromosome numbers reported by Wang et al. 35 and Chen et al. 36 using karyotype method. In fact, fewer number of linkage groups in the linkage map were also found in M. japonicus 8 . In this study, the lack of one linkage group in the maternal map and the lower numbers in the merged map might be mostly due to the relatively lower number of mapping populations or the low enzyme-digestion efficiency in the missing chromosome.
Overall, 4,271 markers were linked to genes, and the gene numbers were generally positively correlated with the length and marker numbers of linkage groups, suggesting that genes were almost evenly distributed in the crab genome. Transforming growth factor-beta regulator I, which is also called nuclear interactor of ARF and Mdm2 (NIAM), activates p53/TP53, causes G1 arrest and collaborates with ARF to suppress proliferation, acted as a growth inhibitor 37 . Insulin-like androgenic gland hormone (IAG), which is produced by androgenic glands in male crustaceans, is regarded as a key regulator of sex differentiation. Interestingly, studies on IAG in Scylla paramamosain suggested that it is involved in regulating ovarian development and somatic growth 38 . Insulin-like growth factor 2 mRNA-binding protein binds to the 5′-UTR of the insulin-like growth factor 2 (IGF2) mRNA   www.nature.com/scientificreports www.nature.com/scientificreports/ to regulate IGF2 translation, and thus has an important function in animal development 39 . C-type lectins are a superfamily of proteins that recognize a broad repertoire of ligands and regulate a diverse range of physiological functions 40 . Indeed, C-type lectins of S. paramamosain were upregulated significantly by Vibrio parahaemolyticus and had an important role in the immune response 41,42 . Broad-complex-like protein and HR4 nuclear receptor are two ecdysteroidogenic transcription factors that deliver ecdysone signal, which is also important for development 43 . Knowing the location of these genes will assist the study of genetic breeding.
The head area of the crab contains many organs, such as the cerebral ganglion and antennary glands. Seven linked annotated genes were associated with this area. Among them, nesprin-2 belongs to a novel family of nuclear and cytoskeletal proteins with rapidly expanding roles as intracellular scaffolds and linkers 44 . Dedicator of cytokinesis protein 3 (DOCK3) is a large protein involved in intracellular signalling networks, which function as activators of small G proteins. DOCK3 is expressed exclusively in the central nervous system of mice and plays an important role in axonal outgrowth and cytoskeleton re-organization 45 .
Six genes associated with pereiopod and swimming stroke traits were identified. Among them, tenascin-X is a glycoprotein that contributes to matrix stability and is possibly involved in collagen fibril formation 46 . A SNP in this gene causes a nonsynonymous mutation, which replaces a hydrophilic amino acid Ser with a hydrophobic amino acid Leu. Sialidase-like, which is a glycoside hydrolase enzyme that cleaves the glycosidic linkages of neuraminic acids 47 , also possesses a nonsynonymous mutation of Ser to Asn. Future studies in large populations are needed to prove the correlation between phenotype and genotype. Polypeptide N-acetylgalactosaminyltransferase 1, which catalyses the initial reaction in O-linked oligosaccharide biosynthesis 48 , and cholinephosphotransferase 1, which catalyses phosphatidylcholine biosynthesis from CDP-choline and plays a central role in the formation and maintenance of vesicular membranes 49 , were also suggested to have a relationship with pereiopod and swimming stroke development.
Three linked annotated genes were suggested to associate with cheliped traits. Among them, mitochondrial ribonuclease P protein 3, which is a part of mitochondrial ribonuclease P (mt-RNase P) that cleaves tRNA molecules at their 5′ -ends 50 , and a nonsynonymous mutation of Trp to Arg was found in this gene. In addition, the putative nuclease HARBI1, which might promote anti-V. alginolyticus infection by participating in regulating phagocytosis, apoptosis, superoxide dismutase activity, PO activity, and THC in Marsupenaeus japonicus, was another gene associated with cheliped traits 51 .
The ankyrin repeat was linked to one QTL of 8CW and appeared to be associated with appearance size. Proteins that contain the ankyrin repeat motif can mediate the protein-protein interactions 52 .
Beneath the abdominal plate is the location in which the eggs are held in female crabs and where the ejaculatory duct is found in male crabs, both of which were possibly associated with reproductive ability. Three genes were linked to QTLs of AW, among which GPI mannosyltransferase 4 is involved in glycosylphosphatidylinositol-anchor biosynthesis 53 . DNA helicase MCM8, forms a complex with MCM9 and is involved in homologous recombination repair following DNA interstrand cross-links, thus playing a key role during gametogenesis 54 .  www.nature.com/scientificreports www.nature.com/scientificreports/ It is likely that some of the association genes seem to have some relationship with specific traits, which indicated by their function studies in other species. However, it should be noted that the association genes do not mean that they are the causative genes, and in-depth studies are still needed.
In summary, this study performed a genome survey analysis, which revealed that S. paramamosain has an estimated genome size of 1.21 Gb with high heterozygosity (1.3%). Then, a high-resolution linkage map was constructed using 24,444 SNP makers. Forty-seven linkage groups were obtained with a total length of 3087.53 cM and a marker interval of 0.92 cM. With the aid of transcriptome and genome scaffold data, 4,271 markers were linked to genes, including some important growth-related, immune-related and hormone pathway genes. Further, 442 QTLs were identified and found to correspond to 24 traits, and of these, 78 QTLs were linked to genes. Some interesting genes, such as dedicator of cytokinesis protein 3, tenascin-X and DNA helicase MCM8, were believed to have important relationships with specific traits and are worth further exploration.