Transcriptome based high-throughput SSRs and SNPs discovery in the medicinal plant Lagenaria siceraria

Lagenaria siceraria (Molina) Standley has unique biological characteristics with high nutritional and medicinal values. It is an important pharmaceutical plant with various biologically active ingredients. Genetic improvement and deeper genomic studies require a rich resource of molecular markers. The application of next-generation sequencing technology, especially for transcriptome profiling, has greatly facilitated high throughput single nucleotide polymorphism (SNP) and simple sequence repeat (SSR) discovery. In this study, we sequenced the transcriptome of three major cultivars of L. siceraria and obtained 64.88 GB of clean data. The assembled high-quality reads were clustered into 89,347 unigenes, which were annotated by non-redundant protein database, Swiss-Port, Eukaryotic Ortholog Groups, Kyoto Encyclopedia of Genes and Genomes, and gene ontology databases. A total of 8,891 SSR and 35,873 SNP markers were predicted from unigenes by MISA and SAM tools, respectively. Characterization of the predicted markers in L. siceraria showed that the SSR and SNP densities were 60 and 243 markers per Mb of genome, respectively, and the estimated ratio of transition to transversion of SNP was 2.016. These markers will be very useful for genetic studies in L. siceraria, especially for the high-density linkage map construction and genome-wide association studies. Further genomic studies based on these results will facilitate the identification of novel genes or alleles of pharmaceutical importance.


Introduction
Markers assisted study, especially based on genetic and molecular markers, is an indispensable tool for selection and manipulation of germplasm in breeding, population genetics, molecular ecology, conservation genetics, and other life sciences experiments. Among different kinds of molecular markers restriction fragment length polymorphism (RFLP), random amplified polymorphic DNA (RAPD), amplified fragment length polymorphism (AFLP), inter-simple sequence repeat (ISSR), simple sequence repeat (SSR), and single nucleotide polymorphism (SNP) have been globally adapted for the last three decades (Grover and Sharma, 2016;Nadeem et al., 2018;Sunnucks, 2000). Among these markers types, SSR and SNP are considered the most versatile and widely used markers (Wang et al., 2019a).
The SSR markers, also named microsatellites, are known for their diverse and user-friendly characteristics such as co-dominance, genomic abundance, random distribution, smaller locus size, simplicity of use, high clarity, reproducibility, low operational cost, hyper-variability, amenability to automation, ease of multiplexing, and high polymorphism (Rashid et al., 2016;Wang et al., 2019b). The point mutations based single nucleotide polymorphisms (SNPs) with bi-allelic, abundant, and ubiquitous nature are the most extensive type of variation in the genome sequence of an organism (Nadeem et al., 2018). The SNP markers help reveal the key functional variation in a candidate gene (Rashid et al., 2016).
The next-generation sequencing technologies, highthroughput sequencing of genomic DNA, and cDNA obtained from extracted RNA of plant tissues have accelerated the exploitation and development of SSR and SNP markers on a larger scale (Grabherr et al., 2011;Kumar et al., 2012). The mRNA based transcriptome sequencing yields the SSRs and SNPs that are potentially linked with protein-coding sequences in genic regions, and these markers are a very powerful tool in multiplex research fields (Codina-Solà et al., 2015) such as population biology (Gramazio et al., 2018), association analysis (Xie et al., 2016), haplotype evaluation (Rashid et al., 2016), modern breeding, and molecular ecology and evolution (De Wit and Palumbi, 2013).
Lagenaria siceraria (Molina) Standley (family Cucurbitaceae), commonly known as bottle gourd, is a plant of high nutrition and medicinal values and is considered one of the earliest plants to be domesticated on earth (Lim, 2012;Shah and Seth, 2010). It is native to Africa and has been widely cultivated in tropical and subtropical regions (Lim, 2012;Prajapati et al., 2010). L. siceraria fruit is traditionally known for its cardiotonic, cardioprotective, general tonic and aphrodisiac properties and useful to treat various anti-cancer (Izawa and Kuroda, 2010) allergic and inflammatory disorders like bronchial asthma, rhinitis, bronchitis, and rheumatism (Prajapati et al., 2010;Shah and Seth, 2010). Extracts of fruits of L. siceraria have antiinflammatory, analgesic, hepatoprotective, antihyperlipidemic, diuretic and antibacterial activities (Prajapati et al., 2010;Shah and Seth, 2010). These properties of L. siceraria fruit are attributed to its saponins, cucurbitacin, carbohydrates, and flavonoids (Izawa and Kuroda, 2010;Shah and Seth, 2010). It contains a high choline level along with required metabolic precursors for brain function (Rahman, 2003). The extracts of L. siceraria fruits have been reported for immunomodulatory activity in experimental animals (Gangwal et al., 2009). Due to its significant nutritional, and pharmaceutical importance (Lim, 2012;Prajapati et al., 2010;Rahman, 2003;Shah and Seth, 2010), it is necessary to improve varieties of L. siceraria with wide adaptability and enrich it with prebiotic and potentially active biomolecules (Ahmad et al., 2011).
The currently available genetic and genomic tools for L. siceraria are very limited as compared to other crop plants. Previous studies have been reported for the genetic variability (Mladenović et al., 2012), diversity, divergence (Chetariya and Vaddoria, 2017), heterosis (Damor et al., 2016) through traditional genetic analysis. Only a few studies were based on modern genomic tools such as genetic mapping (Wu et al., 2019), DNA-markers based diversity (Kumar et al., 2018), transcriptome analysis for root tissues (Yang et al., 2014), and comparative transcriptome analysis (Zhang et al., 2020). These studies mainly used Random Amplified Polymorphic DNA (RAPD) marker and morphological markers. Recently, some studies report the whole genome sequence of bottle gourd and studied the Cucurbitaceae evolution and provided the base to map some useful genes (Wu et al., 2017). A genome-wide study for identification of GRAS transcriptional factor in bottle guard genome was reported (Sidhu et al., 2020). Furthermore, a multi-omic database has also been generated for bottle gourd (Wang et al., 2018). However, the lack of genetic maps, larger high-throughput marker collections, and suited mapping populations are limiting gene isolation and L. siceraria breeding. The poorly studied genome of bottle gourd is another limiting factor for SNP markers discovery. In the current study, we focused on the transcriptome sequencing based high-throughput identification of molecular markers. We evaluated the transcriptome sequences of L. siceraria, identified and characterized the SSR and SNP markers. The results of this study provide rich molecular tools to facilitate germplasm characterization, breeding, and functional genomics in this pharmacologically important plant.

Plant materials and growth conditions
Three cultivars of bottle gourd (Hang, Fu, and USA) were used in this study. These commercial cultivars are highly grown and consumed in Wuhan, China. The seeds were obtained from the Vegetable Institute of Wuhan Academy of Agricultural Sciences, Wuhan, China, and plants were grown in a greenhouse of the same Institute. No special ethical license is required for working with these plant materials. Plants were grown in a substrate composed of sandy soil: vermiculite: perlite: organic fertilizer in 4:4:1:1. The pH of the culture substrate was maintained at 7.0, and the water content was maintained at 70% relative humidity. The greenhouse conditions were maintained at 25-28°C and 18-20°C day and night temperature, respectively, photoperiod 14 h/d with a light intensity of 87.5 µmol/m 2 s was maintained. Pest control was performed according to standard practices. Leaf samples were harvested from three individual plants (biological replicates) of each cultivar, immediately frozen in liquid nitrogen, and stored at −80°C until further use.
cDNA preparation, transcriptome sequencing, assembly and quality control Three independent biological replicates for three cultivars of Lagenaria siceraria were used for cDNA library preparation and sequencing. The total RNA was extracted from each sample and sequencing libraries were generated using NEBNext Ò Ultra TM RNA Library Prep Kit for Illumina Ò (NEB, USA) following manufacturer's recommendations, and index codes were added to attribute sequences to each sample. The clustering of the index-coded samples was performed on a cBot Cluster Generation System using TruSeq PE Cluster Kit v3-cBot-HS (Illumia) according to the manufacturer's instructions. After cluster generation, the library preparations were sequenced on an Illumina Hiseq X Ten platform and paired-end reads were generated. The quality of the transcriptome sequencing library was evaluated; (1) by examining the distribution of inserted fragments on unigenes, the randomness of mRNA fragmentation and the degradation of mRNA were evaluated, (2) by drawing the length distribution map of the inserted fragment, the discrete degree of the length of the inserted fragment was evaluated, (3) by drawing the saturation map, we can evaluate the capacity of the library and the adequacy of the mapped reads compared to the unigenes library.
As per the sequencing strategy of the machine, the average read length was 100 bp. Raw data (raw reads) were firstly processed through in-house perl scripts. In this step, clean data (clean reads) were obtained by removing reads containing adapter, joint sequences, sequencing splices, primer sequences, with less than 5 mass value, Poly-N (more than or equal to 5% N-base), Poly-A and low quality reads from raw data. At the same time, Q20, Q30, Q40, GCcontent, and sequence duplication level of the clean data were calculated and most of the base quality scores can reach or exceed Q30. The QPhred values were estimated to evaluate the sequencing quality. All the downstream analyses were based on clean data with high quality. The clean reads were saved to fastq format (Cock et al., 2010) and assembled using Trinity (Grabherr et al., 2011) with default parameters to obtain the unigenes. Base type distribution test was used to detect the GC contents separation. The whole sequencing process was stable and horizontal because of the principle of random interruption and base complementary pairing.
Gene structure analysis clustering and functional annotation Transcoder software was used to predict the coding region sequence of unigenes and its corresponding amino acid sequence, while Trinity (Grabherr et al., 2011) was used for coding sequences (CDS) prediction. The blast (Altschul et al., 1997) software was used to compare the unigene sequence with NCBI non-redundant protein sequences (NR) (Deng et al., 2006). A manually annotated and reviewed protein sequence database (Swiss Prot) (Apweiler et al., 2004), Gene Ontology (GO) (Ashburner et al., 2000), Clusters of Orthologous Groups of proteins (COG) (Tatusov et al., 2000), KOG (Koonin et al., 2004), eggnog4.5 (Huerta-Cepas et al., 2016, and Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa et al., 2004) database. The kobas2.0 (Xie et al., 2011) program was used to get the KEGG ontology result of unigenes, and Hmmer (Eddy, 1998) software was used to compare with Protein family (Pfam) (Finn et al., 2014) database after predicting the amino acid sequence, to get the annotation information of unigenes.

SSR detection and validation
SSR analysis was carried out for unigenes with a length >1 kb by using MIcroSAtellite identification tool (MISA) v 1.0 to detect microsatellite or simple sequence repeat (SSR) loci (http://pgrc.ipk-gatersleben.de/misa/misa.html). The minimum repetition parameter was ten for mono-, six for di-, and five for tri-, tetra, penta-, and hexa-nucleotide SSR motifs, respectively. If the distance between two repeated motifs was shorter than 100 nucleotides then they were considered as compound microsatellites (Thakur and Randhawa, 2018). The primer 3 v2.3.5 (Untergasser et al., 2012) (http://primer3. sourceforge.net/releases.php) with default parameters was used to design three primer pairs for all SSR markers. After SSR markers identification, 59 SSR markers were randomly selected for validation. Polymerase Chain Reaction (PCR) with the primers of the above selected SSR markers was carried out on the DNA samples extracted from the fresh leaves of Hang cultivar using the DNeasy Plant Mini Kit (QIAGEN). The amplified PCR products were visualized on 1% agarose gel.

SNPs analysis
To identify the putative single nucleotide polymorphic sites (SNPs), the high-quality transcriptome sequences and unigenes in each sample were assembled and compared by STAR software (Dobin et al., 2013) and the BAM format comparison files were obtained. The single nucleotide polymorphism (SNP) sites were identified through the SNP calling process of GATK (Lim, 2012) for RNA-Seq. Picardtools v1.41 and samtools v0.1.18 were used to sort, remove duplicated reads, and merge the bam alignment results of each sample. The SNP quality criteria were set by sequencing depth > 2 and the < 3 mismatches in the 35 bp range of marker. The identified SNPs were further filtered for low quality by SNP calling and only SNPs with distance > 5 were retained. Then, the SNPs were further analyzed to evaluate their effect on the expression level of genes and/or the types of protein products. The SNPs obtained were further characterized manually.

Expression evaluation and differential expression analysis
The bowtie2 v 2.3.4 program (Langmead and Salzberg, 2012) was used to compare the sequenced reads with the unigene library. According to the comparison results, the expression level was estimated by combining with RSEM (v. 1.3.1) (Li and Dewey, 2011). The fragment per kilobase of transcript per million mapped reads (FPKM) value was used to express the expression abundance of corresponding unigenes. The false discovery rate (FDR) < 0.01 and the difference multiple fold change (FC) ≥ 2 was set as the filtering standard. The DESeq program was used to screen the differentially expressed genes. The Benjamin Hochberg method was used to correct the P-value of the original hypothesis test and FDR values were estimated to reduce the false-positives caused by independent statistical hypothesis tests for a large number of gene expression values. The corrected P-value was kept as a key index of differential expression gene (DEGs) screening. The DEGs were further filtered for FDR < 0.01 and the difference multiple FC ≥ 2. For the samples without biological duplication, Ebseq (Leng et al., 2013) was used to analyze the differential expression gene set between the two samples. Pearson's correlation coefficient was used as the evaluation index of correlation among samples (Anders and Huber, 2010). Further Gene Ontology (GO) enrichment analysis of the differentially expressed genes (DEGs) was implemented by the topGO software, and the statistical enrichment of differential expression genes in KEGG pathways was analyzed.

RNA-seq and transcriptome assembly
The high-throughput sequencing (Illumina Hiseq X Ten) data was generated and then transformed into the raw data by base calling analysis. A total of 24.65, 25.66, and 32.32 million raw reads were extracted for the cultivar USA, Fu, and Hang, respectively (Tab. 1). After sequence cleaning for quality control, a total of 64.88 GB clean data was obtained, and the percentage of Q30 base of all products was more than 92.52. A total of 87.29%, 84.94%, and 85.33% of cleaned reads were mapped to unigenes for three cultivars of L. siceraria USA, Fu, and Hang, respectively (Tab. 1).

Functional annotation of unigenes
The assembled unigenes of L. siceraria were evaluated for functional annotation using the seven known databases. A total of 48,274 (54%) unigenes were annotated to at least one of all seven databases. Out of the total 89,347 unigenes, 40,297 (45%), 31,762 (36%), and 25,679 (29%) unigenes could be annotated using the NR, Pfam, and SwissProt databases respectively (Tab. 3). In particular, 12,787 unigenes (14%) showed significant identity to the gene ontology (GO) database. All the identified GO terms distributed into 4,505 functional groups were classified into three major-, and 54 sub-classes. Among the three categories, maximum (33,858) unigenes in 2,339 subclasses of 'Biological processes' to minimum (16,697) unigenes in 504 subclasses of 'Cellular Components' were observed, while (19,783) unigenes were in 1,662 subclasses of 'Molecular function'. As per ontology terms of class 'Cellular Components', maximum (1,627) unigenes hit the subclass 'Nucleus' followed by 'plasma membrane' (1,339), and 'cytosol' (971). Among the GO-terms in 'Molecular functions', maximum (1,418) unigenes were involved in 'ATP binding' followed by 'Zinc ion binding' (604), while 'oxidation-reduction process' with 879 unigenes followed by 'translation' with 580 unigenes were in top trend in the 'Biological processes' (Suppl. Tab. 1).
Furthermore, 17% (14,952) unigenes could be annotated into the COG database. Among the COG function classes 'General function prediction only' was on top selection followed by 'Translation, ribosomal structure and biogenesis' and 'post-transitional modifications, proteins turnover, chaperons' with 2,068, 1,787 and 1,377 unigenes respectively. The same categories in the KOG database with 4,856, 2,079 and 2,837 unigenes and the eggnog database with 7,413, 2,314 and 3,663 unigenes were also observed to   (Fig. 2A). The genes with similar expression patterns were clustered together into various groups. The correlation between samples was analyzed on the basis of expressed genes and Pearson's coefficient of correlation. The clear groups of each cultivar were observable (Fig. 2B), which indicated the genetic FIGURE 1. The length (in nucleotides (nt)) based distribution of clean reads after sequence assembly. similarity amid the repeats (r 2 ≥ 0.9) while showing dissimilarities between cultivars. Among the mapped reads, a minimum of 1,250 unigenes was differentially expressed by more than two-folds. All the differentially expressed genes (DEGs) could further be classified as up-regulated and down-regulated (Fig. 2C, Tab. 4). The DEGs were further evaluated for their functional annotation. All the DEGs could be classified for a maximum of 139 gene ontology (GO) terms (Fig. 3A). In the top twenty GO terms, maximum genes were related to 'metabolic process', 'cell' and 'cell parts', and 'catalytic activity', while in COG function classification, the majority of genes hit the 'Translation, ribosomal structure and biogenesis', 'General functions predictions, and 'Posttranslational modifications, protein turnover, chaperones' (Fig. 3B). Furthermore, the KEGG pathway analysis was performed for DEGs, and 434 functional pathways were observed. Among the top twenty KEGG pathways, 'Ribosome' pathway with 21.98% unigenes has maximum hits (Fig. 3C).
Identification, characterization, and functional enrichment of SSR in transcriptome

SSR identification and characterization
In order to reduce the bias of transcript length and noise at a low level of expression, the unigenes having fragments per kilobase of transcript per million mapped reads (FPKM) ≥1 were selected to identify simple sequence repeats (SSRs). A total of 8,891 microsatellite motifs were identified on 6,705 mapped and/or 6,030 annotated unigenes, i.e., 7.5% and 6.74% of total unigenes. It accounted for 60 SSRs/Mb genome density. The total length of the SSR motifs was 153,577 bp, which provided the relative density of 57.89 bp/ kb. Of the total SSRs, mono-nucleotide microsatellites were the most abundant (5,126, 57.65%) with 61,542 bp length, followed by di-(1,447, 16.27%), tri-(1,615, 18.16%), tetra-(81, 0.91%), penta-(18, 0.20%), and hexa-nucleotide (6, 0.07%), types (Tab. 5). The A (50.72%) and T (46.72%) motifs were the most abundant in mono-nucleotide repeats. Among di-nucleotide motifs, AT (16.86%), TA (14.23%), TC (17.21%), and AG (17.96%) with a total count of 959 were the most abundant types, while there were only three GC and one CG motifs. Among tri-nucleotide, the most abundant motif was GAA (12.58%), followed by TTC (9.12%), AGA (7.01%) motif types, while only one GTA was observed (Suppl. Tab. 2). An average length of L. siceraria microsatellites was 17.27 bp. The length variation of microsatellites was significantly affected by the repeat motif size. The hexa-nucleotide with an average 34 bp length was the longest motif type followed by penta-(25.83 bp), tetra-(20.54 bp), tri-(16.39 bp), and di-nucleotide (14.31 bp) motif types, while the mononucleotide motifs had the shortest average length (12.01 bp). The longest microsatellite identified was 42 bp, which was composed of a 7-fold repetition of a hexa-nucleotide motif. Along with these motif types, the compound motif types as 'c' motifs containing two types of repeats separated by few nucleotides, and 'c*' motifs in which two types of repeats are not separated by nucleotide stretch were observed as well. Among the total microsatellites, 6.53% (581) were 'c' type, and 0.19% (17) were 'c*' type compound motifs with an average length of 72.20 bp and 34.11 bp, respectively. The longest compound motif was 320 bp containing 4 mono-and one tri-nucleotide motifs with multiple repeats. A total of 51 SSR markers were observed for 41 unigenes in the top ortholog group 'Ribosome' pathways (K03010) (Suppl. Tab. 1).

SSR functional enrichment analysis
The 6,030 SSRs possessing annotated unigenes also included the 351 DEGs. Among these annotated unigenes, 7,758 unigenes (87%) showed significant similarities in nonredundant (nr) database proteins. The Gene Ontology annotation showed 424 unigenes (4.77%) were assigned with one or more GO term IDs. Among the total 53 GO classification for SSR containing unigenes, 'Biological Processes' terms appeared most for a total of 986 unigenes, followed by 'Cellular Components' and 'Molecular Functions', with 662 and 414 unigenes, respectively. Among the majorclass 'Biological Process', the 'metabolic process' and 'cellular process' were on top identification with a maximum of 200 and 198 unigenes. Among the 'Cellular Components', most of the unigenes were expressed for 'Cell', 'Cell parts', and 'organelles' with 151, 151, and 111 unigenes, while among the 'Molecular Functions', the 'catalytic activity', and 'binding' were on top hit with 183 and 173 unigenes respectively. KEGG analysis could annotate the 2,892 (32.5%) of SSR containing unigenes (Fig. 4A, Suppl. Tab. 3). A total of 2,250 (25.3%) and 4,658 (52.39%) of SSR containing unigenes could be annotated to COG and KOG databases, respectively. Among the function classes of both databases, 'General function prediction only' was on top selection, followed by 'Translation, ribosomal structure and biogenesis' and 'post-transitional modifications, proteins turnover, chaperons'.
A total of 59 SSR were randomly selected to check whether PCR products could be observed on agarose gel (Suppl. Tab. 4). As shown in Fig. 5, all the SSR yielded PCR products in single band, some were brighter than others, and PCR product sizes corresponded to the expected sizes. This result demonstrates that the SSR primers generated in this study are functional and pending a polymorphism test in various genotypes; they have high potential applications in breeding and genomic studies.

Identification and characterization of SNP variants in transcriptome SNP identification
A total of 35,873 single nucleotide polymorphism (SNP) sites with an average of 243 SNPs per Mb of genome were identified. Among the total SNPs, 30,589 SNPs were found to be located in 12,983 annotated unigenes. The majority of SNPs (68.14%) were homozygous, while the heterozygous SNPs accounted for 31.86% (Tab. 6). The total number of transition (Ts) and transversion (Tv) mutations were 23,970 (66.85%) and 11,889 (33.15%), respectively, with a Ts/Tv ratio of 2.016 (Tab. 7, Suppl. Tab. 5). The amount for A/G transition substitutions was similar to C/T, while the frequency of four transversion substitutions (A/T, A/C, G/C, G/T) was equal as well (Tab. 7).

SNP read depth and distribution
As the prediction accuracy of SNPs is closely related to the read depth (RD) in SNPs position (Li et al., 2013), the RD for each SNP position was calculated and plotted (Fig. 6A). The overall average RD was 15.75. Maximum (76%) SNPs   showed the RD between 1 to 10 followed by 17% SNPs with 10 to 50 RD while the SNPs with 51 to 200 and >200 RD accounted for 5.47% and 1%, respectively.
The distribution of SNPs among unigenes is important to identify the marker density in the genome (Liu et al., 2011). The information on SNP distribution is also significant  while using the SNPs to construct the linkage map. In this study, we found that all the SNPs were distributed in 14.5% (12,983) of total unigenes. Among these annotated unigenes, unigenes with one SNP were more common (51%), and those with no more than 10 SNPs occupied 98% of total unigenes. A total of 249 unigenes containing more than 10 SNPs were observed. The detailed SNPs distribution among those unigenes and SNP density were shown in Figs. 6B and 6C. The mutation rate among unigenes was investigated, as well as the SNP frequency within unigenes was calculated (Fig. 6D). The majority of unigenes showed a very low (0.00-0.01%) mutation rate. The top twenty annotated unigenes with the highest SNP frequency were listed in Suppl. Tab. 6.

SNP functional enrichment analysis
Among the total 12,787 unigenes containing SNPs including 1,921 differentially expressed genes, 11,667 unigenes (90%) showed significant hit to the non-redundant (nr) database proteins. The unigenes were annotated by the corresponding top best BLASTx hit. After the Gene Ontology annotation, 1,921 unigenes (14.8%) were assigned with one or more GO term IDs. Among the total 54 GO terms for unigenes containing SNPs, 'Biological Processes' was on top hit with 5,765 GO terms followed by 'Cellular Components' and 'Molecular Functions' with 3,067 and 2,934 GO terms identification, respectively. Among the major-class 'Biological Process', the 'metabolic process' and 'cellular process' were on top identification with maximum unigenes.   Fig. 4B). The 32% (4,154) and 59% (7,656) of SNPs containing unigenes could be annotated to COG and KOG databases, respectively. Among the function classes of both databases 'General function prediction only' was on top selection followed by 'Translation, ribosomal structure and biogenesis' and 'post-transitional modifications, proteins turnover, chaperons'.

Discussion
Medicinal plants are significant in terms of financial income, healthcare, livelihood security, and cultural identity to the human populations (Hamilton, 2004). The majority (up to 80%) of the population in developing countries primarily depends on herbal medicine for the basic treatment of diseases (Hamilton, 2004;Pourmohammad, 2013). About 10% of 50,000 different herbal plant species are now commercially cultivated to fulfill the public market demands (Pourmohammad, 2013). To meet the increasing demand for medicinal plants in pharmaceuticals, it is required to take a variety of measures such as conservation of wild resources by adopting appropriate traditional and advanced techniques, developing new high-yielding varieties, or improving the existing cultivars using the molecular markerbased techniques . Hence, we used the highly grown commercial cultivars of L. siceraria and performed the transcriptomic study to facilitate the markers-based breeding and research programs. Due to its significant contribution in nutrition pharmaceuticals, L. siceraria has been studied widely in multiple fields including pharmacognostic (Shah and Seth, 2010), cultivation, genetic diversity (Chetariya and Vaddoria, 2017;Mladenović et al., 2012), domestication, and breeding (Damor et al., 2016) but less studied for the exploitation of molecular markers. The present study aimed at developing a large amount of SSR and SNP markers for L. siceraria through transcriptome sequencing.
Molecular markers, especially SSR and SNP markers, are indispensable tools to evaluate the genetic diversity, QTL/gene identification and manipulation, wild resource in situ and ex situ conservation of medicinal plants. The DNA fingerprinting technology-based molecular markers could allow the authentication of medicinal plants and species taxonomy (Li et al., 2003;Oleszek et al., 2002;Um et al., 2001). Markers assisted selection based molecular breeding technology facilitates the identification of desirable chemotypes (Fico et al., 2003). Moreover, molecular markers, especially SSR and SNP markers, have a significant impact in the research field of traditional herbal medicinal plants for improved yield and biotic and abiotic stress control.
High-throughput sequencing has been used frequently to exploit SSR and SNP markers for herbal medicinal plants on a large scale (Liu et al., 2015;Otto et al., 2017;Su et al., 2016;Wang et al., 2019b;Wei et al., 2016;Yun et al., 2012). As most of the functional genes at the corresponding developmental stage could be involved in the transcriptome, next-generation sequencing could enable the deep and efficient probing of the transcriptome (Grabherr et al., 2011) and ensures a sufficient resource for gene-associated markers discovery. Here, we have identified a total of 8,891 SSR and 35,873 SNP markers with high quality from the transcriptome of three L. siceraria cultivars using high-throughput sequencing, which will not only enhance the genetic resource but also help in the assessment of genetic diversity and conservation of this important herbal plant. Most of the identified SSR (75%) were within the mapped unigenes. An analysis of the localization of microsatellite repeated motifs showed that mononucleotides were preferentially localized in unigenes and were in the most proportion.
SSR markers have been extensively used in constructing genetic maps of cucurbit species (Fernandez-Silva et al., 2008;Levi et al., 2009). Until now, only a few microsatellites have been available for Cucurbita, and transferability from other cucurbits, such as cucumber, has been demonstrated to be very low (Fernandez-Silva et al., 2008). Thence, the development of SSRs for this genus is highly desirable. In a study of linkage mapping in 2008, Gong et al. (2008) reported the development of SSRs-enriched partial genomic libraries from an Austrian oil-pumpkin variety Cucumis pepo subsp. pepo and one accession of Cucumis moschata, generating a collection of 1,058 putative SSRs (Gong et al., 2008). Transcriptome-based SSRs have several advantages as they are related to genes. Being functional markers, they can be used as candidate genes to study their association with phenotypic variation, and their flanking sequences are more likely to be conserved among close or distant species, making their use as markers for comparative mapping easier.
The detection of false-positive SNPs is a common problem in the high-through transcriptome sequencing (Cánovas et al., 2010). Different SNP calling programs could generate inconsistent SNP datasets, along with errors in data (Clevenger et al., 2015). Three common programs, namely Bowtie 2 built in SAMtools, Freebayes, and GATK Unified Genotyper, are used to call SNPs. GATK is the most conservative, and SAMtools is the most aggressive program. Freebayes and SAMtools are much more consistent with each other (Clevenger et al., 2015). To remove the falsepositive SNPs, different methods were used by researchers, such as removing all polymorphisms associated with genes that have paralogs (Barbazuk et al., 2007;Tang et al., 2006), setting some parameters like read depth (Li et al., 2012), minimum allele frequency (Byers et al., 2012), genotype score (Peace et al., 2012), base quality (Allen et al., 2013), and mapping quality (Uitdewilligen et al., 2013). The read depth and minimum allele frequency are the most commonly used parameters to investigate the false positives (Clevenger et al., 2015). Another study used the window approach to eliminate false-positive SNPs, i.e., potential alleles differed by 10% or more in the window surrounding the SNPs (Han et al., 2011). In this study, we detected the SNPs with read depth > 2, mismatches rate < 3 in 35 bp range of marker, and SNPs with distance > 5 bp, which could filter most of the false-positive SNPs, and generate high-quality SNPs.
The estimated SNP frequency in L. siceraria was 243 per Mb. The transition/transversion ratio (2.016) was also higher than Pacific oyster (1.3) (Tseng, 2012) and Drosophila (1.5) (Vera et al., 2013). Read depth is a key parameter affecting the predicting accuracy of SNPs (Li et al., 2013). One advantage of the Illumina sequencing platform is the high read depth, which ensures the detection of true SNPs. In this study, the average read depth of SNP position was 15.75, which was enough to guarantee the accuracy of discovered SNPs. It also could ensure that most of the expected SNPs in the sequenced population could be detected (Quinn et al., 2013). SNPs with much higher read depth should be excluded since too high read depth might be caused by paralogous sequence variants (Helyar et al., 2012). Heterozygosity and homozygosity analysis of these SNPs indicate most of these SNPs were polymorphic in the transcriptome. Another advantage of SNP discovery using transcriptome data is to find the SNPs directly associated with interesting traits, such as disease resistance or growth advantages. These SNPs can be possibly used in the further genome-wide association studies and genome selection breeding program of L. siceraria.
The SNP frequency in each unigene was calculated in this study. The SNP frequencies ranged from one per unigene to 55 per unigene. The unigenes with high-density polymorphic markers detected in this study could be used in the population diversity or population differentiation analysis. We arbitrarily evaluated the SNPs in deferentially expressed genes as well, which could be highly associated with the gene functions. GO and KEGG pathways analysis showed the most significant hits to the biological and catalytic process in cells, cell parts, and organelles for catalysis activities and molecular bindings in 'Ribosome' pathway. All of these are supporting the synthesis and regulation of bioactive ingredients, which provides an insight into the functional genomic understanding of various pharmaceutical functions. Therefore, the SSRs and SNPs that are located in the corresponding unigenes should be given priority for molecular breeding of L. siceraria.

Conclusions
In this study, next-generation sequencing of transcriptomes from three L. siceraria cultivars was used for high throughput SSR and SNP markers identification. About 64.88 GB of cleaned data were generated and assembled into 89,347 unigenes. A total of 8,891 and 35,873 high-quality SSR and SNP markers were predicted from the assembled reads of the transcriptome data. The predicted SSR and SNP markers were observed to be on 6,030 and 12,983 annotated unigenes. This study provided a valuable resource of SSR and SNP markers for population genetics, molecular breeding, and association studies of L. siceraria. The results of the present study would increase the knowledge about functional genes and the healthful effects of L. siceraria.
Authors Contribution: Hongyuan Zhang and Min Zhang: designed the study. Hongyuan Zhang, Xia Chen, Jie Tan, Shuping Huang, collected fruit samples, performed transcriptome analysis and data interpretation and validation; Hongyuan Zhang and Xia Chen: Drafted the manuscript; Min Zhang and Guolin Zhou: supervised the study, provided funding and revised the manuscript. All authors have read and approved the final version of the manuscript.
Availability of Data and Materials: The raw RNA-seq data has been submitted to NCBI SRA under BioProject ID PRJNA605136.