Mining of Simple Sequence Repeats Loci, Genetic Relationship And Population Structure of Bottle Gourd (Lagenaria Siceraria (Molina) Standl.) Accessions With Different Geographical Origin Using Single Nucleotide Polymorphism (SNPs) Markers

Lagenaria siceraria (Molina) Standl. (2n = 2x = 22) is an important horticultural and medicinal crop grown worldwide serving for food and pharmaceutical industries. The crop exhibit extensive phenotypic and genetic variation useful for cultivar obtention targeting economic traits, however limited genomic resources are available for effective germplasm characterization into breeding and conservation strategies. This study determined the genetic relationships and population structure in a collection of different accessions of bottle gourd prevenient from Chile, Asia, and South Africa by using single nucleotide polymorphism (SNPs) markers and mining of simple sequence repeats (SSR) loci derived from genotyping-by-sequencing (GBS) data. The GBS resulted in 12,766 SNPs molecular markers classied as moderate to highly informative with mean polymorphic information content of 0.29. The mean gene diversity of 0.16, indicated low genetic differentiation of the accessions. Analysis of molecular variance revealed lower differentiation between (36%) than within (48%) bottle gourd accessions suggesting that random mating system dominates inbreeding. Population structure revealed two genetically differentiated groups comprising of South Africa accessions and an admixed group with genotypes of Asian and Chilean origin. The results of SSR loci mining from GBS data should be developed and validated before being used in diverse bottle gourd accessions. The SNPs markers developed in the present study are useful genomic resources in bottle gourd breeding programs for assessing the extent of genetic diversity for effective parental selection and breeding.


Introduction
Bottle gourd [Lagenaria siceraria (Mol.) Standl., 2n = 2x = 22] or calabash) is a diploid, monoecious, and self-pollinating vegetable crop belonging to the genus Lagenaria of the Cucurbitaceae family (Achigan- Dako et al. 2008). The crop is used for diverse and bene cial uses including food, feed and medicinal purposes. The fresh and tender fruits are cooked as food and the dry fruits for making containers for food and grain storage, decoration and musical instruments (Jeffrey et al. 1976; Kalpana et al. 2020). The cultivated bottle gourd is also used as rootstock for production of sweet watermelon (Citrullus lanatus var. lanatus) to control soil-borne diseases, leaf diseases, low soil temperature and improve nitrogen-use e ciency (Yetisir and Sari, 2003;King et al. 2008;Ulas et al. 2019; Aslam et al. 2020) and improve fruit quality (Guler et al. 2013(Guler et al. , 2014. Bottle gourd is thought to be one of the rst plant species to be domesticated for human use approximately 10,000 years ago (Decker-Walters and Wilkins-Ellert, 2004; Erickson et al. 2005). Archaeological evidence suggested bottle gourd originated in Africa (Decker-Walters and Wilkins-Ellert, 2004) and comprised of two subspecies namely: the African L. siceraria ssp. siceraria and the Asian L. siceraria ssp. asiatica (Kobiakova, 1930;Schlumbaum and Vandorpe, 2012). Although bottle gourd is native of Africa, the species has been widely grown worldwide attributed to its abundant genetic and morphological variation allowing adaptation to diverse growing environments (Erickson et al. 2005 To date there are limited genomic resources developed for bottle gourd germplasm characterization. This has to some extent limited breeding efforts to determine heterotic groups for hybrid development, release, and commercialization of bottle gourd cultivars with desired attributes for farmers, consumers and for food and pharmaceutical industries. Also, quantitative trait loci controlling the expression of key qualitative and quantitative traits remains largely unexplored in bottle gourd partly owing to limited development of genomic resources. In the present study, we developed GBS that resulted in development of 12,766 SNPs molecular markers distributed across 11 chromosomes of bottle gourd. Therefore, the purpose of this study was to determine the genetic relationships and population structure in a collection of different accessions of bottle gourd from Chile, Asia, and South Africa using the new-developed SNPs markers and mining of SSR loci derived from GBS data.  (Table S1).

GBS sequencing, reads clustering and SNP calling
Genomic DNA of the 25 accessions was extracted from young leaves collected from three-weeks year-old seedlings by using the QIAGEN DNeasy Plant Mini Kit for DNA extraction (QIAGEN; https://www.qiagen.com) following the manufacturer's instructions. We evaluated the quality of DNA via agarose gel electrophoresis and measured the uorometric quanti cation by Qubit 2.0 and Qubit dsDNA HS Assay Kit (Thermo Fisher Scienti c; https://www.thermo sher.com/). The genotyping-by-sequencing data was generated following the Elshire et al. (2011) method and included the following changes: 100 ng of genomic DNA and 3.6 ng of total adapters were used, the genomic DNAs were restricted with ApeKI enzyme and the library was ampli ed with 18 PCR cycles. After PCR, the pooled products were puri ed and quanti ed for sequencing on the Illumina HiSeq 2000 ow cell for sequencing.
Reads and tags (fastq) found in each sequencing lane from 96 barcodes produced a total read pairs of 485 million of reads and an average of 18.5 million of high-quality read pair count. The reads for both ends of the pair-end data were combined into individual per-sample les, and aligned to the bottle gourd inbred line USVL1VR-Ls reference genome using bowtie2 (Wu et al. 2017). The preset -sensitive, end-to-end mapping parameters were used, and the sorted alignments were subsequently used for SNP calling using the Stacks 2.5 pipeline (http://catchenlab.life.illinois.edu/stacks/). Alignment and merging resulted in a total of 71,212 called SNPs.
After removing lines with failed data, the GBS data from the 25 accessions were stored in Variant Call Format version 4.1 (Danecek et al. 2011). Genotyping-by-sequencing datasets typically have high rates of missing data (Poland and Rife, 2012). The linkage disequilibrium k nearest neighbor imputation (Money et al. 2015) method was used to impute missing values in this dataset. Only SNPs with a minor allele frequency > 0.05 and < 25% missing data were ltered, resulting in 12,766 high-quality polymorphic SNPs. The SNP calling was performed using TASSEL version 5.2 in the GBS pipeline (Glaubitz et al. 2014).

Analysis of genetic diversity parameters and molecular variance
Genetic diversity of 25 bottle gourd accessions was analyzed with 12,766 SNPs markers by using the poppr package of the R-software (Kamvar et al. 2014). The ltered SNPs were used to calculate the genetic diversity parameters such as minor allele frequency (MAF), polymorphic information content (PIC), expected heterozygosity (He), and observed heterozygosity (Ho). These analyses were carried out in Rpackage. The PIC value of an l-allele locus can be calculated as: where Pi and Pj are the population frequency of the ith and jth allele.
Analysis of molecular variance (AMOVA) was carried out by using the poppr package in R to detect population differentiation (Exco er et al. 1992). Transitions/transversions and percentage of heterozygous positions were determined using SNiPlay3 (Dereeper et al. 2011).

Population structure and genetic relationship
The genetic relationship among the landraces of bottle gourd was calculated based on identity-by-state (IBS) distance that represent a kinship matrix, using the software TASSEL 5.2 (Bradbury et al. 2007). The population structure was inferred with the Markov Chain Monte Carlo (MCMC) algorithm for the generalized Bayesian clustering method implemented in the Structure software (Pritchard et al. 2000). Consequently, 10 independent runs of MCMC sampling were implemented for numbers of groups (K parameter), varying from 2 to 5. For each run, the initial burn-in period was set to 10,000 with 110,000 MCMC iterations, under the non-admixture model, and with prior information on the individual's origin. The optimal value of K was estimated from the second-order change rate of the probability function with respect to K (ΔK), as proposed by Evanno et al. (2005).

Mining of simple sequence repeats markers
The Illumina raw reads data were preprocessed to generate clean reads and then analyzed using the core Stack pipeline of Stacks v.2.5 software with default parameters. Each consensus sequence resulting from the Stack pipeline was then screened for simple sequence repeats (SSRs) using GMATA package with default parameters (Wang and Wang, 2016). The acquired SSRs were considered to only represent those containing perfect repeats of SSRs whose basic motifs ranged from 2 to 6 bp with de ned minimum repeat units of ve iterations for di-, tri-, tetra-, penta-, hexa-and heptanucleotide repeats.

GBS Analysis
Genome sequencing of the 25 bottle gourd accessions using GBS generated a total of 485 million reads pairs, with an average read pair count of 18.5 million. Each of the 25 sample reads was mapped to 'Lagenaria siceraria var. USVL1VR-Ls'. In the GBS analysis a total of 71,212 called and un ltered SNPs were detected as raw SNP markers. Of these, 12,766 ltered SNPs were obtained and distributed across the eleven chromosomes of L. siceraria. The numbers of homozygote and heterozygote SNP loci ranged from 9,865 (CLS-013) to 10,594 (CLS-024) with an average of 10,194 and from 2334 (CLS-024) to 3063 (CLS-013) with an average of 2734, respectively ( Table 1). The average homozygote rate was approximately 78.9%, and the average heterozygote rate was 21.1% (Fig. 1). Transversion SNPs (62%, 37790 SNPs) were more frequent than transition (38%, 23141 SNPs). Of these, the C/G transversion (38.6%) accounted for the highest frequency, whereas C/T transitions (19.2%) occurred at the lowest frequency among all the 60,931 SNPs (Fig. 2).
The average PIC value across all the markers and chromosomes was 0.26, whereas the observed heterozygosity ranged from 0.15 to 0.22 with an average of 0.18. The expected heterozygosity ranged between 0.15 and 0.16, with a mean of 0.16. Minor allele frequency (MAF) ranged between 0.21 and 0.242, with an average of 0.23. The highest PIC and MAF were on chromosome ten, whereas the lowest were on chromosome eight (Table 1). According to the phi-statistics, there was relatively high differentiation between the different levels of comparison. The lowest differentiation was reported among samples within the same population or geographical origin (25%). Substantial differentiation between populations was reported (36%). However, 52% of the differentiation occurred within samples (Table 2).    Table 4). Dinucleotides and trinucleotides were identi ed as the most abundant SSR class, representing the 95,49% of the SSR motif classes. The repeat motif AT/AT (26,274) was the most frequent into the dinucleotide SSR, representing 37.71% of the total dinucleotides, and the repeat motif AAT/ATT (7,592) was the most frequent into the trinucleotide SSR, representing 31,08% of the trinucleotides (Fig. 5).  Expected heterozygosity is usually preferred to assess genetic diversity, because it is less sensitive to the sample size than the observed heterozygosity (Chesnokov and Artemyeva, 2015

Conclusions
The present study genotyped bottle gourd accessions of diverse origins using new-developed single nucleotide polymorphism markers. A total of 12,766 SNPs molecular markers were generated using genotyping-by-sequencing which were classi ed as moderate to highly informative. Low genetic differentiation was observed among the assessed bottle gourd accessions using SNPs markers. Random mating system was found to dominate inbreeding in the assayed bottle gourd population. Accordingly, two genetically differentiated groups comprising of South African accessions and an admixed group with genotypes of Asian and Chilean origin were identi ed. The results of SSR loci mining from GBS data should be developed and validated before being used in diverse bottle gourd accessions. The SNPs developed in the present study are a useful genomic resource for bottle gourd breeding targeting development of genetically improved genotypes for diverse uses including rootstocks, food, feed and medicine.
Declarations Figure 1 Percentage of heterozygous positions of 25 L. siceraria accessions of diverse geographical origins generated using single nucleotide markers developed using genotyping-by-sequencing