Genome-wide development of insertion-deletion (InDel) markers for Cannabis and its uses in genetic structure analysis of Chinese germplasm and sex-linked marker identification

Cannabis sativa L., a dioecious plant derived from China, demonstrates important medicinal properties and economic value worldwide. Cannabis properties have been usually harnessed depending on the sex of the plant. To analyse the genetic structure of Chinese Cannabis and identify sex-linked makers, genome-wide insertion-deletion (InDel) markers were designed and used. In this study, a genome-wide analysis of insertion-deletion (InDel) polymorphisms was performed based on the recent genome sequences. In total, 47,558 InDels were detected between the two varieties, and the length of InDels ranged from 4 bp to 87 bp. The most common InDels were tetranucleotides, followed by pentanucleotides. Chromosome 5 exhibited the highest number of InDels among the Cannabis chromosomes, while chromosome 10 exhibited the lowest number. Additionally, 31,802 non-redundant InDel markers were designed, and 84 primers evenly distributed in the Cannabis genome were chosen for polymorphism analysis. A total of 38 primers exhibited polymorphisms among three accessions, and of the polymorphism primers, 14 biallelic primers were further used to analyse the genetic structure. A total of 39 fragments were detected, and the PIC value ranged from 0.1209 to 0.6351. According to the InDel markers and the flowering time, the 115 Chinese germplasms were divided into two subgroups, mainly composed of cultivars obtained from the northernmost and southernmost regions, respectively. Additional two markers, “Cs-I1–10” and “Cs-I1–15”, were found to amplify two bands (398 bp and 251 bp; 293 bp and 141 bp) in the male plants, while 389-bp or 293-bp bands were amplified in female plants. Using the two markers, the feminized and dioecious varieties could also be distinguished. Based on the findings obtained herein, we believe that this study will facilitate the genetic improvement and germplasm conservation of Cannabis in China, and the sex-linked InDel markers will provide accurate sex identification strategies for Cannabis breeding and production.


Introduction
Cannabis sativa L., a member of the family Cannabinaceae, is a diploid (2n = 20) monocotyledon and one of the oldest cultivated plants. Although it originated in Central Asia, its cultivation was soon commenced worldwide for applications in folk medicine, textile fibre, oil, and recreational use [1]. Cannabis is a botanical genus of flowering plants divided into two distinct species, namely Hemp and marijuana, based on its tetrahydrocannabinol (THC) content [2]. Although Cannabis cultivation is being restricted in many countries due to its widespread usage as a recreational drug, there has been a resurgence of interest for its agronomic potential and especially its medical value; its outer and inner stem tissues can be used to prepare bioplastics and concrete-like material in construction sectors owing to the rich source of both cellulosic and woody fibres, and its metabolites exert potent bioactivities on human health especially for the treatment of pediatric seizure disorders.
Cannabis is a dioecious species, which includes both male and female flowers separated on different plants. The sex of the plants commonly affects economically relevant traits like fibre quality and cannabinoid (CBD) content. In general, male plants have a better fibre quality, while CBD content in female plants is higher than that in male plants. Therefore, an ideal ratio of male-tofemale individuals must be maintained with different production purposes to improve economic efficiency. However, it is difficult to identify the sex of plant via the mere examination of morphological traits before flowering, and DNA molecular marker technology has been considered as an accurate and reliable method for the sex identification of dioecious plants, as it is unaffected by plant growth stages [3].
Conventional breeding is considered the primary method for developing new varieties in Cannabis breeding programs. However, this process is extremely challenging and often spans several years [4]. Previous studies have indicated that advancements in molecular technologies offer several molecular breeding strategies, such as the use of molecular markers to overcome the limitations of conventional breeding [4,5]. A shift from isozyme and random amplified polymorphic DNA (RAPD) to amplified fragment length polymorphism (AFLP), simple sequence repeat (SSR), and single nucleotide polymorphism (SNP) has occurred, and these markers have been used for genetic analysis and sex identification in Cannabis [6][7][8][9][10][11][12][13][14][15]. Although different types of Cannabis molecular markers have been identified and utilized, research on Cannabis is lagging compared to other crops like rice, wheat, and maize. As a result, the density of molecular markers in Cannabis is relatively low, which is insufficient for genetic study in Cannabis, including genetic map construction, gene/ QTL mapping, and genetic analysis.
Insertion-deletions (InDels) are recognised as major sources of genetic structural variations found widely distributed across the plant genomes. InDels like SSRs are also a type of length polymorphisms originating from a single mutation event, which is generally bi-allelic and single-locus in nature. Meanwhile, InDels exhibit many desirable inherent genetic characteristics of both SNP and SSR markers, such as co-dominance, abundance, and random distribution across the genome [16]. Generally, unlike SNP, InDel markers have been considered breeder-friendly markers, with limited infrastructure requirements, and its products can be detected in regular genetics and breeding laboratories using polyacrylamide gel electrophoresis (PAGE) or simple gel-based size separation procedures. Furthermore, InDels markers are commonly amplified without stutter bands, which renders them more valuable. In a few previous studies, InDels were also found to be more polymorphic than microsatellite markers [17,18]. As a valuable complement for both SNPs and SSRs markers and owing to their significance in crop genomic studies, InDel markers have been widely identified in rice [19], barley [20], oil rapes [18,21], maize [22], and other plants [23][24][25][26], and to our knowledge, no research on genome-wide development of InDels in Cannabis has been reported so far. This knowledge gap limits the comprehensive molecular analysis of Cannabis.
China has been considered one of the putative centres of origin for Cannabis, and a region where Cannabis has been cultivated for more than 2000 years for obtaining fibre, oil, and for other purposes [27]. However, the fibre yield, fibre quality, and CBD content are vital factors limiting the development of the Cannabis industry in China, rendering significance to the genetic improvement of the Cannabis crop cultivated in China. Previous studies have shown that the genetic structure analysis of the germplasm can facilitate genetic improvement in other crops [28,29]. Until now, the genetic diversity and population structure of Cannabis were analysed using SSR and ISSR markers [9,10,30,31]. However, in Cannabis, most SSR and ISSR markers usually display multiple loci [9,10,30,31], thereby posing challenges in the application of molecular analysis such as the comparison of genes/QTLs detected using different genetic populations in Cannabis. Alternatively, the single-locus nature of InDels may help overcome this drawback of multi-locus SSRs and ISSRs. Though the draft genome sequences data were published in 2011 [32], data quality has not met the criteria for genome-wide development of InDel markers and the location of such valuable markers in the Cannabis chromosome has not been elucidated. Recently, a high-quality chromosome-scale reference genome of a drug-type strain "Purple Kush" and the hemp variety "Finola" were obtained, which enabled the genome-wide capture of InDels in the Cannabis genome [33]. With the objectives to increase the density of molecular markers of Cannabis genome and to establish a significance for SSR markers in Cannabis genomic studies, the present study focused on the genome-wide development of InDels and the application of these markers in genetic structure analysis of Chinese germplasm and identification of sex-linked marker in Cannabis. Our study results will help establish a valuable tool for the molecular analysis of Cannabis in the future, and the information on the genetic structure of the Cannabis germplasm and sex-linked marker will aid the genetic improvement and molecular breeding of Cannabis.

Distribution of InDel markers
Data on whole genomes for "Purple Kush" and "Finola" were downloaded from ftp://ftpmips.helmholtzmuenchen.de/plants/barley/public_data/. On a genome-wide basis, 47,558 InDels were identified between PK and FN in the genomic DNA sequence database (Table  S1). InDel sites varied from 4 bp to 87 bp, and the number of the InDel sites decreased markedly with an increase in the InDel length. Four InDel sites were found to be the most common InDel sites (11286), accounting for 23.7% of the total InDels (Fig. 1). Meanwhile, the distribution of the InDels on each chromosome of the FN genome was different. As shown in Fig. 2, the number of InDels on each chromosome ranged from 2177 to 5081. Chromosome 5 exhibited the highest number of InDels among the Cannabis chromosomes, while chromosome 10 exhibited the lowest. Additionally, the densities of InDels on each chromosome were different, and chromosome 9 exhibited the highest density of InDels (67.5 InDels/Mb) while chromosome 2 exhibited the lowest (44.5 InDels/Mb) (Fig. 2, Fig. 3).

Development of InDel markers for whole Cannabis genome and polymorphism analysis
In total, 47,558 InDel markers between FN and PK were successfully developed, with a density of 47.1/ Mb in the FN genome. Of these InDel markers, 31, 802 InDel markers were non-redundant based on the specificity, and its density in the FN genome was found to be 31.51/Mb (Table S1). The lengths of all primers ranged between 18 bp and 24 bp, and the product sizes ranged from 80 bp to 400 bp. Eightyfour primer pairs distributed along the chromosomes with intervals of about 10 Mb were selected to evaluate the quality of InDel markers across three Cannabis varieties (Fig. S1). The results showed that 80 primers were amplified successfully, and 38 primers exhibited polymorphisms among three varieties ("Yunma 6", "Neimengudali", "Qingdama 1"). Of all the polymorphism primers, 14 primers which exhibited two alleles among the above-mentioned three varieties were used for further study.

Genetic diversity analysis and population structure
The 14 InDel primers were used to analyse the genetic relationships of 115 accessions, and a total of 39 polymorphic bands were amplified. The PIC ranged from 0.1209 to 0.6351, with an average of 0.4109, and the gene diversity varied from 0.1243 to 0.6865, with an average of 0.4664. The average MAF was 0.6484 and ranged from 0.4478 to 0.9348 (Table 2). Thereafter, cluster analysis was conducted based on the unweighted pair-group method with arithmetic means (UPGMA) using the NTSYS-pc2.11 software. As showed in Fig. 4, at a genetic distance of 0.74, the 115 accessions were divided into two groups. Group I included 84 accessions, mainly consisting of the varieties cultivated in the northern regions of China (up to 90%). Group II included 31 accessions, and most of them were from the southern regions of China (90.3%).
In PCoA, the two main axes explained approximately 59% of the total variation, at 44 and 15%, respectively. The 115 Cannabis varieties could also be classified into two groups using the genetic similarity matrix, which was similar to cluster analysis results (Fig. 5).
Based on the 39 alleles amplified using 14 InDels, the population structure of the 115 individuals was further estimated under the Hardy-Weinberg Equilibrium using the STRUCTURE V2.3.3 software. Delta K values were plotted against K values, and the best number of clusters was obtained via the Structure Harvester platform (http://taylor0. biology.ucla.edu/structureHarvester/). As shown in Fig. 6, Delta K reached a maximum value at K = 2, which indicated that the 115 cultivars could be partitioned into two populations ( Fig. 6).
As showed in Table 1, the flowering time of 115 Cannabis genotype varied from 23 days to 125 days. Thereafter, cluster  analysis was conducted using IBM SPSS Statistic 19.0 with the longest distance method and the Euclidean distance square. As shown in Fig. 7, at an inter-class distance of 25, the 115 genotypes were divided into two groups; group 1 included 34 cultivars, which mainly originated from the southern regions of (30), and group 2 contained 81 cultivars, most of which were from the northern regions of (74), such as Northwest China (15) and Northeast China (37).

Screening of sex-linked InDel markers and PCR-based verification of known-sex plants
Based on the latest report which indicated that chromosome pair 1 was the sex chromosome pair in Cannabis [34], fifteen pairs of primers evenly distributed on chromosome 1 were designed and used to amplify twelve samples (six females and six males) from the F 2 population crossed by "Yunma 6" and "H4" (Table S2). As shown in Fig. 8a and Fig. 8d, two primers pairs (Cs-I1-10, Cs-I1-15) amplified two bands in male plants (251 bp and 398 bp; 293 bp and 141 bp), while one band (398 bp; 293 bp) was amplified in female plants.
To further verify the versatility and accuracy of the two primers pairs, samples from 24 known-sex plants from the dioecious variety, "H4", and 10 known-sex plants from the feminized variety, "ZY1" were used for amplification via PCR, respectively. The results showed that 12 female plants showed amplification of the 398bp fragment, while 12 male plants showed amplification of the two bands (398 bp and 251 bp in size) (Fig. 8b). Consistent with the amplification fragment in female plants of "H4", all plants from "ZY1" showed

Discussion
Although different types of molecular markers, such as RAPD, ISSR, SSR, SNP and ARFP, have been used in the molecular biology studies conducted on Cannabis, such as genetic diversity analysis, sex identification, and QTL mapping [9,10,12,15,30,31], these molecular makers remain fewer in number compared with those available for other crops, which poses challenges for genetic map construction and QTL mapping. In addition, a genomewide survey of InDels has not yet been carried out for Cannabis. In this study, 31,802 InDel markers were identified in the Cannabis genome, and the average density across the FN genome was 0.031 InDels/kb (Table S1), which was much less compared to that found in other species such as rice, oilseed rape, maize and cotton [18,22,35,36]. Molecular analyses like map-based gene cloning, GWAS, and MAS, rely on the availability of several genetic markers with detailed information of their position on the genome. The PCR-based InDel markers are extensively applied during initial mapping to identify unknown genes in rice, maize, wheat, and other crops [22,[37][38][39][40]. However, due to a lack of availability of chromosome-scale genome assembly, information about their physical position on the chromosome is not available [9][10][11][12][13][14][15]30], which hinders the comprehensive molecular analysis of Cannabis. In this study, 26,982 InDel markers were developed with a density of 26.7 InDels/ Mb. Notably, the exact physical positions of all identified InDels on the Cannabis genome were also determined, rendering it convenient to identify InDel markers in target genome regions, which, in turn, would help accelerate map-based cloning and marker-assisted trait selection research in Cannabis.
To analyse the population structure of the 115 Cannabis germplasms from the varieties cultivated in China, 84 InDels distributed along the Cannabis chromosomes with intervals of approximately 10 Mb were selected for the polymorphism analysis, and 38 InDels were found to exhibit polymorphism among three accessions. The polymorphism rate was 45.2%, similar to the extent in chickpea (46.6%) [41], lower than that found in jute (58%) [26], and higher than that in maize (18.68%) [22], which indicated that the polymorphism rate might relate to the plant species. Additionally, of the 36 InDels, 14 InDels amplifying only two fragments were selected for the genotyping of the 115 accessions. The PIC values ranged from 0.1209 to 0.6351, with an average of 0.4109, indicating that most of the InDels have a moderate range of genetic diversity, lower than that of SSR  markers in Cannabis [10]. The possible reason was that most InDels used in this study are single-locus (Fig. S2), while, in general, SSRs are multi-locus. The genetic structure of different genotypes can guide breeding programs for developing varieties with a broad genetic background. The genetic diversity of the Cannabis germplasm has been analysed using two types of markers: SSR and ISSR [9,30,31]. In the present study, 39 fragments were amplified using the 14 InDels, and when Delta K was at a maximum value of 2, the 115 accessions were partitioned into two subgroups. In group 1, the sharing proportion of the cultivars of group 2 ranged from 0.011 to 0.453, while in group 2, its sharing proportion of group 1 varied from 0.011 to 0.336 (Table S3). Most cultivars from the northern regions of China belonged to Group I, while most cultivars from the southern regions belonged to Group II (Fig. 6). Similar to the results of population structure analysis, the 115 accessions were clearly clustered into two major groups using UPGMA clustering ( Fig. 4). As Cannabis is an annual and photoperiodsensitive crop, and the day length may determine the floral transition and flowering times, we suggest that the climate, influenced by the latitude and day length, is an essential factor affecting the Cannabis germplasm diversity. In this study, the 115 accessions from China were distinctly classified into two groups (Figs. 4, 5, 6 and 7), and the two groups were consistent with the temperate climate and subtropical climate zones in China, respectively, which was in agreement with the analysis of Gao et al. (2014) and Zhang et al. (2018) [9,42]. Additionally, both group I and group II included the cultivars from central regions of China like the HeNan provinces, implying that the breeders in these areas might frequently exchange Cannabis germplasm resources with the breeders from the northern or southern regions.
Cannabis is a short-day crop, which is sensitive to photoperiod. Flowering time is an important agronomic trait that affects cannabidiol (CBD) and fibre yield content.   (Fig. 7). In general, when the northern Cannabis cultivars are introduced to the southern regions, the plants will encounter early flowering. In this study, though the cultivars '22' and '214' originated from northern China, the plants did not encounter early flowering when cultivated in the southern regions of China (HuNan province), which might support the notion of a superior germplasm for developing wide adaptable Cannabis varieties according to day length. Owing to the different economic values between female and male plants, a suitable ratio of females to males individuals is vital for enhancing economic efficiency. To overcome the difficulties of the accurate identification of sex through morphological methods before flowering, eight pairs of markers mainly consisted of RAPD markers were reported for sex identification in Cannabis [11][12][13][14][15]. However, these RAPD markers had a common shortcoming of poor repeatability and dominance. Additionally, the accuracy of 8 pair markers for sex identification was only validated by using natural populations, thus limiting its application in the Cannabis breeding program [11][12][13][14][15]. In this study, the two primer pairs, Cs-I1-10 and Cs-I1-15, were screened for sex identification, and except for the natural populations, an F1-segregated population and a feminized variety were used to verify its accuracy (Fig. 8). Thus, its applications are broader than those previously reported for sex identification in Cannabis breeding program. Interestingly, similar to the sex-linked SSR markers CS308 [14], the same fragments in size were presented in both female as well as male plants using Cs-I1-10 and Cs-I1-15, indicating these markers were not specific to the Y chromosome, which was different from the markers MADC1 to MADC3 on Y chromosome [11][12][13].

Conclusion
In this study, we first developed 31,802 non-redundant InDel markers with a density of 31.5/Mb in the FN genome. Of these markers, 14 InDel markers could be used to divide the 115 Chinese Cannabis cultivars into two groups by genetic diversity analysis, population structure, and PCoA analysis. Additionally, two InDel markers, Cs-I1-10 and Cs-I1-15, related to female and male plant in Cannabis have been screened out. These genome-wide InDels and data on the genetic relationships of the Chinese Cannabis germplasm would serve useful in the further molecular analysis in Cannabis, and two sex-linked markers may provide accurate sex identification strategies at the early stage of Cannabis in production and breeding program.

Plant materials and DNA extraction
A total of 115 Cannabis accessions were collected from different regions in China and preserved in our institute. Detailed information on these cultivars is summarised in Table 1. Flowering time is the time from sowing to flowering. When more than 50% of the plants of each cultivar bloom, the flowering time was scored and listed in Table 1. Additionally, six female and six male individuals, selected from an F 2 population derived from a cross between a female "H4" plant and a male "Yunma 6" (Y6) plant, were used for the screening of sex-linked marker. Furthermore, 24 samples (12 females and 12 males individuals) from the "H4" variety and ten samples from the feminized Cannabis variety, "ZY1", were used for further validation of the sex-linked marker.

DNA extraction
The young leaves of each sample at the flowering stage were collected for DNA extraction. A Plant Genomic DNA Kit (Tiangen Biotech, Beijing, China) was used for DNA extraction. DNA quality and quantity were checked using an Eppendorf BioSpectrometer (Eppendorf, Hamburg, Germany), and the DNA was further diluted to a 10 ng/L working solution.
The DNA sequences of PK represented the reference genome, which was compared with that of FN using MUMer (http://mummer.sourceforge.net/manual/) software to capture the InDel loci (≥ 4 bp). Then, based on the InDel loci data, the primers were designed using the Primer 3.0 software (http://pgrc.ipk-gatersleben.de/misa/primer3. html). One pair of primers with the highest scoring was selected in the design results for the experiments. Furthermore, all InDel markers were checked for specificity using the TBtool software by blasting with the reference genome to avoid nonspecific amplification [43]. Only unique InDels were retained and listed in Table S1.

InDel genotyping
The 84 primer pairs evenly distributed in the FN genome were selected for polymorphism analysis. Polymerase chain reactions (PCRs) were performed using 10 μL aliquots of the reaction mixture, including 7 μL of the PCR mix solution (Qingke, Nanjing, China), 1 μL of the forward primer (10 nmol/L), 1 μL of the reverse primer (10 nmol/L), and 1 μL of the DNA template. PCR was  Table 2 and Table S2.
Genetic diversity assay and population structure Similar band types of 115 Cannabis cultivars on the electropherograms amplified using the same InDel markers were considered the same allele. Each polymorphic band detected by the same given primer represented an allelic mutation. To generate molecular data matrices, clear bands for each fragment were scored in every accession for each primer pair and recorded as 1 (presence of a fragment), 0 (absence of a fragment), and 9 (complete absence of band). PowerMarker version 3.25 was used to calculate the polymorphism information content (PIC), the number of alleles (NA), major allele frequency (MAF), and gene diversity for each InDel. A clustering map was conducted based on genetic distances and the unweighted pair group method with arithmetic mean (UPGMA) using the SM functionality of the NTSYS-pc2.10e software. Principal Coordinate Analysis (PCoA) was also performed using the NTSYS-pc2.10e software to resolve clustering patterns among genotypes. STRUCTURE v2.3.4 was used to estimate the population structure of the 115 Cannabis genotypes, and the number of the sub-population (K) was set from 1 to 10 based on admixture models and correlated with band frequencies three times. IBM SPSS Statistic 19.0 was used for cluster analysis of 115 Cannabis cultivars with the longest distance method and the Euclidean distance square based on the flowering time of each cultivar.
Additional file 1.0 Table S1. All InDel markers developed in this study.
Additional file 2. Table S2. The primers used for screenning of sexlinked InDel markers. Table S3. The genetic admixture of 115 cultivars. Fig.  S1 The physical location of 84 InDel primers on Cannabis chromosome used in this study. Fig. S2. Amplification products from 96 Cannabis cultivars using the InDel markers CS-I1-2.