Detecting clonemate pairs in multicellular diploid clonal species based on a shared heterozygosity index

Clonal reproduction, the formation of nearly identical individuals via mitosis in the absence of genetic recombination, is a very common reproductive mode across plants, fungi and animals. To detect clonal genetic structure, genetic similarity indices based on shared alleles are widely used, such as the Jaccard index, or identity by state. Here we propose a new pairwise genetic similarity index, the SH index, based on segregating genetic marker loci (typically single nucleotide polymorphisms) that are identically heterozygous for pairs of samples (NSH). To test our method, we analyse two old seagrass clones (Posidonia australis, estimated to be around 8500 years old; Zostera marina, >750 years old) along with two young Z. marina clones of known age (17 years old). We show that focusing on shared heterozygosity amplifies the power to distinguish sample pairs belonging to different clones compared to methods focusing on all shared alleles. Our proposed workflow can successfully detect clonemates at a location dominated by a single clone. When the collected samples involve two or more clones, the SH index shows a clear gap between clonemate pairs and interclone sample pairs. Ideally NSH should be on the order of approximately ≥3000, a number easily achievable via restriction‐site associated DNA (RAD) sequencing or whole‐genome resequencing. Another potential application of the SH index is to detect possible parent–descendant pairs under selfing. Our proposed workflow takes advantage of the availability of the larger number of genetic markers in the genomic era, and improves the ability to distinguish clonemates from nonclonemates in multicellular diploid clonal species.

apomixis (plants), and gynogeneis or pseudogamy (De Meeûs et al., 2007;Orive & Krueger-Hadfield, 2021). In clonal species, a module or ramet (Harper & White, 1974) is the basic modular unit with an ability to live independently, such as an individual shoot in grasses or a single polyp in corals. The product of a single zygote from gamete fusion (Noble et al., 1979) collectively forms a genet (synonymous with clone), which comprises a population of ramets (here also clonemates) that emerge via mitosis during its lifetime.
The majority of clonal species can conduct both clonal and sexual reproduction (Schön et al., 2009;Stoeckel, Arnaud-Haond, & Krueger-Hadfield, 2021;Stoeckel, Porro, & Arnaud-Haond, 2021), and thus have two levels of biological organization. The asexual population of ramets belonging to a single clone/genet is nested into the population of clones/genets , making the proper detection of the clonemates vs. nonclonemates an important starting point to study the ecology or evolution of such species.
As clonemates are often phenotypically undistinguishable, they can no longer be physically tracked once they obtain autonomy and break off the parental module, which is the case in most clonal species including seagrasses, corals, algae or fungi. Hence, genetic markers such as microsatellites or AFLPs have been widely used for detecting clonemates (Arnaud-Haond et al., 2007). Genotyping with many codominant molecular markers such as microsatellites has been particularly useful in the past, as any given multilocus genotype (MLG) can be assigned a likelihood to emerge by chance (i.e., through sexual reproduction and recombination). However, identical or nearly identical MLGs are not definitive evidence for clonal reproduction (Halkett et al., 2005), if for example there is low (or no) allelic diversity at particular loci. To calculate the probability of finding identical MLGs resulting from distinct zygotes, the population frequencies of segregating alleles are required, and the population is assumed to be under Hardy-Weinberg equilibrium (Parks & Werth, 1993). Estimates of such population-level frequencies are inaccurate in populations dominated by a few clones (Arnaud-Haond et al., 2007), and meaningless if a site is dominated by one clone (Arnaud-Haond et al., 2012;Edgeloe et al., 2022;Japaud et al., 2015;Reusch et al., 1999;Smith et al., 1992).
An alternative suite of approaches is the analysis of a pairwise genetic distance or similarity matrix, which can identify clonemates with slightly different MLGs, and also works with dominant/recessive markers such as AFLPs (Douhovnikoff & Dodd, 2003). The genomes of clonemates are identical by descent (IBD), except at sites with novel somatic mutations, as the genomes are inherited from the same asexual ancestor via mitosis without recombination. Because somatic mutations are rare, clonemates are predicted to reveal a smaller genetic distance (or higher levels of genetic similarity) than ramets from different clones. This will lead to a bimodal pattern in the frequency distribution of pairwise genetic distances/similarities.
A threshold can then be defined to separate the clonemates vs. interclone ramet pairs (Bailleul et al., 2016;Douhovnikoff & Dodd, 2003;Meirmans, 2020). The currently available genetic similarity indices (e.g., Jaccard index, identity by state) are based on shared alleles. However, when genetic diversity is low in a population, shared similarity may be high and the distinction in similarity is blurred between clonemates and nonclonemates. Here, we propose a novel pairwise genetic similarity index, the shared heterozygosity (SH) index, which describes the number of shared heterozygous sites divided by the number of heterozygous sites observed in the individual with the higher count. The SH index includes only heterozygous loci under the logic that heterozygous sites are the most informative ones for detecting clonemates because they show the potential for segregation in sexual reproduction, while remaining identically heterozygous in clonal reproduction.
We applied our workflow to two of the oldest seagrass clones in marine systems detected so far, a Zostera marina clone in Finland (>750 years old, Yu et al., 2020) and a Posidonia australis clone in Australia (estimated to be around 8500 years old, Edgeloe et al., 2022), which probably represented the extreme end of accumulating somatically generated variation. We also applied the workflow to two 17-year-old Z. marina clones cultured in outdoor flow-through mesocosms, representing the recent accumulating somatically generated variation. We examined whether our method could reliably detect clonemates based on seagrass ramets belonging to one, two and multiple (more than two) clones, respectively.

| Introducing the SH index for pairs of samples
For two samples (X 1 and X 2 ) in multicellular diploid clonal species, N SH denotes the number of the genetic markers where two samples are identically heterozygous; N Het (X n ) denotes the number of the heterozygous loci in X n .
When technical replicates are available, the clonemate pairs can be identified based on the closeness of their SH relative to technical replicates. If technical replicates are unavailable, a threshold needs to be set using minimal representations of intraclone SH values ( Figure 1). Edgeloe et al. (2022) reported a P. australis clone spanning at least 180 km, which was estimated to be around 8500 years old based on vegetative expansion rate. Although the clone was polyploid, the single nucleotide polymorphisms (SNPs) were called assuming the clone was diploid. The SNP data set containing 18,021 SNPs for 133 individuals (including four technical replicates) from 10 populations

| Detecting clonemates based on a published data set on seagrass P. australis
was used here (named as "all.vcf" in the Dryad repository https://doi. org/10.5061/dryad.0cfxp nw4p). We calculated the SH index along with N SH for each pair of samples based on the nucleotide positions in the reference genome where both samples provided genotype calls using a Python script (https://github.com/leiyu 37/Detec tingclone mates.git).

| Jaccard similarity index based on shared alleles vs. shared heterozygosity using the data set on seagrass P. australis
We assessed the advantage of focusing on SH by comparing a Jaccard similarity index calculated in two different ways (https:// github.com/leiyu 37/Detec ting-clone mates/ tree/main/res/Jacca rd_simil arity). We first calculated the Jaccard similarity index based on the collection of all the alleles present in each sample across all the loci. To assess the advantage of focusing on shared heterozygosity, we also calculated the index based on the collection of all the heterozygous genomic positions in each sample across all the loci.

| Detecting clonemates based on two Z. marina clones of known age (17 years)
Two Z. marina clones (a subsample of eight in total) were kept for 17 years in large, 300-L outdoor flow-through mesocosms at Bodega Marine Laboratory (BML) under ambient light and temperature conditions (Hughes et al., 2009). Each genet was initiated with a single terminal shoot collected from Bodega Harbour, California, in July 2004. Six shoots (= ramets) were collected from each clone for F I G U R E 1 Workflow for detecting clonemate pairs in multicellular diploid clonal species based on the SH index. Sample pairs representing technical replicates are marked in blue. When technical replicates are available, the sample pairs with the SH index close to those of the technical replicates represent clonemate pairs. If unavailable, a threshold needs to be set to separate the intra-and interclone sample pairs, and the mode close to SH = 1 represents clonemate pairs. The figures presented here are based on the data in Figure 5. genomic analysis in 2021, referred to as the "GREEN" and "RED" clone, respectively.
Genomic DNA was extracted using a NucleoSpin plant II kit from Macherey-Nagel following the manufacturer's instructions.
The plant tissue (160-200 mg fresh weight) was ground together with liquid nitrogen. Library construction and Illumina sequencing (PE150, paired-end 150 bp) were conducted by BGI (Beijing Genomics Institute). The quality of the raw Illumina reads was assessed by fastqc version 0.11.5 (https://www.bioin forma tics. babra ham.ac.uk/proje cts/fastq c/). bbduk (last modified November 7, 2019, https://jgi.doe.gov/data-and-tools/ bbtoo ls/bb-tools -user-guide/ bbduk -guide/) was used to remove adapter contamination and conduct a basic quality filtering: (i) reads with more than one N were discarded (maxns = 1); (ii) reads shorter than 50 bp after trimming were discarded (minlength = 50); (iii) reads with average quality below 10 after trimming were discarded (maq = 10). fastqc version 0.11.5 was used for a second round of quality check for the filtered reads.
The quality-filtered Illumina reads were mapped against the chromosome-level Z. marina reference genome version 3.1 (Ma et al., 2021) using bwa mem version 0.7.17 ). The alignments were converted to BAM format and then sorted using samtools version 1.11  We assessed the influence of N SH on the SH index by recalculating SH using 1/10, 1/100 and 1/1000 of the 38,831 SNPs for the 24 ramets reported in Yu et al. (2020), and the corresponding N SH was around 3000, 300 and 30, respectively. Each selection was repeated 100 times, and calculation was conducted for each pair of samples for each of the 100 repeats. N SH and the SH index from the 100 repeats were pooled for the subsequent analysis.

| Jaccard similarity index based on shared alleles vs. shared heterozygosity using the data set on seagrass P. australis
When calculated based on shared alleles, the mean value of the Jaccard similarity index for the clonemate pairs and interclone sample pairs was 0.97 and 0.64, respectively ( Figure 2a). As for the calculation based on shared heterozygosity, the mean value of the Jaccard index remained similar for the clonemate pairs (i.e., 0.94), but decreased to 0.26 for interclone sample pairs (Figure 2b). This comparison showed that focusing on shared heterozygosity could amplify the difference between intra-and interclone modes.

| Detecting clonemates based on the data set of the Finnish Z. marina clone
After re-analysing the data (Hiseq data for the 24 ramets, and Novaseq data for three of the ramets), we obtained 41,885 SNPs, and the pairwise N SH ranged from 28,619 to 32,210. The Hiseq and Novaseq data for the three ramets provided three pairs of technical replicates, and the minimum SH between technical replicates was 0.9998. The SH index between clonemates were distributed within a narrow range from 0.9548 to 0.9993, close but clearly distinct from the values for technical replicates. Our analysis confirmed that all 24 ramets belonged to the same clone ( Figure 3).

| Influence of N SH on the SH index
By calculating the SH index and N SH based on the subsets of the SNP data set in Yu et al. (2020), we found that smaller N SH led to larger variation of the SH index ( Figure 4). The standard deviation was 0.0076, 0.0107 and 0.0259 for subsets "1/10," "1/100" and "1/1000", respectively. The consistently high SH-value when N SH was at least on the order of ≥3000 suggested this number of N SH as a reasonable minimum target to apply this method.

| Detecting clonemates based on the data set of seagrass P. australis
Based on the SNP data set encompassing 18,021 SNPs for 133 in-

| Detecting clonemates based on the two 17-year-old seagrass Z. marina clones
We then applied our method to a situation of two Z. marina clones of known age (17 years old) where technical replicates were unavailable. The whole-genome resequencing data of the 12 cultured Z.
marina ramets covered >93% of the reference genome, and reached >66× coverage (Table 1). Pairwise N SH ranged from 108,585 to 288,606. The SH index showed a wide gap between the intra-and interclone modes, and the mode with SH close to 1 represented clonemate pairs (Figure 6). Within the clone "GREEN," the minimum pairwise SH was 0.9975. We detected one "contaminant" ramet (i.e., R10) in the "RED" clone. The five ramet pairs composed of R10 and

| DISCUSS ION
In many multicellular clonal species, the degree of clonality is highly variable (for seagrasses, e.g., Reusch & Boström, 2011; for corals, e.g., Miller & Ayre, 2004), including the extreme case of a single site dominated by one clonal lineage. Examples of an entire location covered by one clone include fungi (Smith et al., 1992), the seagrass F I G U R E 3 Detecting clonemate pairs using the SH index and technical replicates based on a published data set on seagrass Zostera marina. Each data point represents one pair of samples. Sample pairs representing technical replicates are marked in blue. The SH index values of all the sample pairs are close to those of the technical replicates, meaning that all samples belong to the same clone.

F I G U R E 4 Distribution of SH index values within the Finnish
Zostera marina clone based on subsets (1/10, 1/100 and 1/1000) of SNPs. The variation of SH increases when N SH decreases. The standard deviation is 0.0076, 0.0107 and 0.0259 for subsets "1/10," "1/100" and "1/1000," respectively.  (Reusch et al., 1999;Yu et al., 2020), Thalassia testudinum (Bricker et al., 2018), Posidonia oceanica (Arnaud-Haond et al., 2012) and P. australis (Edgeloe et al., 2022), and the coral genus Acropora (Japaud et al., 2015). The situation of having a population in which all sampled entities belong to the same genet is thus more widespread than previously thought. MLG-based methods cannot work in locations dominated by a few or a single clone, due to the difficulty in calculating the likelihood of an MLG based on population allele frequencies which are principally unavailable under such circumstances. An alternative way for clonemate detection is to use an index to assess the similarity of the collected samples.
Ideally, such an index demonstrates clear differences between the intra-and interclone sample pairs. In our study, focusing on shared heterozygosity amplified the power to distinguish sample pairs belonging to different clones than methods focusing on all shared alleles (Figure 2). Our SH index is similar to the Jaccard index calculated based on shared heterozygosity, with the difference being that the denominator for the calculation of the SH index is the size of the larger set, instead of the size of the union of the two sets in the Jaccard Index. However, the key here is to quantify genetic similarity based on shared heterozygosity, while the exact index can be different. As a result, the SH index is more suitable for detecting clonemates in locations dominated by a few or a single clone, and can generally improve the performance of the methods based on a pairwise genetic similarity matrix.
The SH index can potentially be used to detect possible parentdescendant pairs under selfing (Reusch, 2001), because this reproductive mode leads to loss of heterozygosity in every round of sexual reproduction. A case in point is the analysis of the 17-year-old Z. marina clones where we found one "contaminant" ramet (i.e., R10) that clearly did not belong to the clone "RED." One possible reason is that R10 was produced by sexual event between members of the clone "RED" (i.e., selfing). In the first 15 years of growth, this would be the only way for contamination to occur, as the clones were cultured in isolated containers. Under selfing, each of the heterozygous loci in the parent can either remain identically heterozygous or change to a homozygous state in R10. As a consequence, the heterozygous loci in R10 should be a subset of those in the "RED" clone, and thus SH(R10) (see Introduction for the SH index method) for the five ramet pairs composed of R10 and one of the other five "RED" ramets should be close to 1. However, that is not the case, as SH(R10) for the five ramet pairs ranged from 0.4320 to 0.4331. Hence, we can exclude the possibility that R10 is derived from selfing of the "RED" clone.
Application of the SH index requires a sufficient number of genetic markers where the two target samples are identically heterozygous (i.e., N SH ). With the continuing development of sequencing technology and decreasing costs, it is now straightforward to obtain thousands of genetic markers using whole-genome resequencing, restriction-site associated DNA (RAD) sequencing or SNP arrays.
However, the level of N SH can still be low based on genome-wide maker sets, when the target species shows very low levels of genome-wide heterozygosity. Prior to further analyses, it is advisable to verify whether or not N SH is high enough (approximately N SH on the order of ≥3000). In conclusion, our workflow takes advan- Access funding enabled and organized by Projekt DEAL.

CO N FLI C T O F I NTE R E S T
The authors declare that they have no competing interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
DNA sequence data are available in the NCBI short read archive, BioProject no. PRJNA806459, SRA accession nos. SRR18000159-SRR18000170. Custom-made computer code is available at Github: https://github.com/leiyu 37/Detec ting-clone mates.

F I G U R E 6
Detecting clonemate pairs using the SH index without technical replicates based on the two 17-year-old Zostera marina clones. A threshold needs to be set to distinguish between intra-and interclone modes, and the mode close to SH = 1 represents clonemate pairs.