Mutational Dynamics of Aroid Chloroplast Genomes II

The co-occurrence among single nucleotide polymorphisms (SNPs), insertions-deletions (InDels), and oligonucleotide repeats has been reported in prokaryote, eukaryote, and chloroplast genomes. Correlations among SNPs, InDels, and repeats have been investigated in the plant family Araceae previously using pair-wise sequence alignments of the chloroplast genomes of two morphotypes of one species, Colocasia esculenta belonging to subfamily Aroideae (crown group), and four species from the subfamily Lemnoideae, a basal group. The family Araceae is a large family comprising 3,645 species in 144 genera, grouped into eight subfamilies. In the current study, we performed 34 comparisons using 27 species from 7 subfamilies of Araceae to determine correlation coefficients among the mutational events at the family, subfamily, and genus levels. We express strength of the correlations as: negligible or very weak (0.10–0.19), weak (0.20–0.29), moderate (0.30–0.39), strong (0.40–0.69), very strong (0.70–0.99), and perfect (1.00). We observed strong/very strong correlations in most comparisons, whereas a few comparisons showed moderate correlations. The average correlation coefficient was recorded as 0.66 between “SNPs and InDels,” 0.50 between “InDels and repeats,” and 0.42 between “SNPs and repeats.” In qualitative analyses, 95–100% of the repeats at family and sub-family level, while 36–86% of the repeats at genus level comparisons co-occurred with SNPs in the same bins. Our findings show that such correlations among mutational events exist throughout Araceae and support the hypothesis of distribution of oligonucleotide repeats as a proxy for mutational hotspots.

Previously, co-existence of mutations was observed among SNPs, InDels, and repeats in prokaryotic and eukaryotic genomes (Silva and Kondrashov, 2002;Hardison et al., 2003;Tian et al., 2008;Chen et al., 2009;Zhu et al., 2009;McDonald et al., 2011). Three alternate hypotheses were suggested to explain the coexistence of mutations. First, the "regional difference hypothesis" suggests that certain regions are more prone to mutations in comparison to other regions (Silva and Kondrashov, 2002;Hardison et al., 2003). The second, "InDel-induced mutation hypothesis" was suggested based on strong association between InDels and substitutions, which suggested the recruitment of error-prone DNA polymerase at point of InDels is the cause of generation of substitutions Yang et al., 2009). The third hypothesis suggests high frequency of oligonucleotide repeats in a region of the genome generates InDels and substitutions (McDonald et al., 2011). To repair DNA damage, the existence of a high number of repeats in a region leads to the recruitment of error-prone DNA polymerases, thus the adjacent sequences replicate with a higher error rate compared to other regions (McDonald et al., 2011). Hence, instead of InDel per se, this hypothesis places more importance on "regional difference hypothesis." Associations have been reported between SNPs, repeats, InDels, and inversions (Mes et al., 2000;Lockhart et al., 2001;. The role of repeats in the generation of inversions (Kim and Lee, 2005;Whitlock et al., 2010) and InDels (Kawata et al., 1997) has also been reported. However, these observations were made on the bases of few loci instead of complete chloroplast genomes. The first study of associations among SNPs, InDels, and repeats based on genome-wide analyses of complete chloroplast genomes included five species of Araceae (Ahmed et al., 2012). That study suggests the distribution of oligonucleotide repeats could be used as a proxy for mutational hotspots. Following Ahmed et al. (2012), correlations were studied in two species of genus Cephalotaxus Siebold & Zucc. ex Endl. (Yi et al., 2013). However, authors observed very weak correlations between "InDels and SNPs" and "repeats and InDels, " whereas moderate correlation was observed between "substitutions and repeats." Recently, strong correlations were reported among these mutational events in the species of genus Dendrobium Sw. , whereas others have described weak to strong correlations in species of the plant family Malvaceae (Abdullah et al., 2020c,d). Hence, the very thorough study by Abdullah and colleagues reported correlations at the family, subfamily, and genus levels among 19 species belonging to seven subfamilies of Malvaceae .
The previous study of family Araceae was limited to five species of Araceae, including Colocasia esculenta (L.) Schott from subfamily Aroideae, which is a younger clade evolutionarily; and four species from subfamily Lemnoideae, which is among the earliest diverging aroid subfamilies (Nauheimer et al., 2012). Colocasia esculenta is found in tropical habitat and produces unisexual flowers, whereas the four species of subfamily Lemnoideae produce bisexual flowers and inhabit aquatic habitat (Mayo et al., 1997;Cusimano et al., 2011). These species also demonstrated a different rate of mutations, which is consistent with the finding that aquatic and tropical plant have diverse mutation rates (Abbasi et al., 2016;Hu et al., 2017;Hart et al., 2019;Wang et al., 2020). Sampling is therefore sparse in the previous study for a large and ancient monocot family like Araceae, which dates back to the Early Cretaceous period, and is divided into eight diverse subfamilies distributed across the multitude of ecological habitats (Cusimano et al., 2011;Nauheimer et al., 2012;Henriquez et al., 2014). This family comprises 144 genera and 3,645 species (Boyce and Croat, 2018). Recently, with the advancement of next generation sequencing, chloroplast genome sequences of several species of Araceae were reported from subfamilies Aroideae, Lasioideae, Pothoideae, Monsteroideae, Orontioideae, and Zamioculcadoideae (Han et al., 2016;Choi et al., 2017;Kim et al., 2019;Abdullah et al., 2020a,b;Henriquez et al., 2020a,b). We included 27 species from 7 subfamilies of Araceae which are diverse in term of habit, habitat, native range, and evolutionary time of divergence (Table 1 and Figure 1A). The availability of these genomic resources from a wide array of aroid species (Table 1) provided enough data to elucidate correlations among substitutions, InDels, and repeats throughout the family.
In the current study, we are interested in determining correlations among these mutational events throughout the family Araceae using genus-, subfamily-, and family-level comparisons, aka time of divergences ranged from relatively recent splits to deep divergences. This study will be helpful to understand whether such correlations exist among these five species used in Ahmed et al. (2012) by chance or whether these correlations exist among species of Araceae at varying taxonomic levels and diverse ecological habitats.

MATERIALS AND METHODS
We downloaded chloroplast genome sequences of 27 species of Araceae from GenBank of the National Center for Biotechnology Information ( Table 1). The species are high diverse in terms of habitat, geographical distribution, ecology, and evolutionary history. The species included in the comparisons range in distribution from tropical and subtropical to temperate regions of the world, such as America, Asia, and Africa (Table 1). Similarly, these species also differed in terms of habit and habitat occupying aquatic and semi-aquatic to tropical and subtropical forests ( Table 1). The sub-families diverged during Cretaceous to Miocene periods (Nauheimer et al., 2012). We selected one species per genus from all subfamilies other than subfamily Aroideae for family level comparisons. From subfamily Aroideae, we selected 9 species from the comparisons among the major clades using a previous phylogenetic inference of Araceae (Cusimano et al., 2011;Henriquez et al., 2014). We performed comparisons at the family, subfamily, and genus levels. At the family level, all the species were pairwise compared with Orontium aquaticum L. (Orontioideae) which is among the basal groups of Araceae following a previous approach applied in  The data of distributions, habit and habitat are taken from powo.science.kew.org and Mayo et al. (1997). Orontium aquaticum was used as reference at family level, Montrichardia arborescence was used as reference at subfamily level, and at the generic level, Arisaema franchetianum, Pinellia pedatisecta, Spathiphyllum patulinervum, and Symplocarpus renifolius were used as references for Arisaema ringens, Pinellia ternata, Spathiphyllum kochii, and Symplocarpus nipponicus, respectively.
family Malvaceae . At the subfamily level in Aroideae, the genome of Montrichardia arborescens (L.) Schott is used as a reference for the other species of subfamily Aroideae.  (Kearse et al., 2012) was used for the pairwise alignment in all comparisons after removal of long inverted repeat regions following Ahmed et al. (2012). We also deleted ycf 1 and rps15 genes along with intergenic-spacer regions, as these genes jump between small single-copy and inverted-repeat regions, hence present the problem of rate heterotachy (Lockhart et al., 2006;Abdullah et al., 2020a). Each alignment was divided into non-overlapping bins of 250 bp and deletions in the reference genome were removed from the alignment after noting their positions. This approach has been used previously (Ahmed et al., 2012;Yi et al., 2013;Abdullah et al., 2020c) to fix the coordinates positions in the reference genomes for allocations of oligonucleotide repeats. The InDels were counted manually and assigned into bins of 250 bp. The forward and reverse repeats were determined as ≥ 14 bp using REPuter (Kurtz et al., 2001) by searching for 5,000 repeats in the reference genomes at family, subfamily, and generic levels. The names of the species whose cp genomes were used as reference are mentioned above (vide infra). All the repeats with exact match located at least 10 bp away from each other were included in the analyses after excluding redundant repeats. The repeats were allocated into bins using Microsoft Excel (Redmond, United States). The numbers of substitutions were determined by a custom Pearl script and allocated into bins in Microsoft Excel.
Quantitative and qualitative approaches were used to determine the correlations among the mutational events. The normality test was first performed on the data in Minitab v.19 following Abdullah et al. (2020c). This test confirmed the non-normal distribution of mutational events (Supplementary Figures S1-S4). Hence, Spearman rho (ρ) correlations were applied on the non-normal data in Minitab v.19. The methodology described in Akoglu (2018) was used to express strength of the correlations as follows: negligible or very weak (0.10-0.19), weak (0.20-0.29), moderate (0.30-0.39), strong (0.40-0.69), very strong (0.70-0.99), and perfect (1.00). The probability (p) of significance of correlations was determined at 0.05 α level.
In the qualitative approach, we evaluated the co-occurrence of InDels with substitutions, and of repeats with InDels and substitutions following Abdullah et al. (2020c).

Correlations Among SNPs, InDels, and Oligonucleotide Repeats at the Family Level
Among 22 comparisons at the family level, the correlations between SNPs and InDels were strong for Symplocarpus renifolius and Zamioculcas zamiifolia (Lodd.) Engl., whereas were categorized as very strong in the remaining 20 comparisons ( Figure 1A). Correlations between SNPs and repeats were regarded as strong for all other comparisons except Stylochaeton bogneri Mayo, which showed moderate correlations ( Figure 1A). We recorded strong correlations between repeats and InDels in all comparisons. The average values of coefficients of correlations were recorded highest between substitutions and InDels (0.72), followed by InDels and repeats (0.48), and then by substitutions and repeats (0.44). All correlations were observed with a high significance of p < 0.0001. All the comparisons showed high similarities in correlations from basal groups to the crown group. The distributions of substitutions, InDels, and repeats in 250 bp bins are shown in Supplementary Table S1.

Correlations Among SNPs, InDels, and Oligonucleotide Repeats at the Subfamily Level
For eight comparisons within the subfamily Aroideae, strong correlations were observed among SNPs and InDels for seven comparisons, whereas a very strong correlation was observed for Anubias heterophylla Engl. (Figure 1B). We recorded strong correlations between SNPs and repeats for six comparisons, whilst moderate correlation was recorded for Aglaonema costatum N.E.Br., and weak correlation was recorded in Amorphophallus konjac K. Koch ( Figure 1B). We observed strong correlations between InDels and repeats for all comparisons (Figure 1B). The average values of correlation coefficients showed a similar pattern as observed at the familylevel comparisons: it remained highest between substitutions and InDels (0.62), followed by InDels and repeats (0.55), and then by substitutions and repeats (0.40). All correlations at the subfamily level were also observed with high significance of p < 0.0001. The distributions of substitutions, InDels, and repeats in 250 bp bins are shown in Supplementary Table S2.

Correlations Among SNPs, InDels, and Oligonucleotide Repeats at the Genus Level
We investigated interspecific correlations in four genera as representative of recent splits between species belonging to the same genera. The correlation coefficients greatly varied in these comparisons; the correlations between SNPs and InDels remained very strong between the species of genus Pinellia Ten., strong in Spathiphyllum Schott, moderate in Arisaema Mart., and negligible in Symplocarpus Salisb. (Figure 1C). The same pattern was evident for correlations between substitutions and repeats, which remained strong in Pinellia, moderate in Spathiphyllum, weak in Arisaema, and negligible in Symplocarpus ( Figure 1C). Conversely, all comparisons showed strong correlations between repeats and InDels ( Figure 1C). In these comparisons, the average values of the coefficients of correlations were found highest between repeats and InDels (0.52), followed by SNPs and InDels (0.42), and SNPs and repeats (0.31). Except Symplocarpus, correlations in all other comparisons were observed with p < 0.0001. Low significance was observed for substitutions and InDels (p = 0.024), and for substitutions and repeats (p = 0.055) in Symplocarpus. The distributions of substitutions, InDels, and repeats in 250 bp bins are shown in Supplementary Table S3.

Qualitative Analyses of the Existence of InDels With Substitutions, and of Repeats With Substitutions and InDels
In the qualitative analyses, we determined the percentages of the InDel-containing bins that co-occurred with SNPs, and of the repeat-containing bins that co-occurred with InDels and SNPs. At the family level, we observed that up to 99.47-100% of InDelcontaining bins also contained SNPs, 97.88-100% of repeatcontaining bins also showed SNPs, and up to 66.45-80.51% of repeat-containing bins also contained InDels ( Table 2).
The results at the subfamily level show high similarities with the family level. We observed 97.98-100% of InDelcontaining bins that also contained SNPs, 94.95-100% of repeatcontaining bins also contained SNPs, whereas up to 60.73-80% of repeat-containing bins also exhibited InDels ( Table 2). In genus-level comparisons, for qualitative comparisons of three among the four genera, 71.08-90.55% of InDel-containing bins exhibited SNPs, 42.66-75.16% of repeat-containing bins also contained InDels, while 36-86.51% of the repeat-containing bins also displayed SNPs. The genus Symplocarpus remained an exception, for which only 23.73% of InDel-containing bins showed SNPs, and only 20.28% of repeat-containing bins exhibited InDels, while merely 15.66% of repeat-containing bins displayed SNPs ( Table 2).

Distributions of InDels and Substitutions at Family, Subfamily, and Genus Level
At the family level, the distantly related species showed existence of a high number of substitutions and InDels with 3,430-15,459 substitutions and 456-1,156 InDels. Most of the substitutions and

DISCUSSION
We determined the extent of correlations among SNPs, InDels, and repeats in cp genomes using 27 species from 23 genera, distributed among seven of the eight subfamilies of Araceae. We performed 34 pairwise comparisons and observed strong/verystrong correlations for most of the comparisons among these mutational events, which suggests high associations between these mutational events. We removed the ycf 1 and rps15 genes, along with intergenic spacer regions, as these elements are located at the single-copy and inverted-repeat junctions-appearing in single-copy regions in some species, and in inverted repeats regions in others. Singlecopy regions undergo a different rate of mutation compared to the inverted-repeat regions, hence the same genes that occur in single-copy regions in some species and in inverted-repeats in other species undergo a phenomenon known as rate heterotachy (Lockhart et al., 2006). We previously reported the effect of rate heterotachy in Araceae . Single nucleotide polymorphisms, InDels, and oligonucleotide repeats did not follow the normal distribution curves in normality tests using Minitab v.19. These observations are in agreement with previous reports of chloroplast genomes in which certain regions were found to be predisposed to mutations and reported as hotspots for mutations (Ahmed et al., 2013;Sablok et al., 2019;Abdullah et al., 2020e;Mehmood et al., 2020a,b). Ahmed et al. (2012) determined correlations among SNPs, InDels, and repeats using chloroplast genomes of two morphotypes of one species, C. esculenta, and four species of the subfamily Lemnoideae, including Lemna minor L., Wolffia australiana (Benth.) Hartlog & Plas, Wolffiella lingulata Hegelm., and Spirodela polyrhiza (L.) Schleid. Colocasia esculenta is tropical and belongs to the crown group, whereas the species of Lemnoideae are aquatic and belong to the basal group. Aquatic plants evolve faster as compared to non-aquatic, and tropical plants evolve faster as compared to temperate plants (Abbasi et al., 2016;Hu et al., 2017;Hart et al., 2019;Wang et al., 2020). We found higher rates of mutation in terms of substitutions and InDels in the species of Lemnoideae as compared to other species (Table 3). Hence, further exploration of these observations was required in diverse species to gain insight into correlations among mutational events as sparse sampling of taxa is evident in the previous study of Ahmed et al. (2012). In order to cover the taxa across the family tree, here we include species spanning seven of the eight subfamilies of Araceae and used 34 comparisons among 27 diverse species in terms of habit, habitat, and evolution.
At the family and subfamily levels, most of the comparisons exhibited strong/very strong correlations among "SNPs and InDels, " "SNPs and repeats, " and "InDels and repeats." Hence, our study confirms strong correlations among mutational events in close comparisons (subfamily level) and distant comparisons (family level). Here, the high similarity among mutational events in diverse species in terms of geography, ecology, and time of divergence (Table 1 and Figure 1A) demonstrates that the correlations are unaffected by the geographical distribution, habit, and habitat. Weak correlations in generic-level comparisons, however, may be due to fewer SNPs and InDels in recently diverged species within the same genera. Strong correlations have also been reported in the family Malvaceae . At the genus level, we observed very weak to strong correlations among mutational events. Similar results were reported in the family Malvaceae at the genus level . Here, very weak correlations were recorded between the species of Symplocarpus. The species of Symplocarpus showed closed resemblance and revealed the presence of few substitutions (89) and InDels (70). Hence, the weak correlations might be due to recent divergence of these species from each other. Similar results were observed in the closely related species of Theobroma L.  and Cephalotaxus (Yi et al., 2013). Previously, Tian et al. (2008) suggested InDels as mutagens, whereas McDonald et al. (2011) suggested the role of repeats in the generation of InDels and SNPs. However, they considered the recruitment of error-prone DNA polymerases during replication to be the cause of high mutations due to errors in replications. Therefore, in closely related species InDels and repeats might not have enough time to generate substitutions. Moreover, correlations between "InDels and repeats" were found to be higher than correlations between "SNPs and InDels" and "SNPs and repeats" in three out of four comparisons. Similar results were previously observed in family Malvaceae, where four of the five comparisons showed high correlation between "InDels and repeats" as compared to "SNPs and InDels" and "SNPs and repeats" . These observations at the genus level suggest that most of the InDels are generated by repeats first, and then both InDels and repeats contribute to the generation of SNPs over a period of time.
The quantitative analyses showed very strong correlations between SNPs and InDels in most cases, whereas the qualitative analyses revealed the occurrence of more than 90% of InDels containing bins with SNPs. Previously strong associations were also observed among SNPs and InDels in prokaryotic, eukaryotic, and chloroplast genomes Zhang et al., 2008;Chen et al., 2009;Yang et al., 2009;Abdullah et al., 2020c;Li et al., 2020). The InDels were suggested as a mutagen for the generation of SNPs based on the observation of high association between InDels and SNPs in prokaryotic and eukaryotic genomes Zhu et al., 2009). Our analyses lend support to these previous results. Chloroplast genomes originate from prokaryotes and decrease in size by loss of genomic portions along with several genes (Palmer, 1985) but still reveal high associations between SNPs and InDels. Abdullah et al. (2020c) reported weak to moderate correlations between "SNPs and repeats" and "InDels and repeats" in most of the comparisons in the plant family Malvaceae. However, based on qualitative analyses, they reported the existence of up to 60% of repeats with InDels and up to 90% of repeats with SNPs. In the current study, we report strong correlations between "InDels and repeats" and "SNPs and repeats" based on quantitative analyses in the family Araceae, whereas based on qualitative analyses we observed the existence of up to 100% of repeats with SNPs and up to 80% of repeats with InDels. The variation in the results might be due to the inclusion of a copy of inverted repeats in comparisons of family Malvaceae as the inverted-repeats region showed less polymorphism due to copy-dependent repair mechanisms (Zhu et al., 2016). Here we excluded one copy of the Inverted repeats from our comparisons, following previous studies (Ahmed et al., 2012;Yi et al., 2013). A high frequency of repeats has previously been considered the cause of generations of SNPs and InDels in the adjacent regions based on strong associations between "InDels and repeats" and "SNPs and repeats" in prokaryotic and eukaryotic genomes (McDonald et al., 2011). Here, our analyses in a wider sampling of species of Araceae and the previous report of Malvaceae  also support the role of repeats in the generation of InDels and substitutions, and supports the hypothesis that oligonucleotide repeats can be used as a proxy for identification of mutational hotspots (Ahmed et al., 2012;Abdullah et al., 2020c). This hypothesis has practical implications in selecting appropriate loci for comparative analyses. No one single locus is good enough for evolutionary comparisons at all time scales; slow evolving regions should be preferred for deep divergences, while mutational hotspots for the closely related taxa and recently diverged species (Ahmed et al., 2013;Ahmed, 2015;Li et al., 2020). A recent report of Ahmed et al. (2020) on family Araceae showed the practical implication of the use of repeats in identification of suitable polymorphic loci for the study of phylogeography and population genetics. Their developed markers from the identified loci providing new insight about the origin of Colocasia esculenta in southeast Asia instead of Papua New Guinea . Our current results support strong associations between repeats and substitutions and repeats and InDels in Araceae, which can be helpful for identifying species-specific suitable loci for the study of phylogeography, domestication, and population genetics of other species of Araceae.
In conclusion, the previous observations in five aroid species were not an artifact of low sampling but a representative sample of the correlations found at various taxonomic levels, and in ecologically, geographically and evolutionarily of Araceae. The strong associations of InDels with SNPs, and of repeats within InDels and SNPs, support the previous observation (Ahmed et al., 2012) that the multiple hypotheses outlined in the introduction (vide infra) might explain the mutational dynamics of chloroplast genome evolution. The strong associations among the three types of mutational events reported in prokaryotic, eukaryotic Zhang et al., 2008;Chen et al., 2009;McDonald et al., 2011), and chloroplast genomes (Ahmed et al., 2012;Abdullah et al., 2020c;Li et al., 2020), show that such co-occurrence of mutations might be a universal phenomenon in all types of genomes. Further studies in prokaryotes and eukaryotes are needed to test this hypothesis.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: All the accession numbers are given in Table 1. Moreover, result of analyses are also provided in the article or as Supplementary Material.

AUTHOR CONTRIBUTIONS
A: data analyses, data interpretation, writing initial draft, and conceptualization. CH: data analyses, review and editing of initial draft. TC: data interpretation and conceptualization. PP and IA: conceptualization, review, editing of initial draft, and supervision. All authors contributed to the article and approved the submitted version.