Comparative analyses of 32 complete plastomes of Tef (Eragrostis tef ) accessions from Ethiopia: phylogenetic relationships and mutational hotspots

Eragrostis tef is an important cereal crop in Ethiopia with excellent storage properties, high–quality food, and the unique ability to thrive in extreme environmental conditions. However, the application of advanced molecular tools for breeding and conservation of these species is extremely limited. Therefore, developing chloroplast genome resources and high-resolution molecular markers are valuable to E. tef population and biogeographic studies. In the current study, we assembled and compared the complete plastomes of 32 E. tef accessions. The size of the plastomes ranged from 134,349 to 134,437 bp with similar GC content (∼38.3%). Genomes annotations revealed 112 individual genes, including 77 protein-coding, 31 tRNA, and 4 rRNA genes. Comparison of E. tef plastomes revealed a low degree of intraspecific sequence variations and no structural differentiations. Furthermore, we found 34 polymorphic sites (13 cpSSRs, 12 InDels, and 9 SNPs) that can be used as valuable DNA barcodes. Among them, the majority (88%) of the polymorphic sites were identified in the noncoding genomic regions. Nonsynonymous (ka) and synonymous (ks) substitution analysis showed that all PCGs were under purifying selection (ka/ks <1). The phylogenetic analyses of the whole plastomes and polymorphic region sequences were able to distinguish the accession from the southern population, indicating its potential to be used as a super-barcode. In conclusion, the newly generated plastomes and polymorphic markers developed here could be a useful genomic resource in molecular breeding, population genetics and the biogeographical study of E. tef.


INTRODUCTION
The genus Eragrostis comprises approximately 400 morphologically distinct species distributed throughout the subtropical and tropical regions of the world (Clayton et al., 2016). Eragrostis tef (Zucc.) Trotter is the sole species in the genus Eragrostis cultivated for human consumption and Ethiopia is the center of origin and genetic diversity for E. tef (Ketema, 1997). Compared to other cereal crops, E. tef is more tolerant of extreme environmental conditions and is therefore considered as lower risk crop (Assefa et al., 2015). These characteristics, together with its grain nutrition, market value, desirable storage properties, make this crop attractive to smallholder farmers (Minten, Taffesse & Brown, 2018). The grain of E. tef is also gaining global popularity as healthy and highperformance food due to its high fiber contents and gluten-free nature (Spaenij-Dekking, 2005;Chanyalew et al., 2019). The long history of cultivation and variety selection coupled with the broad agro-ecology adaptation of the crop resulted in high genetic diversity in Ethiopia (Assefa, Chanyalew & Tadele, 2017). Currently, more than 5,000 E. tef accessions collected from different geographic regions of Ethiopia are preserved in the seed gene bank of the Ethiopia Biodiversity Institute (EBI;Tesema, 2013). The conserved accessions are the main sources of genetic variations to enrich the genetic base of cultivated varieties. To establish proper conservation and efficient utilization of the plant genetic resource, understanding genetic variations between and within gene bank samples is essential (Wambugu, Ndjiondjop & Henry, 2018). However, the studies of genetic diversity among accessions of E. tef are still highly limited and one of the most important reasons is the lack of effective molecular markers (Tadele, 2018;Chanyalew et al., 2019).
Advances in biotechnology, especially in the area of molecular biology has provided some critical tools for proper conservation and use of plant genetic resources (Yuan et al., 2017). From the perspective of improving crops through modern breeding programs, molecular markers have played significant roles, especially in the determination of genetic diversity and the classification of germplasm (Majeed et al., 2015;Nadeem et al., 2017). During the last few decades, several universal molecular markers such as amplified fragment length polymorphism (AFLP) (Bai et al., 1999), simple sequence repeats (SSR) (Abraha et al., 2016), random amplified polymorphic DNA (RAPD) (Bai et al., 2000) and inter simple sequence repeat (ISSR) (Assefa, Merker & Tefera, 2004) have been used in E. tef. Moreover, first draft genome (Cannorazzi et al., 2014) and chromosome-scale genome assembly (VanBure et al., 2020) of E. tef have been made publicly available online. These molecular studies have provided some insight into E. tef population genomics and phylogenetic relationships. Furthermore, the utilization of some universal chloroplast markers for the phylogenetic studies have been reported in previous studies (Espelund et al., 2000;Ingram & Doyle, 2003). The available reports concerning the E. tef plastome sequence variability are insufficient for population genetics and biogeographic studies (Assefa et al., 2011;Assefa et al., 2015. Also, there is no valuable molecular barcoding system to discriminate and classify the conserved accessions according to their geographical regions of collection. These will have a direct effect on the conservation and the sustainable utilization of the crop. Therefore, sequencing and comparative analysis of the plastome have the potential to detect intraspecific polymorphism and provide useful molecular markers for various studies in E. tef. The plastome is commonly characterized by an extremely conserved structure and possessed a relatively slow evolutionary tempo (Greiner, Sobanski & Bock, 2015). It generally comprises a pair of inverted repeats (IRs) regions, one large single-copy (LSC) region and one small single-copy (SSC) region (Brears, Schardl & Lonsdale, 1986). Although overall plastome structure is always thought to be conserved, structural variations such as inversion (Lei et al., 2016;Kim & Cullis, 2017), gene duplication and IR boundary shifts (Zhu et al., 2016) have been detected among angiosperms. For the mutations of sequences, the single nucleotide polymorphism (SNP) and deletion or insertion (InDels) of nucleotide bases are the most common variations in the sequences of plastome (Decesare, Hodkinson & Barth, 2010;Kim et al., 2015). These variations have provided ideal information for developing polymorphic markers for numerous applications such as molecular barcoding (Okoth et al., 2016;Zong et al., 2019), phylogenetic reconstruction (Peterson, Romaschenko & Johnson, 2010), biogeographic studies (Kress et al., 2009;Xie et al., 2019), assessment of the maternal line of hybrid species (Schroeder, Höltken & Fladung, 2011;Tomar et al., 2014;Chung et al., 2019), and clarification of the evolutionary relationship between cultivated and crop wild relatives (Gao et al., 2019). Currently, with the advancement of next-generation sequencing technologies (NGS), sequencing of chloroplast genomes and the development of plastid genetic markers have become feasible in various plant genetic researches (Brozynska, Furtado & Henry, 2014;Bi et al., 2018).
In this study, the complete plastomes of 32 E. tef accessions were newly sequenced and assembled. All these plastomes were compared to examine the intraspecific chloroplast genomes sequence variability, to our knowledge, to gain the first compressive analysis of plastome structural variations and mutations across E. tef plastome. Specifically, the distribution of chloroplast simple sequence repeats (cpSSRs), single nucleotide polymorphism (SNPs) and InDels regions were investigated. Besides, phylogenetic analysis was performed to evaluate the genetic relationship of the studied accessions with their respective biogeographic distribution using variable sites detected in the present study. We also examined the impact of identified sequence variations on the evolution of proteincoding genes (PCGs). The markers could be a useful genomic resource for use in various studies such as molecular breeding, molecular barcoding, biogeography and population genetic diversity studies in E. tef.

Plant sampling and DNA extraction
A total of 32 E. tef accessions were obtained from the Ethiopian Biodiversity Institute (EBI) seed genebank. These accessions were sampled from Amhara, Benishanguz Gumuz, Tigray, Oromia, and Southern regions, representing the geographic distribution of the species in Ethiopia (Fig. 1, Table 1). Ethiopian Biodiversity Institute approved this study (EBI 712222942018). The collected leaves were dried immediately using silica gel and preserved in the refrigerator (−20 • C) until DNA extraction. Total genomic DNA was isolated from the dried leaf of each accession using the MagicMag Genomic DNA Micro Kit (Sangon Biotech Co., Shanghai, China) following the protocol given by the manufacturer. The purity and quality of the DNA were detected by electrophoresis on the 1% agarose gel. Chloroplast genome sequencing, assembly and annotation  (Frazer et al., 2004). In order to detect the IR expansions/contraction, all the annotated plastome sequences for the 32 E. tef accessions were compared to the LSC, SSC and IRs border using an online program IRscope (https://irscope.shinyapps.io/irapp/; Amiryousefi et al., 2018). All annotated plastome sequences were submitted to the National Center for Biotechnology Information (NCBI) database (accession numbers: MN780987 to MN781018).

Screening variable regions and intraspecific comparison
Considering the wide range of cpSSR markers applications in the breeding scheme, population and phylogenetic studies (Melotto-Passarin et al., 2011;Diekmann, Hodkinson & Barth, 2012;Ebrahimi et al., 2019), Firstly, we detected the location and types of cpSSRs in the plastome of E. tef accessions using MISA perl script (Beier et al., 2017). The minimum number of repeat unit was adjusted to eight, six, five, five, three, and three, for mononucleotides, dinucleotides, trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotides, respectively. We then employed REPuter (Kurtz et al., 2001) to identify four types of large repeating sequences (reverse, forward, complement and palindromic) with a minimum repeat size of 30 bp, hamming distance equal to 3 and maximum computed repeats was set to 50 bp. To compare the cpSSR of E. tef with related species, three chloroplast genomes were chosen from sub-family Chlorodoideae including

Phylogenetic analysis
The phylogenetic trees were constructed using two data sets: (1) the complete plastome sequences of 32 E. tef accessions (2) concatenation of sequences extracted from twenty polymorphic regions (SNPs and InDels) identified in the current study. Sequence length was determine based on the designed PCR product and was tested for their performance in delineating accessions based on their phylogeographic origin. Before the phylogenetic tree construction, one copy of the IR was removed from the complete chloroplast genome. All sequences alignment was accomplished using MAFFT 7 (Katoh, Rozewicki & Yamada, 2017) plugin in Phylosuite 1.2.1 . The phylogenetic analyses were performed using maximum likelihood (ML) and Bayesian inference (BI). ModelFinder (Kalyaanamoorthy et al., 2017) was used to select the best-fit model with default setting and the maximum likelihood (ML) analysis was performed using IQ-TREE 1.6.12 (Trifinopoulos et al., 2016) with 1000 bootstrap replications. The BI analysis was performed by MrBayes 3.2.6 ( Ronquist et al., 2012), with a total of 2,000,000 generations set to perform the analysis. Four chains run with sampling after every 3000 generations and the first 25% trees were discarded as burn-in, and the remaining trees were constructed a majority rule consensus tree.

Analyses of signatures of selection
To detect the evidence of selective acting in mutational PCGs, the ratio of nonsynonymous (ka) to synonymous (ks) substitution (ka/ks) of mutational PCGs were calculated using DnaSP version. (Librado & Rozas, 2009

Feature of sequenced E. tef plastomes
The size of the complete plastome sequences of E. tef ranged from 134,349 to 134,437 bp (Table 1). They possess a pair of IRs regions (42,042-42,044 bp), one pair of IRs regions (42,042-42,044 bp), one LSC region (79,726-79,798 bp) and one SSC region (12,581-12,600 bp). The guanine-cytosine (GC) content of plastomes was approximately 38.3% and the IR region was slightly higher (44%) compared to LSC (36.3%) and SSC (32.1%) regions. For analysis of the IR junction (contraction/expansion), we compared the border between LSC/IRb/SSC/IRa of all 32 E. tef accessions, and also observed highly conserved IR junction sites (Fig. S1).

Simple sequence repeats
A total of 143 cpSSRs loci were identified in the plastome of E. tef accession (Table S2). The number of detected cpSSRs and their distributions are similar among compered accessions. These cpSSRs were mainly sited in the LSC region (78%), whereas 13% and 9% were localized in SSC and IR, respectively. The majority of cpSSRs were found in intergeneric space regions of the genome (73%) and the other 19% were located in the twelve PCGs (rpoB, rpoC1, atpF, rps14, ndhK, ycf4, petA, petL, psaJ, psbB, rpl16, ndhF ,  Table S2). The remaining 9% was located in the intron region. Among the cpSSR categories, mononucleotide cpSSRs are quite plentiful in the genome (94%), followed by dinucleotide cpSSRs (5%) and tetranucleotide cpSSRs (1%). No of tri-, penta-and hexa-repeat types were detected in the E. tef. The most common of a repeat mononucleotide was A/T (90%)  (Table S3). Three plastomes were chosen from the subfamily Chloridoideae and their cpSSRs repeat number was compared with the E. tef. A total of 142, 141 and 118 cpSSRs were found in the M. abyssinica, E. minor and N. reynaudiana, respectively (Table S4). In addition to cpSSRs, large repeat sequences were analyzed using REPuter, and 44 repeats (Table S5), which include 28 forward (F), 15 palindromic (P) and one reverse (R) repeats, were found. There were no complement repeats in the E. tef. The repeat sequence that ranged between 30 to 40 bp were the most common (27 repeat loci). The majority (55%) of these repeats were located in the noncoding region of the plastome.

Phylogenetic analysis
The phylogenetic relationship of 32 E. tef accessions was established using the complete plastome sequences and variable loci identified in the present study (Tables S3, S6). Both ML and BI gave identical tree topologies and clusters (Fig. 3). In the phylogenetic analysis, all E. tef accessions were divided into two clusters: one formed from accessions of south Ethiopia (Fig. 3) and others included the accessions from central and northern regions. Similarly, the phylogenetic tree inferred from twenty variable sites did show unambiguous biogeographic patterns in the accessions from the south (T3, T15, T24, T43, and T51) (Fig. 3). However, the phylogenetic relationships derived from both datasets did not provide clear biogeographic patterns.

Selection analyses
We examined the pattern of nonsynonymous to synonymous substitution ratio (ka/ks) among four mutational PCGs (atpE, psbB, ndhB and petB) of E. tef accessions. The highest average ks pairwise value was found in atpE (0.010) ( Table 2). The ka/ks ratio for PCGs showed zero values for all analyzed accessions.

DISCUSSION Plastome variations in E. tef
In this study, we conducted whole plastome comparison and determined the site of mutational changes in E. tef. The intra-specific comparison among 32 E. tef accessions revealed similar genome structure and no IR region expansion or contraction has occurred within the accessions. The result suggests that the E. tef plastome sequence is highly conserved (Figs. S1; S2). This finding was similar to other studies showing low intraspecific genetic variation (Jiang, Hinsinger & Strijk, 2016;Jeon & Kim, 2019). Although the plastomes composition and structures of 32 E. tef accessions are highly conserved, we also identified several mutational regions containing variable loci, which could provide potential information for the development of molecular marker and evolutionary studies. In our study, 143 cpSSRs identified in the E. tef, including thirteen polymorphic cpSSRs. The distributions of cpSSRs are non-random and a similar number of repeats among E. tef accessions. The number of cpSSRs detected in the E. tef was also relatively similar to other species in the subfamily Chloridoideae including E. minor, M. abyssinica and N. reynaudiana. A comparison of cpSSRs revealed a relative conservatism in repeat numbers and consistent with other reports (Wheeler et al., 2014;Jiang, Hinsinger & Strijk, 2016).
Most of the cpSSR in E. tef is distributed in the noncoding region of the genome, which is consistent with other studies Abdullah et al., 2019). Chloroplast derived microsatellite markers were developed and utilized in various studies such as assessment of the maternal line of hybrid wheat (Tomar et al., 2014), genetic diversity and relationships analysis among potato accessions (Lee et al., 2019) and species differentiation (Decesare, Hodkinson & Barth, 2010). Our study provides cpSSRs data that could provide valuable molecular tools for the evolutionary studies of E. tef. Although plastomes are highly conserved, there are hotspots region with SNPs and indels mutations, commonly used as DNA barcoding (Kress et al., 2005;Fan et al., 2018). These variations are uniparentally inherited and thus analytically attractive to trace the evolutionary history of maternal lines in the crop breeding program (Keeling, 2010;Tomar et al., 2014). In the present study, intraspecific chloroplast polymorphic sites were detected within the E. tef accessions. The 21 variable sites (12 InDels and 9 SNPs) identified in the present study include: rps16 intron, trnM- atpE,psbB,ndhB,petB,rpl16 intron,. The identified variable sites have provided valuable insight into the intraspecific genetic diversity in E. tef and could provide a valuable genomic resource for plastid marker development. The noncoding regions of plastomes have higher sequence variation than PCGs (Choi, Chung & Park, 2016;Skuza et al., 2019) and are widely used in population genetics and phylogenetic studies. This because in the genome, the PCGs is highly conserved than the noncoding regions (Cao et al., 2018;Wu et al., 2018). Similarly, in the current study, 81% of the identified SNPs and InDels markers were sited in the noncoding region of the plastid genomes. In general, nucleotide substitutions less frequently occur in PCGs than noncoding regions of plastomes (Kim et al., 2015;Daniell et al., 2016).
The nonsynonymous (ka) and synonymous (ks) substitution ratio (ka/ks) are widely used as an estimator for adaptive evolution on PCGs (Erixon & Oxelman, 2008;Gao et al., 2018). The fact that the positive selection in PCGs of plastomes viewed as an important driving force of adaptive evolution (Johnson & Melis, 2004;Zhong et al., 2009;Hu et al., 2015). We analyzed ka/ks ratio of four mutational PCGs of E. tef accessions, which indicated that all four mutational PCGs were under purifying selection (ka/ks < 1).

Phylogenetic analysis
In previous studies, plastid markers have been used to determine the E. tef phylogenetic relationship (Espelund et al., 2000;Ingram, 2010). However, complete plastome and multi loci markers provide more detailed insight (Krawczyk et al., 2018;Wu et al., 2018). In this study, two datasets (complete plastome and twenty variable loci) were applied to determine whether the phylogenetic relationships of E. tef accessions reflected the biogeographic pattern. The phylogenetic tree has divided the accessions into two clusters with identical tree topologies. We found that phylogeny inferred from both datasets and analysis methods (BI and ML) have been able to delineate accessions from south Ethiopia (T3, T15, T24, T43, T51) with robust support (Fig. 3). Furthermore, patterns of mutations among accessions are consistent with all tree topologies. For example, several unique mutational sites were identified in accession from Eritrea (T16), which might be a reason for the relatively long branch length (Fig. 3). Overall, both datasets were able to provide the phylogenetic relationship with a more informative biogeographical pattern among the accessions from the south (Fig. 3) and also identify accession (T16) from Eritrea (Fig. 3). This indicated that the identified variable sites could be useful molecular markers in phylogenetic and biogeography studies. Phylogenetic relationships among Eragrostis have been investigated based on a small number of plastid loci (rps16, trnL-UAA, trnL-trnF ) (Espelund et al., 2000;Ingram, 2010), but these have failed to provide intra-specific variations and sufficient phylogenetic signal of E. tef.
Despite the existence of clusters with a clear biogeographical pattern, the phylogenetic analysis did not reveal a robust biogeographical structure. For example, accessions from the western and central parts of the country are not clustered with their respective geographic origin. Similar analyses conducted in the previous study using the nuclear genome also did not show unambiguous geographic distribution patterns (Fikre, Tesfaye & Assefa, 2019). The lack of clear spatial structure may be attributed to gene flow between adjacent populations and seed exchange among farmers (Assefa, Merker & Tefera, 2004). We also infer that the limited geographical representation of our studied accessions might be the reason to contribute the insufficient geographical information.

CONCLUSIONS
In this study, a comparison of 32 complete plastomes of E. tef accessions was performed and revealed a low level of sequence variability. Only 34 polymorphic sites (13 cpSSRs, 12 InDels and 9 SNPs) were identified in the plastome of these accessions. The noncoding regions of the genome exhibited higher variable sites than PCGs. The newly sequenced E. tef plastomes also provide an additional genomic resource for undertaking various studies in an economical crop.
Additionally, the phylogenetic tree provides an informative insight into the genetic relationship of sampled accessions with their biogeographic distribution. In the future, we would suggest expanded sampling of E. tef and its wild relatives need to be used for assessing the biogeography of this economically important crop. Genome-wide association study is also imperative to identify the genetic basis of agriculturally important traits in E. tef. Overall, in our study, the complete plastomes and detected variable sites could be a useful genomic resource for molecular breeding, identification, population genetics, and biogeography studies of E. tef and related crop species in the Chloridoideae.