Complete chloroplast genome of Lens lamottei reveals intraspecies variation among with Lens culinaris

Lens lamottei is a member of the Fabaceae family and the second gene pool of the genus Lens. The environmental factors that drove the divergence among wild and cultivated species have been studied extensively. Recent research has focused on genomic signatures associated with various phenotypes with the acceleration of next-generation techniques in molecular profiling. Therefore, in this study, we provide the complete sequence of the chloroplast genome sequence in the wild Lens species L. lamottei with a deep coverage of 713 × next-generation sequencing (NGS) data for the first time. Compared to the cultivated species, Lens culinaris, we identified synonymous, and nonsynonymous changes in the protein-coding regions of the genes ndhB, ndhF, ndhH, petA, rpoA, rpoC2, rps3, and ycf2 in L. lamottei. Phylogenetic analysis of chloroplast genomes of various plants under Leguminosae revealed that L. lamottei and L. culinaris are closest to one another than to other species. The complete chloroplast genome of L. lamottei also allowed us to reanalyze previously published transcriptomic data, which showed high levels of gene expression for ATP-synthase, rubisco, and photosystem genes. Overall, this study provides a deeper insight into the diversity of Lens species and the agricultural importance of these plants through their chloroplast genomes.


Results
Genomic features and organization of L. lamottei cp DNA.The BGISEQ-500 platform was used to sequence the cp genome of L. lamottei.The de novo cp genome of L. lamottei was assembled using the GetOrganelle pipeline via two different assembly paths.To examine genome rearrangement, the Mauve aligner was used to align L. culinaris and L. lamottei's cp genome.The alignment calculated the locally collinear blocks (LCB), which represent highly similar, or conserved regions.When these assemblies were compared, certain regions of assembly path 1 were detected to be inverted without any missing parts in reference to L. culinaris (Fig. 1a).On the other hand, assembly path 2 was performed by carrying the swapped blocks on the same strand.After manually reorienting the major collinear blocks for both of these assembly options, we found that the L. lamottei genome has the same structures and gene content as the cp genome of L. culinaris (Fig. 1b).
As a result of our de novo genome construction analysis, the cp genome of L. lamottei was assembled with a size of 122,855 bp and includes the LSC, SSC, and one IR region.Therefore, L. lamottei can be classified under the IRLC.Cross-species plastome comparison was done to identify genic regions, including protein-coding, tRNA, and rRNA loci.Figure 2 shows the structural organization of both Lens cp genomes with the same orientation of the genic regions.The L. lamottei and L. culinaris cp genomes were similar in genic content and structural organization.However, due to ambiguities in the annotation approaches, a few genes detected in L. lamottei were not found in L. culinaris.These genes are pafII(ycf4), rps18, and rpl22.Another difference is that the pafI gene was annotated as a protein-coding gene in L. lamottei but was defined as a pseudogene in the annotation of L. culinaris.This is because the annotation of the L. culinaris genome was performed using a different tool (DOGMA).When the cp genome of L. culinaris was reannotated using the GeSeq tool, the two cp genomes contained the same set of genes.L. lamottei and L. culinaris have 108 and 105 genes, respectively, that make up the cp genome, including 77, and 73 protein-coding genes, respectively.They have the same number of tRNA (27) and rRNA ( 4) genes (Supplementary Table S1).These genes have various functions in the cp, including photosynthesis, transcription, and translation (Supplementary Table S2).

Comparative genome analysis.
To study the degree of sequence divergence between the cp genomes of L. lamottei and L. culinaris mVISTA was used to visualize the overall sequence identities of the two cp genomes (Fig. 3).Comparative analyses showed a high sequence similarity with no discernible differences between the wild and cultivated species.We observed similar gene orders and organizations between the two species and highly conserved protein-coding genes with limited substitutions.
The hotspot regions in the cp genomes of L. lamottei and L. culinaris were determined by comparing the nucleotide variation in the coding and noncoding regions.Compared to protein-coding regions, noncoding regions showed higher levels of divergence.Genomic hotspots were detected.High nucleotide variation was found in the intergenic region between the trnQ-accD genes with a dissimilarity of more than 1.0%.We identified a genomic hotspot in an intergenic region between 50,800 and 52,000 bp (Fig. 4).
Simple sequence repeats analysis.Repetitive region were summarized for both lens species in Fig. 5.
SSRs, commonly referred to as microsatellite repeats, are shorter tandem repeats ranging from 1 to 6 bp in length and are found throughout the cp genome 31 .A total of 67 and 68 SSRs, respectively, were found in L. lamottei and L. culinaris with MISA (MIcroSAtellite Identification Tool).Of these, there were 49 and 47 mononucleotides, 6 dinucleotides in both, 2 and 4 trinucleotides, 6 and 5 tetranucleotides, and 4, and 5 pentanucleotides, respectively (Fig. 5a).One hexanucleotide repeat motif was detected in L. culinaris but not in L. lamottei.The most common type of SSR was the A/T mononucleotide repeat motif (Fig. 5b).There were differences between the two species in terms of trinucleotide, pentanucleotide, and hexanucleotide motifs.ATC/ATG, AAAAT/ATTTT, and AGA TAT /ATA TCT 2,1,1 repeat motifs were detected in L. culinaris, respectively.However, these motifs were not found in L. lamottei (Fig. 5b).Repetitions can be classified under four types: forward, reverse, palindromic, and complementary.Using REPuter, we found 50 forward repeats in both L. lamottei and L. culinaris cp genomes.There were 15 and 16 palindromic repeats detected in L. lamottei and L. culinaris, respectively The darker gray on the inside refers to GC content, while the lighter gray on the outside corresponds to AT content.Genes are color-coded according to their function.
(Fig. 5c).One reverse repeat was found in L. lamottei but none in L. culinaris.No complementary repeats were found in either species (Fig. 5c).In addition, 31 tandem repeats were found in L. lamottei and 36 in L. culinaris.Using L. culinaris (NC_027152) as a reference sequence, synonymous and non synonymous variations were compared and evaluated to further investigate the divergence of nucleotides.Variations in protein-coding and non-coding regions in the cp genome were examined.Synonymous and nonsynonymous nucleotide changes in protein-coding regions were detected in the L. lamottei genome (Table 1).Out of 73 protein-coding genes, at least one variation was detected in 18 protein-coding genes.Of these, 8 protein-coding genes (ndhB, ndhF, ndhH, petA, rpoA, rpoC2, rps3, and ycf2) were found to have nonsynonymous changes, i.e., variation resulting in an amino acid replacement.In particular, more variations (L429F, M1290R, H1324D, F1436L, and I1751M) were identified in the ycf2 gene region compared to other genes.In addition, the only synonymous changes occurred in the other 10 genes in the cp genome.In addition, 32 noncoding regions were found (Supplementary Table S3).These genes consist of rRNA and tRNA genes.Only two single nucleotide variations were detected in the rrn23 gene.
RNA editing is a post-transcriptional modification that results in base conversion.The PREP-Cp (Predictive RNA Editor for plants) program was used to identify putative RNA editing sites in L. lamottei.We found 41 predicted RNA editing sites in 17 genes (Supplementary Table S4).The most frequent editing site predicted the change of serine to leucine (%29.3).From nucleotide base C to T, there were all replacement sites.The ndhB (7)  gene has the highest number of editing sites.At the second codon position, the editing event was most prevalent.In addition, RNA editing sites of L. culinaris which were cultivated species of L. lamottei were also predicted (Supplementary Table S5), with 41 RNA editing sites in 17 genes and the ndhB (7) gene with the highest abundance of sites.The most frequent editing site in L. culinaris was also the change of serine to leucine (29.3%).tiple codons that encode one amino acid, and synonymous codons typically exhibit varied usage preferences as a result of selection pressure during plant evolution.All 20 amino acid and stop codons in the protein-coding genes of the L. lamottei's cp genome were evaluated for codon content and relative synonymous codon usage (RSCU) values (Supplementary Fig. S1; Supplementary Table S6).A total of 19,509 codons was detected in the cp genome of L. lamottei.Looking at the distribution of these codons in amino acids, leucine was the most abundant, while cysteine was the least abundant.Among those amino acids prevalent ATT, which encodes isoleucine (Ile), was the most abundant codon in L. lamottei, with 910 TGC, which encodes cysteine (Cys), was the least frequently used codon with 48.Most amino acids show codon biases, while methionine (ATG) and tryptophan (TGG) were expressed by only one codon and have no codon bias.TTA (leucine) had the highest RSCU value, whereas CTC (leucine) had the lowest.Prediction of gene expression based on codon usage requires defining a set of reference genes, which includes all genes encoding ribosomal proteins.Higher MELP (MILC-based Expression Level Predictor) values indicate higher gene expression levels and codon usage bias.The log 10 (FPKM) (Fragments per Kilobase Million) value is expected to increase with the MELP value.Genes with MELP values higher than 1.0 (Fig. 6a) were psbN, psbI, psbE, psbK, psaI, and ndhC, which play roles in photosynthesis, and accD.
Using RNA-seq data from L. lamottei, we analyzed the expression of the 76 genes that code for proteins in the cp 32 .Reads were mapped to the L. lamottei cp genome we assembled in this study.Taking gene lengths into account, the numbers of reads corresponding to coding genes were computed and normalized.Gene expression values in FPKM were calculated for 76 protein-coding genes (Supplementary Table S7).We found that genes encoding ATP synthase showed the highest expression among all genes (Fig. 6b).This was followed by rubisco and photosystem genes.

Phylogenetic analysis.
In this study, 14 species were included in the phylogenetic analysis to determine the relationships of L. lamottei with the other members of the Papilionoideae subfamily (Fig. 7).Arabidopsis thaliana was selected as an outgroup.Five species were chosen from Papilionoideae, which belongs to IRLC (L.culinaris, L. ervoides, C. ariethinum, Cicer echinospermum, Cicer bijigum), and two species from Papilionoideae (Cajanus cajan, Glycine max), two species from the Detarionoideae subfamily (Intsia bijuga, Crudia hamsiana), and two species from Cercidoideae subfamily (Cercis glabra, Tylosema esculentum).Phylogenetic analysis of the 13 cp genomes was performed to provide a better resolution with 1000 bootstraps.In the phylogenetic tree, L. lamottei formed a sister clade with L. culinaris, indicating that these two species are monophyletic and closely related.

Discussion
Lentils are richer in protein, carbohydrate, and dietary fiber content compared to other legumes.Therefore, they are consumed at higher rates worldwide due to their importance for human health [1][2][3] .The cp is a highly conserved organelle, both structurally and genetically, and is involved in many biological functions, particularly in photosynthesis 16,17 .Therefore, the cp genomes of various have been widely sequenced with the development of technologies 22,33 .In comparison to the nuclear genome, the cp genome is inherited maternally and Table 1.Synonymous and nonsynonymous variations in protein-coding genes of L. lamottei cp genome.

Number of synonymous variations
Nucleotide changes can conveniently be used for phylogenetic investigations due to its short size and conserved structure 34 .We presented the first assembly and annotation of the cp genome of L. lamottei and compared it to the cultivated lentil species L. culinaris.By examining the genome rearrangements, we found highly similar regions between L. lamottei and L. culinaris.Two different haplotypes in L. lamottei have emerged.Two structural haplotypes of cp genomes that differ in the direction of single-copy regions have been identified in previously published studies 35 .Flip-flop recombination is a plausible theory to explain the existence of structural heteroplasmy and has been reported previously 35,36 .Gene loss, changes in the intergenic region, and expansion, or contraction of the IR region are all variables that affect genome size 37 .The cp genome of L. lamottei (122.8 kb) of the Fabaceae family has a similar size to that of the close wild relative L. ervoides 20 , the cultivated form L. culinaris, and other Fabaceae species such as O. arctobia 38 .The configuration of cp genomes in the species of this family is similar, with sizes ranging from 107 to 218 kb 39 .The chloroplast genome of L. lamottei has an LSC, SSC region and a single copy IR region.In other plant cp genomes, a pair of IRs are separated by one LSC and one SSC 22 .Some species of the Fabaceae family, including L. culinaris contain a single copy of the IR region and belong to the IRLC 26,38 .IRs stabilize the plastome structure and affect its size 40 .However, expansion or contraction, gene loss, and genome rearrangement occurred in the IR regions of some species 41 .Therefore, the cp genomes of L. lamottei and L. culinaris are highly conserved and similar in genome organization.
The cp gene content was also highly conserved between L. lamottei and L. culinaris, except for some genes.Some genes annotated in L. lamottei could not be identified in the annotation of L. culinaris.This could be due to the limitations of DOGMA, the annotation tool used for the reference genome L. culinaris.When the cp genome was reannotated using the GeSeq tool, the genes on the cp genome matched with each other.The L. www.nature.com/scientificreports/lamottei cp genome contains 107 genes, but some genes such as rps16 and infA were not identified.Many IRLC species plastomes contain regions with notable changes and rearrangements, such as the loss of introns from the rps12 and clpP genes and the lack of the rps16 gene 42 .The absence of these genes has been reported in other species 20,23 .Introns serve a critical function in gene expression regulation 43 .L. lamottei contains 17 genes that have an intron.The absence of one of the clpP1 introns in L. lamottei is consistent with the loss of the clpP1 intron in the genomes of highly papilionoid IRLC members 20,44 .
The sliding window and mVISTA analysis demonstrated high sequence similarity between L. lamottei and L. culinaris, implying a highly conserved evolutionary model.The intergenic region between the trnQ and accD genes has several nucleotide variations in the cp genome of L. lamottei.Furthermore, the highly divergent areas were mostly found in the noncoding rather than coding regions, as previously shown in the Fabaceae family 20,45 .These hypervariable intergenic regions can serve as candidate regions for creating genetic markers 46 .These hotspots as markers can be used in the discrimination of species in the Lens genus for phylogenetic and identification studies.
Repetitive regions in the genome are essential for genome rearrangements 47 .SSRs can also serve as useful molecular markers for studying genetic diversity and evolutionary relationships in L. lamottei and similar species.Nucleotide and tandem repeats in cp genomes are associated with gene duplication, rearrangement, and expansion and can be particularly helpful markers for classifying populations or species 46,48 .SSRs are another type of molecular marker that consists of 1-6 nucleotide repeat units and are commonly employed in population genetics 31 .For both species, mononucleotide SSR is the most prominent motif.Although the numbers of repeat motifs are similar, they differ from each other in some motifs.While L. culinaris has one hexanucleotide repeat motif, L. lamottei has none.Of these repeat motifs, mononucleotide repeats are the most common SSR motif discovered in the cp genome 49 .Among these mononucleotides, the most common SSR motif found in the L. lamottei cp genome is the A/T motif.The common consensus is that plastid SSRs are mostly composed of A/T repeats, with G/C repeats considered as rare 37 .Similar results were found in the wild chickpea species belonging to Fabaceae family and in the L. ervoides cp genome 20,50,51 .SSRs have been identified in various organelle genomes and have great application potential for studies on the molecular evolution of plant population genetics and crop breeding 52 .
Understanding the mechanisms underlying molecular evolution critically depends on the estimation of synonymous and nonsynonymous nucleotide mutations 53 .Eight protein-coding genes (ndhB, ndhF, ndhH, petA, rpoA, rpoC2, rps3, and ycf2) in the cp genome of L. lamottei had nonsynonymous changes.Nonsynonymous mutations are those in which base mutation results in an alteration of the amino acid in the encoded protein 21 .Ycf1 and matK genes have been determined as molecular markers due to the detection of nonsynonymous substitutions in these genes in other studies 53,54 .However, in this study, while no substitution occurred in matK and ycf1 genes, several mutations were detected for the ycf2 gene (in the position of L429F, M1290R, H1324D, F1436L, I1751M).Numerous nonsynonymous changes were detected in the ycf2 gene for L. ervoides 20 , a close relative of L. lamottei.Among noncoding genes, only two single nucleotide substitutions were found in the rrn23 gene.Therefore, markers that can be developed using the ycf2 and rrn23 genes may shed light on future studies.
RNA editing is a fundamental post-transcriptional process that commonly occurs in the cp protein-coding regions to restore the evolutionarily conserved amino acid sequence 24,55 .The ndhB gene, which had seven RNA editing sites, contained the highest number of editing sites.The NDH protein complex serves an important function in photosynthesis.RNA editing may enhance photosynthetic efficiency and exhibit positive selection during evolution 41 .In 29.3% of the editing sites, serine aminoacids were converted to leucine.Many RNA editing sites can convert the encoded amino acid from polar to nonpolar in both species.In previous studies, unlike hydrophilic mutations, which can affect the secondary structure and function of proteins and improve their genetic information, hydrophobic mutations produce a more stable structure in proteins 24,55,66 .Therefore, the presence of RNA editing sites in the cp genome may elucidate evolutionary mechanisms.
Codon usage significantly influences the evolution of the cp genome and can also be used to determine gene functions, gene expression, and levels of mRNA and proteins 21,56 .The L. lamottei cp genome has 19,509 codons, with cysteine having the lowest codon usage and leucine having the highest.In Coleanthus subtilis, which also belongs to the legume family, leucine amino acid has the highest codon usage, while cysteine amino acid has the lowest codon usage 56 .In L. lamottei, one codon each encodes the amino acids methionine and tryptophan.On the other hand, the remaining amino acids are encoded by two to six codons.Similar codon usage results are observed in wild lentil and chickpea species 20,50,51 .Codon usage could be helpful for phylogenetic association research as codon use can influence the manner of gene mutation 31 .
For cp growth and photosynthesis, appropriate expression of cp genes is essential 57 .Therefore, the expression levels of cp genes in L. lamottei were investigated in this study.MELP values of protein-coding genes were obtained using ribosomal genes as reference.In particular, the MELP values of genes involved in photosynthesis were high.In addition, FPKM values were calculated using RNA-seq data to examine the gene expression of cp genes.We found high gene expression levels, especially for the ATP-synthase, rubisco, and photosynthesis-related genes were classified according to their biological functions.Plants synthesize ATP during electron and proton transport.These processes include photosystem I, photosystem II, cytochrome b6/f complex, and ATP synthase 58 .Also, the rbcL gene is another gene that is related to the photosystem 59 .Therefore, the expression levels of these genes are expected to be higher.We found that the expression levels of these genes were not positively correlated with the MELP value, which could be due to the highly increased expression of ATP synthase.
To assess the phylogenetic relationships of L. lamottei, the complete cp genomes of 14 plant species were used to construct a phylogenetic tree using the Jukes-Cantor model.We used A. thaliana as an outgroup to confirm the phylogeny study.For the phylogenetic analysis, the representative species were selected from three main subfamilies of the Fabaceae family: the Papilionoideae, Detarionoideae, and Cercidoideae subfamilies.In this study, results show that L. lamottei was closely related to L. culinaris and L. ervoides.In a previously published www.nature.com/scientificreports/Gene expression analysis.Lens lamottei RNA sequencing data were retrieved from NCBI SRA using accession number PRJNA625627.The reads were mapped using STAR 82 to the assembled cp genome of L. lamottei.The featureCounts 83 program was used to quantify gene expression.FPKM 84 normalization of expression values for protein-coding genes were calculated.
Phylogenetic analysis.Following multiple sequence alignment using MAFFT 85 , MEGA 11 86 was used for the phylogenetic analysis of the aligned sequences.The Jukes-Cantor 87 model was used to create a maximum likelihood tree for phylogenetic analysis of whole cp genomes of different species.A. thaliana was used as an outgroup in the construction of a phylogenetic tree based on the complete cpDNA sequences of 13 species from the Leguminosae subfamilies Papilionoideae (3), IRLC (6), Detarioideae (2), and Cercidoideae (2).Construction and calculation of the phylogenetic tree were performed with 1000 bootstraps.GenBank data of the species used were obtained from the NCBI database.The species that were used were L. culinaris (NC_027152.

Figure 1 .
Figure 1.Gene arrangement analysis using Locally Colinear Blocks (LCBs).MAUVE software was used to compare the gene order of L. lamottei with the genome of L. culinaris cp as a reference.(a) Comparison of two possible haplotypes of L. lamottei without manual correction.(b) Comparison of two possible haplotypes of L. lamottei after manual correction.

Figure 2 .
photosystem II photosystem I

Figure 3 .Figure 4 .Figure 5 .
Figure3.Visualization of alignment of the L. lamottei cp genome using L. culinaris as a reference sequence using mVISTA.The vertical axis shows the percentage of identity, which goes from 50 to 100%.In the reference plastome, each arrow represents annotated genes and the direction of transcription.Exons are purple, the untranslated region (UTR) is blue, and conserved non-coding sequences (CNS) are pink in the plastome coding areas.A decrease in purple/pink shadowing indicates a decrease in sequence identity (white spaces).

Figure 6 .
Figure 6.(a) The correlation between each protein-coding gene's MELP score and the percentage of codons used that are preferred (RSCU > 1.0).Green indicates ribosomal genes, while red indicates other genes.Gene names are only assigned to non-ribosomal genes with MELP > 1.0.The linear regression is shown by the blue line.(b) The value of mean FPKM is used to represent cp gene expression.The X-axis shows the gene in which category they are involved.The Y-axis shows the mean FPKM values of biological function.