Chloroplast genome sequence of Chongming lima bean (Phaseolus lunatus L.) and comparative analyses with other legume chloroplast genomes

Lima bean (Phaseolus lunatus L.) is a member of subfamily Phaseolinae belonging to the family Leguminosae and an important source of plant proteins for the human diet. As we all know, lima beans have important economic value and great diversity. However, our knowledge of the chloroplast genome level of lima beans is limited. The chloroplast genome of lima bean was obtained by Illumina sequencing technology for the first time. The Cp genome with a length of 150,902 bp, including a pair of inverted repeats (IRA and IRB 26543 bp each), a large single-copy (LSC 80218 bp) and a small single-copy region (SSC 17598 bp). In total, 124 unique genes including 82 protein-coding genes, 34 tRNA genes, and 8 rRNA genes were identified in the P. lunatus Cp genome. A total of 61 long repeats and 290 SSRs were detected in the lima bean Cp genome. It has a typical 50 kb inversion of the Leguminosae family and an 70 kb inversion to subtribe Phaseolinae. rpl16, accD, petB, rsp16, clpP, ndhA, ndhF and ycf1 genes in coding regions was found significant variation, the intergenic regions of trnk-rbcL, rbcL-atpB, ndhJ-rps4, psbD-rpoB, atpI-atpA, atpA-accD, accD-psbJ, psbE-psbB, rsp11-rsp19, ndhF-ccsA was found in a high degree of divergence. A phylogenetic analysis showed that P. lunatus appears to be more closely related to P. vulgaris, V.unguiculata and V. radiata. The characteristics of the lima bean Cp genome was identified for the first time, these results will provide useful insights for species identification, evolutionary studies and molecular biology research.

distributed all over the world, Chongming lima bean, an important characteristic vegetable variety in the Chongming area, has been grown on Chongming Island for more than 100 years [6].
Chloroplasts, a place for plant photosynthesis, starch, fatty acids and amino acids biosynthesis, play an important role in the transfer and expression of genetic material [7]. Chloroplast has its own genome, chloroplast genome of most plants are mostly double-stranded circular, but a few species have linear forms with multiple copies. The genome size usually ranges from120 to 170 kb and includes 120-130 genes [8]. It has a typical quarter structure, which composed of a large single-copy region, a small single-copy region and a pair of large inverted repeats [9][10][11]. The Cp genome is highly conserved, the differences between different plant species are mainly caused by the IR region's contraction and expansion [12,13]. With the development of highthroughput sequencing technologies, there were more than 2400 plant Cp genomes have been published in the NCBI database [14]. Leguminosae, with nearly 770 genera and more than 19,500 species, is the third largest family of angiosperms [15]. Within the Leguminosae family, there were more than 44 species Cp genomes have been published including C. arietinum [8], G. gracilis [16], L. japonica [17], C. tetragonoloba [18], G. max [19], V. radiate [20], and P. vulgaris [21]. Leguminosae has experienced a great number of plastid genomic rearrangements [22], including loss of one copy of the IR [23,24], inversion of 50 kb and 70 kb [17,21,25], transfer of infA, rpl22 and accD genes to the nucleus [26][27][28] and loss of the rps12 and clpP introns [8,26].
Chloroplast DNA has been extensively used to taxonomy, phylogenetics and evolution of plants, due to its low substitution rates of nucleotide and relatively conserved structural variation of genomic [29][30][31]. Phylogenetic analyses of Leguminosae were mainly based on gene fragments in chloroplast DNA like trnL, rbcL and matK [32][33][34]. Based on the chloroplast matk gene and combining the characteristics of morphology, chemistry and chromosome number, a new classification system of six subfamilies was proposed, and the most complete leguminous phylogeny tree was constructed so far [15]. However, the classification and phylogenetic relationships of the main branches within the subfamilies are still unclear. Chloroplast phylogenetic genome has been successfully used to analyze the phylogenetic relationship of many difficult groups, and it also provided a better system framework for studying the structural characteristics, variation and evolution of plants [35,36]. Due to the limited chloroplast genomes of legumes that have been sequenced, phylogenetic chloroplast phylogeny has not been applied to classification of the Leguminosae.
Currently, there are no published studies of the Cp genome of lima bean. In this study, we applied a combination of de novo and reference-guides to assemble complete Cp genome sequence of P. lunatus. Here, we not only described the whole Cp genome sequence of P. lunatus and the characteristics of long repeats and SSRs, but also compared and analysed the Cp genome with other members of Leguminosae. It is expected that the results will help us to understand of the Cp genome of lima bean and provide markers for phylogenetic and genetic studies.

Results
Characteristics of the P. lunatus L. Cp genome The Cp genome of lima bean was 150,902 bp in size with a typical quadripartite structure, containing a pair of inverted repeats (IRs; 26,543 bp), a large single copy (LSC; 80,218 bp) and a small single copy (SSC; 17,598 bp) (Fig. 1). The GC content in lima bean was 35.44%, the GC content of LSC, SSC and IR regions was 32.92, 28.61 and 41.52% respectively (Table 1), IR regions was higher than the LSC and SSC regions. Species of Leguminous: G. max, P. vulgaris, V. unguiculata, G. sojasieb, V. faba and P. sativum were selected to Compare with lima bean (Table 2). Although the sizes of the overall genome had differences, the GC content was similar in each region (LSC, SSC and IR) of different species. There is a litter difference in total genes, CDS and tRNAs among the seven species. C. cajan has most genes, CDS and tRNAs and V. radiata has least.
There were 129 genes found in the P. lunatus Cp genome, containing 82 protein-coding genes, 37 tRNA genes, 8 rRNA genes and 2 pseudogenes (Tables 2 and 3). There are 79 genes (56 protein-coding and 23 tRNAs) located in LSC region and 13 genes (12 CDS and 1tRNA) in SSC region. Among them, 35 genes (13 CDS, 14 tRNAs and 8 rRNAs genes) were duplicated in the IR regions ( Fig. 1; Table S1). Codon usage frequency of the P. lunatus Cp genome was estimated and summarized (Table S2). Totally, all the genes are encoded by 25,873 codons, in these codons, the most frequent amino acids are leucine (2719, 10.51%) and the least are cysteine (300, 1.16%). The most preferred synonymous codons end with A and U.
Overall, 22 intron-containing genes (14 protein-coding genes and 8 tRNA genes) were found (Table 4). Among them, 20 genes have one intron, ycf3 and clpP have two introns. trnL-UAA and trnK-UUU have the the smallest intron (467 bp) and largest intron (2562 bp), respectively. In the P. lunatus Cp genome, rps16 and rpl133 gene was found to be present as a pseudogene.
A comparison of the boundaries of the lima bean Cp genome was performed among the other six Leguminosae species: P. vulgaris, V. radiata, V. unguiculata, C. cajan, G.max, and G. soja (Fig. 6). At the LSC/IR junction of lima bean, the rps19 and trnN genes are duplicated at the IR/SSC junction completely and included in the IR region. a partial ycf1 gene is included at the IRa/SSC junction. Compared to other species in the genus, the range of each region showed substantial differences. The rps19 gene in the P. lunatus, P. vulgaris, V. radiate Cp genomes was shifted by 564 bp from IR to LSC at the LSC/IR border and 701 bp from IR to LSC in the V. unguiculata. However, in C. cajan, G. max and G. soja, the rps19 gene crossed the IRb/LSC region, with 46, 68 and 68 bp of rps19 gene within IRb, respectively. On the other hand, the ycf1 gene is located at the IRa/SSC border in all the compared legumes, but the junctions of IRa/SSC located in ycf1 within the SSC and IRa regions vary in length (P. lunatus: 4706 and 616 bp; P. vulgaris: 4775 and 505 bp; V. radiate: 4683 and 492 bp; V. unguiculata: 4683 and 492 bp; C. cajan: 13 and 473 bp; G.max: 11 and 478 bp; G. soja: 11 and 478 bp), while the ycf1 gene was only at the IRb/SSC border of P. vulgaris, C. cajan, G. max, and G. soja and the size varies among them.

Adaptive evaluation analysis
The Ka/Ks ratio were calculated by KaKs_Calculator among the Cp genome of eleven species of Leguminosae protein-coding genes. The results indicated that the Ka/ Ks ratio is < 1 in mostly except for rpl23 of V. faba vs P. lunatusis, ndhD of C. cajan, rps18 of M. truncatula vs P. Fig. 5 The comparison of four Phaseolinae species Cp genomes by using mVISTA. The grey arrows above the contrast indicate the direction of the gene translation. The y-axis represents the percent identity between 50 and 100%. Protein codes (exon), rRNAs, tRNAs and conserved noncoding sequences (CNSs) are shown in different colours lunatusis, ndhD of G. max vs P. lunatusis, accD/ ycf2/ ndhD of P. vulgaris vs P. lunatusis, ndhB/ rps15/ ndhB of C. arietinum vs P. lunatusis, petL/ ycf2/ ndhD of V radiata vs P. lunatusis, petL/ ycf2 of V. unguiculata vs P. lunatusis (Fig. 7). For each gene, the majority had a Ka/ Ks ratio < 0.5 for the ten comparison groups. At the same time, 13 of them had a Ka/Ks ratio between 0 and 0.1. In contrast, the Ka/Ks ratio of the ndhD gene was greater than 1 in four of the ten comparison groups, four of them had no this gene and another two exhibited low Ka/Ks ratios. Moreover, ycf2 also exhibited a Ka/Ks ratios > 1 in three of them and the ratio > 0.5 in the other species.

Phylogenetic analysis
To identified the phylogenetic position of lima bean in Leguminosae, we used the 44 protein sequences of 48 Leguminosae species to phylogenetic analyse (Fig. 8).
Maximum likelihood (MI) and Bayesian inference (BI) were used to construct phylogenetic tree with Arabidopsis thaliana as outgroup. The phylogenetic results resolved most nodes with bootstrap support values of 100. These 48 species belong to Caesalpinoideae, Cercidoideae, Detarioideae and Papilionoideae. The phylogenetic tree showed that P. lunatus and P. vulgaris are sister spisecies with a 100% bootstrap value and P. lunatusis more closely related to P. vulgaris, V. unguiculate and V. radiata. The phylogenetic trees are very helpful for us to understand the phylogenetic relationship among more Leguminosae species.

Discussion
In this study, the Cp genome of lima bean were sequenced and assembled, and this information was applied for their comparative analysis with other Leguminosae species. The size of genome, content of GC, the length of IR, LSC, and SSC regions and gene content exposed high similarity among the genomes, suggesting that leguminosae species shared low diversity [20,21,37,38]. The GC content is closely related to species affinity [39]. High GC content is conducive to the stability of the genome and maintaining the complexity of the sequence. The four rRNAs genes have high GC content, which results in a high GC content in the IR region [40]. The codon usage bias is related to translational efficiency, which biased towards rich tRNA. At the same time, those codons that bind more tightly than other homologous tRNAs [41]. In this study, all genes were encoded by 25,873 codons, in these codons the most frequent amino acids are leucine (2719, 10.51%) and the least are cysteine (300, 1.16%). The most preferred synonymous codons end with A and U. High AT abundance is the main cause why synonym codons end with A/U, which may be the result of natural selection and mutation [42,43]. Repeat sequences are significant important for genome rearrangements and variations, and repeat occurrence is more prevalent in IGSs than in genic sequences [44][45][46]. Furthermore, these repeats can be used to develop genetic markers for phylogeny and population studies [47]. We found 61 repeat sequences in P. lunatus Cp genome, and most of the repeats were distributed within the intergenic spacer regions, intron sequences, and ycf2 genes, which is highly homologous to the sequence in V.radiata [20].
Currently, chloroplast genome markers have more advantages than nuclear DNA markers in terms of evolution and taxonomic research due to their maternal inheritance in most plants and much lower mutation rate [48,49]. cpSSRs are often used to identify species and analyze genetic because they are relatively richness and have demonstrated high reproducibility and polymorphism. Two hundred ninety SSRs were found in the lima bean Cp genime. The number of SSR is similar to those in pigeonpea [37], but more than clusterbean [18]. Among 290 SSRs, most of them distributed in LSC (63.45%) and located in intergenic spacers (45.86%). The findings were similar to clusterbean [18] and pigeonpea [37].
Although the Cp genomes of angiosperms are wellconserved, inversion, rearrangement, novel DNA insertion and IR expansion contraction occur frequently [18,21,25]. Leguminosae is an excellent choice for studying the evolution of the Cp genome because legume plastid genomes have undergone multiple genomic rearrangements and the loss of genes or introns [50].. In our study, P. lunatus has a common 50 kb inversion in the LSC region, spanning from rbcL to rps16, which has been found in other legumes (Fig. 4) [18,20,21]. Due to the expansion and contraction of IRs, the Cp genomes of P. vulgaris, V. radiata and V. unguicalata have 70 kb inversion to subtribe Phaseolinae but are absent from other Cp genomes [51,52]. P. lunatus as a member of the subtribe Phaseolinae shows the same inversion. All the results shown in the gene order suggest that considerable rearrangements and diversification were occurred in the legume Cp genomes and a valuable resource for phylogenetic analysis is provided.
Crop evolution and genetic improvement progress mainly depends on the genetic diversity available in germplasm resources [53]. rpl16, accD, petB, rsp16, clpP, ndhA, ndhF and ycf1 genes in CDS was found significant variation and high sequence variations were found in intergenic regions as follows: trnk-rbcL, rbcL-atpB, ndhJ-rps4, psbD-rpoB, atpI-atpA, atpA-accD, accD-psbJ, psbE-psbB, rsp11-rsp19, ndhF-ccsA (Fig. 5). These regions were considered useful markers for elucidating phylogenetic relationships among Leguminosae species. The ycf1, accD and ndhF genes were also served as genetic markers for Quercus bawanglingensis [54]. The Cp DNA regions: trnL-trnF and atpB-rbcL were used to evaluate 262 accessions of P. lunatus to identify whether the MA gene pool of P. lunatus has a single centre or multiple centres [55]. trnL-trnF in noncoding Cp DNA regions also has been used to study Phylogeny and domestication [56][57][58]. polymorphisms of the Cp DNA is very useful to study the evolutionary of Lima bean and to pinpoint domestication places in several studies. Hence, more and more genome resources need to be developed for plants [59].
Studies have shown that IR regions in plant chloroplast genomes are more conserved than single copy and non-coding regions, and can stable the rest genome [38]. The size change of the angiosperm plastid genome is caused by the contraction and expansion of the IR region at the boundary [51,60]. The change of the IR/SC junction is a common phenomenon and plays an important role in evolution [54,61,62]. In the seven Leguminosae species, P. lunatus, P. vulgars, V. radiate and V. unguiculata showed similar characteristics, only some genes including rps8, rps19, trnN, ndhF, ycf1 and rps3 showed a little difference; C. cajan, G.max, and G. soja showed more differences than other four species (Fig. 6).
The complete trnH-rps19 cluster of P. lunatus, P. vulgars, V. radiata and V. unguiculata is present in IR regions, which is consistent with TYPE III [63]. Non-synonymous (Ka) and synonymous (Ks) substitutions and their ratios (Ka/Ks) have been used to assess the rate of gene divergence. The ratio of Ka/Ks < 1 represents purifying selection, while the ratio > 1 represents positive selection [64]. In most protein-coding genes, nucleotide substitutions of synonymous occur more frequently than non-synonymous [65]. In this study, the ratio is < 1 in most of the genes, indicating that they are under purifying selection in lima bean. However, the Ka/ Ks ratio of ndhD gene is > 1 in four of the ten comparison groups, ycf2 exhibited a ratios > 1 in three of them and the ratio > 0.5 in the other species. The ndhD and ycf2 undergo positive selection in lima bean, which may help to adapt to their living environment.
Genetic analysis of lima bean was performed using cytogenetic [66] and molecular data [55]. With the development of sequencing technologies, an increasing number of Cp genomes have been used for phylogenetic analysis [35,36]. The Cp genomes have been used for phylogenetic analyses in the genus Quercus, which provide strong support for the deep phylogenetic relationship between subfamily tribes [67].. In our study, the sequences of chloroplast genomes were used for phylogenetic analysis by ML and BI based on 48 Leguminosae species. P. vulgaris and P. lunatus are sister species, P. lunatus is more closely related to P. vulgaris, V. unguiculata and V. radiata. Consistent with the gene order results, they are all of subtribe Phaseolinae. The result is consistent with other phylogenies constructed by Cp genome containing representatives Phaseolinae genus [68][69][70].

Conclusions
In this study, the complete Cp genome of P.lunatus was first sequenced on IlluminaNextera XT platforms. The size of genome, structure and organization of gene were shown to be conservative, which is similar to those reported Cp genomes of Leguminosae species. Sixty-one repeats and 290 SSRs were present in P. lunatus. These results are very useful for developing barcoding molecular markers. In comparison with other legume species, the Cp genome of lima bean shares a similar gene order and IR region borders with P. vulgaris, V. unguiculata and V. radiata. Phylogenetic analysis of 48 Leguminosae species shows that P. lunatus are more closely related to P. vulgaris, V. unguiculata and V. radiata. These results provide important information for the complete Cp genome of P. lunatus, which might be useful for further studies of evolution and phylogenetic.

Sequencing and assembly of lima bean Cp genome
Fresh leaves were collected from lima bean plants grown on Huiyuan Vegetable Gardening Farm at Chongming Island [6]. Genomic DNA was extracted by CTAB method [71]. Then the DNA quality was tested (> 50 ng·μL − 1). The DNA was sequenced by the HiSeq™ X10 platform (Illumina, USA) at Nanjing. Bowtie2 v2.2.4 [72] was used to exclude non-chloroplast genome reads with paired-end alignments and a maximum of 3 mismatches(−v = 3), as the raw sequence reads always include non-cpDNA. The Cp genome was assembled by SPA-desv3.10.1 [73] and with the options of "-trusted-contigs" via manual correction using comparison with the reference species P. vulgaris (NCBI ACCESSION NC_ 009259.1). The Cp genome of lima bean was submitted into GenBank (SRA: SRR13319750, BioProject: PRJNA688003 Accession number: MW423611).

Genome annotation of the cp DNA sequences
The annotation of the Lima bean Cp genome was performed by blast v2.2.2 (parameter: -nproc 20, −bestn 5 [74]., and the final annotation result was correct manually. rRNAs and tRNAs were identified by hmmer v3.1b2 [75] and aragorn v1.2.38 [76], respectively. The entire genome was mapped by OGDRAW [77]. The synonymous codon usage, relative synonymous codon usage values (RSCU) and codon usage of the complete plastid genomes were analyzed using MEGA 6.0 PREP suit [78] with cut off values of 8.0 was used to predict the RNA editing sites in the plastome.

Comparative analysis of Cp genomes
MUMmer was used to pair sequence alignment of the chloroplast genome [81]. The chloroplast genome of P. lunatus (SRA: SRR13319750, BioProject: PRJNA688003) was compared with P. vulgaris (NC_009259), V. radiate (NC_013843) and V. unguiculata (NC_018051) in the Leguminosae tribe by mVISTA with the shuffle-LAGAN mode [82,83]. The annotation of P. lunatus was set as a reference.

Adaptive evaluation analysis
In order to analyze non-synonymous (Ka) and synonymous (Ks) substitution rates and Ka/Ks ratio, P. lunatus was compared with the ten other species in Leguminosae tribe: G. max, C. cajan, C. arietinum, V. radiate, P. vulgaris, G. soja, V. unguiculata, P. sativum, V. faba and M. truncatula. The ten sequences was separately aligned by MAFFT v7.427 [85], then the Ka and Ks substitution rates and Ka/Ks value was counted using the KaKs_calculator 2.0 [86] with the default model averaging (MA) method.

Phylogenetic analysis
The phylogenetic analysis was conducted for lima bean, another 47 Leguminosae species, and one outgroup Arabidopsis thaliana, all of which were down loaded from the NCBI except those of P. lunatus. The complete Cp genomes were aligned using MAFFT v7.427 [85]. RAxML v.8.2.10 [87] and MrBayes version 3.2.6 [88] was used to reconstruct the phylogenetic relationship with the maximum likelihood (ML) and Bayesian Inference (BI) methods.