Complete Chloroplast Genome Sequence of Omani Lime (Citrus aurantiifolia) and Comparative Analysis within the Rosids

The genus Citrus contains many economically important fruits that are grown worldwide for their high nutritional and medicinal value. Due to frequent hybridizations among species and cultivars, the exact number of natural species and the taxonomic relationships within this genus are unclear. To compare the differences between the Citrus chloroplast genomes and to develop useful genetic markers, we used a reference-assisted approach to assemble the complete chloroplast genome of Omani lime (C. aurantiifolia). The complete C. aurantiifolia chloroplast genome is 159,893 bp in length; the organization and gene content are similar to most of the rosids lineages characterized to date. Through comparison with the sweet orange (C. sinensis) chloroplast genome, we identified three intergenic regions and 94 simple sequence repeats (SSRs) that are potentially informative markers with resolution for interspecific relationships. These markers can be utilized to better understand the origin of cultivated Citrus. A comparison among 72 species belonging to 10 families of representative rosids lineages also provides new insights into their chloroplast genome evolution.


Introduction
Citrus is in the family of Rutaceae, which is one of the largest families in order Sapindales. Flowers and leaves of Citrus are usually strong scented, the extracts from which contain many useful flavonoids and other compounds that are effective insecticides, fungicides and medicinal agents [1][2][3]. Citrus is of great economic importance and contains many fruit crops such as oranges, grapefruit, lemons, limes, and tangerines. However, due to a long cultivation history, wide dispersion, somatic bud mutation, and sexual compatibility among Citrus species and related genera, the taxonomy of Citrus remains controversial [4,5] and the origination of many Citrus species and hybrids is still unresolved [6,7].
The chloroplast (cp) genome sequence contains useful information in plant systematics because of its maternal inheritance in most angiosperms [8,9] and its highly conserved structures for developing promising genetic markers. The only complete cp genome available in Citrus is sweet orange (Citrus sinensis) [10], which has provided valuable information to the position of Sapindales in rosids. Although a genome sequencing project is in progress for C. clementine, its complete chloroplast genome sequence is not available yet. To identify the cp genome regions that are polymorphic and may be used as molecular markers for resolving the evolutionary relationships among Citrus species, a second cp genome within the genus is necessary for comparative analysis. For this purpose, the major aim of this study is to determine the complete cp genome sequence of C. aurantiifolia.
C. aurantiifolia, which is commonly known as Key lime, Mexican lime, Omani lime, Indian lime, or acid lime, is native to Southeast Asia and widely cultivated in tropics and subtropics. Oman is known to be a transit country for lime, from which lime spread to Africa and the New World [11]. In Oman, Omani lime is considered the fourth most important fruit crop in terms of cultivated area and production. The products of Omani lime can be used for beverage, food additives and cosmetic industries [12]. Omani lime is sensitive to several biotic agents, the most serious of which is 'Candidatus Phytoplasma aurantifolia', the cause of witches' broom disease of lime (WBDL). Recent studies on WBDL focused on effect of genetic diversity of Omani limes on the disease [13], transcriptome and proteomic analysis of lime response to infection by phytoplasma [14][15][16] and effect of phytoplasma on seed germination, growth and metabolite content in lime [17,18].
Here, we present the complete chloroplast genome sequence of Omani lime (C. aurantiifolia). To identify loci of potential utility for the molecular identification and phylogenetic analyses of Citrus cultivars and species, we compared the intergenic regions and SSRs in the cp genomes of C. aurantiifolia and C. sinensis. Furthermore, we performed phylogenetic analyses to infer the history of gene losses in the cp genome evolution among representative rosids lineages.

Sample Preparation and Sequencing
The Omani lime leaves were collected from a 5-year-old lime tree at a private farm located in the Omani territory of Madha (GPS coordinates: 25.276318, 56.318909). This farm is owned by one of the co-authors of this work, Dr. Abdullah M. Al-Sadi, whom should be contacted for future permissions. This study does not involve endangered or protected species and does not require specific permission from regulatory authority concerned with protection of wildlife. The sample was stored in a cool box and transported to the Plant Pathology Research Laboratory at Sultan Qaboos University (Al Khoud, Oman) for DNA extraction following a protocol of Maixner et al. [19]. The leaves were washed with clear water before the isolation procedure. 1 g of leaves were used and crushed in 3 ml CTAB extraction buffer (2% CTAB, 1.4 M NaCl, 500 mM EDTA pH8, 1 M Tris-HCl pH8 and 0.2% beta-mercaptol). 1.5 ml of the leave extract was transferred to a 2 ml tube and incubated in a water-bath at 65uC for 15 min. The tube was turned up and down twice during incubation, centrifuged at 960 g for 5 min, and the supernatant was subsequently transferred to a clean eppendorf tube. An equal volume of chloroform-isoamyl alcohol mix (24:1) was added and the tube was centrifuged at 21000 g for 20 min. The supernatant was transferred to a new tube and then 0.6 volume of isopropanol was added to the supernatant and incubated at 220uC for 30 min. The DNA pellet was collected by centrifugation at 21000 g for 20 min and then washed with 1 ml of 70% ethanol. The final DNA was resuspended in 100 ml TE (Tris 10 mM, EDTA 1 mM pH8) and was stored at 280uC until used.
The library construction and sequencing were done at the Genome Analysis Centre (Norwich, UK). The Illumina TruSeq DNA Sample Preparation v2 Kit was used to prepare an indexed library. The DNA sample was sheared to a fragment size of 500-600 bp using a sonicator, followed by end-repair and the addition of a single A base for binding of the indexed adapter. The appropriate sized library (500 bp) was selected by gel electrophoresis, followed by PCR enrichment. The 251 bp paired-end sequencing run was performed on an Illumina MiSeq instrument using the SBS chemistry and Illumina software MCS v2.3.0.3 and RTA v1.18.42. The raw reads were deposited at the NCBI Sequence Read Archive under the accession number SRR1611615.

Genome Assembly and Analyses
The procedures for genome assembly and annotation were based on our previous studies of cp genomes [20,21]. In addition to the standard de novo assembly approach by using Velvet v1.2.10 [22] with the k-mer size set to 243, a reference-based approach for assembly as described below was used in parallel. All of the raw reads were initially mapped onto the published cp genome of C. sinensis [10] using BWA v0.6.2 [23]. The sequence variations were identified with SAMtools v0.1.19 [24] and visually inspected using IGV v2.3.25 [25]. The variants were corrected with the raw reads and the regions without sufficient coverage were converted into gaps. This corrected sequence was then used as the new draft reference for the next iteration of verification. Gaps were filled using the reads overhang at margins and the process was repeated until the reference was fully supported by all mapped raw reads. The final assembly, which was supported by our de novo and reference-based approaches, resulted in an average of 1,441-fold coverage of paired-end reads with a mapping quality of 60 and the region with the lowest coverage is 506-fold.
The preliminary annotations of the C. aurantiifolia cp genome were performed online using the automatic annotator DOGMA [26] and verified using BLASTN [27,28] searches (e-value cutoff = 1e-10) against other land plant cp genomes. Each annotated gene was manually compared with C. sinensis cp genome for start and stop codons or intron junctions to ensure accurate annotation. The codon usage was analyzed by using the seqinr R-cran package [29]. A circular map of genome was produced using OGDRAW [30].
To identify the differences between C. aurantiifolia and C. sinensis, the two sequences were aligned using Mauve v2.3.1 [31] and the result was analyzed using custom Perl scripts. Intergenic gene regions were parsed out from the two Citrus cp genomes and aligned using MUSCLE v3.8.31 [32] with the default settings. The pairwise distances were calculated using the DNADIST program in the PHYLIP package v3.695 [33].
The positions and types of simple sequence repeats (SSRs) in the two Citrus cp genomes were detected using MISA (http://pgrc. ipk-gatersleben.de/misa/). The minimum number of repeats were set to 10, 5, 4, 3, 3, and 3 for mono-, di-, tri-, tetra-, penta-, and hexanucleotides, respectively. For long repeats, the program REPuter [34] was used to identify the number and location of direct and inverted (i.e., palindromic) repeats. A minimum repeat size of 30 bp and sequence identity greater than 90% setting were used according to the study of C. sinensis cp genome [10]. The redundant or overlapping repeats were identified and filtered manually.

Phylogenetic Inference
Phylogenetic analysis of the representative rosids lineages with complete cp genomes available was performed using PhyML v20120412 [35] with the GTR+I+G model. A total of 72 rosids species were chosen as the ingroups and Vitis venifera was included as the outgroup, the accession numbers were provided in Table S1. The protein-coding and rRNA genes were parsed from the selected cp genomes and clustered into ortholog groups using OrthoMCL [36]. The presence/absence of orthologous genes in each genome was examined and further verified using TBLASTN [27,28] searches (e-value cutoff = 1e-10). The nucleotide sequences of the conserved genes were aligned individually by using MUSCLE with the default settings. The concatenated alignment was used to infer a maximum likelihood phylogeny as described above. The bootstrap supports were estimated from 1,000 resampled alignments generated by the SEQBOOT program in the PHYLIP package.

Investigations of orf56 and ycf68
To investigate the presence/absence of orf56 and ycf68 in the selected cp genomes, the gene sequences from C. aurantiifolia was used as the queries to perform BLASTN [27,28] searches (e-value cutoff = 1e-10). The significant hits were examined to investigate the presence of intact open reading frames (ORFs). Phylogenetic analysis of the cp orf56 genes and the homologous mitochondrial sequences was performed as described above. The final alignment contains 190 aligned nucleotide sites and a total of 70 sequences, including two sequences of Amborella as the outgroup.

General Features of the Omani Lime Chloroplast Genome
The complete cp genome of C. aurantiifolia (Christm.) Swingle (GenBank accession number KJ865401.1) is 159,893 bp in length, including a large single copy (LSC) region of 87,148 bp, a small single copy (SSC) region of 18,763 bp, and a pair of inverted repeats (IRa and IRb) of 26,991 bp each ( Figure 1 and Table 1). A total of 137 different genes, including 93 protein-coding genes, 30 tRNA genes, and four rRNA genes, were annotated (Table S2). Among these, 12 protein-coding genes and 7 tRNA genes are duplicated in the IR regions. Most of the protein-coding genes are Table 2. Differences between the C. aurantiifolia and C. sinensis cp genomes.  composed of a single exon, while 14 contain one intron and three contain two introns. The gene rps12 was predicted to undergo trans-splicing, with the 59 exon located in the LSC region and the other two exons located in the IR regions. The protein-coding regions contain a total of 27,159 codons (Table S3). Isoleucine and cysteine are the most and least frequent amino acids and have 2,892 (10.7%) and 359 (1.2%) codons, respectively. The codon usage is biased towards a high ratio of A/ T at the third position, which is also observed in many land plant cp genomes [37].

Sequence Comparisons with Sweet Orange
The general characteristics of the two Citrus cp genomes are summarized in Table 1, overall the compositions are quite similar. The GC content of these Citrus cp genomes is approximately 38.5%, which is slightly higher than the average of the 72 representative rosids lineages (36.7%). In these two Citrus cp genomes, the genic regions, introns, and intergenic regions account for ca. 58%, 11%, and 31%, respectively ( Table 1).
The pairwise sequence alignment between the two Citrus cp genomes revealed approximately 1.3% sequence divergence (Table 2), including 1,780 indels (1.11%) and 330 substitutions (0.21%). The LSC region contains more sequence polymorphisms than expected by its size, including 1,360 (76.4%) indels and 235 (71.2%) substitutions. In contrast, the two IR regions account for ca. 34% of the cp genome yet contain only 16 (0.9%) indels and 12 (3.6%) substitutions. The size differences in the LSC and SSC regions between these two cp genomes are mostly explained by one large indel in each region. The LSC sizes differ by 596 bp and a 523-bp indel was found in the spacer between rps16 and trnQ-   UUG. The SSC sizes differ by 370 bp and a 354-bp indel was found in the spacer between rpl32 and trnL-UAA.
To identify the intergenic regions that may be useful for phylogenic analysis or molecular identification, we searched for the spacers that are .400 bp in length and exhibit above-average sequence divergence between the two Citrus species (i.e., .1.3%). A total of three regions satisfied these criteria, including the spacer between trnH-GUG and psbA (449 bp, 1.6% divergence), the spacer between rpl32 and trnL-UAG (1141 bp, 1.5% divergence), and the spacer between trnD-GUC and trnY-GUA (469 bp, 1.3% divergence).
The junctions between the IR, LSC, and SSC regions in C. aurantiifolia are similar to that of C. sinensis except for the LSC-IRb boundary. A total of 23 indels and five substitutions were found at this region, resulting in one copy of rpl22 spanning across the LSC-IRb junction in C. aurantiifolia. Comparing the IR junctions of Citrus with Theobroma and Gossypium in Malvaceae [38], it was found that the IRs in Citrus have expanded to include rps19 and 252 nt of rpl22, whereas in Malvaceae, rps19 is located in LSC and rpl22 was missing [38][39][40].

Analyses of Repetitive Sequences
A total of 109 SSR loci were found in the cp genome of C. aurantiifoliaa, accounting for 1,352 bp of the total sequence (ca. 0.8%). Among these, 94 were also found in C. sinensis and 42 exhibit length polymorphism (Table 3). Most SSRs are located in intergenic regions, but some were found in coding genes such as matK and ycf1. Concerning the controversial status of Citrus taxonomy, the SSRs identified in this study may provide new perspective to refine the phylogeny and elucidate the origin of the cultivars. Furthermore, these SSRs may be used as molecular markers for population studies.
In addition, 62 large repeats that are longer than 30 bp were found in the C. aurantiifolia cp genome ( Table 4). Most of these repeats are located in intergenic spacers, except for three that are located in the coding regions of rps4, psaA and psaB. Twelve of these long repeats were also found in C. sinensis, indicating that these repeats might be widespread in the genus.

Gene Content Analyses within the Rosids
A maximum likelihood phylogenetic analysis of 72 representative rosids lineages was conducted based on a concatenated alignment of four rRNA and 58 protein-coding genes with 54,689 sites ( Figure 2). Citrus represents Sapindales and is sister to the clade containing Malvales and Brassicales. These relationships are congruent with the previous reports [10,[41][42][43]. Based on this phylogeny and the gene content, we inferred the gene loss events during the cp genome evolution in rosids.
The translation initiation factor gene infA in cp has been lost independently at least 24 times in angiosperms and evidence provided from some cases suggested functional replacement by a nucleus copy [44]. Although the majority of infA in our selected cp genomes were found to be pseudogenized or completely lost, an intact infA was found in Quercus, Francoa, and two Cuscumis species.
The rpl22 were found to be lost in Fabaceae [45] and Castanea of Fagaceae [46] following independent transfers to nucleus. Furthermore, another putative loss of rpl22 was detected in Passiflora [46]. The rpl22 in Malvaceae, including Theobroma and three Gossypium species, were found to be pseudogenized in our analysis. In Citrus, the ORF of rpl22 was shortened to 252-264 nt compared to the typical length of 399-489 nt in other rosids [10,46]. However, compared with the pseudogenized rpl22 found in Malvalvace, the rpl22 homologs in Citrus still show high sequence conservation. Additionally, the rpl22 transcripts can be identified in the EST database for various Citrus species (data not shown). Taking account into the above consideration, we did not annotate rpl22 as a pseudogene in Citrus.
The parallel losses of rps16 were found in several rosids lineages (Figure 2), including one time in Salicaceae, two times in Fabaceae and another two times in Brassicaceae. The loss of rps16 in Medicago and Populus was found to be substituted by a nuclearencoded copy that transferred from the mitochondrion (mt) [47]. Because the nuclear-encoded RPS16 was found to target both mt and cp in Arabidopsis, Lycopersicon, and Oryza [47], it is possible that the cp genome-encoded rps16 would not be maintained by selection and will eventually become lost in these lineages.
There are only a few gene loss events of photosynthetic genes found in rosids. In addition to the loss of psaI in Lathyrus sativus [48], the accD seems to be lost independently in Trifolium subterraneum and several Gerantiaceae species except for Geranium palmatum. In Trifolium, a nuclear-encoded accD copy has been reported [48], which presented another example of horizontal gene transfer from cp to nucleus. Successful gene transfers from cp to the nucleus in angiosperms are rare and have been only documented for four genes in rosids. Other than the three genes described above (i.e., infA, rpl22, and accD), the rpl32 in Populus (Salicaceae) is the fourth example [49][50][51].
The IR has been reported to be independently lost at least five times among seed plants, two of which are within rosids [51]. In addition to the inverted repeat lacking clade (IRLC) of papilionoid Fabaceae [52] and Erodium of Gerantiaceae [53,54], the IR was found to be lost in two lineages of Fragaria (Rosaceae), which are F. vesca ssp. bracteatea and F. mandschurica (accession: NC_018767, not shown in Figure 2). Based on the Fragaria phylogeny shown in a previous study [55], it seems that IR loss was not a single event in Fragaria.

Molecular Evolution of orf56 and ycf68 within the Rosids
In the comparison of gene content between the two Citrus cp genomes, C. aurantiifolia was found to contain two additional protein-coding genes. The first gene, orf56, is located in the trnA-UGC intron that contains one sequence homologous to previously recognized mitochondrial ACRS (ACR-toxin sensitivity gene) in Citrus [56]. In addition to the 171-bp identical sequences between cp orf56 and the ORF sequences of ACRS in mt, the full length of 355-bp region of ACRS that conferred sensitivity to ACR-toxin in E. coil are also identical. Furthermore, the whole trnA-UGC among two Citrus cp regions and C. jambhiri mitochondrial ACRS shared more than 96% identity ( Figure S1), which highlight the conservation of this region between cp and mt.
The gene orf56 has also been included in the annotation of complete cp genomes of Calycanthus [57] and Pelargonium [58]. Our BLAST search against the rosids genome database revealed that in addition to Citrus and Pelargonium, all of the species examined in Cucurbitaceae and Myrtales also contain an intact orf56 ( Figure 3). Moreover, an intact ACRS ORF is also present in the mt genomes of Liriodendron [59] and Silene [60] and the ORF sequences between cp and mt are identical. Goremykin et al. [57] suggested that the ACRS gene was relative recently transferred from cp to mt. Based on the phylogeny containing the cp orf56 and the mt ACRS ( Figure S2), it appears that orf56 has been independently transferred from cp to mt in different lineages.
The second gene, ycf68, is located in the trnI-GAU intron. A nearly identical sequence was found in C. sinensis but an additional T insertion near the C-terminus abolished the stop codon at the corresponding position. The intact ycf68 can be detected in several monocots and Nymphaeaceae [61,62]. However, in the majority of other rosids ( Figure 3) and the rest of the eudicots [61], the ycf68 homologs all contain premature stop codons. Although Raubeson et al. [61] argued that ycf68 is not a protein-coding gene based on the lack of intron-folding pattern, the high levels of sequence conservation among the ORFs of identified homologs suggest that the true identity and functionality of this putative gene remains to be further investigated.

Conclusions
We reported the complete cp genome sequence of Citrus aurantiifolia (Rutaceae) in this study. The genome organization and gene content is typical of most angiosperms and highly similar to that of C. sinensis (i.e., 98.7% identical at the nucleotide level). The only difference in the gene content between the two Citrus cp genomes is the C. aurantiifolia-specific presence of a proteincoding gene (ycf68) in the trnI-GAU intron. Notably, three long intergenic spacers with high sequence divergence and 94 shared SSR regions were identified in the C. aurantiifolia-C. sinensis comparison. These regions may provide phylogenetic utility at low taxonomic levels and could be applied to the molecular identification of Citrus cultivars. Finally, our comparative analysis of gene content among 72 representative rosids lineages highlighted multiple events of gene losses within this group. Figure S1 Alignment of the orf56-containing sequences of two Citrus cp genomes and C. jambhiri mitochondrial ACRS sequences. (TIF) Figure S2 The maximum likelihood phylogeny of the cp orf56 and mt ACRS ORF sequences.

(TIF)
Table S1 List of the complete chloroplast genome sequences included in the phylogenetic analysis. (XLSX)