The Complete Chloroplast Genome of Arabidopsis thaliana Isolated in Korea (Brassicaceae): An Investigation of Intraspecific Variations of the Chloroplast Genome of Korean A. thaliana

Arabidopsis thaliana (L.) Heynh. is a model organism of plant molecular biology. More than 1,700 whole genome sequences have been sequenced, but no Korean isolate genomes have been sequenced thus far despite the fact that many A. thaliana isolated in Japan and China have been sequenced. To understand the genetic background of Korean natural A. thaliana (named as 180404IB4), we presented its complete chloroplast genome, which is 154,464 bp long and has four subregions: 85,164 bp of large single copy (LSC) and 17,781 bp of small single copy (SSC) regions are separated by 26,257 bp of inverted repeat (IRs) regions including 130 genes (85 protein-coding genes, eight rRNAs, and 37 tRNAs). Fifty single nucleotide polymorphisms (SNPs) and 14 insertion and deletions (INDELs) are identified between 180404IB4 and Col0. In addition, 101 SSRs and 42 extendedSSRs were identified on the Korean A. thaliana chloroplast genome, indicating a similar number of SSRs on the rest five chloroplast genomes with a preference of sequence variations toward the SSR region. A nucleotide diversity analysis revealed two highly variable regions on A. thaliana chloroplast genomes. Phylogenetic trees with three more chloroplast genomes of East Asian natural isolates show that Korean and Chinese natural isolates are clustered together, whereas two Japanese isolates are not clustered, suggesting the need for additional investigations of the chloroplast genomes of East Asian isolates.


Introduction
Arabidopsis thaliana (L.) Heynh. is a well-known species familiar to those who study plant molecular biology as well as genetic engineering. It was considered to be a weedy species before being used as a model organism [1], representing a good example of the usefulness of weeds. Owing to its importance as a model plant organism, its complete chloroplast genome of Col0 strain was deciphered in 1999 [2]. Its length is 154,478 bp, with a large single copy (LSC) region of 84,170 bp and a small single copy (SSC) region of 17,780 bp separated by inverted repeat (IR; 26,624 bp) regions. It was also found to have 128 genes consisting of 87 protein-coding, 37 tRNA, and four rRNA genes.
Consequently, the whole genome sequences of A. thaliana were released in 2000, presenting a 115. 4 Mb genome with 25,498 genes [3]. Owing to the rapid development of sequencing technologies, including next-generation sequencing (NGS) technologies, more than 1,750 A. thaliana whole genome sequences have been sequenced and analyzed; whole genomes of the Bur-0 and Tsu-1 strains were sequenced with an early version of the Illumina sequencer [4]. The genomes of ebi-1 and Ws-2, which are clock mutants, were also sequenced [5]. Whole genomes of 80 strains isolated from eight regions were also sequenced [6]. In addition, whole genomes of 18 Arabidopsis ecotypes were sequenced along with providing assembled sequences for each ecotype, which can be used in further comparative genomic analyses [7]. However, the organelle genomes of 1the 8 ecotype genomes were not correctly assembled, though these can be rectified using raw sequences. With additional improvements in NGS technologies that have lowered the costs of sequencing, the number of sequenced A. thaliana isolates has been increased over time. Specifically, genomes of 180 Arabidopsis lines isolated from Sweden were sequenced [8]. 217 Arabidopsis individual genomes to uncover genome-wide methylation patterns were also sequenced [9], and the genomes of 118 Chinese Arabidopsis strains were sequenced showing that the Yangtze River population in China can be considered as an independent lineage on the same level of Central Asian and European isolates [10], and 1,135 natural isolates from all over the world were sequenced and analyzed [11]. Interestingly, except for one study on the sequencing Chinese A. thaliana strains [10], only two Arabidopsis strains originated from East Asia have been sequenced (Kyoto and Tsu0 from Japan) [11]. Therefore, once any genome sequences of Arabidopsis isolated in Korea are available, they can serve as a bridge connecting between China and Japan as part of the effort to construct the evolution history of natural isolates of A. thaliana in East Asia.
To understand characteristics of A. thaliana isolated in Korea (termed 180404IB4) based on chloroplast genome sequences, we completed its chloroplast genome, presenting the third chloroplast genome of A. thaliana based on the NCBI database and related publications [2,12]. The chloroplast genome of A. thaliana 180404IB4 presented the shortest total length and the shortest inverted repeat (IR) region length, caused by a 6 bp deletion on ycf2. Due to the limitation of available chloroplast genomes of A. thaliana (only two are available in NCBI: Col0 and Ler-0), we assembled three additional chloroplast genomes of East Asian isolates of A. thaliana from raw sequences downloaded from the Short Read Archive (SRA) in NCBI, indicating that Tsu0 (Japanese isolate) had the longest length due to an approximately 500 bp insertion. Numbers of sequence variations calculated based on the 180404IB4 chloroplast genome are in the middle of the numbers of intraspecific variations of many plant chloroplast genomes. Phylogenetic analyses of these chloroplast genomes indicate that 180404IB4 (Korea) and 11-15 (Chinese) are clustered, whereas two Japanese isolates (Kyoto and Tsu0) are scattered. These results will provide a glimpse of the evolutionary history of A. thaliana in the East Asian area region together with upcoming research results.  [33] after filtering raw reads using Trimmomatic 0.33 [34]. After obtaining the first draft of the chloroplast genome sequences, gaps were filled with GapCloser 1.12 [35] and all bases from the assembled sequences were confirmed by checking each base in the alignment (tview mode in SAMtools 1.9 [36]) against the assembled chloroplast genome generated with BWA 0.7.17 [37]. All these bioinformatic processes were conducted under the environment of Genome Information System (GeIS; http://geis.infoboss. co.kr/; Park et al., in preparation).

De Novo
Assembly of the Japanese and Chinese Natural Isolates A. thaliana Chloroplast Genomes. Raw sequences downloaded from NCBI SRA (SRR492307; Kyoto (Japan), ERR031555; Tsu0 (Japan) and SRR2204166; 15-11 (China)) [7,9,10] were used for chloroplast de novo genome assembly with Velvet 1.2.10 [33] after filtering raw reads using Trimmomatic 0.33 [34] under the environment of Genome Information System (GeIS; http://geis.infoboss.co.kr/; Park et al., in preparation). The remaining steps for finalizing the chloroplast genomes are identical to those used in the assembly process of the Korean A. thaliana. .  2.9. Construction of Phylogenetic Trees. Whole chloroplast genomes of seventeen Arabidopsis genomes and one Arabis chloroplast genome were aligned by MAFFT 7.450 [39], and alignment quality was checked manually. The neighbor-joining (NJ) and maximum likelihood (ML) trees were reconstructed in MEGA X [48]. In the ML analysis, a heuristic search was used with nearest-neighbor interchange (NNI) branch swapping, Tamura-Nei model, and uniform rates among sites. All other options used the default settings. Bootstrap analyses with 1,000 pseudoreplicates were conducted with the same options. The posterior probability of each node was estimated by Bayesian inference (BI) using the Mr. Bayes 3.2.6 [49] plug-in implemented in Geneious R11 11.0.5. The HKY85 model with gamma rates was used as a molecular model. A Markov chain Monte Carlo (MCMC) algorithm was employed for 1,100,000 generations, sampling trees every 200 generations, with four chains running simultaneously. Trees from the first 100,000 generations were discarded as burn-in.  Figure 1). It contains 130 genes (85 proteincoding genes, eight rRNAs, and 37 tRNAs), with 19 genes (8 protein-coding genes, 4 rRNAs, and 7 tRNAs) that are duplicated in IR regions ( Figure 1). The overall GC content is 36.3%, and those contents in the LSC, SSC, and IR regions are 34.0%, 29.3%, and 42.3%, respectively.

Results and Discussions
To determine the characteristics of A. thaliana chloroplast genomes from East Asia, we also completed three chloroplast genomes of A. thaliana, two from Japan (Tokyo and Tsu0) and one from China (11-15; Table 1). In addition, two available chloroplast genomes (Col0 and Ler-0) were used for comparative analyses. Their lengths range from 154,464 bp (180404IB4) to 154,938 bp (Tsu0) and the IR length ranges from 26,257 bp to 26,264 bp ( Table 1). The chloroplast genome of the Korean isolate, 180404IB4, is the shortest, and its IR is also the shortest among four East Asian A. thaliana chloroplasts ( Table 1). It is caused by 6 bp deletion on ycf2 located in the IR region compared to the rest five A. thaliana chloroplast genomes. Interestingly, Tsu0 shows the longest length of chloroplast genome, caused by an insertion of approximately 500 bp between trnL and trnF, which is similar to the cases of Coffea arabica with one continuous insertion region [50], Duchesnea chrysantha showing three continuous insertion regions [21], Viburnum amplificatum showing two continuous insertion regions [32], and mitochondrial genomes of Populus tremula x Populus glandulosa and Liriodendron tulipifera with four and thirty-three continuous insertion regions, respectively [51,52].

Identification and Evaluation of Sequence
Variations of the A. thaliana 180404IB4 Chloroplast Genome against the Col0 Chloroplast Genome. Based on the pairwise alignment with the A. thaliana Col0 chloroplast genome (GenBank accession is NC_000932), 50 single nucleotide polymorphisms (SNPs) and 14 insertion and deletions (INDELs) are identified. Two SNPs on rpoC2 and one SNP each on the ycf2 and ndhF genes are nonsynonymous SNPs, while for rpoC2, rpoB, rbcL, rpl20, and psbB, one SNP in each case is synonymous (Table 2). Specifically, the ycf2 has a 6 bp deletion on the 180404IB4 chloroplast; this does not cause frameshift but is a critical variation making the 180404IB4 chloroplast shortest among the six chloroplast genomes (Tables 1 and 2). Except for this deletion, all INDELs exist in the intergenic space. These INDELs cause the 180404IB4 chloroplast genome to be shorter than the chloroplast genome of Col0 by 14 bp.   Table 3). Some of these studies are compared chloroplast genomes of natural isolates (e.g., Duchesnea chrysantha [21]) and some compared among cultivars to find useful molecular markers (e.g., Chenopodium quinoa [70]; Table 3). These studies cover 23 families showing relatively large coverage, so that we expected that some characteristics of these sequence variations on chloroplast genomes can be rescued. In addition, we used number of SNPs and INDELs directly during comparison of sequence variations for better understanding intuitively because their complete chloroplast genome lengths are around 150 kb except genera Marchantia, Selaginella, Gastrodia, Illicium, Pseudostellaria, and Daphne (  [75], some cases of Cucumis melo [76] and Chenopodium quinoa [70], all of Dioscorea polystachya [77], Oryza sativa among cultivars [78], G. schlechtendaliana [53], and G. elata [56] (  [15,80], and Nymphaea (586 SNPs and 1,150 INDELs between Nymphaea capensis and Nymphaea ampla) [21], no clear levels pertaining to the number of intraspecific or interspecific variations exist. However, the numbers of SNPs and INDELs between 180404IB4 and Col0 are relatively small considering the intercontinental distance between two samples of the same plant species.

Comparison and Evaluation of Sequence Variations of
Chloroplast Genomes of the Six A. thaliana in East Asia. Based on 15 pairwise alignments of the six A. thaliana chloroplast genomes, the numbers of SNPs and INDELs between two A. thaliana chloroplast genomes range from 10 to 116 and from 22 to 570, respectively ( Figure 2). The Tsu0, Japanese natural isolate, chloroplast genome contains large insertions compared to the remaining five chloroplast genomes of A. thaliana, supported by the largest Tsu0 chloroplast genome ( Table 1). The number of INDELs compared to the Tsu0 chloroplast genome (GenBank accession number is MK380721) ranges from 470 to 570, much higher than those of other combinations (Figure 2). This case is similar to those of C. arabica, showing one 84 bp insertion region [50] and D. chrysantha, presenting three insertion regions [21]. In terms of the number of INDELs, it is also in relation to high intraspecific variations that only P. ussuriensis [68], G. schlechtendaliana [53], and G. elata [56] present higher numbers of INDELs (Table 3). In addition, two out of the three Orchidaceae species shows high rates of divergence in terms of flower morphologies as well as the number of species [81][82][83]. This indicates that the Tsu0 insertion is an exceptional case of intraspecific variation. Consequently, Kyoto (GenBank accession number is MK380720), which was also isolated in Japan, and Tsu0 correspondingly present 97 SNPs and 482    8 International Journal of Genomics INDELs (Figure 2), suggesting that Tsu0 has a different genomic configuration compared to the remaining five strains. The numbers of sequence variations on six Arabidopsis chloroplast genomes were plotted together with the numbers of intraspecific variations identified from 90 comparisons of 31 species (Table 3), resulting in three groups; one shows that the number of SNPs is less than 80 and that the number of INDELs is less than 100, the second indicates that the number of SNPs is less than 80 and number of INDELs is between 100 and 200, and the third shows that the number of SNPs exceeds 80 and that the number of INDELs is approximately 500 (Figure 3). The third group is caused by the long

10
International Journal of Genomics insertion of the Tsu0 chloroplast genome. The third group is positioned with a relatively high number of variations areas, while the remaining groups are similar to the most of intraspecific variations on chloroplast genomes (green thick dotted circles in Figure 3).

Comparative Analysis of Simple Sequence Repeats (SSRs)
Polymorphisms on Chloroplast Genomes inside East Asian A. thaliana. One hundred and one simple sequence repeats (SSRs) and 42 extendedSSRs on the chloroplast genome sequences of the Korean isolate of A. thaliana were identified (Supplementary Table 1). One hundred and four (72.72%), 18 (12.59%), and 21 (14.69%) SSRs and extendedSSRs were found in the LSC, IR, and SSC regions, respectively. This distribution is similar to that of Dysphania ambrosioides, but not to those of Dysphania pumilio or Dysphania botrys [30]. Eighteen SSRs and four extendedSSRs (15.38%) are located in the exonic regions of ten protein-coding genes, matK, trnK, trnR, rpoC2, rpoB, atpB, accD, psbB, rps12, rpoA, ycf1, and ndhF, and two tRNA genes, which is higher proportion than that of D. ambrosioides [30]. In addition, the number of genes on the A. thaliana chloroplast genome exceeds that of D. ambrosioides by one, while five out of ten protein-coding genes are shared between two species. 25 SSRs and 13 extendedSSRs (26.57%) are in intronic regions of five protein-coding genes and three tRNAs: ycf3, rps12, clpP, rps16, and ndhA and trnK, trnR, and trnA, respectively. Compared to previous findings that identified SSRs in 12 chloroplast genomes of Brassicaceae, the numbers of SSRs found on the genes are similar to each other, ranging from 40 to 60 [45], which is similar to that of the A. thaliana Korean isolate. We also applied the same method to identify SSRs of the other five chloroplast genomes of A. thaliana used in this study ( Table 4). The total numbers of SSRs and extendedSSRs range from 143 to 145, showing that the Korean isolate of A. thaliana has the fewest, at 143 (Table 4). Based on the number of sequence variations among the six chloroplast genomes (Figure 2), the numbers of SSRs and extendedSSRs along with the motif length are expected to be nearly identical; however, only the triSSRs, nonaSSRs, and decaSSRs show identical numbers across the six chloroplast genomes (Table 4).
Using the SSR comparison pipeline implemented in SSRDB, 117 groups of SSRs or extendedSSRs containing six SSRs from the six A. thaliana chloroplast genomes are identified, accounting for 702 out of 864 SSRs or extendedSSRs (81.25%; Figure 4). There is one interesting SSR group (named as SSR Group 2) containing six SSRs from the six A. thaliana chloroplasts: two are octaSSRs (TATCTATA * 2) and four are diSSRs (TA * 5). Twenty-one SSR groups contain five SSRs or extendedSSRs from five chloroplast genomes, explaining 105 out of 864 SSRs or extendedSSRs (12.15%; Figure 4). Five SSR groups containing four SSRs or exten-dedSSRs from four chloroplast genomes and three SSR groups covering three SSRs or extendedSSRs from three chloroplasts, four SSR groups having two SSRs or extendedSSRs from two chloroplast, and 20 singletons, indicating unique SSRs, among the six chloroplast genomes are identified (Figure 4). Considering the coverage of the SSRs and exten-dedSSRs on Korean isolate of the A. thaliana chloroplast genome (in total, 1,825 bp out of 154,464 bp; 1.18%), the expected number of sequence variations of the SSR and extendedSSR regions is 0.75; however, the number of common SSRs or extendedSSRs is 117 (81.82%), indicating that the numbers of sequence variations located in SSR or exten-dedSSR regions are lower than expected number (107.25 sequence variations for SSR or extendedSSR regions). These variations can be used to develop molecular markers [41].
3.5. Comparison of Nucleotide Diversity among the Six A. thaliana Chloroplast Genomes. Nucleotide diversity among six Arabidopsis thaliana chloroplast genomes was calculated, indicating that the average nucleotide diversity is 0.00017 ( Figure 5(a)), which is at least ten times lower than those of Dysphania (0.0068; Chenopodiaceae) [27] and Viburnum  Figure 4: Distribution of the SSR groups identified from six chloroplast genomes of A. thaliana. X-axis indicates the types of the SSR groups and Y-axis means the number of the SSR groups or SSRs/extendedSSRs. Blue graph means the number of the SSR groups and orange bars mean the # of SSRs from the SSR groups.

11
International Journal of Genomics (0.00176; Adoxaceae) [32]. This is a justifiable result because sequence diversity within species is usually lower than interspecific nucleotide diversity.
There are two significant peaks identified in the sliding window analysis of nucleotide diversity: one is trnL/trnF (pi value is 0.0147) and the other is trnP/psaJ (pi value is 0.00441). There are fewer peaks than in other studies, including those focusing on Dysphania [27] and Viburnum [32], stemming from the low level of nucleotide diversity throughout the chloroplast genome. The first peak, trnL/trnF, appeared due to one large insertion of the Tsu0 chloroplast genome ( Figure 5(b)). The second peak, trnP/psaI, reflects the sequence variations occurring in 180404IB4 (Korea), , Tsu0 (Japan), and Ler0 (Germany; Figure 5(c)). Specifically, SNPs located between 67,670 and 67,680 in both 180404IB4 and the 15-10 isolates mainly contribute to this peak ( Figure 5(c)).
3.6. Comparison of the IR Junction among Arabidopsis thaliana Chloroplast Genomes. The IR region on the plant chloroplast genome is the major origin at which to expand or to shrink the chloroplast genome sequences [84][85][86][87][88]. An investigation of the IR junctions of A. thaliana chloroplast genomes shows that there are no differences among six A. thaliana chloroplast genomes, in agreement with the finding of no structural variations ( Figure 6), identical to the case of D. ambrosioides [30]. In addition, all Arabidopsis chloroplast genomes used in this study present the same structure in the IR junctions.

Phylogenetic Analysis of Korean A. thaliana Chloroplast
Genome Sequence. Bootstrapped neighbor-joining (NJ), maximum parsimony (ML), and Bayesian inference (BI) phylogenetic trees of seventeen Arabidopsis chloroplast genomes including six A. thaliana chloroplasts and one Arabis chloroplast genome as outgroup species indicate that 180404IB4 is clustered with the 15-11 (Chinese natural isolate) with high bootstrap support, while two Japanese isolates are not clustered together in contrast to what was expected here (Figure 7), as Tsu0 has an approximate 500 bp insertion compared to all other A. thaliana chloroplast genomes. This indicates that more chloroplast genomes of East Asian A. thaliana natural isolates should be investigated to find exceptional sequence variations, such as a Tsu0 insertion.

12
International Journal of Genomics Practically, it is possible to utilize currently available NGS raw read datasets of A. thaliana natural isolates by adding an effort to assemble them. In addition, we must consider the possibility of leaking Col0 strains from many molecular laboratories in Korea, which will affect their genetic diversity in some ways. Based on the phylogenetic trees, there appears to be no contamination in Korea. Several intraspecific phylogenetic relations of plant species using whole chloroplast genomes have been studied, including Aconitum coreanum, showing small branches of three individuals with high bootstrap values from both ML and NJ methods from mid-level of sequence variations [89]; G. schlechtendaliana, displaying branches of each samples caused by a sufficient number of sequence variations with high bootstraps in both methods [53,54]; Abeliophyllum distichum, indicating partial support of intraspecific individuals from both methods [90][91][92]; and Coffea arabica, showing high bootstrap values from both methods with no branch of either individual sequences due to the low level of sequence variations [50,[59][60][61][62][63]. All these results differ from that of A. thaliana, presenting a different clade structure from the phylogenetic trees constructed by three methods (Figure 7). Instead, this phenomenon was found in genome studies focusing on intraspecific variations of insect, fungal, and marine invertebrate mitochondria. These include Laodelphax striatellus [93,94] and Nilaparvata lugens [95][96][97] belonging to the Delphacidae family; Fusarium oxysporum which is a fungal plant pathogen [98,99] and Apostichopus japonicus [100]. Because A. thaliana has a sufficient amount of sequencing data to construct chloro-plast genomes, additional studies with more complete chloroplast genomes will provide a clear answer as to whether or not this phenomenon remains.

Conclusions
We sequenced and assembled the chloroplast genome of the Korean isolate of A. thaliana and compared this with the other East Asian A. thaliana chloroplast genomes assembled from NGS raw reads available to the public. Based on the numbers of sequence variations of the six A. thaliana chloroplast genomes, three groups with low, medium, and high levels of sequence variations were found, particularly due to the large insertion identified on the Tsu0 chloroplast genome. Here, 101 SSRs and 42 extendedSSRs were identified on the Korean A. thaliana chloroplast genome, with similar numbers of SSRs on the remaining five chloroplast genomes with a preference of sequence variations of the SSR region. Nucleotide diversity on the six A. thaliana chloroplast genomes indicates only two regions that are highly variable,

13
International Journal of Genomics an outcome that is less dynamic than those of interspecific comparisons of chloroplast genomes. As expected, the IR borders of the six chloroplast genomes are conserved. Phylogenetic analyses of the six A. thaliana chloroplast genomes with those of other Arabidopsis species revealed that the geographical distribution is not congruent with the phylogenetic relationships; however, more complete chloroplast genomes are required for further analysis. Additional whole chloroplast genomes of A. thaliana strains based on a large amount of genomic resources of A. thaliana can describe the detailed evolutionary history of the natural isolates of A. thaliana in East Asia, especially for Korea, China, and Japan.

Data Availability
Chloroplast genome sequence of Korean A. thaliana can be accessed via accession number MK353213 in NCBI GenBank. In addition, three more chloroplast genomes of A. thaliana, Kyoto, Tsu0, and 11-15, based on SRA datasets are accessible through MK380720, MK380721, and MK380719, respectively.

Conflicts of Interest
The authors declare that they have no competing interests.