Siberian larch (Larix sibirica Ledeb.) chloroplast genome and development of polymorphic chloroplast markers

Background The main objectives of this study were sequencing, assembling, and annotation of chloroplast genome of one of the main Siberian boreal forest tree conifer species Siberian larch (Larix sibirica Ledeb.) and detection of polymorphic genetic markers – microsatellite loci or simple sequence repeats (SSRs) and single nucleotide polymorphisms (SNPs). Results We used the data of the whole genome sequencing of three Siberian larch trees from different regions - the Urals, Krasnoyarsk, and Khakassia, respectively. Sequence reads were obtained using the Illumina HiSeq2000 in the Laboratory of Forest Genomics at the Genome Research and Education Center of the Siberian Federal University. The assembling was done using the Bowtie2 mapping program and the SPAdes genomic assembler. The genome annotation was performed using the RAST service. We used the GMATo program for the SSRs search, and the Bowtie2 and UGENE programs for the SNPs detection. Length of the assembled chloroplast genome was 122,561 bp, which is similar to 122,474 bp in the closely related European larch (Larix decidua Mill.). As a result of annotation and comparison of the data with the existing data available only for three larch species - L. decidua, L. potaninii var. chinensis (complete genome 122,492 bp), and L. occidentalis (partial genome of 119,680 bp), we identified 110 genes, 34 of which represented tRNA, 4 rRNA, and 72 protein-coding genes. In total, 13 SNPs were detected; two of them were in the tRNA-Arg and Cell division protein FtsH genes, respectively. In addition, 23 SSR loci were identified. Conclusions The complete chloroplast genome sequence was obtained for Siberian larch for the first time. The reference complete chloroplast genomes, such as one described here, would greatly help in the chloroplast resequencing and search for additional genetic markers using population samples. The results of this research will be useful for further phylogenetic and gene flow studies in conifers. Electronic supplementary material The online version of this article (10.1186/s12859-018-2571-x) contains supplementary material, which is available to authorized users.


Background
The chloroplast genome in conifers, including larch species [1], has a unique, strictly paternal inheritance via pollen, unlike angiosperms, where it has a maternal inheritance via seeds [2]. It allows tracing paternal gene flow and lineages separately from maternal (mitochondrial genes) and bi-parental (nuclear genes) ones. Therefore, chloroplast DNA sequences are the most important source of genetic markers to study distribution of paternal genes and paternally based molecular phylogenetic relationships in conifers.
Larch species, as well as many other conifer species are the main boreal forest tree species, which comprisẽ 30% of the world's forested lands [3]. Boreal forests play a very important ecological role, but are also affected by the global climate change. On one hand, they suffer now from more frequent and drastic droughts, but on the other hand, their area is expanding in the northern regions, and their tree line is moving towards the north creating an ecotone, a highly dynamic transition area [4]. It is important to know how much of paternal associated gene flow by pollen contributes into establishing this zone, compared to the maternal and bi-parental contributions by seeds. Such studies require chloroplast markers. The next generation sequencing (NGS) technique allows whole chloroplast genome sequencing in multiple individuals and makes a search for molecular genetic markers more efficient. For instance, Parks et al. [5] sequenced chloroplast genomes in 37 pine species using NGS nearly completely. They found a significant amount of variation (especially in two loci ycf1 and ycf2) that provided them with additional data for inferring intrageneric phylogeny of genus Pinus.
Whole chloroplast genome comparison across different species and genera allows also studying organelle evolution and how it is associated with speciation and dispersal. Complete chloroplast genome sequences are available in NCBI Genbank for multiple plant species, including conifers. However, most of them represent the Pinus genus, and only three chloroplast genomes are available for the Larix genus: complete -for European (Larix decidua Mill.; AB501189.1) and Chinese (L. potaninii var. chinensis Beissn.; KX808508) larch and partial -for Western larch (L. occidentalis Nutt.; FJ899578.1).
Variation in the chloroplast genome is effectively used in phylogenetics at different levels. It allowed discriminating different subgenera and genera. For instance, Cronn et al. [6] compared chloroplast genome sequences of seven pine and one spruce species and found three regions that have deletions corresponding to the subgenera specific deletions in three genes: ycf12 (78 bp at the nucleotide starting position 51,051), psaM (93 bp at position 51,442), and ndhI (371 bp at position 101,988), respectively. These are common deletions in the chloroplast genome in pine species of the subgenus Strobus (i.e., P. gerardiana, P. krempfii, P. lambertiana, P. longaeva, P. monophylla, P. nelsonii, P. koraiensis); the corresponding genes were present in the subgenus Pinus (P. contorta, P. ponderosa, P. thunbergii) and in spruce Picea sitchensis [6].
Variation in the chloroplast genome can be also effectively used in discriminating different populations of the same species. For instance, Whittall et al. [7] demonstrated a strong differentiation between the mainland and island populations of Torrey pine (Pinus torreyana) based on 5 SNPs found in the entire chloroplast genome of 120 Kbp.

Methods
We used the data of the whole genome sequencing of three Siberian larch trees generated by Illumina HiSeq2000 [8]. DNA samples were isolated from needles and haploid callus of three Siberian larch trees, representing different regions in Russiathe Ural Mountains, Krasnoyarsk Region and Khakassia Republic, respectively. The Larix decidua Mill. [9] and L. occidentalis Nutt. [5] chloroplast genomes were used as a reference (NCBI Genbank accession numbers AB501189.1 and FJ899578.1, respectively). We did not use the chloroplast genome of L. potaninii [10] as a reference, because it was assembled by using the chloroplast genome of L. decidua (NC_016058; [9]) as a reference, but we used it in the comparative analysis. The paired-end (PE) and mate-pair (MP) libraries with fragment sizes of 400-500 bp (Ural and Krasnoyarsk trees) and 300-400 bp (Khakassia tree), respectively, were used for sequencing via 2 × 100 cycles by Illumina HiSeq2000.
The sequence reads were mapped to the reference chloroplast genomes using the Bowtie2 software [11], which is good for mapping short sequence reads to medium-sized and large genomes. This software implements an algorithm to derive the FM-index based on the Burrows-Wheeler Transform. The SPAdes genome assembler has been used to assemble the larch genome, which implements the De Bruijn graph approach [12]. The Rapid Annotation service with Subsystem Technology (RAST) has been used for annotation [13].
The first step in our assembly procedure consisted of mapping short reads to the available chloroplast genome references of L. decidua and L. occidentalis using the Bowtie2 software. Then, the aligned reads were assembled by SPAdes. The obtained contigs were aligned again on the reference of L. decidua using BLAST. In the third step, the selected contigs were verified to get the "trusted" status. Then, the assembly was carried out using SPAdes. The final step of the assembly was scaffolding, which was done using the generated contigs and MP reads using the SSPACE program [14].
Considering the well-known fact that chloroplast organelle originated from cyanobacteria, and that, therefore, chloroplast genes are still very similar to the bacterial ones, the RAST service, which was designed for annotation of bacterial and archaeal genomes, was used for the larch genome annotation. The annotation obtained by the RAST contained both the confirmed known genes and the predicted genes, potentially coding hypothetical proteins. In order to clarify the roles of these hypothetical coding regions, our annotation was compared with the annotations of two closely related species, L. decidua and L. occidentalis, respectively.
In addition, some fragments of the genome have been also selectively aligned with BLAST. The sites of hypothetical proteins confirmed by BLAST were identified and recorded.
The assembled chloroplast genome of L. sibirica has been deposited in the NCBI GenBank with the accession number NC_036811.1 and used as a reference to search for polymorphisms (SNPs and SSRs) among the three above mentioned trees. SNPs were searched using the Bowtie2 and UGENE [15] software (option Call Variants with SAMtools). First, the reads of the Urals and Khakassian trees were mapped to the finally assembled genome Fig. 1 Gene map of the Larix sibirica chloroplast genome. Genes belonging to different functional groups are color-coded. The dark and light grey in the inner circle represents the GC and AT content, respectively of the Krasnoyarsk tree. Then, the resulting sam-file together with the assembled genome was used by the UGENE program to search for SNPs. The SSR loci were searched using GMATo [16] with a threshold of minimum 6 repeats for di-, tri-, tetra-, penta-, and hexanucleotide motifs, and 10 minimum repeats for mononucleotide motifs.

Results
The total length of the final Siberian larch chloroplast genome assembly was 122,560 bp, which is very close to 122,474 bp in the closely related European larch (Larix decidua). The annotation through comparison with the available data for L. decidua and L. occidentalis identified 110 genes, from which 34 represented tRNA genes, 4 rRNA, and 72 protein-coding genes. In three trees 13 SNPs were detected. Two of them were found in the coding regions of the tRNA-Arg and Cell division protein ycf2 genes.
We used the available software, such as Bowtie2, BLAST, and SPAdes to assemble the chloroplast genome using the reads generated in the whole genome sequencing of the Siberian larch project. We used SSPACE for scaffolding and the RAST service for annotation of the obtained chloroplast genome. We developed a procedure that allowed us to successfully extract chloroplast genome specific reads and then assemble and annotate the resulting sequences. We identified and verified 110 coding regions representing 38 RNA and 72 protein genes, which is equal to the number of genes in chloroplast sequences of L. decidua and L. potaninii and close to 105 genes in the partial chloroplast genome sequence of L. occidentalis. A gene map of the genome was generated using OGDRAW [17] and is presented in Fig. 1. The search for SNPs using UGENE revealed a relatively small number of SNPs ( Fig. 2; Additional file 1), but it is only preliminary data based on a limited sample size. In addition, 23 SSR loci (16 with mono-and 7 with dinucleotide repeat motifs) were identified in the chloroplast genome (Additional file 2). No SSR loci with tri-, tetra-, penta-, and hexanucleotide repeats were found with the search parameters used.

Discussion
The chloroplast genome variation in most plants is often limited due to a relatively low frequency of mutations in this organelle. For example, the mutation rate of the chloroplast genome in pines is approximately 0.2-0.4 × 10 − 9 synonymous substitutions per nucleotide per year [18,19]. However, with an average length of 120-160 Kbp and 130 genes, the chloroplast genomes are sufficiently large and complex, and include structural and point mutations that reflect population differentiation and evolutionary divergence [6].
Unlike angiosperms, the conifer chloroplast DNA (cpDNA) lacks large inverted repeats (IR), but contains dispersed repetitive DNA that is associated with structural rearrangements. In addition to large dispersed repeated sequences, the conifer cpDNA also possesses a number of small repeats. It contains variable numbers of tandem repeats of 124 to 150 bp in size, which are associated with the polymorphic rearranged region near trnK-psbA, where the psbA gene has been duplicated [20].
Most variation in the chloroplast genome is associated with microsatellite loci [21,22]. However, these markers have a too high mutation rate that can lead to incorrect phylogenetic inferences [23][24][25]. SNPs could be better markers for phylogenetic inferences, and comparative complete chloroplast genome studies are needed to discover these markers. The reference complete chloroplast genomes, such as the one described here, would greatly help in chloroplast resequencing and search for SNPs using population samples.

Conclusions
The complete chloroplast genome sequence was obtained for Siberian larch for the first time. Annotation and comparison of the obtained data with data available only for two other larch species helped us identify and verify 110 coding regions representing 38 RNA and 72 protein genes. The total of 13 SNPs were detected; two of them were in the coding regions of the genome. The results of this research will be useful for further phylogenetic and gene flow studies in conifers.