Background & Summary

H. syriacus, commonly known as rose of Sharon, is a fast-growing deciduous shrub belonging to the Malvaceae family and is renowned for its diverse applications, including culinary, ornamental, and medicinal uses1. Its wide range of flower colors makes it an attractive choice for decorative landscaping2,3. In North American countries, it has gained immense popularity as a garden tree due to its versatile properties4. However, breeding H. syriacus presents significant challenges due to its self-incompatibility, resulting in most landraces being natural hybrids5. Consequently, there have been limited reports of breeding trials aimed at developing polyploidy plants4,6. In Korea, breeding advancements have been achieved through methods such as the propagation of naturally occurring mutants, inter-generic crossings, and the induction of mutations using gamma-ray irradiation6,7,8,9. The complexities of breeding H. syriacus highlights the importance of elucidating the phylogenetic relationships among its cultivars to establish a breeding system capable of generating F1 hybrids.

Given the challenges in H. syriacus breeding, the utilization of chloroplast genomes represents a strategic approach due to their unique features. These organelles are typically maternally inherited, except in some gymnosperms where inheritance is paternally directed. Chloroplast genomes contain non-recombinant sequences and are usually inherited in a uniparental manner, allowing for lineage tracing through the maternal line and minimizing uncertainties associated with biparentally inherited nuclear genomes10,11,12,13. Furthermore, the high conservation of the chloroplast genome, including gene repertories and structures, enables comparative analyses that offer clear insights into the evolutionary trajectories and phylogenetic relationships among cultivars14,15,16. Previous studies on Atractylodes species and Panax ginseng demonstrated that even with low divergence, unique polymorphic chloroplast-derived markers could be developed to distinguish inter- and intra-species differences, respectively11,17,18,19,20,21,22. This highlights the potential applications of chloroplast genomes in the development of highly species-specific molecular markers, even at the intra-species level, thereby overcoming challenges posed by minimal genetic divergence. Nevertheless, the majority of studies on Hibiscus chloroplast genomes have predominantly focused on the taxonomic level of genus, leaving in-depth intra-species studies relatively unexplored10,23,24,25. Given the breeding challenges of H. syriacus outlined earlier, comparative studies at the intra-species level are not only crucial but indispensable. Developing more molecular markers at the intra-species level is essential to gain unparalleled insights into the evolutionary trajectory and contribute to the precise taxonomic classification of H. syriacus26,27,28.

In this study, we generated 94 H. syriacus chloroplast genomes using a short-read sequencing platform (Illumina) and 1 genome using a long-read sequencing platform (Oxford Nanopore Technology). Subsequent pangenome analysis of these 95 H. syriacus chloroplast genomes revealed a high degree of conservation in the majority of genome sequences, while also identifies unique cultivar-specific variant patterns. A total of 193 single-nucleotide polymorphisms (SNPs) and 61 insertions or deletions (Indels) were identified, highlighting their potential applications as intra-species molecular markers29. The development of molecular markers utilizing these regions will play a pivotal role in achieving precise classification among H. syriacus cultivars and establishing refined breeding strategies. Moreover, these results will offer essential insights for species conservation, biodiversity enhancement, and the exploration of the agricultural and ornamental potentials of H. syriacus.

Methods

Plant materials and sequencing

H. syriacus cv. Gangneung was used for long-read-based chloroplast genome assembly30. A core collection of H. syriacus from the National Institute of Forest Science was utilized for short-read-based chloroplast genome assembly. Genomic DNA was extracted from fresh leaf tissues of H. syriacus plants using the standard cetyltrimethylammonium bromide method31.

The quantity and quality of genomic DNA were assessed using a Nanodrop spectrophotometer with a quality cut-off at an OD260/280 ratio of 1.8–2.0 and a Qubit dsDNA HS Assay Kit (Thermo Fisher Scientific, Massachusetts, USA). Following quality assessment, the DNA was used to generate libraries with an average insert size of 550 bp. Paired-end sequencing was performed to obtain 150-bp sequences at both ends using an Illumina NovaSeq. 6000 platform (Illumina Inc., San Diego, CA, USA).

Genome assembly and annotation

For long-read assembly, the generated reads30 were aligned to a reference chloroplast sequence obtained from prior research32, using minimap2 (v2.22) with default parameters33. Reads with a mapping coverage exceeding 80 were extracted using Seqtk (https://github.com/lh3/seqtk) v1.3. These extracted reads were then assembled into a pseudo-molecule using Flye (v.2.9)34 and subsequently polished using NextPolish (v1.4)35 to correct base errors arising from noisy long reads.

For short-read assembly, Trimmomatic (v0.39)36 was used to trim adapters and eliminate low-quality sequences from the raw reads to enhance read quality. The trimmed reads were then aligned to the reference chloroplast genomes obtained from prior studies37,38,39,40,41,42,43,44, using the Burrows–Wheeler alignment (v0.7.17) tool45 (Table 1). The mapped reads were assembled using NOVOPlasty (v4.3.1)46, which employed a 39 k-mer and default RUBP sequences as seeds for chloroplast assembly47,48. The contigs generated by NOVOPlasty were ordered and merged into a single pseudo-molecule according to the reference chloroplast genome sequence.

Table 1 Reference chloroplast genomes for mapping.

Genome annotation was performed using the GeSeq platform, which provides rapid and accurate annotation of organellar genomes49. We employed BLAT50, Chloë (v0.1.0), and HMMER51 to annotate coding sequences and rRNA, and ARAGORN (v1.2.38)52 and tRNAscan-SE (v2.0.7)53 to annotate tRNA. Annotation accuracy was validated against H. syriacus var. Baekdansim30, and any discrepancies were manually curated. The circular map representation of the chloroplast genome was generated using OGDRAW (v1.3.1)54 (Fig. 1).

Fig. 1
figure 1

Circular map of the chloroplast genome in H. syriacus var. Gangneung. The center of the plot displays the cultivar name and genome length. The inner grey circle represents the GC content proportion in each region, with the line representing 50%. Genes located outside the outer circle are transcribed counterclockwise, and those inside the circle are transcribed clockwise. Genes with different functional annotations are differentiated by color.

Chloroplast genome alignment and pan-chloroplast genome-graph construction

To validate the genome assembly, we employed the chloroplast genome of H. syriacus var. Gangneung, constructed using long-read sequencing, as a reference for multiple sequence alignment. Sequence alignment was performed using MAFFT55 with default parameters. Subsequently, pairwise alignments of the chloroplast genomes were generated using MUMmer456.

To construct a pan-chloroplast genome-graph encompassing 95 H. syriacus genomes, we utilized the Minigraph-Cactus Pangenome Pipeline (v2.6.8)57. The integration process involved the iterative addition of the remaining 94 genomes with the reference chloroplast genome. Precise base-level alignments were achieved with the Cactus-pangenome tool using the parameters “--giraffe --fa --bz --viz.” From this comprehensive graph, we employed the Cactus-graphmap (v2.6.8) tool to map the graph utilizing the default parameters. We identified a total of 193 SNPs and 61 Indels across the entire genomes, observations that offer significant potential for the future development of intra-species molecular markers29. Overall, H. syriacus cultivars exhibit similarity across all genomic regions. However, for H. syriacus var. Russian Violet, a notable divergence in similarity was observed in the regions spanning 59,000 bp to 62,000 bp (Fig. 2).

Fig. 2
figure 2

Pan-chloroplast genome-graph for 95 H. syriacus cultivars. (a) The pan-chloroplast genome-graph represents all 95 H. syriacus cultivars with the total chloroplast genome scale. (b) An enlarged view of the pan-chloroplast genome graph highlighting a region of the largest variation identified in H. syriacus var. Russian violet, indicated by red bars. (c) Multiple sequence alignment for the largest variation site among the 95 H. syriacus varieties.

Comparative genomic analysis in 95 H. syriacus chloroplast genomes

Structural similarity and gene distribution among the 95 chloroplast genomes were analyzed using mVISTA software in LAGAN mode with the default settings, with H. syriacus var. Baekdansim used as the reference58,59,60,61 (Fig. 3). This observation was consistent with the results from the pan-chloroplast genome analysis, where H. syriacus var. Russian Violet exhibits a significant deletion in specific regions.

Fig. 3
figure 3figure 3

The 95 H. syriacus accessions mVISTA map, with the Gangneung chloroplast genome as the reference. The vertical scale represents the percentage of identity, ranging from 50% to 100%. The horizontal axis corresponds to the base sequence region. Red indicates non-coding sequences(CNS), blue indicates the exons of protein-coding genes and light green indicates untranslated regions(UTR) including tRNA or rRNA.

Hypervariable regions within the chloroplast genome of H. syriacus were identified using DnaSP version 6 software62. A total of 95 H. syriacus chloroplast genomes were aligned using MAFFT55 with default parameters. Nucleotide diversity was calculated through sliding window analysis, with the window size set at 600 bp with a step size of 100 bp22 (Fig. 4). The inverted repeat regions tend to be more conservative than the single copy regions. The highest nucleotide diversity was identified in the trnS-psbZ region. This region has the potential for use as a DNA barcode to facilitate distinction among the H. syriacus cultivars.

Fig. 4
figure 4

Nucleotide diversity in 95 H. syriacus chloroplast genomes. Sliding window analysis was performed with a window length of 600 bp and a step size of 100 bp. The x-axis represents nucleotide position, while the y-axis represents nucleotide diversity (Pi). Genes within the most hypervariable regions are highlighted in red.

Data Records

A total of 94 raw reads obtained through Illumina sequencing have been deposited in the NCBI Sequence Read Archive under the accession number SRP46454163. The assembled chloroplast genome sequences, accompanied by their corresponding gene annotations for the 94 cultivars have been submitted to NCBI GenBank64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157 and are detailed in Table S1. Additionally, H. syriacus var. Gangneung has been deposited in the NCBI GenBank with the accession number OR619829158.

Technical Validation

Evaluation of chloroplast genome assembly

To evaluate the completeness of the chloroplast genome assembly, chloroplast reads were aligned to the chloroplast genome as described in the “Genome assembly and annotation” section. The lengths of the 95 assembled pseudo-molecules ranged from 160,231 bp to 161,041 bp, which is consistent with the observed chloroplast genome length in other members of the Malvaceae family23,24,25,26,27. Synteny analyses were conducted using MUMmer159 with the previously reported chloroplast genome of H. syriacus var. Baekdansim as the reference30. The dot plot revealed that the assembled genomes align cohesively with no major rearrangements observed (Fig. 5). Instead, the plot displayed inversions, represented by a blue line, corresponding to the chloroplast-specific inverted region.

Fig. 5
figure 5

Pairwise comparative analysis of chloroplast genomes in various H. syriacus cultivars with H. syriacus var. Baekdansim using MUMmer plots. (a) Comparison of chloroplast genomes constructed using ONT and PacBio long-read sequencing platforms. (b) Comparison of chloroplast genomes constructed using Illumina short-read and PacBio long-read sequencing platforms. The red lines represent collinear sequences and the blue lines represent inverted sequences.

Evaluation of gene annotation

The accuracy of the gene annotations was meticulously evaluated by comparing them to the H. syriacus var. Baekdansim61 chloroplast genome. Any discrepancies identified were refined through manual curation. In total, 113 distinct genes were identified, including 79 protein-coding genes, 30 tRNA genes, and 4 rRNA genes (Table 2).

Table 2 Genes annotated in the chloroplast genome.

The gene repertoires were consistent across all 95 cultivars, with the only observed differences being related to specific gene loci details. Our results indicate that the gene repertoire was congruent with annotations commonly observed within the Malvaceae family23,160,161, with minor variations detected in the pafI (ycf3), pafII (ycf4), and pbf1 (psbN) genes162,163.