Data characterizing the chloroplast genomes of extinct and endangered Hawaiian endemic mints (Lamiaceae) and their close relatives

These data are presented in support of a plastid phylogenomic analysis of the recent radiation of the Hawaiian endemic mints (Lamiaceae), and their close relatives in the genus Stachys, “The quest to resolve recent radiations: Plastid phylogenomics of extinct and endangered Hawaiian endemic mints (Lamiaceae)” [1]. Here we describe the chloroplast genome sequences for 12 mint taxa. Data presented include summaries of gene content and length for these taxa, structural comparison of the mint chloroplast genomes with published sequences from other species in the order Lamiales, and comparisons of variability among three Hawaiian taxa vs. three outgroup taxa. Finally, we provide a list of 108 primer pairs targeting the most variable regions within this group and designed specifically for amplification of DNA extracted from degraded herbarium material.


Specifications
Biology, genetics, genomics More specific subject area

Molecular phylogenetics and evolution
Type of data Tables and figures How data was acquired High-throughput sequencing of contemporary and herbarium samples was conducted on the Illumina HiSeq 2500 and MiSeq platforms, followed by both de novo and reference-guided assemblies, mapping, and functional annotation Data format Raw, and analyzed Experimental factors De novo assemblies were created using SOAPdenovo and reference-guided assemblies were created using YASRA. Sequences were functionally annotated using DOGMA. SNPs were called and filtered using SAMtools and BCFtools Experimental features Data include chloroplast genome gene content, structure, and comparisons of variable loci in a suite of recently diverged species and outgroups Data source location Hawaii, North America, South America, Europe, Africa, and Asia Data accessibility Data are published with this article

Value of the data
These data provide a summary of the characteristics and structure of the chloroplast genomes of several taxa within Lamiaceae, which can be used to increase our understanding of molecular evolution of the chloroplast genome, as well as the evolution of its structure and function.
A comparison of variable regions in mints can be used to identify rapidly evolving regions in other taxa.
Primer sequences described here can be used to target highly variable regions in closely related taxa.

Data
Raw, demultiplexed sequence reads have been deposited in the NCBI sequence read archive (SRP070171) and full chloroplast genomes for 12 mint taxa have been deposited in GenBank (KU724130-KU724141). Data presented in the text include tables and figures giving information on gene content and variability in these 12 species, as well as comparison of genome structure with other members of the order Lamiales.

Samples, library construction, and shotgun sequencing
We selected 12 Hawaiian mint taxa for shotgun sequencing (five contemporary and seven from herbarium collections ranging up to $ 100 years old), of which two extinct species were represented by two accessions each (see Tables 1 and 2 in [1]). We also sequenced four Stachys species, representing both close and more distantly related relatives.
DNA extraction, library construction and shotgun sequencing followed the methods described in [1]. Briefly, approximately 100 mg dried leaf tissue was homogenized using the TissueLyser system (Qiagen), and DNA was extracted using the DNeasy plant mini kit (Qiagen). DNA isolated from herbarium samples was processed separately from contemporary samples using stringent protocols and controls to prevent and detect any potential contamination. For contemporary samples, DNA extracts were sheared to 200-600 bp via sonication in a Covaris S220; DNA from herbarium samples is naturally degraded and therefore was not sheared further. Genomic shotgun sequencing libraries were constructed following the standard Illumina Tru-seq protocol for contemporary samples, or the NEBNext Library Prep Mastermix kit (New England Biolabs) for herbarium samples. Libraries were quantified using the PicoGreen High Sensitivity assay and then pooled and sequenced on the Illumina Table 1 Gene content of the chloroplast genome of Stenogyne haliakalae and 11 additional mint species.

Table 2
Genes containing introns in the chloroplast genomes of Stenogyne haliakalae and 11 additional mint species. Numbers represent the lengths (bp) of exons and introns in S. haliakalae.

Gene
Location # Introns  Exon I  Intron I  Exon II  Intron II  Exon III   atpF  LSC  1  143  656  410  clpP  LSC  2  70  658  291  616  227  ndhA  SSC  1  552  1020  538  ndhB  IR  1  755  680  776  petB  LSC  1  5  718  650  petD  LSC  1  7  728  474  rpl16  LSC  1  8  907  392  rpl2  IR  1  390  658  433  rpoC1  LSC  1  434  736  1634  rps12  LSC/IR a  2  1 1 3   a   231  537  25  rps16  LSC  1  39  875  226  trnA-UGC  IR  1  37  807  34  trnG-UCC  LSC  1  22  690  47  trnI-GAU  IR  1  34  938  36  trnK-UUU  LSC  1  36  2509  34  trnL-UAA  LSC  1  36  488  49  trnV-UAC  LSC  1  37  579  36  ycf3  LSC  2  123  713  229  725  152   a Trans-spliced  Table 3 Lengths (bp) of the long single copy region (LSC), short single copy region (SSC), and inverted repeats regions (IR) for the chloroplast genomes of Stenogyne haliakalae and 11 additional mint species. , both members of the Orobanchaceae, are not considered here because they are parasitic and lack chlorophyll, thus demonstrating largely reduced chloroplast genomes. Blocks with the same color represent homologous regions free of internal structural changes for that subset of taxa, and those above the centerline for each taxon are in the same orientation as in Stenogyne haliakalae, whereas those below the line are in the reverse direction. Within each block a similarity profile for the region is plotted. Areas outside of blocks are presumed to represent lineagespecific regions of the chloroplast genome. One copy of the inverted repeat has been trimmed so that homology of the remaining repeat (area shaded in light gray) can be shown.

Species
HiSeq and MiSeq platforms. Adapter sequences were trimmed from the reads using the Adapter-Removal software [2]. Assessment of DNA damage in old herbarium specimens was conducted using mapDamage 2.0 [3]. The presence of misincorporations characteristic of damaged DNA molecules typically found in old and degraded samples suggests that the data from herbarium samples are authentic, however, the overall levels of damage were low and within the range expected based on the age of the specimens (see Supplementary Figs. 2 and 3 in [1]).

Assembly of the Hawaiian mint reference chloroplast genome
Because no chloroplast genome sequence from a closely related taxon was available at the time this study was conducted, we implemented a combined reference-guided and de novo assembly approach [4] to determine the first complete chloroplast genome sequence for a Hawaiian mint. We assembled the sequence for Stenogyne haliakalae, an extinct species, as it had the largest number of reads. Briefly, the approach involved conducting both reference-guided assembly in YASRA 2.32 [5] with olive (Olea europaea, NC_013707; [6]) as the reference, as well as de novo assembly in Fig. 2. Conservation among 11 complete mint chloroplast genomes. The sequence for Haplostachys linearifolia was excluded due to missing data. A physical map is given at the top to show gene content and organization (see Fig. 2 in [1] for gene names and products). In the lower panels, regions of the genome are represented by bars, and those that are conserved among all 11 species are colored mauve, whereas those that are conserved among subsets of the taxa have different colors. The height of the bar shows the degree of similarity.  SOAPdenovo v1.05 [7]. Assembly methods are described in more detail in [1]. The resulting contigs from both approaches were split into overlapping sequences, and then used as input for a further reference guided-assembly step in YASRA. Gaps between the final contigs were closed using PCR (see [1] for PCR reaction conditions and Supplementary Table 1 of this paper for primer information) and Sanger sequencing in both directions from high-quality DNA extracted from a contemporary sample of Stengyne bifida. This ensured that amplification could be carried out over potentially large gaps, which would not be possible with degraded DNA from the extinct Stenogyne haliakalae. Contigs and Sanger sequences were aligned in Sequencher 4.7 (Gene Codes) to create a pseudo-reference sequence [4]. Reads from Stenogyne haliakalae were then mapped to the pseudo-reference using BWA v. 0.6.2 [8]. The reference sequence was further refined through Sanger sequencing of areas with low coverage or poor mapping quality (e.g., the border between the inverted repeat and single copy region). Reads were mapped to the final sequence, PCR duplicates were flagged and removed with the MarkDuplicates tool of the Picard command line toolset (http://picard.sourceforge.net/index.shtml), and a consensus sequence was called using SAMtools [9].

Assembly of additional mint chloroplast genomes
Complete or nearly complete chloroplast genomes were assembled using similar methods for 11 additional taxa: seven from the endemic Hawaiian mints (two of which were from herbarium samples) and four Stachys outgroups (see Tables 1 and 2 in [1]). Since the Hawaiian mints have diverged recently, we used the new chloroplast genome sequence from Stenogyne haliakalae as the reference during reference-guided assembly for the remaining Hawaiian taxa. The resulting contigs were aligned to create an interim sequence, and then the reads were mapped to this using BWA and a final consensus sequence called using SAMtools.
Chloroplast genome sequences for the Stachys outgroup taxa were assembled in a similar manner. We first assembled the chloroplast genome sequence for Stachys chamissonis, as this species is most closely related to the Hawaiian lineage. We conducted independent YASRA runs using Olea europaea as the reference, in addition to newly available sequences from Stenogyne haliakalae, Sesamum indicum (NC_016433) [10], Origanum vulgare (JX880022) [11], and Salvia miltiorrhiza (NC_020431) [12]. The contigs from all five runs were aligned to create an interim sequence. The reads were then mapped to the interim sequence using BWA and a consensus called using SAMtools. Once the Stachys chamissonis sequence was assembled we used this as the reference in YASRA for reference guided assembly of both Stachys coccinea and Stachys sylvatica. For Stachys byzantina, the most distantly related outgroup, we performed the initial reference-guided assembly using the sequence from Stachys chamissonis, as well as Olea europaea and Sesamum indicum. The rest of the assembly proceeded as described for the other Stachys species.

Gene content and structure of mint chloroplast genomes
The Stenogyne haliakalae reference sequence and sequences from additional species were annotated using a combination of DOGMA [13], tRNAscan-SE [14], and additional manual BLAST searches. The borders of the inverted repeats were identified with the program Inverted Repeats Finder [15].
Overall the chloroplast genome sequences assembled here are very similar to other Lamiales. Table 1 shows the gene content of the Stenogyne haliakalae chloroplast genome, which mirrors the gene content of the chloroplast genomes for the 11 additional mint taxa, including the Stachys outgroups. Table 2 lists the genes that contain introns (and the number of introns present) in the mint taxa investigated here. Table 3 gives the lengths of the full chloroplast genome for each sequence assembled, as well as the lengths of the inverted repeats and single copy regions.
To compare the genome structure of all of the Hawaiian and Stachys mints to each other, we conducted a separate analysis in Mauve (Fig. 2). The sequences were assumed to be collinear. The seed weight was set to 7, the gap opening penalty was set to À 200, and the gap extension penalty to À 30.

Variability in the chloroplast genome sequences of Hawaiian mints and Stachys outgroups
We investigated variability among the three highest quality chloroplast genome sequences of the Hawaiian mints (Stenogyne haliakalae, Stenogyne bifida, Haplostachys haplostachya), and the three highest quality sequences from the Stachys outgroups (Stachys chamissonis, Stachys coccinea, and Stachys sylvatica). These species represent all of the main lineages within our samples, and were sequenced at 410 Â depth. We used BWA to map the reads for each of these species onto the Stenogyne haliakalae reference genome and used SAMtools and BCFtools to call SNPs with a SNP quality score 430. Annotations of the S. haliakalae reference genome were transferred to the locations of the SNPs. We compared the levels of chloroplast genome diversity among the six genomes by identifying unique, variable positions, which we refer to as potentially informative characters (PICs) [17,18]. We did not include indels and inversions in this definition, which have been included in other analyses of chloroplast genome variability.
To analyze diversity among the chloroplast genomes, we compared the number of PICs present in 1000 bp non-overlapping sliding windows across the entire chloroplast genome sequences (Table 4, see also Fig. 5 in [1]). We also compared the number of PICs per locus for coding (Table 5, Fig. 3a), intron ( Table 6, Fig. 3b), and intergenic spacer and pseudogene regions ( Table 7, Fig. 3c). Because this approach does not take into account the length of the locus, very long loci appear to have more PICs than shorter loci. Therefore, we also divided the number of PICs by the total length of the region to give the percent PICs per locus (Tables 5-7, Fig. 4). However, very short regions may still appear to    To minimize this, we have excluded regions less than 100 bp in length, and for clarity we have also excluded regions that were conserved among all six taxa.
To identify the most variable regions of the mint chloroplast genome for targeted re-sequencing and high resolution phylogenetic analyzes, reads from all 15 taxa subjected to shotgun sequencing  Table 8 Tailed primers used for multiplex amplification and targeted re-sequencing in mints. Sequences complimentary to the Illumina sequencing adapters were appended to the end of each primer (Forward: 5' TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG-[locus specific sequence] and Reverse: 5' GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG-[locus specific sequence]). The loci within each region are also indicated. When the names of two genes are given with an underscore between them, the intergenic spacer between these genes is included in the amplified region.   TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG AAA TTC GAC  TCC GCA TTG TT  Mint23294R  GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GGC CCA ATG  GAG AGA TAG TCG   69.7   Mint23536F  TCG TCG GCA GCG TCA GAT GTG TAT TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG AGG ATT TGA  ACC CGT GAC CT   70.2 TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG CCT TTT GGT  GCA TAC GGT TC   70.1  clpP Intron2   Mint68800R GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GCC ATC GTG  ATT TGG TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG TCT GCG GAT  TAG TCG ACA TTT   69.3  rpl36   Mint76974R  GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GCC GAA ACA  AGG ATT CGA AAG   68.9   Mint79374F  TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG ATT GCT TTC  CGG TTC ATT TC   68.6  rpl16 Intron   Mint79374R  GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GCA AGA GCT  TCG AGC CAA TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GGG ATT CGG  CAA GTT GGT ATG   69.4   Mint107435F TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG TCG TAT TGG  CGG ATT CAT AA   68.8  ndhF   Mint107435R GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GGG GGT AAA  GGG TAT TCC  (including the partial genomes from all of the historical samples, except Phyllostegia variabilis) were mapped to the Stenogyne haliakalae reference sequence using BWA. SNPs were called with SAMtools and BCFtools, filtering out those with SNP quality o30. We selected a total of 108 variable loci (see Fig. 2 in [1] for a diagram of locations) identified from single copy regions, including (1) all the regions that had a variant position among the Hawaiian mints (except where every individual had an alternative allele as compared to the reference sequence) and (2) additional regions that had variant positions among at least two of the Stachys species. 100 bp of flanking sequence on either side of the SNPs was retrieved from the reference genome, and PCR primers were designed using BatchPrimer3 [19], with further manual examination for quality control (e.g. to ensure that primer sequences did not fall into a gap for one of the other taxa). Sequences complementary to the Illumina sequencing adapters were appended to the end of each primer (Table 8) so that sequencing libraries could be prepared directly from the cleaned multiplex PCR products. Overall, these regions represent roughly 20,000 bp of sequence from the chloroplast genome and contain additional variable sites beyond the initial targeted SNP.  TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG TTT TGA