Rooibos (Aspalathus linearis) Genome Size Estimation Using Flow Cytometry and K-Mer Analyses

Plant genomes provide information on biosynthetic pathways involved in the production of industrially relevant compounds. Genome size estimates are essential for the initiation of genome projects. The genome size of rooibos (Aspalathus linearis species complex) was estimated using DAPI flow cytometry and k-mer analyses. For flow cytometry, a suitable nuclei isolation buffer, plant tissue and a transport medium for rooibos ecotype samples collected from distant locations were identified. When using radicles from commercial rooibos seedlings, Woody Plant Buffer and Vicia faba as an internal standard, the flow cytometry-estimated genome size of rooibos was 1.24 ± 0.01 Gbp. The estimates for eight wild rooibos growth types did not deviate significantly from this value. K-mer analysis was performed using Illumina paired-end sequencing data from one commercial rooibos genotype. For biocomputational estimation of the genome size, four k-mer analysis methods were investigated: A standard formula and three popular programs (BBNorm, GenomeScope, and FindGSE). GenomeScope estimates were strongly affected by parameter settings, specifically CovMax. When using the complete k-mer frequency histogram (up to 9 × 105), the programs did not deviate significantly, estimating an average rooibos genome size of 1.03 ± 0.04 Gbp. Differences between the flow cytometry and biocomputational estimates are discussed.

Subsequently, WPB was used to isolate and stain nuclei from radicles and cotyledons, as well as fresh and dried leaves from two-months-old commercial rooibos seedlings ( Figure 2). In this test, higher debris backgrounds were observed. Nonetheless, all freshly harvested plant tissues generated high resolution histograms with distinct peaks (Figure 2A-C), which allowed genome size estimations using Vicia faba (2C = 26.66 pg) as an internal reference standard. Plant tissue type significantly affected the estimation of the 2C DNA contents. With 2.54 pg, radicles had significantly lower values (p < 0.05) than cotyledons (2.64 pg) and leaves (2.69 pg). We also noted that the 4C peak was more prominent in the radicle histograms than in the cotyledon and leaf histograms. The calculated 1C genome sizes for these tissues equaled 1.24 ± 0.01 Gbp, 1.29 ± 0.02 Gbp, and 1.31 ± 0.02 Gbp, respectively.  Subsequently, WPB was used to isolate and stain nuclei from radicles and cotyledons, as well as fresh and dried leaves from two-months-old commercial rooibos seedlings ( Figure 2). In this test, higher debris backgrounds were observed. Nonetheless, all freshly harvested plant tissues generated high resolution histograms with distinct peaks (Figure 2A-C), which allowed genome size estimations using Vicia faba (2C = 26.66 pg) as an internal reference standard. Plant tissue type significantly affected the estimation of the 2C DNA contents. With 2.54 pg, radicles had significantly lower values (p < 0.05) than cotyledons (2.64 pg) and leaves (2.69 pg). We also noted that the 4C peak was more prominent in the radicle histograms than in the cotyledon and leaf histograms. The calculated 1C genome sizes for these tissues equaled 1.24 ± 0.01 Gbp, 1.29 ± 0.02 Gbp, and 1.31 ± 0.02 Gbp, respectively.
For best results, flow cytometry analyses are typically conducted on fresh plant material. When samples are collected in remote regions, a transport medium has to be identified that would preserve the plant material (in our case for 4-5 days, the typical duration of the sample trips). Four preservation media were tested: sterilized water, 5% glycerol solution, 10% glycerol solution and silica-gel crystals. All samples maintained in liquid media turned brown within hours of collection. Preliminary tests had shown that brown rooibos plant tissues did not yield peaks in the flow cytometry histograms, and flow cytometry analyses failed for most samples from these media. Silica-gel crystals kept plant tissues green for the entire specified period. The 2C peaks in the histograms for silica-dried leaves were smaller and broader than those obtained for fresh plant material, but were still suitable for genome size estimation ( Figure 2D). With 1.26 ± 0.05 Gbp, the average genome size estimate for dried rooibos leaves was comparable to the estimates obtained with fresh plant material. Silica gel was therefore used to preserve plant samples collected from commercial and wild rooibos ecotypes at distant sampling locations. estimations using Vicia faba (2C = 26.66 pg) as an internal reference standard. Plant tissue type significantly affected the estimation of the 2C DNA contents. With 2.54 pg, radicles had significantly lower values (p < 0.05) than cotyledons (2.64 pg) and leaves (2.69 pg). We also noted that the 4C peak was more prominent in the radicle histograms than in the cotyledon and leaf histograms. The calculated 1C genome sizes for these tissues equaled 1.24 ± 0.01 Gbp, 1.29 ± 0.02 Gbp, and 1.31 ± 0.02 Gbp, respectively. For best results, flow cytometry analyses are typically conducted on fresh plant material. When samples are collected in remote regions, a transport medium has to be identified that would preserve the plant material (in our case for 4-5 days, the typical duration of the sample trips). Four

Flow Cytometry-Analysis of Field Plants
A total of 82 commercial and wild A. linearis plants representing nine rooibos growth types were randomly sampled from 17 populations located in the broader Clanwilliam, Wupperthal, and Nieuwoudtville regions ( Figure S1). To verify morphology-based taxonomic classification of the rooibos field plants, we sequenced the ITS region and aligned these sequences to the 15 ITS sequences from 14 Aspalathus species currently available at NCBI (including 2 sequences for A. linearis). The alignment, showing the results for two plants per growth type, comprised 711 nucleotides ( Figure  S2). Our results showed that (1) all ITS sequences from our dataset were identical, with highest similarity to previously published A. linearis ITS sequences (99.9% sequence identity to EU347722); (2) the single nucleotide polymorphism in position 567 of the A. linearis sequence EU347722 and the polymorphic site in position 109-117 of the A. linearis sequence AJ744951 are not conserved in the A. linearis species complex; and (3) three conserved single nucleotide polymorphisms in the sequenced ITS region (positions 537, 695, and 697) distinguish the A. linearis and A. pendula species from the other Aspalathus species. These results are in line with previous phylogenetic analyses that place A. linearis and A. pendula into a distinct clade within the Aspalathus genus [21]. Therefore, our ITS analyses supported the morphology-based taxonomic classification of the sampled plants into the A. linearis species complex.
Flow cytometry analyses of silica-dried leaf samples were successful for most plants, but failed for all five Black type plants, and the ten plants from one of the two sampled Grey sprouter populations. Figure 3 shows the flow cytometry genome size estimates for silica-dried leaf samples from rooibos plants collected in different locations. The dried leaf samples from commercial field plants (Red type, RC) showed highest variability in genome size estimates, ranging from 1.16 Gbp to 1.42 Gbp. Interestingly, the mean genome size estimate and the standard deviation for these samples (1.26 ± 0.08 Gbp) mirrored the results obtained for the dried leaf samples in the seedling test (1.26 ± 0.05 Gbp). The samples from two populations of wild-growing Red type plants, including one escaped commercial (RE) and one wild Red type (RW) population, had an average genome size estimate of 1.20 Gbp, showing low deviations from the sample means. The average genome size estimates for all other rooibos growth types ranged from 1.21 Gbp to 1.26 Gbp, i.e., did not differ substantially from the genome size estimates of the Red type plants.

Illumina Sequencing
The paired-end 300 bp insert Nextera library was sequenced using the MiSeq and the HiSeq 2500 Illumina sequencing platforms, resulting in 0.6 billion and 1.9 billion read pairs, respectively (Table 1). Average read length of the raw data after adapter removal was 119 bp. Quality filtering (trimming of bases below Phred score of 20, and removal of reads shorter than 50 bp) did not substantially reduce the dataset: 2.4 billion read pairs were retained and average read length was 120 bp. Based on the flow cytometry estimates of the rooibos genome size for radicle samples (1.24 ± 0.01 Gbp), the total genome coverage from the raw and quality processed data amounted to 245x and 234x, respectively. The MiSeq dataset yielded 58x genome coverage (both raw, and quality processed data).

K-Mer Analysis
For computational genome size estimation using k-mer analysis, we investigated four methods, including the most recent dedicated programs (BBNorm, GenomeScope, and FindGSE) and the oftenused formula for genome size calculations derived from equations introduced by the M.S. Waterman group [22,23]. For each program, we investigated the effects of (1) MiSeq sequence subset vs complete dataset (MiSeq and HiSeq data), (2) k-mer size, and (3) raw vs quality processed data. The results are provided in Table 2. Our analyses showed that the performance of GenomeSope (both v1 and v2) was strongly affected by parameter settings: the rooibos genome size estimates varied from 0.51 Gbp to 1.01 Gbp. The most influential parameter was the cutoff threshold for maximum k-mer coverage (CovMax). Increasing the threshold from 1k (default setting of GenomeScope v1) to 10k or 900k resulted in average increases of genome size estimates by 0.14 Gbp and 0.33 Gbp, respectively.

Illumina Sequencing
The paired-end 300 bp insert Nextera library was sequenced using the MiSeq and the HiSeq 2500 Illumina sequencing platforms, resulting in 0.6 billion and 1.9 billion read pairs, respectively (Table 1). Average read length of the raw data after adapter removal was 119 bp. Quality filtering (trimming of bases below Phred score of 20, and removal of reads shorter than 50 bp) did not substantially reduce the dataset: 2.4 billion read pairs were retained and average read length was 120 bp. Based on the flow cytometry estimates of the rooibos genome size for radicle samples (1.24 ± 0.01 Gbp), the total genome coverage from the raw and quality processed data amounted to 245x and 234x, respectively. The MiSeq dataset yielded 58x genome coverage (both raw, and quality processed data).

K-Mer Analysis
For computational genome size estimation using k-mer analysis, we investigated four methods, including the most recent dedicated programs (BBNorm, GenomeScope, and FindGSE) and the often-used formula for genome size calculations derived from equations introduced by the M.S. Waterman group [22,23]. For each program, we investigated the effects of (1) MiSeq sequence subset vs complete dataset (MiSeq and HiSeq data), (2) k-mer size, and (3) raw vs quality processed data. The results are provided in Table 2. Our analyses showed that the performance of GenomeSope (both v1 and v2) was strongly affected by parameter settings: the rooibos genome size estimates varied from 0.51 Gbp to 1.01 Gbp. The most influential parameter was the cutoff threshold for maximum k-mer coverage (CovMax). Increasing the threshold from 1k (default setting of GenomeScope v1) to 10k or 900k resulted in average increases of genome size estimates by 0.14 Gbp and 0.33 Gbp, respectively. Genome size estimates also increased with increasing k-mer size. The differences were higher at the lower CovMax settings, ranging from 0.17 Gbp at 1k, 0.11 Gbp at 10k, to 0.01 Gbp at 900k. For GenomeScope, the effects of using the MiSeq subsets vs complete datasets and raw vs quality processed data were small (<0.10 Gbp). FindGSE predicted a rooibos genome size of 1.06 ± 0.03 Gbp (averaged over all tested parameters). With this program, the differences between the MiSeq subset and corresponding values in the complete dataset were small (ranging from 0.01 Gb to 0.09 Gb). Increasing k-mer size only marginally increased genome size estimates (max by 0.04 Gbp), and differences between raw and quality processed datasets were also small (max 0.04 Gbp). BBNorm estimated a rooibos genome size of 1.08 ± 0.03 Gbp. The differences between the MiSeq subset and the complete dataset were minimal (varying from 0.00 to 0.06 Gbp, with higher estimates obtained for the complete dataset). An increase in k-mer size increased genome size estimates by only 0.05 Gbp. Differences between quality processed and raw datasets amounted to maximum 0.04 Gbp. When using the formula, the rooibos genome size estimate amounted to 1.03 ± 0.04 Gbp. The effects of dataset size, k-mer size and data quality were also small (at most 0.08 Gbp, 0.05 Gbp, and 0.04 Gbp, respectively). When using the complete histogram profile (max k-mer coverage 900,000×) and averaging the values across all four methods and parameter settings, the computational estimate of the rooibos genome size was 1.03 ± 0.05 Gbp.

Discussion
Rooibos produces a range of compounds that are known to interfere with flow cytometry analyses. It is rich in phenolic compounds (e.g., PPAG and aspalathin, [24]) that bind to the DNA, reducing fluorochrome accessibility and stoichiometric staining of the DNA [25,26]. Aspalathin alone can contribute up to 13.5% to the leaf dry weight (2%-6% on average), and total polyphenols can make up to 30% of the plant dry weight [27]. Buffer selection therefore focused on those previously found suitable for flow cytometry analyses of recalcitrant and/or woody plant species. The Partec buffer, though successfully used on woody and non-woody plant species rich in secondary metabolites, including coconut [28], honeybush [29], Lupinus sp. [30], and Brassica sp. [31], did not produce peaks with rooibos leaf samples. Here, high polyphenol concentrations in the rooibos leaf samples may have inhibited nuclei isolation. Addition of compounds such as PVP, which bind polyphenols, were found to be essential for successful application of the Partec buffer in flow cytometry analyses of aromatic herbs of the genus Zingiberacea [32] and chamomile [33]. The same may have been true for LB01, which failed or generated low resolution peaks with most rooibos samples. Woody Plant Buffer (WPB) had been developed specifically for flow cytometry analyses of woody plant species [34]. It contains both PVP and DTT (a reducing agent), which counteract the effects of staining inhibitors, and was found suitable for a wide variety of recalcitrant plant species [25,26,34]. When using rooibos leaves, it was the only buffer that consistently generated peaks of sufficient quality to estimate the rooibos genome size.
To verify, if plant tissue type affects peak quality and genome size estimates, WPB was also tested on rooibos radicles and cotyledons. Vicia faba was found to be a suitable reference standard for rooibos genome size estimation, because its large genome size (2C = 26.66 pg) ensured clear separation of the fluorescent signals originating from rooibos and reference standard nuclei, respectively. Best flow cytometry results were obtained for radicles, where CV% values were consistently below 5%. The prominent 4C peak observed in the radicle samples likely reflects higher cell division in the rapidly growing root tissues as compared to the cotyledon and leaf tissues [35,36]. Interestingly, the average genome size estimates for radicles (1.24 ± 0.01 Gbp) were marginally but consistently smaller than those for cotyledons and leaves (by 0.05 Gbp and 0.08 Gbp, respectively; both p < 5%). This may reflect higher amounts of polyphenols in the cotyledon and leaf tissues. Analyses with different plant species [37] showed that plant tissues richer in polyphenols yielded higher flow cytometry genome size estimates than plant tissues with low polyphenol contents. In rooibos, aspalathin concentrations have been shown to be higher in leaves than in stems [38], but comparative analyses for seedling tissues are outstanding.
Typically, fresh leaf material is the material of choice for flow cytometry analyses. For rooibos, once harvested, leaves turn red-brown in a very short period of time, which is mostly due to the degradation of aspalathin [39]. Preliminary flow cytometry tests showed that these tissues do not yield peaks in the histograms. For commercial rooibos plants, seeds can be purchased and seedling populations providing fresh plant material can be established at the research site. For wild rooibos plants seeds are not available commercially, and seed collection is exceptionally difficult: The black rooibos seeds are as small as ant eggs and difficult to distinguish from soil particles. Moreover, seed treatment (scarification) and germination methods would have to be adapted, since germination is likely to be affected by seed ripeness. To investigate genome sizes of the different rooibos growth types that grow in remote locations, a transport media had to be identified that would preserve plant tissues for at least four days (the typical duration of the sampling trips). Our analyses showed that for most rooibos plants silica gel crystals were suitable for this purpose: the plant tissue was still green and yielded distinct peaks in flow cytometry histograms. However, intensive chopping was necessary to obtain sufficient numbers of intact nuclei. Despite all efforts, flow cytometry analyses failed for the Black type plant samples and all 10 plant samples from one Grey sprouter population. These plants had very tough leaves and were difficult to chop, yielding insufficient numbers of intact nuclei. For these plants, further method adaptation is required. For silica-dried plant samples from commercial rooibos seedlings and field plants, we noted that the DNA content estimates were more variable (stdev 0.08 and 0.05 Gbp, respectively) than with any of the fresh rooibos samples (max stdev 0.02). This is likely due to differences in the chemical profiles of the plants (which grew in three distant locations and were therefore exposed to different environmental conditions) and the degree of polyphenol degradation in the silica-dried leaf material. The standard deviations for wild rooibos growth types where only one population was sampled, were smaller (0.02-0.04 Gbp). The mean genome size estimates for the wild rooibos growth types were within the range obtained for commercial rooibos samples. Therefore, the morphological differences between the rooibos growth types are not likely to be associated with variations in ploidy levels or significant genome expansion events.
For genome sequencing, the genome size estimate obtained for the radicle samples (1.24 ± 0.01 Gbp) was used, since these samples generated the best quality peaks, and the higher estimates for cotyledon and leaf samples are likely associated with higher polyphenol contents in the plant material. Subsequently, k-mer histograms generated from the Illumina sequencing data were used to estimate the rooibos genome size biocomputationally. GenomeScope was the only program that stood out, as genome size estimates were strongly affected by parameter settings, specifically k-mer coverage thresholds (CovMax). At lower CovMax thresholds, highly covered k-mers, mostly representing repeats, are excluded from the analyses, which may substantially reduce genome size estimates. In GenomeScope v1, the default for CovMax is set to 1k. This low default threshold may explain why, in studies with vanilla [40], cane toad [41], and pacific oyster [18,42], k-mer based GenomeScope estimates of genome sizes were only half of those obtained by flow cytometry analyses and considerably smaller than those obtained after genome assembly. For rooibos, GenomeScope estimates differed between 0.51 and 1.01 Gbp, indicating that with this program parameter settings must be adjusted, based on reasonable assumptions on the genome under investigation (repeat content, data preprocessing). When using the complete k-mer histogram (up to a k-mer coverage of 900,000×), the investigated programs and parameter settings did not differ substantially in genome size estimates, yielding an average genome size of 1.03 ± 0.05 Gbp. This value is approximately 0.2 Gbp below the flow cytometry value obtained for rooibos radicles. Similar discrepancies have been reported for European eel [43], and even for the model plant Arabidopsis thaliana [20] and were argued by the authors to be associated with the interference of chemical compounds in the stoichiometric DNA content measurements in flow cytometry analyses. In our study, the equipment related restrictions in the choice of the fluorescent stain (DAPI) may have also led to higher flow cytometry genome size estimates: DAPI predominantly binds to AT-rich regions in the DNA, and plant genomes are known to be riddled with AT-rich repeats. We therefore conclude that the k-mer based rooibos genome size estimate is closer to the true value.

Plant Material
Pre-treated commercial rooibos seeds (scarified and coated with pesticide) were provided by Rooibos Ltd. (Clanwilliam, Western Cape, South Africa). These seeds were collected in 2017 from commercial rooibos plants from diverse farms in the Cederberg Mountains, and therefore represent a mixture of commercial rooibos genotypes.
A study population of commercial rooibos plants was established at the University of the Western Cape as follows: First, 200 rooibos seeds were germinated on moist filter paper in glass Petri dishes at room temperature and ambient light for two days. Thereafter, 90 seedlings with a distinguishable radicle were transferred to pots (9 cm diameter) containing non-pasteurized Clanwilliam soil (3 seedlings per pot), and maintained in the greenhouse at 26 • C. The seedlings were watered daily for 7 days, and then at 2-to-4 day intervals. Leaves and twigs from two months old plants were used in experiments to identify an appropriate buffer for flow cytometry analyses and to find a suitable transport/storage media for rooibos leaf samples from distant locations.
For flow cytometry analyses of cotyledons and radicles, 40 rooibos seeds were surface sterilized by soaking for 20 min in 6% sodium hypochlorite (Kimix, Cape Town, South Africa) and rinsing five times with sterile water. The seeds were then germinated on PDA (containing 30 mg/mL penicillin and 25 mg/mL streptomycin) at room temperature in the dark (5 seeds per plate). To exclude contaminations, all healthy-looking 5 day old seedlings were transferred to new PDA plates (1 seedling/plate) and maintained at RT in light for 3 days. For flow cytometry analyses of cotyledons and radicles, 10 healthy seedlings were analyzed, respectively using woody plant buffer (see below).
To identify an appropriate media for transport of rooibos samples from distant locations that would maintain plant material suitable for flow cytometry analyses, four media were tested: (1) Distilled water, (2) 5% glycerol solution, (3) 10% glycerol solution, and (4) 3.5 mm silica gel beads (Lasec). For each treatment, four twigs with green leaves from ten different rooibos plants of the study population were placed into 50 mL falcon tubes (1 tube per plant) and completely submerged in the respective media. The tubes with silica gel beads were maintained at ambient temperatures, and the tubes with liquid media at 4 • C. After four days, the samples were visually examined and analyzed by flow cytometry using woody plant buffer (see below).
For field plant analyses, commercial plants from three rooibos farms in distant locations were sampled. Wild A. linearis populations, representing nine different rooibos growth types, were identified in the larger Clanwilliam, Wupperthal, and Nieuwoudtville regions with the help of local rooibos farmers ( Figure S1). Taxonomic classification of the wild rooibos plants was based on morphological characterization of the plants, supportive information from the farmers and growth type verification by the South African botanist B.E. van Wyk [10]. In addition, we sequenced the ITS region from all sampled field plants (see below). Depending on the size of the population, five to ten plants were sampled per population (Table S1). For DNA analyses, leaf material was flash-frozen in the field using liquid nitrogen, transported on dry ice and maintained at −80 • C. For flow cytometry analyses, two twigs with healthy, green leaves were cut from four sides of the plants (eight twigs in total), transferred to 50 mL Falcon tubes, completely covered with 3.5 mm silica gel beads and transported at ambient temperatures in a dry container to the study site. Flow cytometry analyses were conducted within four days of collection on three different leaf samples per plant using woody plant buffer (see below).

DNA Extraction, PCR Amplification and Sequencing
Rooibos genomic DNA was extracted using a modified protocol adapted from [45] and [46]. Young rooibos leaves were ground in liquid nitrogen using mortar and pestle to fine powder. For each gram of leaf tissue, 4 mL of heated (65 • C) CTAB extraction solution (100 mM Tris-Cl pH 8, 20 mM EDTA pH 8, 1.4 M NaCl, 2% w/v CTAB, 1% w/v PVP, 2% v/v 2-ME) and 0.5 mL CTAB/NaCl solution (10% w/v CTAB, 7 M NaCl) was added. After vigorous mixing, the homogenate was transferred into a 15 mL falcon tube. An equal volume of CTAB extraction solution was added, the solution was mixed thoroughly by inversion and then incubated at 65 • C for 1 h. To maximize yields, the solution was vortexed every 15 min. Thereafter, an equal volume of Chloroform: Isoamyl alcohol (24:1) was added and mixed by inversion. The top aqueous phase was recovered after centrifugation for 5 min at 7500× g. A 1/10th volume of 65 • C CTAB/NaCl solution was added to the supernatant and mixed well by inversion. The top aqueous phase was recovered after centrifugation for 5 min at 7500× g. An equal volume of CTAB precipitation solution (50 mM Tris-Cl pH 8, 10 mM EDTA pH 8, 1% w/v CTAB) was then added and the solution was incubated at 37 • C overnight. After centrifugation for 5 min at 500× g and 4 • C, the supernatant was recovered. DNA was precipitated by adding 0.6 volumes of isopropanol to the supernatant, mixed well and centrifuged for 15 min at 7500× g at 4 • C. The pellet was then washed with 80% ethanol, dried, re-suspended in 200 µL high salt TE buffer (10 mM Tris-Cl pH 8, 0.1 mM EDTA pH 8, 1 M NaCl) per gram starting material and stored at −20 • C.
The ITS region was PCR amplified using the 17SE and 26SE primers published by Sun et al. [47]. Each reaction tube contained 30-50 ng of template DNA, 10 µL of 5X High-Fidelity Phusion ® reaction buffer (Biolabs ® Inc, New England), 5 µL of dNTP stock solution (containing 2 mM of each dNTP), 2.5 µL of each primer (10 µM; Inqaba Biotec, Pretoria, South Africa), 0.5 µL of 2 u/mL Phusion ® High-Fidelity DNA polymerase (Biolabs ® Inc, France SASU, New England) and nuclease-free water (final volume 50 µL). The amplification was conducted in a T100 thermal cycler (Bio-Rad, USA) under the following conditions: initial denaturation at 98 • C for 3 min, 34 cycles of denaturation at 98 • C for 10 s, annealing at 72 • C for 30 s, and extension at 72 • C for 30 s; and a final extension at 72 • C for 10 min. To verify the size and concentration of the PCR products, 8 µL of the PCR reactions were analyzed on a 1.2% (w/v) agarose gel using PstI as a molecular weight marker. For samples that showed no bands or weak bands on the agarose gel, the PCR reactions were repeated. The PCR products were purified using the QIAquick PCR Purification Kit (Qiagen, Germantown, Maryland, USA) and stored at −20 • C. Amplicons were sequenced bidirectionally (with 17SE and 26SE primers) by Macrogen Inc. (Netherlands, Europe) using a 96-capillary ABI 3730xl DNA analyzer (Applied Biosystems, Foster City, Carlifonia, USA) following standard procedures. Sequencing reads were quality trimmed using CLC Genomics Workbench version 7.0.1 (QIAGEN-Bioinformatics, Germantown, Maryland, USA) using a quality limit of 0.01, removing all bases below a quality score of 20. Quality processed reads (Appendix A) were aligned with published ITS sequences from the Aspalathus genus obtained from NCBI (AJ74495, EU347722, EU347738, EU347723, EU347736, EU347729, EU347726, EU347732, EU347734, EU347731, EU347728, EU347733, EU347727, EU347725, EU347730) using MAFFT [48] with default settings, and the alignments were manually edited using MEGA X [49].
DNA library preparation for genome sequencing was performed at the UKHC Genomics Core Laboratory (UK Chandler Hospital Lexington, KY 40536, USA). One paired-end 300 bp insert size library was prepared using the Nextera DNA Library Preparation kit (Illumina, San Diego, CA, USA), following the manufacturer's instructions. Sequencing was performed using two flow cells (1 lane each) on the Illumina MiSeq platform (Illumina, San Diego, CA, USA) and one flow cell (six lanes) on the Illumina HiSeq 2500 platform (Illumina, San Diego, CA, USA), generating in total 2.5 billion read pairs of 120 bp length according to the manufacturers protocols. Adapter trimming and removal of the PhiX spike was performed by the service provider using Illumina's bcl2fastq2 Conversion Software v2.20 (Illumina, San Diego, CA, USA). Subsequently, quality trimming was performed using Trimmomatic (v0.38) [50] to remove remaining adapter sequences, leading and trailing nucleotides with a Phred score below 20, and reads shorter than 50 bp (ILLUMINACLIP:NexteraPE-PE.fa:2:30:10 LEADING:20 TRAILING:20 MINLEN:50). Error correction was performed on filtered reads using Lighter (v1.1.1) [51] with k set to 31, α set to 0.1, and specifying a 2x genome coverage of 2.5 Gbp (the use of a 2x genome coverage was suggested by the author of the program due to high heterozygosity predicted for the rooibos genome, personal communications). Quality filtering was performed for each lane, and subsequently assessed using FastQC (v0.11.7) [52] and MultiQC (v1.7) [53].

K-Mer Analyses
The khist.sh script from BBNorm was used to produce k-mer frequency histograms for k-values equal to 19, 23, 27, and 47. Histlen was set to 900,000 (which calculates k-mer frequencies up to a coverage of 900,000×). Analyses were performed on the raw and quality processed datasets using first just the MiSeq data and subsequently the complete dataset (MiSeq plus HiSeq data). The 16 k-mer histograms were used to estimate the rooibos genome size using the programs GenomeScope (v1 and v2), FindGSE and BBNorm. Both GenomeScope versions were run using three maximum k-mer coverage settings: 1000×, 10,000×, and 900,000×. Read lengths were specified based on the FastQC outputs for the datasets (119 bp or 120 bp), and ploidy level was set to 2. FindGSE was run specifying the respective k-mer size and the expected k-mer coverage for the homozygous region (exp_hom). The value for exp_hom was selected by finding the k-mer coverage at the maximum height of the homozygous peak, and then using the k-mer coverage value two counts above that one, satisfying the requirement that fp < VALUE < 2*fp, where fp is the maximum frequency of the homozygous peak (as discussed on the FindGSE GitHub page: [54]). To estimate the genome size using BBNorm, the khist.sh script was run using the above parameter settings. The rooibos genome size was also estimated using the often-applied formula introduced by Waterman: genome size = total number of k-mers/k-mer coverage at the highest frequency of the homozygous peak (for rooibos-the second peak in the histograms). The formula was applied after excluding low frequency k-mers, which likely represent sequencing errors.

Conclusions
This study highlights that estimation of genome sizes in plants remains a challenging undertaking. Flow cytometry analysis may overestimate the values due to the effects of different plant compounds that affect binding of the stains, while values based on k-mer analysis may differ depending on parameter settings of the programs. For the latter, quality of the sequencing data may also play a role. With a genome size of 1.03-1.24 Gbp, Aspalathus linearis is one of the legume species with a relatively large genome (similar to Lupinus angustifolius [55] and Pisum sativum [56]). Differences in ploidy levels between the rooibos growth types can be all but ruled out.
Supplementary Materials: The following are available online at http://www.mdpi.com/2223-7747/9/2/270/s1. Figure S1: Rooibos sample collection sites from different locations. Figure S2: ITS alignment of 20 sequences from Aspalathus linearis plants representing 10 different growth types (Appendix A) and 15 Aspalathus sp. sequences from NCBI. Table S1: Mean ± SD Genome sizes (in Gbp) of different rooibos growth types sampled from remote areas. Funding: This research was funded by the National Research Foundation (NRF), grant numbers RTF150421117446 and CSUR150714125961. Opinions expressed, and conclusions arrived at, are those of the authors and are not necessarily to be attributed to the NRF.