To remove sequences sampled from other taxa during RNA extraction, BlobTools was used to classify assembled transcript sequences, and only reads contributing to sequences classified as Streptophyta were used for all downstream analysis.
BlobTools infers taxonomic annotation from a similarity search of the input sequences against a public sequence collection (e.g. NCBI nt), and determines the taxonomy of each sequence using a taxrule algorithm. Coverage and GC content of the annotated sequences are plotted in order to visualise the partitioned sequences and perform downstream contaminant screening.
Screening for contaminants revealed the majority of taxonomically assigned transcripts belonged to Streptophyta, with 62,082 and 68,696 transcripts from B. conchifolia and B. plebeja respectively assigned to the taxon (supplementary Figs. 1 and 2). The next most frequent taxon represented by annotated transcripts in B. conchifolia is Arthropoda with 3,629 transcripts, and Ascomycota in B. plebeja with 7,309 transcripts, representing plausible sources of contamination from a greenhouse setting.
The sequence length weighted span of coverage in both species’ BlobPlots shows Streptophyta having the second highest peak of coverage after no-hit sequences, and the widest span, reflecting the range of expression levels of the transcripts screened.
Using the BlobPlot information, reads from both annotated contaminant and no-hit transcripts were removed from further analysis totalling around 80 million reads in B. conchifolia and 64 million in B. plebeja (supplementary table 1).
Comparison of assemblies prior to and after decontamination showed that removal of contaminant and no-hit reads predominantly affected the number of short transcripts (Fig. 2), median transcript length increasing to 1,641bp and 1,304bp in the decontaminated assembly from 595bp and 505bp in the contaminated assembly for B. conchifolia and B. plebeja respectively. The maximum contig length also increased in assemblies after decontamination, increasing from 15,079bp to 15,923bp in B. conchifolia and 15,948bp to 16,037bp in B. plebeja, suggesting an improvement in the Trinity assembly process when faced with fewer reads and reduced K-mer graph complexity.
Decontaminated assemblies had fewer shorter and incomplete sequences, only 39% and 35% of sequences in the contaminated B. conchifolia and B. plebeja assemblies had a detected ORF, compared to 77% and 73% for each species after removal of contaminants.
B. conchifolia showed less evidence of fragmentation, with fewer total transcripts, and a greater N50 and N90 than B. plebeja (Table 1). Sequence length distribution corroborates this pattern, showing B. plebeja to have a greater bulk of shorter transcripts than B. conchifolia. BUSCO assessment of transcriptome completeness (Table 2) showed that B. plebeja had marginally poorer scores for transcript completeness and fragmentation, however both transcriptomes showed over 80% completeness, indicating that using multi-tissue RNA-seq to sample as much of the plant as possible has successfully captured a large proportion of genes.
Table 1
Assembly statistics for B. conchifolia and B. plebeja before and after contaminant removal.
| Before BlobTools | After BlobTools |
B. conchifolia | B. plebeja | B. conchifolia | B. plebeja |
Number of seqs | 200,849 | 251,473 | 42,614 | 59,106 |
Number of unigenes | 95,553 | 124,116 | 17,012 | 19,969 |
Smallest seq | 201 | 201 | 201 | 201 |
Largest seq | 15,079 | 15,948 | 15,923 | 16,037 |
N over 1kb | 70,738 | 71,283 | 31,876 | 37,865 |
N over 10kb | 44 | 13 | 28 | 16 |
N with ORF | 77,736 | 86,926 | 32,848 | 43,119 |
N90 | 400 | 328 | 1,090 | 842 |
N50 | 1,808 | 1,348 | 2,381 | 1905 |
Table 2
BUSCO assessment of transcriptome completeness for B. conchifolia and B. plebeja
BUSCO category | B. conchifolia | B. plebeja |
Complete | 362 (85.2%) | 351 (82.5%) |
Complete single copy | 352 (82.8%) | 344 (80.9%) |
Complete duplicated | 10 (2.4%) | 7 (1.6%) |
Fragmented | 23 (5.4%) | 36 (8.5%) |
Missing | 40 (9.4%) | 38 (9%) |
Total | 425 | 425 |
The Trinotate pipeline was used to annotate 17,012 B. conchifolia and 19,969 B. plebeja unigenes (supplementary files 3 and 4). More unigenes were annotated in B. plebeja than in B. conchifolia within each annotation source (Fig. 3), likely due to the larger number of input transcripts. Concordantly, B. plebeja also has more transcripts without any annotation, 16.65% and 17.48% of unigenes are unannotated in B. conchifolia and B. plebeja respectively. Roughly equal proportions of unigenes from both species were annotated to Streptophyta (70.87% in B. conchifolia and 70.51% in B. plebeja).
Unigenes which shared sources of annotation were identified using UpSet plots, showing that, due to very sparse EggNOG annotation (Fig. 3), the largest group of unigenes sharing annotation sources were annotated with all sources except for EggNOG (supplementary Figs. 5 and 6), B. conchifolia having 7,217 unigenes in this group and B. plebeja 7,787.
Unigene presence or absence across tissues was compared for B. conchifolia and B. plebeja, using a cutoff of 1 FPKM for whether a transcript is present or absent in a tissue. UpSet plots were used to visualise the shared and tissue specific distribution of unigene expression (supplementary Figs. 3 and 4). The majority of unigenes are expressed in all tissues (11,495 unigenes in B. conchifolia and 13,609 unigenes in B. plebeja).
Both species had a high frequency of unigenes expressed uniquely in root (528 and 502 unigenes in B. conchifolia and B. plebeja respectively) and in and male flower (223 and 359 unigenes in B. conchifolia and B. plebeja respectively).
To identify unigenes which are uniquely expressed in each tissue, we used a 1 FPKM cutoff for unigene expression, identifying unigenes which are expressed in a single tissue and not expressed in any other tissues. For example, a unigene which is uniquely expressed in root has an expression of more than 1 FPKM in root, and less than 1 FPKM in all other tissues. To identify the functional categories of genes uniquely expressed per tissue, we mapped GO terms to tissue specific unigenes (plotted in supplementary Figs. 7 and 8). The total number of GO terms mapped to these uniquely expressed unigenes, here referred to as unique GO terms (UGT), is shown in Table 3. B. conchifolia and B. plebeja had a comparable number of UGTs across tissues. Both species had fewer UGTs in female flower compared to male flower (199 and 131 in female flower and 772 and 924 in male flower in B. conchifolia and B. plebeja respectively). Of all tissues, leaf had the least UGTs in both species (69 and 0 in leaf in B. conchifolia and B. plebeja respectively), and root tissue had the greatest number (1183 and 1820 in leaf in B. conchifolia and B. plebeja respectively). Tissues which had the biggest difference in number of UGTs between species were petiole (a difference of 225 UGTs) and vegetative bud (a difference of 341 UGTs).
Table 3
number of GO terms mapped to unigenes expressed uniquely in each tissue in B. conchifolia and B. plebeja
Tissue | B. conchifolia | B. plebeja |
Female flower | 199 | 131 |
Leaf | 69 | 0 |
Male flower | 772 | 924 |
Petiole | 338 | 113 |
Root | 1183 | 1820 |
Vegetative bud | 102 | 443 |
After alignment of the CHS copies identified, any CHS sequences which had an overlap of less than 200bp within the conserved coding sequence with any other sequence were discarded (supplementary file 1). A maximum likelihood tree was inferred from the remaining subset of CHS copies identified in B. conchifolia and B. plebeja, using Z. mays as an outgroup, and including the CHS sequence of C. sativus, the closest relative to Begonia with a publicly available genome sequence 37, and H. sandiwcensis, the monotypic sister genus to Begonia (supplementary file 2). Four copies of CHS were identified in B. conchifolia and five copies in B. plebeja.
At least one more copy exists in both species (supplementary table 2), but due to incomplete sequence reconstruction, it is not possible to assign a copy number with any certainty. Of the nine CHS sequences included for B. conchifolia and B. plebeja, four pairs of closely related orthologs were identified in B. conchifolia and B. plebeja, as well as one additional B. plebeja duplicate.
The phylogenetic tree reconstructed for CHS copies from Begonia, Hillebrandia, C. sativus, A. thaliana and Z. mays revealed duplicates from B. conchifolia and B. plebeja arose after the divergence of Begonia and its closest sequenced neighbour in the Curcurbitales, C. sativus. Begonia sequences were obtained from RNA-seq, therefore it is not possible to confirm that no older duplicates exist in B. conchifolia and B. plebeja, however the absence of older duplicates in the genome of H. sandiwcensis, the monotypic species of Begonia’s sister genus Hillebrandia supports a pattern of high duplicate turnover in the Begoniaceae.
The nine copies identified in B. conchifolia and B. plebeja are four putative orthologs and one single B. plebeja duplicate, and are colour coded for ease of comparison (Fig. 4 and Fig. 5).
The oldest duplication identified in Begonia gives rise to group 4 orthologs (in green, Fig. 4), and is placed after the divergence of Begonia and Cucumis.
The intra-ortholog expression similarity (e.g. group 1 B. conchifolia ortholog vs group 1 B. plebeja ortholog) is reasonably high (Fig. 5), and reflects the recent speciation of the two study species. The single B. plebeja ortholog in group 5 (red) also shows high similarity in expression profile to group 2 orthologs (pink), mirrored by the phylogenetic proximity of the two groups (Fig. 4). Any changes between each ortholog pair are therefore the result of expression changes since divergence of B. conchifolia and B. plebeja.
Inter-ortholog expression is more variable; ortholog group 1 (blue) has the highest expression (B. conchifolia FPKM min = 530, max = 2350, B. plebeja FPKM min = 408, max = 2903), and group 3 (yellow) has the lowest expression (B. conchifolia FPKM min = 0.11, max = 2.42, B. plebeja FPKM min = 2.06, max = 14.91). Ortholog groups 2 (B. conchifolia FPKM min = 27.57, max = 576.92, B. plebeja FPKM min = 11.91, max = 635.85 and 4 (B. conchifolia FPKM min = 13.65, max = 365.58, B. plebeja FPKM min = 12.26, max = 606.76) have much more comparable expression, indicating that phylogenetic proximity is not correlated with expression similarity in this case. The only exception is the group 5 single B. plebeja duplicate (FPKM min = 27.28, max = 871.03), which is the product of a duplication shortly before the group 2 orthologs, and shares similarity in expression due to phylogenetic proximity.
The group 3 CHS orthologs appear to have considerably decreased expression levels in both species, the B. conchifolia ortholog showing less than 1 FPKM expression across all tissues except for root. While the B. plebeja ortholog has slightly higher expression, this ortholog pair may represent an example of a gene duplicate in the process of pseudogenisation. This possibility is supported by the considerably longer branch length leading to the group 3 orthologs, suggesting an increased rate of mutation accumulation allowed by the reduced selective constraints during pseudogenisation 38.