Comparative analysis of whole flower transcriptomes in the Zingiberales

The advancement of next generation sequencing technologies (NGS) has revolutionized our ability to generate large quantities of data at a genomic scale. Despite great challenges, these new sequencing technologies have empowered scientists to explore various relevant biological questions on non-model organisms, even in the absence of a complete sequenced reference genome. Here, we analyzed whole flower transcriptome libraries from exemplar species across the monocot order Zingiberales, using a comparative approach in order to gain insight into the evolution of the molecular mechanisms underlying flower development in the group. We identified 4,153 coding genes shared by all floral transcriptomes analyzed, and 1,748 genes that are only retrieved in the Zingiberales. We also identified 666 genes that are unique to the ginger lineage, and 2,001 that are only found in the banana group, while in the outgroup species Dichorisandra thyrsiflora J.C. Mikan (Commelinaceae) we retrieved 2,686 unique genes. It is possible that some of these genes underlie lineage-specific molecular mechanisms of floral diversification. We further discuss the nature of these lineage-specific datasets, emphasizing conserved and unique molecular processes with special emphasis in the Zingiberales. We also briefly discuss the strengths and shortcomings of de novo assembly for the study of developmental processes across divergent taxa from a particular order. Although this comparison is based exclusively on coding genes, with particular emphasis in transcription factors, we believe that the careful study of other regulatory mechanisms, such as non-coding RNAs, might reveal new levels of complexity, which were not explored in this work.


INTRODUCTION
Next-generation sequencing technologies have been instrumental in allowing for the rapid generation of large quantities of transcriptomic data, previously unavailable for the majority of non-model organisms. In parallel to refinements of the sequencing technologies, several bioinformatics pipelines have been put forward allowing for the de novo assembly of transcriptomes from organisms for which there is not a fully sequenced and annotated genome ('reference genome' e.g., Wit et al., 2012;Chiara, Horner & Spada, 2013;Singhal, 2013;Unamba, Nag & Sharma, 2015). Although long predicted as a revolutionary tool (Wang, Gerstein & Snyder, 2009), RNA-Seq approaches enabling the comparative quantification of gene expression during organismal development have recently gained wide use across a diversity of organisms representing unique developmental and physiological processes. These advances have enabled the identification of candidate genes involved in a variety of processes ranging from flower color (e.g., pigment biosynthesis in Camellia reticulata (Yao et al., 2016); color polymorphism in Silene littorea Brot (Casimiro-Soriguer et al., 2016)) to characterization of biosynthetic pathways (e.g., glucosinolate and phytochelatin pathways in Sinapsis alba L. (Zhang et al., 2016); flavonoid and stilbenoids pathways in Gnetum parvifolium (Warb.) W.C.Cheng (Deng et al., 2016)). NGS approaches have also been used to study plant architecture (González-Plaza et al., 2016) as well as specific aspects of reproductive development (Hollender et al., 2014). Chanderbali and colleagues (2009;2010) pioneered the use of next-generation sequencing technologies to study the comparative evolution of floral development across angiosperms. Their choice of plant species included representatives of main angiosperm lineages (i.e., water lily, avocado, California poppy, and Arabidopsis), as well as a non-angiosperm seed plant (cycad), which allowed the authors to obtain insights into the molecular mechanisms underlying the evolution and diversification of the flower (Chanderbali et al., 2010). While there was deep conservation in the genetic programs specifying floral organ identities, further confirmed by the careful study of 18 angiosperm genomes (Davila-Velderrain, Servin-Marquez & Alvarez-Buylla, 2013), it was also possible to identify distinct transcriptional programs characterizing more recently derived plant lineages (Chanderbali et al., 2010). Thus, one can hypothesize that these distinct transcriptional programs are likely involved with mechanisms of diversification in floral shape, especially in closely related species.
In order to gain insight into the genetic basis of floral morphological variation, we present a comparative transcriptomic analysis of several species within the angiosperm order Zingiberales. The Zingiberales is a lineage of tropical and subtropical monocots comprising eight families. The order includes economically important species such as culinary ginger (Z. officinale Roscoe), turmeric (Curcuma longa L.), and banana (Musa acuminata Colla), as well as popular ornamentals, such as Canna indica L., bird-of-paradise (Strelitzia reginae Banks), spiral gingers (Costus spp.), and heliconias (Heliconia spp.). A recent phylogenetic analysis (Sass et al., 2016) supports the placement of Musaceae as sister to all other lineages followed by a monophyletic clade comprising Heliconiaceae, Strelitziaceae and Lowiaceae. Together, these four families are referred to as the ''banana  2016)). Zingiberales families are divided into the banana group, a paraphyletic assembly of early branching lineages, and the ginger clade. The asterix (*) marks the evolution of increased petaloidy and reduced number of fertile stamens as shared characteristics of the ginger clade. (B) M. basjoo flower and floral organs. Calix and corolla members are mostly fused into what is called the floral tube, with the exception of a single corolla member, the free petal. As a representative of the androecial constitution of the banana group, M. basjoo has five filamentous fertile stamens. M. basjoo gynoecium is also representative of most species in the banana group. (C) Canna sp. flower and floral organs. Species in the ginger clade usually exhibit inconspicuous and sepal-like calix and corolla, while infertile androecial members (staminodes) become laminar and petaloid. Species in the Zingiberaceae and Costaceae families bear a single fertile stamen, while species in the Cannaceae and Marantaceae families only develop 1/2 a fertile stamen. Furthermore, in Canna sp. the gynoecium is also laminarized to some extent. ft, floral tube; fp, free petal; se, sepals; pe, petals; st, stamen; th, theca; std, staminodes; gy, gynoecium (Photos by Ana Almeida).
Full-size DOI: 10.7717/peerj.5490/ fig-1 lineages'' and form a basal paraphyly with respect to the derived monophyletic ginger clade (Cannaceae, Marantaceae, Costaceae, and Zingiberaceae = ''ginger clade'') ( Fig. 1A). Flower morphology in the Zingiberales varies dramatically, and one of the main floral transitions in the order is related to the androecial whorl. Throughout the evolution of the Zingiberales, the number of fertile stamens is drastically reduced from 5-6 fertile stamens in the banana lineages to one or 1 2 fertile stamen in the ginger clade. This reduction in fertile stamen number is inversely correlated to an increase in petaloidy, in which the infertile androecial members laminarize (flatten) and develop into petal-like organs (Almeida, Yockteng & Specht, 2015) (Fig. 1B).
Several gene and gene networks have been hypothesized as underlying the molecular mechanisms of Zingiberales floral developmental evolution (Bartlett & Specht, 2010;Yockteng et al., 2013a;Almeida et al., 2014;Almeida et al., 2015). However intriguing, these studies are limited to candidate-gene or candidate-process approaches. In this study, we present an analysis of whole flower transcriptomes of several species spanning the Zingiberales order, as well as of a closely related Commelinaceae species. We focus our comparative analysis on coding regions, with particular attention to transcription factors. This broad approach aims at avoiding the pitfalls of targeted candidate-based methodologies, and can potentially illuminate lineage-specific mechanisms of floral development linked to evolution and diversification in form and function. We also highlight the advancements and challenges of comparative transcriptome-based approaches for the study of developmental evolution.  (Table 1). Whole young floral buds were collected and immediately flash frozen in liquid nitrogen. Flower and/or inflorescence size and morphology vary widely within the Zingiberales, and uniform developmental stages have not yet been established for the different lineages. In all cases, young inflorescences were dissected as much as possible and the youngest discernable floral buds were collected. Frozen floral buds were stored in −80 • C for up to two days before RNA extraction. Total RNA was extracted from floral material using Plant RNA Extraction Reagent (Invitrogen, Carlsbad, CA, USA), according to Yockteng et al. (2013b). RNA was stored at −80 • C until further use.

Library Preparation and sequencing
cDNA libraries for sequencing on the Illumina platform were prepared using the TruSeq RNA sample prep kit v2. cDNA libraries were prepared with 2,0 µg of RNA extracted from flash frozen floral buds. Co. spicatus whole flower library was sequenced using the Illumina HiSeq2000 at IIGB HT Sequencing Facility at the University of California, Riverside. All other samples were multiplexed 1:1 using barcode set A. Multiplexed libraries were sequenced using Illumina HiSeq2000 High Output at Vincent J. Coates Genomics Sequencing Lab at University of California at Berkeley. All libraries were sequenced as 100 bp pair-end reads.

Data cleanup and transcriptome assembly
Data clean-up was performed using a custom Perl script involving the following steps: (i) removal of identical forward and reverse reads; (ii) removal of duplicated  Langmead et al., 2009). Clean-up quality was assed comparing FastQC v0.11.2 (http: //www.bioinformatics.babraham.ac.uk/projects/fastqc/) reports of cleaned and raw reads. Transcriptomes were assembled de novo using Trinity v2.1.0 (Grabherr et al., 2011) with a variety of parameters. The best assembly results (based on the quality assessments presented below) used default parameters for all other species despite discrepancies in the overall estimated transcriptome coverage and number of reads. Contigs larger than 300bp were retained and further annotated.
Quality assessment of de novo assemblies was performed using DETONATE v1. 10 (Li et al., 2014). In particular, RSEM-EVAL was used as a reference-free evaluation method. True transcript length was estimated through comparison to several predicted transcriptomes from the sequenced genomes of Musa acuminata (D'Hont et al., 2012), the palms Phoenix dactylifera (Al-Mssallem et al., 2013) and Elaeis guineensis (Singh et al., 2013), and the core eudicot Arabidopsis thaliana (The Arabidopsis Genome Initiative, 2000). The number of coding sequences (CDS) in these species ranged from 28,889 in Phoenix dactylifera to 35,386 in Arabidopsis thaliana and 36,549 in Musa acuminata, to 44,360 in Elaeis guineensis.
Further quality assessment was performed on the basis of number and length of contigs as well as N50 (Table 2).

Transcriptome annotation and comparison
Statistically supported contigs were annotated with the help of TransDecoder v4.1.0 (https://transdecoder.github.io/). First, coding regions were identified using TransDecoder long ORFs prediction. Predicted long ORFs were subjected to a Blastp search (Gish & States, 1993) using Blast+ v2.7.1 against the Uniprot database (The UniProt Consortium, 2015), as well as a HMMER3 v3.1b2 (Eddy, 1998) search against the Pfam database (Finn et al., 2016). The results from the Blastp and HMMER3 searches were used by TransDecoder to filter likely coding regions from the predicted long ORFs list. For each species, TransDecoder-predicted coding regions were further filtered, using a Blastp search to the Uniprot database and the following parameters: ≥ 70% identity; E-value ≤ 1.0e −5 ; alignment length ≥ 100 bp; and coverage of at least 40%. These stringent lists were used as inputs for whole flower transcriptome comparisons, in order to avoid the inclusion in the analyses of chimeras and/or truncated transcripts.

Transcription factor sorting and analysis
Transcripts were further classified into overall functional categories as either metabolic enzymes, mitochondrial, chloroplast, structural or regulatory, based on BLAST results. Unknown transcripts as well as predicted uncharacterized transcripts were grouped as ''uncharacterized''. Regulatory transcripts were further analyzed regarding their role as transcription factors, and were subjected to further BLAST searches against the NCBI database, based on their conserved DNA-binding amino-acid domains. Further analysis also entailed a comparison of these transcripts to transcription factor sequences available at the curated plant specific database PlantTFDB v4.0 (http://planttfdb.cbi.pku.edu.cn/index.php; Jin et al., 2017). A list of all transcription factors retrieved in this analysis is presented on Supplemental Information S1.
All data processing was performed within the QB3 Computational Genomics Resource Laboratory (CGRL) at University of California at Berkeley, except when specified otherwise.

Transcriptome assembly
The number of cleaned reads for each whole flower transcriptome ranged from ∼1 million reads for Co. spicatus to 142,860,349 reads in Ca. zebrina ( Table 2). The significant difference in the number of reads is likely due to differences in the sequencing platform, in the case of Co. spicatus, and unequal multiplexing of libraries, in the case of Ca. zebrina. All other libraries resulted in a comparable number of reads, ranging form ∼4.3 million in O. fimbriata to ∼9.3 million reads in Canna sp. The number of non-filtered contigs ranged from ∼52,000 to ∼74,000, except in Co. spicatus (∼19,000) and Ca. zebrina (∼132,000), likely due to the discrepancy observed in the number of cleaned reads. With the exception of Co. spicatus, contig average length and N50 were comparable in all other libraries (Table 2).  -Fontanesi et al., 2016). In order to further assess assembly quality, we calculated RSEM-scores based on estimates of true transcriptome length of Musa acuminata, Phoenix dactylifera, Elaeis guineensis, and Arabidopsis thaliana (Table 2). Although we found no significant difference between results, RSEM-EVAL scores tended to favor the largest transcript length (Elaeis guineensis), regardless of phylogenetic proximity. Even for M. basjoo, phylogenetically close to Musa acuminata, the best RSEM-score was that based on Elaeis guineensis transcriptome.

Transcriptome annotation and comparison
Transcriptomes were filtered based not only on long predicted open reading frames (long ORFs) but also on Blastp and HMMER3 results (filtered ORFs) using TransDecoder (Table 3). The average number of filtered ORFs was ∼30,000, ranging from 13,122 in Co. spicatus to 55,360 in Ca. zebrina. The number of filtered coding sequences observed in this study is similar to already described numbers of floral unigenes of other non-model plants, which ranges between ∼25,000 (in buckwheat, Logacheca et al., 2011) to ∼80,000 (in Dendrocalamus latiflorus floral buds, Zhang et al., 2012). Whole flower transcriptome filtered ORFs represented on average 47% of reconstructed contigs, and ranged from 40 to 68%, similarly to what has been recently reported in Arabidopsis developing flowers (23,961 expressed genes; 67% of predicted CDS; Zhang et al., 2015). After filtering, the high number of contigs observed in Ca. zebrina was reduced to 55,360 ORFs, which is within the upper limits of already described non-model plant floral transcriptomes (see above).
In order to further annotate the contigs, OrthoFinder was used to establish orthology between the transcriptomes and the sequenced genome CDS. A total of 41,557 orthogroups were found (Supplemental Information S2), of which 17,418 had counterparts in at least one of the sequenced genomes included in the analysis. Over 24,000 groups had no CDS components in any of the analyzed genomes, which might suggest the persistance of chimeras and/or truncated ORFs within the filtered transcriptomes, Zingiberales specific genes, or a combination of the two. Arabidopsis thaliana CDS were present in 11, 511 orthogroups (Supplemental Information S3), while 5,907 orthogroups had no Arabidopsis counterparts but comprised other CDS from at least one of the other sequenced genomes.
Orthogroup species overlap is presented on Table 4. Furthermore, OrthoFinder identified 6,916 orthogroups with all 10 species present. Of those, only 28 comprised single-copy orthogroups, in which one single ortholog was found for each especies (Supplemental Information S4). Within Zingiberales transcriptomes, the largest orthogroup overlap was to the Musa acuminata genome, likely a reflection of their phylogenetic proximity. In all cases, Zingiberales transcriptomes largest orthogroup overlap to a non-Zingiberales genome was to Elaeis guineensis CDS.
One-hundred and forty-two (142) orthogroups were Arabidopsis thaliana-specific (Supplemental Information S5) with no counterparts in any of the other analyzed genomes. Given that all other genomes were from monocot species, this finding might reflect either Arabidopsis-specific or eudicot-specific genes. Further analyses are necessary to determine whether these genes are involved in eudicot-or Arabidopsis-specific flower development.
Blastn searches were conducted on the basis of Arabidopsis, Elaeis, Phoenix and Musa predicted CDS (Table 5). These searches produced variable results, potentially due to phylogenetic proximity and degree of genome sequence completeness. In general, all floral transcriptome Blastn searches resulted in a very small number of hits to Arabidopsis thaliana CDS, as expected due to its phylogenetic distance, indicating that although Arabidopsis is likely the best annotated plant genome to date, its phylogenetic distance to the study group makes fine-tuned statements of homology between Arabidopsis coding sequences and the predicted ORFs in the Zingiberales species studied here a challenging task. For instance, while 80.5% of Musa acuminata CDS were present amongst M. basjoo contigs, only ∼5% of Arabidopsis thaliana CDS were represented within the same assembly (Table 5), which  (Table 5). In order to avoid phylogenetic bias, as well as to maximize transcriptome annotation, further Blastn analyses of filtered ORFs were based on Elaeis guineensis predicted CDS. Based on Blastn searches against Elaeis guineensis predicted CDS, floral transcriptomes shared 4,153 genes (Fig. 2). We also identified 1,748 hits specific to Zingiberales, 666 to the ginger clade, 1,560 hits unique to the Cannaceae-Marantaceae lineage, 2,001 specific to the banana families, and 1,887 specific to Z. officinale, from which 221 hits are shared with Co. spicatus. The small number of contigs recovered for Co. spicatus likely limited the analysis of the Costaceae-Zingiberaceae lineage-specific Blastn hits (Supplemental Information S6).

Conserved genes
Orthogroup analysis containing Arabidopsis thaliana counterparts (Supplemental Information S3) revealed the presence of several well-known gene families in our flower transcriptomes. Within these orthogroups, the most noticeable groups were members of the AGAMOUS-like (AGL) family of transcription factors, including AGL6, AGL12, AGL20, AGL26, AGL29, AGL44, AGL58, AGL61, AGL65 and AGL104. Other MADS-box genes, Blastn hits to E. guineensis were used to further place genes in functional categories, as described in methods. Figure 3 depicts the main category of genes shared by all floral transcriptomes. Almost half of these genes (47%) are enzymes related to metabolic processes of the cell, while 26% of the genes are structural proteins such as membrane proteins, cytoskeleton-related proteins, ribosomal, histones, heat-shock and ribonucleoproteins. Approximately 10% of these genes are regulatory proteins, of which approximately 508 could be assigned to known transcription factor (TF) families, based on the PlantTFDB v4.0 (Supplemental Information S1). From the 58 well-characterized plant transcription factor families, our dataset was able to retrieve 36 families, based on the closest homolog in Arabidopsis thaliana (Table 6).  Additionally, six putative new categories of TFs that are not described in the database were also recovered, although more experimental evidence is required to further categorize their potential role as transcription factors. Here, we preliminarily named these sequences based on their match to the NCBI Conserved Domain Dataset Interestingly, the remaining regulatory proteins that were not included in the transcription factor category were nonetheless implicated in regulating plant organ development and/or growth, acting as protein co-factors that physically interact with transcription factors, or as related to the chromatin remodeling machinery.
Among the transcription factors shared by all flower transcriptomes, it is worth noticing a single homolog of APETALA-2 (a member of the A-class ABC model genes Jofuku et al., 1994), three homologs of MADS-6 or AGL6, as well as several homologs of HUA2-like proteins 2 and 3. In Arabidopsis thaliana, HUA1 and HUA2 are important components of the AGAMOUS gene regulation pathway (Chen & Meyerowitz, 1999). It has been suggested that HUA2 facilitates AGAMOUS action during flower development (Chen & Meyerowitz, 1999), and it is also required for the expression of FLC in Arabidopsis thaliana (Doyle et al., 2005). Moreover, HUA2 has been implicated in natural variation in Arabidopsis thaliana shoot morphology (Wang et al., 2007). Five LEUNIG-like homologs were also recovered in all floral transcriptomes. LEUNIG proteins are also involved in the regulation of AGAMOUS expression in Arabidopsis thaliana (Liu & Meyerowitz, 1995;Sridhar et al., 2004). The number of shared genes involved in the regulation of AGAMOUS indicates the shared importance of precise AGAMOUS regulation during flower development (Supplemental Information S1). In particular, genes involved in physiological responses to stress and pathogen response, such as the WRKY family of transcription factors (Wang et al., 2011) and the NAC domain proteins (Nuruzzaman, Sharoni & Kikuchi, 2013), were recovered in all transcriptomes. More recently, WRKY71 has been implicated in the control of shoot branching in Arabidopsis thaliana, through the regulation of RAX genes (Guo et al., 2015). All floral transcriptomes also presented several members of the zincfinger transcription factor family, seven KNOTTED 1-like homologs, as well as GATA transcription factors 2, 4, 12, and 24. Several members of the bHLH family; homologs of MYB44, MYB82; TCP-4, -15, and -7 homologs; four CONSTANS-like homologs; several members of the TCP family, as well as WUSHEL-like transcripts were also widely retrieved (Supplemental Information S1).
Other regulatory proteins include, for example, a homolog of COBRA-like 1; two homologs of FY-like proteins; one FRIGIDA-like homolog; and five homologs of EMBRYONIC FLOWER2-like. We also retrieved six TOPLESS-like homologs, almost 20 members of the TBC1 family, five IWS1 homologs, a GIGANTEA-like homolog, as well as four SQUAMOSA PROMOTER BINDING-like homologs.
Interestingly, the most prominent feature of the Blastn searchers was the match to different paralogues and/or variants of the same genes or gene families in different floral transcriptomes (Supplemental Information S1). For example, LATERAL ORGAN BOUNDARIES (LOB)-domain homologs were retrieved in all floral transcriptomes analyzed. However, while LOB40, 41 and 6 -like homologs were retrieved in all Zingiberales floral transcriptomes, LOB36 and a paralog of LOB6 -like transcripts were retrieved only in the banana transcriptomes. Similarly, LOB18-like was only recovered in the Cannaceae-Marantaceae lineage, while LOB4-like transcript was only recovered in Z. officinale. On the other hand, LOB15-like homologs were only recovered in the floral transcriptome of D. thyrsiflora. LOB genes have been implicated in defining organ boundaries in Arabidopsis floral organs through negative regulation of brassinosteriod accumulation (Shuai, Reynaga-Peña & Springer, 2002;Bell et al., 2012).
Whether this phenomenon is a result of gene duplication followed by divergence or whether it is due to lineage-specific divergence within a single copy begs further investigations. Whether these homologs have retained the same function is an exciting matter for further studies.

Lineage-specific genes
The great majority of lineage-specific genes, including Zingiberales specific genes, were related to metabolic processes of the cells (Fig. 3). The most prevalent unique genes were enzymes such as oxidoreductases, methyltransferases, aminoacyl-tRNA synthetases, kinases, hydrolases, and phosphatases. Carrier proteins, transporters, chaperones and ribonucleoproteins were also abundant in all lineage-specific datasets. Several transcription factors, many of which are known players during plant development, were recovered in a lineage-specific fashion. Fifty percent of Zingiberales specific genes are metabolic enzymes (28%) or structural proteins (22%), while 12%, approximately 210 coding sequences, are regulatory proteins (Fig. 3).
Among these regulatory proteins, several families of transcription factors were recovered exclusively in the Zingiberales, such as ENHANCER OF AG-4, various AP2-like ethyleneresponse transcription factors, BRZ1 homologs 1 and 3, SHOOT GRAVITROPISM 5-like homolog, the zinc-finger transcription factor JACKDAW -like homolog, a YABBY2-like homolog, as well as GT -2 and GT -3 (GT-element binding transcription factors) homologs.
Several DIVARICATA lineage-specific homologs, were retrieved in the banana and ginger groups transcriptomes. Similarly, other homologs appeared in a lineage-specific manner. For example, while two homologs of B-ZIP transcription factor family TGA4-like were recovered in the banana group, homologs for TGA2-like were recovered only in the ginger clade. Likewise, homologs of the trihelix DNA binding family gene ASIL1-like (ARABIDOPSIS 6B-INTERACTING PROTEIN 1-LIKE) were recovered in the banana group, while ASIL2-like homologs were recovered in the ginger clade.
As far as other regulators go, in all Zingiberales floral transcriptomes, but not in the outgroup D. thyrsiflora, we were able to recover a homologue of the plant homeodomain (PHD) protein ING2 (Inhibitor of growth). ING tumor suppressors are found in animals, plants and yeast, and have long been implicated in oncogenesis, control of DNA damage repair, cellular senescence and apoptosis (Champagne & Kutateladze, 2009). In A. thaliana, ING2 is involved in chromatin regulation by binding to the active histone marker H3K4me3/2 (Lee et al., 2009). Histone modifications, such as those promoted by ING2 and other PHD proteins, modulate the expression of crucial genes involved in flower development (López-González et al., 2014). Similarly, the histone chaperone ANTI-SILENCING FACTOR-1 (ASF1) homologue was recovered in all analyzed Zingiberales transcriptomes, while missing in D. thyrsiflora. ASF1 is a family of histone chaperones conserved in all eukaryotes (Triphathi et al., 2015), and in A. thaliana ASF1 is required for cell proliferation during development and is involved in transcriptional regulation of histones and histone modifications (Zhu et al., 2011). However interesting, further analyses are necessary to establish the potential role of histone modifications, and in particular the functions of ING2 and ASF1, in Zingiberales flower development.
In turn, various transcription factors were only recovered in the D. thyrsiflora floral transcriptome to the exclusion of the Zingiberales. Among these are a FLORICAULA/LEAFY homolog, a homolog of ODORANT1-like, a homolog of JUNGBRUNNEN 1-like, homologs of RAX-1, -2, and -3, as well as homologs of the transcription factors DPB, TT2-like, and GAMYB-like. In particular, a SOMBRERO-like homolog was retrieved only in D. thyrsiflora. SOMBRERO proteins, members of the NAC domain transcription factors, have been implicated in the control of cell division plane orientation in Arabidopsis thaliana Willemsen et al., 2008. Other regulators retrieved specifically in the Dichorisandra lineage include two STICHEL-like homologs, a homolog of UPSTREAM OF FLC-like, a TONSOKU -like homolog, two SAGA-like homologs, a TASSELSEED homolog, and a TITAN -like homolog.
Regulatory sequences retrieved exclusively within the banana lineage, represented by M. basjoo and O. fimbriata floral transcriptomes, include four CCA1-like homologs, six FLX2-like homologs, a KTI12-like homolog, a YABBY4-like homolog, a CPC homolog, and a SPATULA homolog represent transcription factors that were recovered exclusively in this group. Curiously, few coding sequences were uniquely reconstructed within the ginger clade, potentially due to the low coverage of the Co. spicatus transcriptome. Particularly interesting is the unique recovery of four AS1-like (ASYMMETRIC LEAVES-1) homologs and two DROOPING LEAF -like genes. Regulatory coding sequences uniquely reconstructed in the Canna-Calathea (Cannaceae-Marantaceae) lineage include a CUC2 homolog, a homolog of Arabidopsis EXORDIUM -like protein, two FAF -like homologs, and five SPX -like homologs.
A complete list of lineage specific transcription factors, sorted by plant transcription factor families characterized in the PlantTFDB, can be found in Supplemental Information S1.

DISCUSSION
Recently, there has been an explosion in the use of RNA-Seq approaches as part of a comparative analysis pipeline to study the evolution of developmental processes, using plant transcriptomes as an indication of differential gene expression among organisms with different phenotypic displays. This approach has become particularly important in non-model organisms that lack a reference genome or other genetic and bioinformatic tools that exist in plant model organisms like A. thaliana, rice, poplar or corn. Despite challenges assembling transcriptomic sequence data without a reference genome, researchers can determine the quality of their data based on the number, size and scores of the contigs assembled. The transcriptome data presented here are in agreement in terms of number of contigs, contig size distributions, and quality scores with those presented in the literature.
The study of mechanisms underlying floral diversification in plant lineages will likely point, in most cases, to at least three potentially concurrent scenarios: (i) tinkering of conserved mechanisms specific to flower development; (b) evolution of lineage-specific mechanisms resulting in novelty or change, or (c) co-option of non-flower mechanisms to elaborate specific aspects of flower development. The identification of these mechanisms, however, requires careful examination of exemplar species within a clearly delimited phylogenetic context. Also, careful choice of outgroup species might help the distinction between gain versus loss of molecular processes when analyzing lineage specific phenomena. Our data show that the inclusion of D. thyrsiflora significantly reduced the overall number of Zingiberales unique genes, as well as the number of lineage specific genes within the Zingiberales, potentially due to shared molecular mechanism during flower development. It is possible that the addition of other outgroups would further limit the lineage-specific datasets. The results presented here support previous assertions that annotation based on Blastn searches is highly influenced by phylogenetic proximity as well as genome sequence completeness and annotation quality, particularly when blasting against predicted CDS (Hornett & Wheat, 2012). Meanwhile, orthogroup analysis provides a wider view of less stringent relationships between trasncriptomes. Furthermore, the orthogroup analysis presented here reinforces the notion that gene duplications are a widespread phenomenon during plant evolution (Panchy, Lehti-Shiu & Shiu, 2016). Only 28 of the over 40,000 orthogroups identified comprised single copy genes in the transcriptomes and genomes analyzed.
The stringent filtering of the data performed with Blastn likely excluded several genes that could potentially participate in flower development across the Zingiberales and in the outgroup (D. thyrsiflora), and may even participate in floral evolution. However, due to this stringent cutoff, it is likely that the genes recovered are strong candidates for further studies. Functional analysis of the genes that emerge from these comparative datasets, coupled with careful phylogenetic assessments of specific gene families, will potentially refine the picture.
Perhaps the most significant results presented here are related to the set of shared floral transcription factors recovered for all taxa analyzed. Due to the nature of the methodology used, we believe there is sufficient evidence to support the presence of these genes in all floral transcriptome studies, making them likely floral development regulators and involved in not only floral development but, given their presence among and between lineages, suggesting that they are conserved regulators of floral evolution. Most of these genes and gene families have already been implicated in floral development in A. thaliana, but knowledge of their roles outside core eudicots is still poor. Their specific involvement in processes of morphological diversification has yet to be established.
Our results point to interesting differences between Zingiberales lineages. In particular, the presence of a YABBY4-like homolog in the banana lineages but not in the ginger cladewhere only a YABBY2-like homolog was reconstructed-might underlie developmental differences between these Zingiberales flowers. Information regarding the role of YABBY4 in comparative floral development remains sparse. Even though expression of YABBY4 (INNER NO OUTER) is restricted to the ovule integument (Villanueva et al., 1999) and seems to be conserved across angiosperms (Skinner et al., 2016), little is known about the presence of this gene in monocots other than rice, or pertaining the role it may play in ovule development within the monocot clade (Toriba et al., 2007;Morioka et al., 2015). Although it requires further evidence, the lineage specific gene set presented here might provide an interesting candidate gene list for further studies into the molecular mechanisms of floral development and diversification in the Zingiberales.
It is widely accepted that the ability to recover low expressed genes is related to transcriptome coverage (Grabherr et al., 2011;Martin & Wang, 2011;Tarazona et al., 2011). The high coverage of Calathea might explain the large number of genes recovered that appear unique to Cannaceae-Marantaceae, especially given the overrepresentation of transcription factors in this lineage. However, the total number of unique transcription factors between Canna and Calathea is similar to that observed in other lineages within the Zingiberales. Particularly interesting was the reconstruction of CUP-SHAPED COTYLEDON2 (CUC2) exclusively in the Cannaceae-Marantaceae lineage. The evolution and functional divergence of CUC genes (1-3) have been well studied in Arabidopsis (Hasson et al., 2011), although much less is known in monocots especially outside of the grasses. During flower development, CUC genes have been implicated in the formation of carpel margin meristems, although their role in plant development does not appear to be restricted to the flower (Kamiuchi et al., 2014). It is conceivable that the CUC gene copies play important roles, together with SPATULA homologs (SPT ) (Nahar et al., 2012), in carpel diversification in Zingiberales.
It is interesting to notice that AGAMOUS regulatory proteins were widely recovered in all transcriptomes, suggesting consistent levels of expression throughout the Zingiberales and outgroup developing flowers. This might support the evolution of several regulatory mechanisms of AGAMOUS expression during flower development, bringing redundancy and indicating the critical nature of AGAMOUS regulation. In turn, it may suggest that variations of AGAMOUS expression might lead to floral morphological diversification, a mechanism that has already been proposed to participate in Zingiberales flower evolution (Almeida et al., 2015b).
Because expression levels can interfere with the ability to reconstruct specific genes, it is possible that some of the differences observed in lineage-specific transcriptome reconstructions, particularly the absence of transcripts, are due to low or restricted expression within the developing flower. It is imperative that further studies are carried out, especially comparative spatial-temporal expression studies, to further unravel the role of these transcription factors in floral morphological variation. Comparisons based on expression levels of shared genes, as well as protein-protein or protein-DNA interactions, can certainly reveal other levels of developmental divergence. Expression levels were not calculated here, due to the lack of replicates for each floral transcriptome. Also, despite the interesting findings discussed here regarding coding sequences and, in particular transcription factors, further analysis is needed to fully uncover the mechanisms underlying floral developmental evolution. A careful analysis of non-coding sequences might revel other layers of gene regulation and function that were not explored in this work. The complexity of the molecular mechanisms underlying floral development cannot be underestimated. Thus, we believe that further investigations are needed to achieve a full understanding of the molecular processes underlying flower developmental evolution in the Zingiberales.
Despite limitations, we believe the transcriptome analysis presented here sheds light on interesting phenomena that might underlie molecular mechanisms of flower developmental evolution. In particular, the consistent recovery of distinct homologs for various genes families in closely related evolutionary lineages is a pattern that suggests the need for further studies. The complex patterns of gene duplications in plants, although daunting, provides an exciting opportunity for the study of the relationship between genes, functions and morphological diversification.