Comprehensive transcriptome analysis reveals genes potentially involved in isoflavone biosynthesis in Pueraria thomsonii Benth

Pueraria thomsonii Benth is an important medicinal plant. Transcriptome sequencing, unigene assembly, the annotation of transcripts and the study of gene expression profiles play vital roles in gene function research. However, the full-length transcriptome of P. thomsonii remains unknown. Here, we obtained 44,339 nonredundant transcripts of P. thomsonii by using the PacBio RS II Isoform and Illumina sequencing platforms, of which 43,195 were annotated genes. Compared with the expression levels in the plant roots, those of transcripts with a |fold change| ≥ 4 and FDR < 0.01 in the leaves or stems were assigned as differentially expressed transcripts (DETs). In total, we found 9,225 DETs, 32 of which came from structural genes that were potentially involved in isoflavone biosynthesis. The expression profiles of 8 structural genes from the RNA-Seq data were validated by qRT-PCR. We identified 437 transcription factors (TFs) that were positively or negatively correlated with at least 1 of the structural genes involved in isoflavone biosynthesis using Pearson correlation coefficients (r) (r > 0.8 or r < -0.8). We also identified a total of 32 microRNAs (miRNAs), which targeted 805 transcripts. These miRNAs caused enriched function in ‘ATP binding’, ‘defense response’, ‘ADP binding’, and ‘signal transduction’. Interestingly, MIR156a potentially promoted isoflavone biosynthesis by repressing SBP, and MIR319 promoted isoflavone biosynthesis by repressing TCP and HB-HD-ZIP. Finally, we identified 2,690 alternative splicing events, including that of the structural genes of trans-cinnamate 4-monooxygenase and pullulanase, which are potentially involved in the biosynthesis of isoflavone and starch, respectively, and of three TFs potentially involved in isoflavone biosynthesis. Together, these results provide us with comprehensive insight into the gene expression and regulation of P. thomsonii.

Introduction Pueraria thomsonii Benth belongs to the perennial leguminous plants, and it is an important Chinese medicinal herb that is mainly distributed in East Asian countries, such as China and Japan. The dried roots of P. thomsonii were first described in the Shengnong Bencao Jing as a pungent diaphoretic drug, which could be used to treat influenza, body stiffness, and other illnesses [1]. The main active isoflavones in P. thomsonii are puerarin and daidzein, which accumulates in roots for Pueraria plants [1][2][3][4][5]. Modern pharmacological research has shown that puerarin can be used to treat diabetes and vascular hypertension [6], as well as to promote osteogenic differentiation and bone formation [7]. Daidzein can be used to decrease blood alcohol levels [8]. The biosynthesis of isoflavone is via phenylpropanoid and isoflavonoid hybrid-pathway [3][4][5][6]9,10]. For the phenylpropanoid pathway, starting from phenylalanine, liquiritigenin and naringenin are formed through the successive catalysis of phenylalanine ammonia-lyase (PAL), trans-cinnamate4-monooxygenase (CA4H) and 4-Coumarone Coenzyme A Ligase (4CL), 6-deoxychalcone synthase (CHS) and chalcone isomerase (CHI). The isoflavonoids biosynthetic pathway has been studied intensively in Pueraria lobata (Willd.) Ohwi, another important Pueraria plant. Firstly, liquiritigenin or naringenin are hydroxylated by 2-hydroxyisoflavanone synthase (IFS) and 2-hydroxyisoflavone dehydrate (HID) to form daidzein, genistein and formononetin. Then these isoflavone scaffolds are further modified by glycosyltransferases and methyltransferases, yielding to diverse isoflavonoids [3][4][5]10]. In addition, the roots of P. thomsonii are rich in starch and have a high pasting temperature, gelatinization temperature and enthalpy [11].
However, current research on the molecular biology of P. thomsonii is limited. EST information is not available, and only 46 nucleotide sequences have been deposited in the NCBI database. Additionally, there have not been any studies on protein-coding candidate genes involved in isoflavone biosynthesis in P. thomsonii, and microRNAs (miRNAs) in P. thomsonii have not been reported. miRNAs are a class of 20-24 nt noncoding RNAs that play vital roles in plant morphogenesis, biotic and abiotic stresses, and the synthesis of secondary metabolites through either cleavage of their targets or by repressing target translation at the posttranscriptional level [12,13].
For species without genome information, high-throughput RNA sequencing and data analysis are powerful tools to mine for genes that have specific functions. Compared to the shortread Illumina sequencing platform, PacBio Iso-Seq can obtain long reads of up to 10 kb and does not require reconstruction of the transcripts, making it convenient and accurate to find splice variants [14]. The full-length transcriptome has been previously described for some other medicinal plants, such as Salvia miltiorrhiza [14] Dendrobium officinale [15], Astragalus membranaceus [16] and Cassia obtusifolia [17].
In this study, we used PacBio RS II and illumina sequencing to obtain the full-length transcripts from P. thomsonii and carry out comprehensive analysis. In addition to the structural genes, we found regulatory genes, such as transcription factors (TFs) and miRNAs were potentially involved in isoflavone biosynthesis. We also identified alternative splicing (AS) events, some of which were related to the biosynthesis of isoflavone and starch. These results lay the foundation for breeding P. thomsonii cultivars with high isoflavone compounds through engineering.

Plant materials and RNA extraction
Two-year-old "Enge-8" plants were grown in Huazhong Medicinal Botanical Garden of Institute of Chinese Medicinal Materials, Hubei Academy of Agricultural Sciences (Enshi, China).

Illumina transcriptome library construction and sequencing
The Poly (A) mRNA from 9 samples (three organs from three plants) was enriched from the total RNA using oligo (dT) magnetic beads. Following the enrichment, the mRNA was fragmented into small pieces. Using these short fragments as templates, the first-strand cDNA was synthesized using Superscript III reverse transcriptase and random hexamer (N6) primers. Subsequently, the RNA templates were removed, and the second-strand cDNA was synthesized using dNTPs, DNA polymerase I, and RNase H. These short double cDNA fragments were purified with AMPure XP beads. After end reparation and A-tailing, the short cDNA fragments were ligated with the Illumina paired-end adaptors and purified with AMPure XP beads. Then, PCR was used to selectively enrich DNA fragments with adapter molecules on both ends and to create the final cDNA library. The concentration of the cDNA library was assessed using the Qubit2.0 fluorometer (Life Technologies, Carlsbad, CA, USA), and the quality of the cDNA library was measured using the Agilent 2100 Bioanalyzer (Santa Clara, CA). Finally, the libraries were sequenced from both the 5' and 3' ends using the Illumina sequencing system (Illumina, San Diego, CA, USA).

PacBio data analysis
SMRT Analysis software package (v2.3.0) was used for data processing. First, raw reads were error corrected and processed into reads of insert (ROIs) using the ToFu pipeline [18]. Next, full-length, nonchimeric (FLNC) transcripts were identified by searching for the polyA tail signal and the 5' and 3' cDNA primers in ROIs. ICE (Iterative Clustering for Error Correction) was used to obtain consensus isoforms, and FL consensus sequences from ICE were polished using Quiver [19]. FL transcripts with postcorrection accuracy above 99% were generated for further analysis. All the isoforms were corrected with the help of the Illumina short reads using the Proovread tool [20] with default settings. Finally, the longest isoform from each cluster was regarded as the transcript to be followed up on with correction by CD-HIT software (v4.6) [21] with the following parameters: -c 0.99 -G 0 -aL 0.00 -aS 0.99 -AS 30 -M 0 -d 0 -p 1 to remove all redundant transcripts.

Detection of AS events
The AS events were obtained as previously described [16]. First, the nonredundant transcripts were processed with the Coding GENome reconstruction Tool (Cogent v1.4, https://github. com/Magdoll/Cogent). In general, Cogent first creates the k-mer profile of nonredundant transcripts, calculates pairwise distances, and then clusters transcripts into families based on their k-mer similarity. Each transcript family is further reconstructed into one or several unique transcript model(s) (UniTransModels) using a De Bruijn graph method. Then, nonredundant transcripts were mapped to UniTransModels using GMAP (v2014-08-04) [22]. Splicing junctions for transcripts that mapped to the same UniTransModels were examined, and transcripts with the same splicing junctions were collapsed. Collapsed transcripts with different splicing junctions were identified as transcription isoforms. AS events were detected with SUPPA using the default settings [23].

Illumina data processing and Quantification of gene expression levels
Raw reads in FASTQ format were first processed using in-house Perl scripts. Clean reads were obtained by removing reads containing adapters, reads containing poly-N and low-quality reads. Gene expression levels were determined by calculating the sum of the fragments mapping to each transcript and were then normalized by converting the fragment counts to fragments per kilobase of transcript per million mapped reads, or FPKM, and were then normalized to the nonredundant transcripts. Differential expression analysis was performed using DESeq (v 1.10.1) [31]. Genes that had a |fold change| � 4 and FDR < 0.01 found by DESeq were assigned as differentially expressed.

MiRNA identification and target prediction
Known microRNA sequences were downloaded from miRBase (Released 22, http://www. mirbase.org) [32]. Due to the conservation of miRNAs, the known sequences of the miRNAs in the miRBase database were compared with the nonredundant transcript data with the use of predicted precursor module "psRobot-mir" of the psRobot software [33]. The criteria described by Meyers et al were applied to check P. thomsonii microRNAs manually [34,35]. When two or more miRNAs were located at the same precursor, the sequences of the Glycine max or Medicago truncatula miRNA were chosen as a reference. The minimal folding free energy index (MFEI) was calculated as described previously [36]. Targets of the microRNAs were predicted by the psRNATarget Server using a penalty score � 3 [37].

Validation the expression profile of transcripts by qRT-PCR
For each tissue type, the RNA samples extracted from three individuals were used as biological replicates for qRT-PCR analyses. Each biological replicate had three technical replicates. Then, RNA was digested with RNase-free DNase I (Promega) to remove genomic DNA contamination. Reverse transcription was performed using 1 μg total RNA for each example with 200 U M-MLV Transcriptase (Takara) in a 20 μl volume. The reaction was carried out at 70˚C for 10 min, 37˚C for 60 min and 70˚C for 15 min. The resulting cDNA was diluted to 800 μl with sterile water. qPCR was carried out in triplicate reactions using the BIO-RAD CFX system (BIO-RAD). Genespecific primers were designed by Primer3 (http://bioinfo.ut.ee/primer3/). The primers used in this study are listed in S1

Combined sequencing of P. thomsonii transcripts
To deeply mine the gene information of P. thomsonii, sequencing with the PacBio RS II and Illumina platforms was performed simultaneously. First, full-length cDNA libraries from the pooled poly (A) RNA samples were constructed and subjected to SMRT sequencing by the Pac-Bio RS II platform. To avoid the preference of different transcripts based on sequence length, 1-3 kb and 3-6 kb libraries were constructed and sequenced by 2 SMRT cells. In total, 300,584 polymerase reads were generated for these two libraries. After discarding the adapter sequence and sequences shorter than 50 bp, 207,253 and 237,356 subreads representing 5.13 Gb and 5.88 Gb were obtained, respectively (S2 Table). Next, 160,327 and 147,869 ROIs with high quality were extracted with a mean length of 1,646 bp and 3,270 bp, respectively (S3 Table). Next, we further distinguished the full-length nonchimeric sequences from the non-full-length sequences (S4 Table). Based on the clustering algorithm of ICE, we integrated the 1-3 kb and 3-6 kb libraries and found 76,347 consensus isoforms with a mean length of 2,251 bp. These data contained 62,400 high-quality transcripts and 13,947 low-quality transcripts ( Table 1). All the 76,347 consensus isoforms were further corrected using the Illumina read data (S5 Table) to improve quality. Finally, after removing the redundant sequences and the cluster of the low-quality transcripts using CD-HIT (c = 0.99), a total of 44,339 nonredundant transcripts were obtained, which were further annotated for downstream analysis.
For Illumina sequencing, 9 mRNA samples from the leaves, stems and roots (each in triplicate) of P. thomsonii were subjected to sequencing by Illumina HiSeq. As a result, we obtained 23 Table). The Illumina sequencing reads were not assembled alone because approximately 80% of them mapped to the 44,339 nonredundant transcripts (S6 Table).

Gene function annotation and categorization
To categorize the nonredundant transcripts, sequences were compared against Nr, Swissprot, KEGG, GO, COG, eggNOG, KOG and Pfam using BLASTX; as a result, 43,083, 32,079, 17,391, 24,720, 40,870, 25,626 and 35,525 transcripts were found to be annotated in the abovementioned databases, respectively. These results are summarized in S7 Table, and a total of 43,195 transcripts were annotated. The databases COG, eggNOG, KOG and Pfam distinguish proteins by different domains/families. GO terms describe the functions of these cross-species homologous genes and their gene products, which are further divided into biological processes, cellular locations and molecular functions. In this study, 45,138, 17,048 and 37,654 GO terms were obtained for the three categories.

Identification of structural genes involved in isoflavone and starch biosynthesis
Isoflavones are the main active compounds of P. thomsonii. However, isoflavone-related genes in P. thomsonii have not been reported to date. To fill this gap, we first annotated the 44,339 nonredundant transcripts of P. thomsonii with KEGG database. In total, 17,391 transcripts were annotated, which could be divided into 128 categories (S8 Table). Among them, carbohydrate metabolism (ko01200), biosynthesis of amino acids (ko01230), ribosome (ko03010), protein processing in endoplasmic reticulum (ko04141) and spliceosome (ko03040) ranked in the top five. In total, 4,134 and 2,072 transcripts were involved in metabolic pathways and the biosynthesis of secondary metabolites, respectively. Based on KEGG and Nr annotation, we annotated a total of 69 transcripts that could be associated with the isoflavone biosynthesis (S1 Fig,  S9 Table). Compared with P.lobata, some genes were absent in P. thomsonii, such as 3'-Omethyltransferase, isoflavone-specific 4'-O-methyltransferase. These data greatly enrich our knowledge of isoflavone biosynthesis and provide candidate genes for genetic engineering to enhance the production of isoflavone compounds in P. thomsonii.
The roots of P. thomsonii are called Fen-Ge in Chinese because they are rich in starch [1]. Therefore, it is important to mine starch-related genes in P. thomsonii. Starch consists of amylose and amylopectin. Starch biosynthesis is a complex process that requires the coordinated activity of multiple enzymes. It is reported that amylose is mainly produced via granule-bound . In this study, we found that 10 families of genes were involved in starch synthesis (S10 Table), including GBSS Ⅰa, uncharacterized GBSS, GBSS Ⅱ, SSⅢ, SSⅣ, SSSⅠ, SSSⅢ, pullulanase-DBE (PULⅠ), PHO1 and PHO1-like.

Analysis of TFs
It has been reported that TFs play important roles in plant growth and development and in the biosynthesis of secondary metabolites [9]. Until recently, TFs have not previously been reported in Pueraria plants. To analyze TFs in P. thomsonii, sequences of the nonredundant transcripts were compared to the sequences in the iTAK database using the default parameters [41]. As a result, 2,079 TF transcripts were identified, which could be classified into 66 families (S11 Table). Among them, the bHLH, C3H, AP2/ERFERF, bZIP and C2H2 families of TFs ranked in the top five, followed by the MYB-related, B3-ARF, WRKY, FAR1, GRAS, and MYB TF families.

Identification of DETs related to isoflavone biosynthesis
Then, the expression levels of the nonredundant transcripts were determined by FPKM. Compared with the expression levels in the plant roots, those of transcripts with a |fold change| � 4 and FDR < 0.01 in the leaves or stems found by DESeq were assigned as DETs, because the roots accumulate more isoflavone compounds [1][2][3][4][5] and some structural genes related to isoflavone express higher in roots [3][4][5][6]10]. In total, 9,225 DETs were found (S12 Table). GO enrichment showed that the top 5 categories of DETs were genes associated with oxidationreduction process (GO: 0055114), ATP binding (GO:0005524), integral component of membrane (GO:0016021), metal ion binding (GO:0046872) and membrane (GO:0016020) (S13 Table). Among the 9,225 DETs, the comparison of the roots and leaves resulted in 8,138 DETs. Of these, 4,610 were upregulated, and 3,528 were downregulated. KEGG enrichment was carried out, and the top 20 enriched pathways, including the phenylpropanoid-and isoflavone-biosynthesis pathways, are shown in Fig 2. A total of 4,381 DETs were found when the roots and stems were compared. Among these, 2,687 were upregulated, while 1,694 were downregulated. The phenylpropanoid-and isoflavone-biosynthesis pathways were also enriched (S2 Fig). Of the 9,225 total DETs, 36 were annotated as encoding structural biosynthesis genes (S8 Table), and 32 of these were potentially involved in isoflavone synthesis because they are more highly expressed in roots than in leaves or stems.
In addition to the structural genes, the data were further analyzed to find other regulatory genes, such as TFs, that were potentially involved in the isoflavone biosynthesis. Among the 9,225 total DETs, 508 were predicted to be TFs. We hypothesized that isoflavone-related TFs should share highly correlated expression profiles with those of structural genes. Therefore, we calculated Pearson's correlation coefficients (r) of the expression profiles of the 508 TFs with those of CHS (PB5670), CHI (PB36892), HID (PB16218), IFS (PB13397), PlUGT2 (PB16388), IF7MAT (PB14269) and PlUGT43 (PB16388). As a result, we found a total of 437 TFs that were coexpressed with at least one of the 7 structural genes involved in isoflavone biosynthesis using the cutoff r > 0.8 or r < -0.8 (S15 Table). Among the 437 TFs, 57, 12, 94, 173, 74, 20, and 7 were positively or negatively correlated with the 7, 6, 5, 4, 3, 2, and 1 structural gene, respectively (S15 Table). The 437 TFs could be classified into 47 families, and the MYB, HB-HD-ZIP, bHLH, C2H2, WRKY, and bZIP TF families were enriched. These data provided a foundation for further studies of the TFs involved in isoflavone biosynthesis.

Predicted microRNAs and their targets
In order to decipher the regulatory role of miRNAs in isoflavone biosynthesis in P. thomsonii, we identified the miRNAs and analyzed their target genes. Because P. thomsonii small RNAs were not sequenced, based on the conservation of miRNAs, we used psRobot software to predict the P. thomsonii miRNAs by comparing the known miRNAs in miRBase (version 22) to the nonredundant transcripts to find matches with no more than three mismatches and less than two bulges in the hairpin structures. In total, we identified 32 miRNAs that belong to 12 families (Table 2).
Among the identified miRNAs, MIR156, MIR171, MIR398, MIR1507 and MIR2119 had two or three members, while MIR167, MIR171, MIR319, MIR390, MIR396 and MIR408 had only one member. Some hairpin structures of miRNA precursors are shown in Fig 4. Most miRNA precursors had MFEIs higher than 0.8, which was consistent with previous reports [36]. Interestingly, we found that the precursors of MIR398c and MIR2119b clustered at the PB273 transcript ( Fig 5).
Moreover, these two miRNAs had no sequence similarity. In plants, clustering of miRNAs was frequently found in miRNAs from the same family or from different families but with sequence similarity [42].
The identified miRNAs targeted 805 transcripts (S16 Table), 608 of which were annotated. GO enrichment showed that the top 10 gene categories targeted by the identified miRNAs were those associated with ATP binding ( ). Interestingly, we found that SBP (PB22265, PB8085) was targeted by MIR156a. The expression profile of PB22265 was negatively correlated with that of IFS, and PB8085 was negatively correlated with CHS and IFS (S15 Table). Therefore, we inferred that MIR156a promotes isoflavone biosynthesis by repressing SBP (PB22265, PB8085). We also found that TCP (PB17199) and HB-HD-ZIP (PB19367) were targeted by MIR319. The expression profile of PB17199 was negatively correlated with that of CHS, CHI, HID, IFS, PlUGT2, IF7MAT and PlUGT43, and the expression profile of PB19367 was negatively correlated with that of CHS, CHI, and IFS (S15 Table). Therefore, we inferred that MIR319 promotes isoflavone biosynthesis by repressing TCP and HB-HD-ZIP.

AS identification
In order to mine the role of AS in the biosynthesis of isoflavone and starch, we identified and analyzed AS in P. thomsonii. Compared to assembling unigenes using the Illumina short reads by Trinity, PacBio Iso-Seq can detect AS from the nonredundant transcripts using a de novo alignment method, therefore identifying true AS events. We first constructed the full-length UniTransModels using Cogent (https://github.com/Magdoll/Cogent). Then, AS events were detected by aligning the nonredundant transcripts back to the UniTransModels with SUPPA using the default setting. In this study, we found a total of 1,482 UniTransModels occurrences of AS, involving 2,690 AS events (S17 Table). We found 1,558 instances of intron retention (IR), 143 of exon skipping (ES), 525 of alternative 3' (A3), and 464 of alternative 5' (A5). Among the 1,482 UniTransModels, 876 had only one AS instance, while in 606 UniTransModels, AS occurred more than once. We found 812 UniTransModels that could be assigned GO terms. GO enrichment showed that ATP binding (GO:0005524), integral component of membrane (GO:0016021), zinc ion binding (GO:0008270), nucleus (GO:0005634) and protein phosphorylation (GO:0006468) ranked in the top five categories of AS events, followed by oxidation-reduction process (GO:0055114), membrane (GO:0016020), protein serine/threonine kinase activity (GO:0004674), metabolic process (GO:0008152) and nucleotide binding (GO:0000166). Interestingly, we found that the structural genes of UniTransModel 8313_0| path1 encoding CA4H had IR AS occurrence, and UniTransModel 1409_0|path1encoding PUL had A3, A5 and IR AS events. CA4H and PUL are potentially involved in the biosynthesis of isoflavone and starch, respectively. Three UniTransModels encoding TFs that are potentially involved in isoflavone biosynthesis had AS events (S15 and S17 Tables). 5170_2|path7 encoding WRKY incurred IR, while 8167_0|path0 encoding MYB59-like protein had both A5 and IR AS events. 1759_0|path1encoding HB also had an A5 AS occurrence.

Discussion
In this study, we sequenced the full-length transcriptome of P. thomsonii for the first time and analyzed the gene expression profile. In total, we obtained 44,339 nonredundant transcripts, 43,195 of which were annotated. We found 9,225 DETs, and 32 of these were structural genes related to isoflavone P. lobata and P. thomsonii are the most important and widely used Pueraria plants. These two species were separated in 2005 because of differences in their isoflavone compounds. The roots of P. lobata and P. thomsonii are named Ge-Gen and Fen-Ge, respectively, in the Chinese Pharmacopoeia [1]. On the one hand, Ge-Gen contains higher levels of isoflavone compounds than Fen-Ge. Du et al found 14 major compounds in the roots of P. lobata and P. thomsonii by UHPLC-DAD-TOF-MS and found that the contents of puerarin ranged from 3.64-6.98 mg/g in P. thomsonii and 7.56-61.31 mg/g in P. Lobata [2]. Puerarin, mirificin and daidzin may serve as chemical markers to distinguish the two Pueraria species [2]. On the other hand, it was also reported that tectorigenin was primarily derived from the flowers of P. thomsonii, which had a potential beneficial role in Parkinson's disease via oxidative stress inhibition [43] and in Attenuates palmitate-induced endothelial insulin resistance [44]. Therefore, it is important to look for genes that may allow for breeding of P. thomsonii cultivars with high contents of special isoflavone compounds to extend its usage. In this study, we found structural genes that were potentially involved either upstream or downstream of the isoflavone pathway (Fig 3). One of these genes is plUGT43, which had been identified to function in C-glucosylation in P. lobata [5], indicating a common pathway in these two species. However, some structural genes in P. lobata were not found in P. thomsonii. One such gene is 3'-OMT, which converts 3'-hydroxy-daidzein to 3'-methoxy-daidzein [3], indicating the existence of some individual pathways in the two species or different spatiotemporal gene expression. Improving starch content is another breeding objective for P. thomsonii because starch is an important resource in food, feed and industrial material. In this study, we found 10 families of structural genes involved in starch synthesis. Because of the spatiotemporal expression specificity, some previously described starch-related genes [40] were absent from our transcriptome data. Future studies should integrate multiple 'omics' at different stages of tube development to reveal the mechanism of starch synthesis and to improve the yield of P. thomsonii.
In addition to the structural genes, we also studied regulatory genes, including TFs and miRNAs. TFs can regulate the expression levels of multiple flavonoid biosynthesis genes, which have drawn more attention in recent years. For example, in Arabidopsis, the expression levels of PAL1, CHS and FLS in GmMYB12B2-overexpressing plants were significantly higher than those in wild-type plants [45]. In soybean, GmMYB29 promoted isoflavone biosynthesis by activating the IFS2 and CHS8 gene promoters [9]. GmMYB39 inhibited isoflavone biosynthesis by repressing the transcript levels of PAL, C4H, CHS, 4CL and CHR in soybean [46]. Here, we found that the expression profile of MYB39 (PB15357) was negatively correlated with that of CHS, CHI, IFS, IF7MAT and plUGT43 in P. thomsonii. In addition to MYBs, we also found TFs of WRKY, C2C2, HB, GRAS, bHLH, GARP and SBP that were potentially involved in isoflavone biosynthesis by activating or repressing multiple structural genes (S15 Table). In this study we only sequenced the transcriptome, which contains at most the 5' untranslated regions of genes. Because the genome of P. thomsonii has not been released to date, we did not know the promoter sequences of interesting genes so that we could not analyze TF binding elements in their promoters region.
In this study, we also identified 33 miRNAs that targeted 805 transcripts involved in multiple biological processes. Of the 608 annotated targets, 169 were related to disease resistance, indicating the important role of P. thomsonii miRNAs in the defense response. Moreover, we found that MIR156a potentially promoted isoflavone biosynthesis by repressing SBP, while MIR319 promoted isoflavone biosynthesis by repressing TCP and HB-HD-ZIP.
The identification of these structural genes and their associated regulatory genes could allow for the future gene engineering via synthetic biology to increase isoflavonoid content in P. thomsonii. Taken together, these results provide comprehensive insight into the gene expression and regulation of P. thomsonii.