Combined analysis of transcriptome and metabolite data reveals extensive differences between black and brown nearly-isogenic soybean (Glycine max) seed coats enabling the identification of pigment isogenes

Background The R locus controls the color of pigmented soybean (Glycine max) seeds. However information about its control over seed coat biochemistry and gene expressions remains limited. The seed coats of nearly-isogenic black (iRT) and brown (irT) soybean (Glycine max) were known to differ by the presence or absence of anthocyanins, respectively, with genes for only a single enzyme (anthocyanidin synthase) found to be differentially expressed between isolines. We recently identified and characterized a UDP-glycose:flavonoid-3-O-glycosyltransferase (UGT78K1) from the seed coat of black (iRT) soybean with the aim to engineer seed coat color by suppression of an anthocyanin-specific gene. However, it remained to be investigated whether UGT78K1 was overexpressed with anthocyanin biosynthesis in the black (iRT) seed coat compared to the nearly-isogenic brown (irT) tissue. In this study, we performed a combined analysis of transcriptome and metabolite data to elucidate the control of the R locus over seed coat biochemistry and to identify pigment biosynthesis genes. Two differentially expressed late-stage anthocyanin biosynthesis isogenes were further characterized, as they may serve as useful targets for the manipulation of soybean grain color while minimizing the potential for unintended effects on the plant system. Results Metabolite composition differences were found to not be limited to anthocyanins, with specific proanthocyanidins, isoflavones, and phenylpropanoids present exclusively in the black (iRT) or the brown (irT) seed coat. A global analysis of gene expressions identified UGT78K1 and 19 other anthocyanin, (iso)flavonoid, and phenylpropanoid isogenes to be differentially expressed between isolines. A combined analysis of metabolite and gene expression data enabled the assignment of putative functions to biosynthesis and transport isogenes. The recombinant enzymes of two genes were validated to catalyze late-stage steps in anthocyanin biosynthesis in vitro and expression profiles of the corresponding genes were shown to parallel anthocyanin biosynthesis during black (iRT) seed coat development. Conclusion Metabolite composition and gene expression differences between black (iRT) and brown (irT) seed coats are far more extensive than previously thought. Putative anthocyanin, proanthocyanidin, (iso)flavonoid, and phenylpropanoid isogenes were differentially-expressed between black (iRT) and brown (irT) seed coats, and UGT78K2 and OMT5 were validated to code UDP-glycose:flavonoid-3-O-glycosyltransferase and anthocyanin 3'-O-methyltransferase proteins in vitro, respectively. Duplicate gene copies for several enzymes were overexpressed in the black (iRT) seed coat suggesting more than one isogene may have to be silenced to engineer seed coat color using RNA interference.

the black (iRT) seed coat suggesting more than one isogene may have to be silenced to engineer seed coat color using RNA interference.

Background
Flavonoids represent a class of plant secondary metabolites that have evolved a variety of physiological functions including pigmentation, pathogen defense, and UV protection [1]. Additionally, metabolic engineering of flavonoids has become an important target for plant biotechnology, as flavonoids provide health benefits to foods, favorable agronomic traits to crops, and may be used in the future to color commercial transgenic materials such as grains to facilitate their identification and monitoring [2][3][4].
Commercial soybean (Glycine max (L.) Merr.) has a yellow grain. However rare spontaneous mutants exist that have black (iRT) or brown (irT) seed coat (testa) color phenotypes. Black (iRT) and brown (irT) soybean seed coats contain proanthocyanidins (PAs, a.k.a. condensed tannins) but differ in the presence/absence of anthocyanins [5]. A goal for biosafety is to engineer a novel red seed coat color as a marker for transgenic soybean grains to facilitate their identification [4], and could potentially be achieved by the suppression of anthocyanin-specific genes that are overexpressed in the black soybean seed coat. However the genes have not yet been identified.
Six genetic loci (I, R, T, Wp, W1, and O) [6] identified by classical genetics control flavonoid-based seed coat color in soybean. The I locus controls the presence or absence and spatial distribution of flavonoid pigments and has four alleles (I, I i , I k , i); I gives completely nonpigmented seed coat, I i restricts pigment to the hilum and I k to a saddle-shaped region, whereas the i allele results in a fully pigmented seed coat [6]. The recessive i allele results from spontaneous deletion of CHS4 or CHS1 promoter sequences and results in the abolishment of a posttranscriptional RNA silencing mechanism that results in the increased accumulation of chalcone synthase (CHS) transcripts in the seed coat [7,8].
The T locus is a pleoiotropic locus that controls the type and abundance of flavonoid pigments in the seed coat in addition to other traits such as seed coat cracking and trichome pigmentation [5,9,10]. Genetic polymorphisms that affect the expression or function of the flavonoid 3'-hydroxylase gene (F3'H1) have been shown to co-segregate with recessive t alleles [11,12].
The W1 locus controls flower color and affects seed color only in an iRt background; where W1 and w1 alleles give imperfect black and buff seed coat colors, respectively [6]. The W1 allele for purple flower color was shown to code flavonoid 3',5'-hydroxylase (F3'5'H), as a 65-bp insertion in the gene (F3'5'H) co-segregated with white flower color (w1) [13].
The Wp locus was suggested to code the flavonone 3hydroxylase gene (F3H1) by microarray analysis as high levels of F3H1 transcripts co-segregated with purple flower color (Wp), and low levels with pink (wp) flowers [14]. In the seed coat, recessive wp resulted in a change from black (iRTWp) to a lighter grayish (iRTwp) color [14].
The O locus affects the color of brown (irTO) seed coats, with the recessive o allele giving a red-brown (irTo) phenotype [6]. The O locus has been suggested to code the proanthocyanidin (PA, a.k.a. condensed tannin) biosynthesis gene anthocyanidin reductase (ANR), as the gene was located between markers that flank the O locus on the soybean physical map [15]. However molecular genetics analyses have not yet demonstrated the identity of the O locus gene.
Finally, the R locus controls the presence (R) or absence (r) of anthocyanins in black (iRT) or brown (irT) seed coats, respectively [16]. Despite the identification of several pigment genes from soybean, only transcripts for anthocyanidin synthase (ANS) genes (ANS23-1, ANS27-1 and ANS100) have been shown to be overexpressed in the black (iRT) seed coat compared to the brown (irT) nearly-isogenic line by northern blot [17]. The upregulation of several ANS genes suggests the R locus to code a regulatory factor, and raises the possibility that other isogenes for anthocyanin biosynthesis may be upregulated.
Recently, a cDNA coding the UDP-glycose:flavonoid-3-O-glycosyltransferase (UF3GT) gene (UGT78K1) was isolated from the black (iRT) seed coat and shown to function in anthocyanin biosynthesis in vitro and by complementation of a gene mutation in Arabidopsis [18]. However UGT78K1 expressions have not been investigated in relation to seed coat color.
The soybean genome sequence Glyma1 was predicted to code 46,430 protein-coding genes with nearly 75% of the genes present in multiple copies [19]. This may suggest a relatively high frequency of functional redundancy and increased difficulty in identifying soybean genes by traditional approaches. However, using transcriptome analysis tools such as the Soybean GeneChip equipped with 37,500 probe sets in combination with broad-coverage metabolite analysis methodologies such as LC-MS/MS, gene functions could potentially be efficiently predicted. The combined analysis of transcriptome and metabolite data has been shown to be a powerful approach for the functional identification of unknown genes [20][21][22]. Metabolite differences caused by the overexpression of a transcription factor, the exposure to a nutritional stress, and by species differences have been correlated with differences in transcriptome profiles to successfully predict the functions of unknown genes in flavonoid, glucosinolate, and alkaloid biosyntheses [21][22][23].
In the present study we have employed targeted and non-targeted metabolite analysis methodologies and have demonstrated that black (iRT) and brown (irT) nearly-isogenic soybean seed coats do not just differ in the presence/absence of anthocyanins, and have extensive differences in procyanidin, (iso)flavonoid and phenylpropanoid compositions. The underlying differences in gene expressions were then identified by microarray analysis, and the putative functions of 20 unknown genes were assigned by comparison to metabolite data. From the set of differentially regulated genes, two putative late-stage anthocyanin genes were selected and the functions of their coded enzymes were validated in vitro. In addition, a set of gene candidates potentially coded by the R locus have been provided.

Results
qRT-PCR indicates differential expressions of anthocyanin and proanthocyanidin genes in nearly-isogenic black (iRT) and brown (irT) soybean seed coats Among the flavonoid genes identified from soybean seed coats, only ANS transcripts were found to be overexpressed in the seed coat of black (iRT) soybean compared to the brown (irT) isoline, however no RFLP difference was observed when probing ANS genes [17]. This suggests that ANS overexpression is not the only molecular difference between black (iRT) and brown (irT) seed coats. To examine whether other anthocyanin genes are upregulated, transcript profiles of the recently identified UGT78K1 [18] and the DFR1 gene [GenBank: AF167556] were measured from seed coat cDNA by quantitative RT-PCR (qRT-PCR) using ANS2/ANS3 transcripts [GenBank:AY382829/GenBank: AY382830] as a positive control. Phosphenol pyruvate carboxylase (PEPC) [GenBank: D10717] was used as an endogenous reference for normalization of the qRT-PCR measurements of target genes across seed coat samples, as PEPC is expressed at similar levels in a wide range of soybean tissues and has been used previously as a reference for gene expressions during soybean seed coat development [8,24].
qRT-PCR demonstrated UGT78K1 to be significantly overexpressed in the black (iRT) seed coat at the transition stage (300 mg -400 mg FSW) and early maturation stage (400 -500 mg FSW) of seed development compared to the brown (irT) isoline, and to be almost undetectable at earlier stages ( Figure 1). Stereomicroscopy and photospectroscopic measurements confirmed these developmental stages to coincide with the stages of anthocyanin biosynthesis in the black (irT) seed coat Relative expression Transcript profiles from black (iRT) and brown (irT) seed coats during development measured by qRT-PCR. Black (iRT) soybean (black bars) and brown (irT) soybean (white bars). Student's t test significant at P < 0.01 (**), student's t test significant at P < 0.05 (*).
( Figure 2). In contrast to UGT78K1, DFR1 was not differentially expressed at the stages of anthocyanin biosynthesis ( Figure 1). ANS2/ANS3 mRNAs were confirmed to be overexpressed in the black (iRT) seed coat at most stages ( Figure 1) as shown previously [17].
The seed coat palisade cells of black (iRT) soybean are unusual compared to those of other model plants used for the study of anthocyanin biosynthesis (e.g. Arabidopsis, Maize, and Petunia) because black (iRT) soybean palisade cells synthesize both anthocyanins and proanthocyanidins (PAs). ANR and LAR enzymes have been identified to catalyze the conversion of cyanidin to epicatechin and leucocyanidin to catechin, respectively, which are the putative initial biochemical steps of PA biosynthesis [25,26]. To investigate how a seed tissue may coordinate the biosynthesis of these parallel pigment pathways, the expressions of putative PA genes ANR1 [GenBank:JF433915] and LAR1 [GenBank: JF433916] were examined in relation to anthocyanin gene expressions. qRT-PCR demonstrated ANR1 expressions to decrease throughout seed coat development in both genotypes and to be almost undetectable by the stages of UGT78K1 overexpressions and anthocyanin biosynthesis ( Figure 1). By contrast, LAR1 expressions were downregulated at the stages of anthocyanin biosynthesis in the black (iRT) seed coat (Figure 1).
In conclusion, these results show that the recently identified UGT78K1 is upregulated in the black (iRT) seed coat, in addition to ANS2/ANS3, and that there is a simultaneous downregulation of the PA gene LAR1 at the stages of anthocyanin biosynthesis. These results demonstrate that expression differences between black (iRT) and brown (irT) seed coats are not limited to anthocyanin genes, and may suggest extensive differences in gene expressions and phenolic compositions.
Combined analysis of targeted and non-targeted methodologies indicate overaccumulation of anthocyanins, altered procyanidin, and reduced flavonol, benzoic acid, and isoflavone compositions in the seed coat of black (iRT) soybean The structures of anthocyanins from only a few (Korean) black soybean varieties have been described [27,28] and information regarding other phenolic compositions may be limited to proanthocyanidins (PAs), where acidcatalyzed hydrolysis of black (iRT) and brown (irT) seed coats showed the PAs to be 3',4'-hydroxylated [5]. A more extensive analysis of phenolic composition differences between black (iRT) and brown (irT) seed coats could greatly improve our understanding of seed pigmentation, and may suggest gene activities that function in pigment biosynthesis. For these reasons, black (iRT) and brown (irT) Clark seed coat metabolites were extracted at the transition stage (300 mg -400 mg FSW) of seed development with multiple solvents and analyzed by targeted high performance liquid chromatography-diode array detection (HPLC-DAD) and non-targeted HPLC-tandem mass spectrometry (HPLC-MS/ MS). To ensure seed coats were at the same stage of development, the days post anthesis, pod length, pod color, embryo morphology, and transcript levels of the developmental marker gene Gm-r1083-1191, a putative cutin biosynthesis gene [29], and the metabolism gene DFR1 were ensured to be equivalent between black (iRT) and brown (irT) tissues (Additional file 1: Table  S1).
To compare the structural features of PAs, black (iRT) and brown (irT) seed coats were extracted with 70% acetone and subjected to phloroglucinol cleavage analysis [30,31]. PA free monomers, polymer subunit compositions, and mean degree of polymerization (mDP) did not differ significantly between nearly-isogenic black (iRT) and brown (irT) seed coats. The soluble PA mDP was 4, and the subunit and monomer compositions consisted mainly of epicatechin with minor amounts of unidentified compounds (Additional file 2: Figure S1A -S1C). Reaction with DMACA reagent demonstrated similar amounts of soluble PAs from both black (iRT) and brown (irT) seed coats (P = 0.08) (Additional file 2: Figure S1F). Further extraction of seed coat residues at 50°C in the presence of acid and phloroglucinol yielded only low levels of epicatechin-phloroglucinol adduct, epicatechin terminal units, and several unidentified compounds (Additional file 2: Figure S1B), and acid catalyzed cleavage of seed coat residues at 95°C identified relatively high, but equivalent amounts of 3',4'-hydroxylated solvent insoluble PAs (Additional file 2: Figure S1E).
For a non-targeted analysis of soluble phenolic metabolites, nearly isogenic black (iRT) and brown (irT) Clark seed coats were extracted with acidified aqueous methanol and analyzed by HPLC-MS/MS over a mass range of 100 amu to 1000 amu. Identities were assigned to peaks by comparison of retention times and mass fragmentation patterns to authentic standards. Peaks with no corresponding authentic standards were identified by mass fragmentation patterns in comparison to the literature. A total of 177 peaks were detected, of which 78 peaks were assigned to 37 compounds ( Table  2). Twenty-eight compounds were detected from the black (iRT) seed coat ( Table 2); consisting mainly of unidentified compounds (35.7%) and anthocyanins (32.1%), followed by procyanidins (25.0%). By contrast, 23 compounds were detected from the brown (irT) seed coat ( Table 2); consisting mainly of unidentified compounds (56.5%), followed by flavonols (13.0%) and procyanidins (13.0%). The majority of compounds (62.2% of all compounds identified by HPLC-MS/MS) were exclusive to the black (iRT) or brown (irT) seed coat, suggesting extensively different biochemical specializations of these tissues. The distribution of compound classes from black (iRT) and brown (irT) Clark seed coats is illustrated in Figure 4.
Taken together, our metabolite analysis experiments show that nearly-isogenic black (iRT) and brown (irT) Clark seed coats have compositional differences not limited to the presence or absence of anthocyanins; with specific flavonol, isoflavone, and benzoic acids exclusive to the brown (irT) seed coat and anthocyanins and unique procyanidin oligomers exclusive to the black (iRT) seed coat. These results may suggest extensive differences in the underlying gene expressions.  Figure 3 Anthocyanins from the seed coat of black (iRT) soybean variety Clark. Numbers correspond to chromatographic peaks identified by HPLC-DAD (see Table 1). Nearly 75% of genes in the soybean genome are present in multiple copies [19], and previously only transcripts for ANS isogenes were found to be overexpressed in the black (iRT) seed coat compared to the nearly-isogenic brown (irT) tissue [17]. For a global analysis of isogene expressions, black (iRT) and brown (irT) Clark seed coats were dissected at the transition stage of seed development (300 -400 mg FSW) for microarray analysis using the Affymetrix Soybean GeneChip. Seed coats were ensured to be at the same stage of development as described above for the metabolite analyses (Additional file 1: Table S1). Of the 37,500 probe sets present on the Soybean Gen-eChip, 80 were found to be upregulated more than 2fold in black (iRT) soybean (Additional file 3: Table S2). Classification using the Gene-Bins ontology tool (http:// bioinfoserver.rsbs.anu.edu.au/utils/GeneBins/index.php) demonstrated the majority of upregulated genes to be "unclassified without homology" (28.7%) or "unclassified with homology" (24.8%), followed by "biosynthesis of secondary metabolites" (7.9%) (Additional file 4: Figure  S2). This latter class consisted exclusively of putative (and some characterized) phenylpropanoid, flavonoid, and anthocyanin isogenes. Interestingly, putative isogenes of the phenylpropanoid pathway (4CL-like and 4CL2), the flavonoid pathway (CHS4/5, F3H-like, DFR2, ANS2/ANS3), and the late-stage anthocyanin pathway (UGT78K1, UGT78K2, OMT-like, OMT5, GST21 and GST26) were all upregulated more than 2-fold in the black (iRT) seed coat, more than 234-fold in the case of the GST26 probe set (Table 3). A BLASTP search of the soybean genome sequence Glyma1 (http://www.phytozome.net/soybean) using the conserved plant secondary product glycosyltransferase signature sequence (PSPG box) [32] identified 214 annotated Family 1 glycosyltransferases (UGTs)(Additional file 5: Table S3). Of the 214 UGTs, only 2 were found to be upregulated more than 2-fold from the black (iRT) seed coat; the recently identified UF3GT gene  UGT78K1 [18] was upregulated more than 2.8-fold and a second uncharacterized glycosyltransferase, which we named UGT78K2 in accordance with UGT nomenclature [33], was upregulated more than 3-fold (Table 3). A search of the soybean genome sequence with the conserved OMT-domain sequence (PLN02589; NCBI) and the ontology 'O-methyltransferase', identified 84 annotated OMTs (Additional file 6: Table S4). Of the 84 OMT genes, only probe sets for OMT-like and OMT5 were upregulated in the black (iRT) seed coat, more than 2.4-fold and 2.5-fold, respectively (Table 3).
A total of 52 probe sets were downregulated more than 2-fold in the black (iRT) seed coat (i.e. black/ brown < 0.5 fold); among these, 2 probe sets by more than 10-fold, 12 probe sets by more than 3-fold, with the remainder between 2-and 3-fold (Table 4; Additional file 7: Table S5). Classification of down-regulated genes using the Gene-Bins ontology tool demonstrated the majority to be "unclassified without homology" (45.5%) or "unclassified with homology" (12.1%), followed by "biosynthesis of secondary metabolites" (9.1%) (Additional file 8: Figure S3). The latter class consisted primarily of characterized and putative (iso)flavonoid isogenes. Isogenes for common flavonoid biosynthesis (CHS1, CHS9, F3'H), PA biosynthesis (LAR1), and isoflavone biosynthesis (IFR1-like) were all down-regulated more than 2-fold in the black (iRT) seed coat, more than 35-fold in the case of the CHS1 (Table 4). The flavonoid genes CHS9 and F3'H were down-regulated more than 21-fold and 2.5-fold, respectively, the putative isoflavone gene IFR1-like by more than 3.6-fold, and the putative PA gene LAR1 by more than 2.2-fold (Table 4).
Of the differentially regulated probe sets, 12 represented transcription factor genes, 8 were upregulated and 4 downregulated (Additional file 3: Table S2 and  Additional file 7 Table S5, respectively), however none of these genes clustered with known flavonoid transcription factor orthologs by phylogenetic analysis (not shown).
A previous study mapped the R locus between markers A668_1 and K387_1 on MLG K (Chromosome 9) [34]. Nineteen differentially-regulated probe sets were associated with genes located on chromosome 9 of the soybean genome sequence Glyma1 (Additional file 9: Table S6). Interestingly, all genes were located in a 5. 16 Mb region of the distal arm of chromosome 9, and only the downregulated AP2/ERF (Glyma09g36840) and serine carboxypeptidase-like protein (Glyma09g36080) and the upregulated polyamine oxidase (Glyma09g36150) were located between markers (BARC-025669-04989 and Sat_293) that flank the putative position of the R locus (http://soybeanphysicalmap.org) ( Figure 5). To validate the microarray results, 28 differentially-regulated genes were selected and their relative expression Table 4 Putative phenylpropanoid/flavonoid gene probe sets downregulated more than 2-fold in the black (iRT) compared to the brown (irT) soybean seed coat a,b profiles were measured by semi-quantitative RT-PCR (semi-qRT-PCR) using black (iRT) and brown (irT) seed coat cDNA as templates. Phosphenol pyruvate carboxylase (PEPC) was used as the endogenous control to ensure equal loading and the flavonoid genes CHS7/ CHS8 and the putative PA gene ANR1 were used as negative controls as they were not found to be differentially expressed in the microarray analysis. Semi-qRT-PCR confirmed the differential-expressions of all 28 genes and demonstrated similar expressions for CHS7/ CHS8 and ANR1 (Additional file 10: Figure S4). Furthermore, as the upregulated probe set GmaAffx.42116.1.S1 did not distinguish between CHS4 and CHS5 expressions, semi-qRT-PCR was performed with gene specific primers to show the CHS4 gene, and not the CHS5 gene to be upregulated in the black (iRT) seed coat (Additional file 10: Figure S4). In summary, these results show that numerous specific anthocyanin, flavonoid and phenylpropanoid isogenes are upregulated and other flavonoid, benzoic acid, and isoflavone isogenes are downregulated in the black (iRT) seed coat transcriptome relative to that of the nearlyisogenic brown (irT) tissue.

Putative Assignment of Function to Genes Differentially-Regulated between Black (iRT) and Brown (irT) Seed Coats
Gene functions can be successfully predicted based on a combined analysis of transcriptome and metabolite data [21]. We compared differences in gene expressions determined by microarray to differences in metabolite compositions to assign putative functions to genes from the black (iRT) and brown (irT) seed coats ( Figure 6). In addition to 7 previously characterized soybean flavonoid genes, several uncharacterized genes belonging to large multi-gene families with possible roles in anthocyanin biosynthesis and transport were upregulated; including one glycosyltransferase (UGT78K2), two O-methyltransferases (OMT5 and OMT-like) and two glutathione Stransferases (GST21 and GST26) ( Table 3). Considering the accumulation of metabolites with specific molecular structures in the black (iRT) seed coat, putative functions of the differentially-expressed genes were assigned to specific structural modifications and transport functions ( Figure 6).
Anthocyanins in the seed coat of black (iRT) soybean are glycosylated at the 3-position of the C-ring ( Figure  3). The glycosyltransferase UGT78K2 is assigned to code a UDP-glycose:flavonoid 3-O-glycosyltransferase (UF3GT) protein ( Figure 6) and may have functional redundancies in anthocyanin biosynthesis with the previously characterized UGT78K1 [18], which was also upregulated in the black (iRT) seed coat (Table 3). Reducing the expressions of late-stage anthocyanin genes by RNA interference (RNAi) in the seed coat of black soybean may be an effective strategy to engineer seed coat color while minimizing unintended effects on other flavonoid subpathways. However, functional redundancies may require that multiple genes be silenced in order to reduce enzyme activity enough to reduce pigment levels. Two glycosyltransferases (UGT78K1 and UGT78K2) were upregulated in the black (iRT) seed coat relative to the nearly-isogenic brown (irT) tissue (Table 3) and the upregulation paralleled the accumulation of anthocyanins with 3-O-glycosylated structures (Figure 2), suggesting the function of these genes as UDP-glycose:flavonoid 3-O-glycosyltransferases (UF3GTs). We have characterized UGT78K1 previously and shown it to possess UF3GT activity towards anthocyanidins in vitro and in vivo [18]. To  Table S2 and Table S5, and for a complete list of differentially accumulated metabolites see Table 2. As we have performed the kinetics analysis for recombinant UGT78K1, we were able to employ the same experimental conditions to test the kinetics of UGT78K2. Interestingly, UGT78K2 converted cyanidin-3-O-glucoside with greater velocity than UGT78K1 and higher concentrations of cyanidin were required to inhibit its activity ( Figure 7D) despite the enzymes having 93% amino acid similarity (Additional file 11: Figure S5).
Methylation of the B-ring of anthocyanins has a reddening effect on flower color (reviewed by [35]) and thus engineering the expression of 3'-O-methyltransferases could change the redness of seed color. However, despite the identification of several flavonoid OMTs from soybean, none have been shown to accept anthocyanins as substrates. Our microarray analysis identified two O-methyltransferases (OMT5 and OMT-like) to be upregulated in the black (iRT) seed coat (Table 3)  To further investigate the function these genes, UGT78K2 and OMT5 expression profiles were measured from seed coat cDNA throughout black (iRT) and brown (irT) soybean seed development by quantitative RT-PCR (qRT-PCR) ( Figure 9) and expressions were compared to anthocyanin accumulation profiles throughout seed development (Figure 2). Similar to UGT78K1, UGT78K2 and OMT5 expressions peaked at the transition and early maturation stages of seed development (300 mg -400 mg FSW, 400 mg -500 mg FSW) in the black (iRT) seed coat (Figure 9), the stages at which anthocyanins accumulate (Figure 2), and were low or almost undetectable at early stages of seed development in the black (iRT) seed coat and at all stages in the brown (irT) seed coat.
Together the qRT-PCR and recombinant enzyme assays suggest that UGT78K2 and OMT5 code UF3GT and AOMT proteins, respectively, that function in anthocyanin pigment biosynthesis in the seed of black (iRT) soybean.

Discussion
The seed coat compositions of black (iRT) and brown (irT) soybean (Glycine max) were known to differ only by the presence/absence of anthocyanins, however our metabolite analysis demonstrated these nearly-isogenic tissues to have extensive differences in anthocyanin, proanthocyanidin (PA), (iso)flavonoid, and phenylpropanoid content ( Table 2). Of the 37 compounds detected, 24 compounds (62.2% of all compounds identified by HPLC-MS/MS) were exclusive to either the black (iRT) or the brown (irT) seed coat. Analysis of the distributions of compound classes between seed coats demonstrated the black (iRT) to a have a greater proportion of anthocyanins and PAs, and the brown (irT) to have a greater proportion of flavonol, benzoic acid, isoflavone, and unknown metabolite compositions (Figure 4). Metabolites exclusive to the black (iRT) seed coat included PA trimers (Procyanidin trimer 1, 2, and 4) and an A-  Figure 9 Seed coat transcript profiles of UGT78K2 and OMT5 during seed development measured by qRT-PCR. Black (iRT) seed coat (black bars) and brown (irT) seed coat (white bars). Student's t test significant at P < 0.01 (**), student's t test significant at P < 0.05 (*).
type procyanidin dimer in addition to various anthocyanins ( Table 2). The anthocyanins identified from variety Clark were mainly similar to those reported previously from various Korean varieties [27,28] however Clark was the first shown to contain a 3-deoxyanthocyanidin (diosmetinidin) in the seed coat (Table 2). 3-deoxyanthocyanidins are relatively rare intensely coloured pigments found in Sorghum bicolour in response to fungal attack [36], and from the cotyledons of yellow soybean in response to UV-C treatment [37]. The biosynthesis of 3-deoxyanthocyanidins from flavonones has been shown to be a minor activity of a DFR isogene in Sorghum [38]. It would be interesting to determine whether the DFR gene that was highly overexpressed in the black Clark seed coat (DFR2) had similar catalytic activity, and whether the Korean varieties contain genetic alterations related to DFR2.
Analysis of the black (iRT) seed coat transcriptome compared to that of the nearly-isogenic brown (irT) tissue identified 80 probe sets to be upregulated (Table 3). This is almost double the number of probe sets upregulated by overexpression of the MYB transcription factor PAP1 in Arabidopsis (38 probe sets) [21], similar to the number upregulated by overexpression of the anthocyanin factor LAP1 in M. truncatula (61 probe sets), and less than half the number upregulated by overexpression of LAP1 in M. sativa (270 probe sets) [39]. The higher numbers of genes upregulated in anthocyanin-overaccumulating tissues of black soybean, M. truncatula and M. sativa likely reflect the fact that legumes have undergone genome duplication and that genes with multiple copies are frequently retained [19]. Our data supported this hypothesis as duplicate copies of several genes were upregulated in the black (iRT) seed coat (e.g. 4CL2 and 4CL-like, UGT78K2 and UGT78K1, OMT5 and OMTlike, and GST21 and GST26).
Functional redundancies can complicate gene identification by traditional approaches. Previous to the present study, genes for only 5 flavonoid enzymes (CHS, F3H, F3'H, F3'5'H, and UF3GT) were identified to be involved in pigment biosynthesis in the soybean seed coat [7,[11][12][13][14]18]. In the present study, 20 isogenes for 14 proteins were identified to have putative roles in flavonoid biosynthesis in black (iRT) or brown (irT) seed coats using a combined analysis of microarray and metabolite data ( Figure 6).
Transcriptome analysis has previously demonstrated a critical role of CHS7 and CHS8 genes for isoflavonoid biosynthesis in developing embryos of soybean [40]. However, our microarray and semi-quantitative RT-PCR data demonstrated that CHS7 and CHS8 were not differentially regulated in black (iRT) and brown (irT) seed coats (Table 3). Interestingly, it was CHS4 that was upregulated and CHS1 and CHS9 that were downregulatedin the black (iRT) seed coat (Table 3 and Table 4 respectively). As CHS4 is upregulated with anthocyanin genes, it likely has a role in anthocyanin biosynthesis in this tissue. Consistent with this, spontaneous mutations of the dominant i i allele to the recessive i allele have been shown to affect the promoter region of the CHS4 gene [7]. By contrast, the downregulation of CHS1 and CHS9 may suggest that they have a role in the parallel downregulated biochemical pathways, such as flavonol and isoflavonoid biosynthesis, as these metabolites, and transcript levels of genes for their biosynthesis, were reduced or not present in the black (iRT) seed coat (Figure 6). These results emphasize the complexity of isogene expressions that exist to achieve a single enzyme function in soybean.
Microarray analysis identified differentially regulated probe sets for only 3 gene functions (LAR, GST, and MATE) that have been previously shown to be involved in PA biosynthesis and transport. Recombinant LAR enzymes from several legume species have been shown to convert leucocyanidin to (+)-catechin in vitro [25,26,41,42]. However, the in vivo role of LAR in PA biosynthesis remains questionable as endogenous expression has been shown not to correlate with PA levels in M. truncatula and heterologous expression has failed to increase PA accumulations [41]. Similarly, our metabolite data showed that soybean PAs lack catechin subunits (Additional file 2: Figure S1) and that LAR1 expressions (Figure 1) do not parallel PA accumulations during seed development (not shown). Interestingly, 4 PAs were found to be exclusive to the black (iRT) seed coat (Table 2) however the total subunit compositions, the mean degree of polymerization (mDP), and the total amounts of soluble and solvent-insoluble PAs were not different compared to the brown (irT) seed coat (Additional file 2: Figure S1). These results suggest the PA oligomers of black (iRT) and brown (irT) seed coats may differ only in their subunit linkages. However, the potential influence of LAR or other PA genes on PA subunit linkages (if any) remains to be determined.
From the set of differentially regulated genes identified by microarray, two putative late-stage anthocyanin genes were selected and the functions of their coded enzymes were determined in vitro. UGT78K2 was found to code a UF3GT enzyme for anthocyanin biosynthesis ( Figure  7). A UF3GT cDNA (UGT78K1) has recently been isolated from the black seed coat and functionally characterized [18]. UGT78K2 was 93% similar to UGT78K1 but possessed increased catalytic activity towards cyanidin ( Figure 7D). The identification gene redundancy has important implications for seed coat color engineering by RNA interference, as both copies of the gene may have to be silenced to achieve a significant reduction in UF3GT activity. Similarly, two OMT genes (OMT5 and OMT-like) were found to be overexpressed in the black seed coat (Table 3). OMT5 was cloned and found to code an AOMT enzyme in vitro (Figure 8). UF3GT and OMT genes may be ideal candidates for seed coat color engineering as suppression of UF3GT genes (UGT78K1 and UGT78K2) may result in a reduction in anthocyanin levels similar to that caused by a T-DNA insertion in the Arabidopsis UF3GT [21] and upregulation of an AOMT may result in a more red seed coat, as methylation of the B-ring of anthocyanins has a reddening effect on flower color (reviewed by [35]). However, it remains to be determined whether altering the expressions of these genes could be use to modify seed coat color without causing unintended effects on alternative biochemical pathways.
Prior to the present study, currently known soybean genes did not map between markers A668_1 and K387_1 on chromosome 9 of the soybean physical genomic map, where the R locus is located, leaving no obvious candidates for the R locus gene [15]. In our study, 14 genes identified to be differentially regulated in the black (iRT) soybean seed coat transcriptome were found to be located on chromosome 9 of the soybean genome sequence Glyma1 ( Figure 5). Among the gene families represented, only WD40 and MYB have known roles in flavonoid regulation. However the genes demonstrated did not cluster with known flavonoid regulators by phylogenetic analysis (not shown). A Mob1/phocein gene was commonly upregulated in the black (iRT) soybean seed coat and Arabidopsis seedlings overexpressing PAP1 [21]. Other regulatory gene families represented on chromosome 9 were AP2/ERF, serine/threonine protein kinase, calcium/calmodulin-dependent protein kinase, and ethylene-responsive element-binding protein.
Interestingly, all upregulated and downregulated genes on chromosome 9 were located in 5.16 Mb region that includes the nearest sequence markers to the R locus. Only the downregulated AP2/ERF transcription factor, the serine carboxypeptidase gene, and the gene moderately similar to the A. thaliana polyamine oxidase ATPAO1 were located between these sequence markers ( Figure 5). As a result our study provides a short list of gene candidates that may serve as the focus of future efforts to identify the R locus gene.

Conclusion
Metabolite composition and gene expression differences between black (iRT) and brown (irT) seed coats are far more extensive than previously thought. Putative anthocyanin, proanthocyanidin, (iso)flavonoid, and phenylpropanoid isogenes were differentially-expressed between black (iRT) and brown (irT) seed coats, and UGT78K2 and OMT5 were validated to code UDP-glycose: flavonoid-3-O-glycosyltransferase and anthocyanin 3'-Omethyltransferase proteins in vitro, respectively. Duplicate gene copies for several enzymes were overexpressed in the black (iRT) seed coat suggesting more than one isogene may have to be silenced to engineer seed coat color using RNA interference.

Plant Materials and Growth Conditions
Black (iRT) soybean (G. max (L.) Merr.) variety Clark (PI547438) and the nearly isogenic brown (irT) soybean (PI 547475) were obtained from the U.S. Department of Agriculture Soybean Germplasm Collection (Agricultural Research Service, University of Illinois at Champaign-Urbana). Seeds were surface-sterilized in 2% triton/70% EtOH-H2O (7:3, v/v) for 5 min on a mixer wheel, rinsed three times with EtOH, dried, and germinated in autoclaved vermiculite in a Conviron E15 cabinet with a photoperiod of 16 h light (590 μE m 2 s) at 25°C, and 8 h dark at 20°C. Plants were fertilized twice weekly with N:P:K 35-5-10 (1% w/v). After 12 days, seedlings were transplanted to autoclaved soil and 34 days later the photoperiod was changed to 12 h light/12 h dark to encourage reproductive development. For RNA isolation, seed coats were dissected from plants, immediately frozen in liquid N2, lyophilized, and stored at -80°C. Seed coats for comparative analyses were harvested on the same days, 4 h after the initiation of the light cycle to minimize for potential differences in transcript and metabolite accumulations that may be influenced by circadian rhythm.

Seed Coat Metabolite Analyses
Lyophilized seed coats (~40 mg) from each developmental stage were pulverized in HCO2H-MeOH-H2O (400 μL, 15:80:5, v/v) using a FastPrep FP120 Homogenizer (Savant) and incubated at 4°C for 2 h. The slurry was centrifuged at 20,000 g for 10 min, and 5 μl of supernatant from each sample was measured by photospectroscopy using a NANODROP 2000 (Thermo Scientific) according to the formula A530-0.25A657 to compensate for chlorophyll absorption at 530 nm (Mancinelli, 1990). The quantity of anthocyanins was determined by comparison to a standard curve and expressed as cyanidin 3-O-glucoside equivalents. For HPLC-DAD and LC-QTRAP analyses samples were extracted for an additional 24 h and 200 μl of supernatant was vortexed with HCO 2 H-MeOH (1 mL, 15:85, v/v) for 20 sec, centrifuged at 20,000 g for 10 min, and the supernatant was evaporated under a stream of nitrogen gas (to 50 μL), filtered through PVDF (0.45 μm; Millipore), and 20 μl and 5 μl aliquots were analyzed by HPLC-DAD and HPLC-MS/MS respectively, as described below.
For analysis of proanthocyanidins (PAs), lyophilized seed coats were ground to a powder, extracted with 70% acetone (1 mg mL -1 ) and analyzed as described previously [30,31] with (-)-epicatechin, (+)-catechin, and (-)-epicatechin gallate used as standards for the identification of free monomers, and procyanidin B2 used as a standard for the analysis of soluble and insoluble PA oligomer compositions. Quantitation of soluble PAs by reaction with DMACA reagent and insoluble PAs by acid-catalyzed cleavage was performed as described [43] using procyanidin B2 and cyanidin as standards for the respective assays.
Microarray Analysis, Semi-qRT-PCR, and qRT-PCR Total RNA was isolated from pigmented soybean seed coats as described previously (Wang and Vodkin, 1994). The quantity and purity of RNA samples were determined by spectrophotometry using a NANODROP 2000 (Thermo Scientific). RNA integrity was confirmed by microfluidics using a Bioanalyser 2100 equipped with an RNA 6000 Nano Chip (Agilent Technologies Inc., Montreal, QC, Canada).
Microarray was performed on RNA isolated from three plants of each genotype. Briefly, 100 ng of total RNA was in vitro transcribed for 16 h using the Gene-Chip 3'IVT Express Kit (Affymetrix, http://www.affymetrix.com). Labelled RNA was hybridized to GeneChip Soybean Genome Arrays as described by the manufacturer (Affymetrix). Statistical analyses of data for the identification of differentially regulated genes were performed using FlexArray software (M. Blazejczyk and associates; Genome Quebec, Montreal) that uses R and Bio-Conductor [44]. The raw data was adjusted for background signal, and normalized across all GeneChips using the Robust Multi-array Average (RMA) method [45]. To identify differentially expressed genes, the SAM algorithm, which minimizes variation across arrays and incorporates an estimate of false discovery rate (FDR) [46], was used. Only genes with a fold change > 2 or < 0.5 and a P value < 0.01 were considered to be differentially expressed. All materials and procedures comply with the MIAME standards for array data [47]. The  For quantitative RT-PCR (qRT-PCR), cDNA template was prepared as described above. Reactions (25 μl) consisted of 2 μl of first-strand cDNA (or untreated RNA), 250 nM of forward and reverse primers, and 12.5 μl of the iQ SYBR Green Supermix (BioRad). qRT-PCR of each target gene for each seed coat sample was performed in triplicate on cDNA samples or untreated RNA samples using an PTC-200 Peltier thermal cycler equipped with a Chromo4 continuous fluorescence detector (MJ research). PCR cycling was 95°C for 10 min, followed by 40 cycles of 95°C for 30 sec, 58°C for 1 min, and 72°C 1 min. qRT-PCR data and PCR efficiencies were analyzed using the Opticon Monitor 3 software (BioRad). To verify the specificity of the RT-PCR reactions, melting curves were performed subsequent to each reaction in addition to fractionation of RT-PCR products on agarose gels. Semi-qRT-PCR and qRT-PCR experiments were performed in triplicate. Primers used in this study can be found in Additional file 12: Table  S7.

Cloning of UGT78K2 and OMT5 cDNAs from the Seed Coat of Black Soybean
To clone the full-length coding sequences (CDSs) of UGT78K2 and OMT5, PCR primers were designed based on the transcript sequence predicted for Gly-ma08g07130 (http://www.phytozome.net/soybean) and based on the mRNA sequence of BT098523, respectively. Full-length CDSs for UGT78K2 and OMT5 were amplified from black (iRT) seed coat cDNA by end-toend PCR using primers UHF/UAR and HO5F/HO5R, respectively (Additional file 12: Table S7) and a highfidelity DNA polymerase (Invitrogen). The resulting UGT78K2 and OMT5 amplicons (1466 bp and 836 bp, respectively) were cloned into NdeI and BamHI sites, and NdeI and XhoI sites of the pET-14b vector (Novagen), respectively, and sequenced to confirm their identities.

Expression of Recombinant Proteins in E. coli
The full-length CDSs of UGT78K2 and OMT5 were independently fused in-frame to the N-terminal hexahistidine tag encoded by the pET-14b vector (above). Plasmids were transformed into the expression host E. coli BL21(DE3) pLysS (Novagen) and a single colony for each plasmid was selected for production of recombinant proteins. Soluble recombinant proteins were isolated following growth and induction of E. coli BL21 (DE3) pLysS at 16°C. Recombinant proteins were purified by ion metal-affinity chromatography (IMAC) using a kit (Novagen). To confirm the purity of the recombinant enzyme, the eluted fractions were visualized on 12.5% acrylamide gel stained with 0.25% Coomassie Blue. The amount of purified recombinant enzyme was determined using the BioRad reagent.

HPLC-DAD and HPLC-MS/MS Analyses
HPLC-DAD analysis of anthocyanins was performed as described previously [18]. Briefly, separations were achieved at 45°C on a Luna C18(2), 4.6 × 150 mm, particle size 5 μm fitted with a corresponding guard-column (Phenonenex Inc, Santa Ana, CA) using an Agilent 1100 series (Agilent Technologies Inc., Montreal, QC, Canada) equipped with an autosampler with a 100 μL loop, a quaternary pump (maximum pressure, 400 bars), a column thermostat, and a diode array detector (DAD). The mobile phase consisted of 5% HCO 2 H in H 2 O (solvent A) and MeOH (solvent B). Optimized elution conditions were a linear gradient of 10-100% B in 25 min with a flow rate of 1 ml min -1 , the column was washed for 10 min with 100% B, brought back to starting mobile phase composition in 0.1 min and equilibrated for 5 min prior to the next injection. The HPLC separations were monitored at 520, 476, 350, 280, and 254 nm. The relative quantities of anthocyanins were calculated based on percent area of each peak eluting at specific retention times.
The HPLC-ESI-MS/MS system consisted of an Agilent 1200 series (Agilent Technologies Inc., Montreal, QC, Canada) connected directly to a 3200QTRAP (ABI Sciex, Toronto, Canada). The software used for data acquisition and analysis is Analyst 1.4.2. The chromatographic conditions were the same as described for the HPLC-DAD analysis anthocyanins. The mobile phase was split 50/50 post-column. For enhanced mass scan (EMS), the MS was operated in positive polarity at a scan rate of 4000 amu s -1 within the mass range of 100 -1000 amu. The optimal source conditions of the mass spectrometer were: collision gas high, curtain gas (N2) 25 L min -1 , ion spray voltage 4500 V, source temperature 500°C, source gas 1 (at 50 psig), source gas 2 (at 55 psig). The optimal compound parameters were declustering potential (DP) +20 V, entrance potential (EP) +10 V, collision energy (CE) 10 V, ionization energy 1.0 eV and detector 2800. For enhanced product ion scan the information dependant acquisition (IDA) criteria was as follows: select 1 -2 most intense peaks which exceeds 10,000 cps, exclude targeted ions after 3 occurrences for 12 s. Product ions were scanned within the mass range of 100 -1000 amu. All source parameters were the same as for EMS. The optimal compound parameters were DP +70 V, EP +10 V, CE 65 V, and collision energy spread 40 V.

Additional material
Additional file 1: Supplementary Table S1. Developmental properties of black (iRT) and brown (irT) soybean seed coats at the 400 mg fresh seed weight stage. Additional file 3: Supplementary Table S2. Gene probe sets upregulated more than 2-fold in the seed coat of black (iRT) soybean compared to the brown (irT) isoline a .