The genome sequence of allopolyploid Brassica juncea and analysis of differential homoeolog gene expression influencing selection

Mingfang Zhang, Sally Mackenzie and colleagues report the genome sequence of allopolyploid Brassica juncea and through comparative analysis suggest that A-subgenome evolution contributes to differences in agricultural subvarieties. They find that differential homoeolog gene expression from the subgenomes helps to shape the selection that distinguishes vegetable- and oil-use Brassica. The Brassica genus encompasses three diploid and three allopolyploid genomes, but a clear understanding of the evolution of agriculturally important traits via polyploidy is lacking. We assembled an allopolyploid Brassica juncea genome by shotgun and single-molecule reads integrated to genomic and genetic maps. We discovered that the A subgenomes of B. juncea and Brassica napus each had independent origins. Results suggested that A subgenomes of B. juncea were of monophyletic origin and evolved into vegetable-use and oil-use subvarieties. Homoeolog expression dominance occurs between subgenomes of allopolyploid B. juncea, in which differentially expressed genes display more selection potential than neutral genes. Homoeolog expression dominance in B. juncea has facilitated selection of glucosinolate and lipid metabolism genes in subvarieties used as vegetables and for oil production. These homoeolog expression dominance relationships among Brassicaceae genomes have contributed to selection response, predicting the directional effects of selection in a polyploid crop genome.


2 2 6
VOLUME 48 | NUMBER 10 | OCTOBER 2016 Nature GeNetics A r t i c l e s a potential link between homoeolog expression dominance and trait improvement that might be extendable to other polyploid crops.

Genome assembly, scaffold anchoring and annotation
To distinguish among subgenomes in Brassica species, we redesignated the subgenomes in Brassica 23 as follows: B. rapa as BraA; B. nigra as BniB; B. oleracea as BolC; B. juncea A subgenome as BjuA and B subgenome as BjuB; and B. napus A subgenome as BnaA and C subgenome as BnaC.
We selected an advanced generation inbred line of B. juncea var. tumida (variety T84−66) for whole-genome sequencing. We estimated the size of the T84-66 genome at 922 Mb by flow cytometry (Supplementary Fig. 1 and Supplementary Table 1). We assembled the T84-66 genome using 176× Illumina shotgun reads and 12× PacBio single-molecule long reads (Supplementary Table 2a,b and Supplementary Fig. 2). The assembly spanned 784 Mb, 85% of the 922 Mb estimated by flow cytometry (Supplementary Table 3). The contig N50 value was 61 kb, and the scaffold N50 was 855 kb (Supplementary Table 3).
We collected 996,648 BioNano DNA molecules over 150 kb, which corresponds to 222 equivalents of the genome, the average of which exceeded 2 Mb in size (Supplementary Table 4). The genome map assembled de novo consisted of 922 constituent genome maps with average length of 1.19 Mb and N50 of 1.84 M (Supplementary Table 4). We used these assemblies to correct the genome assembly above (Supplementary Fig. 3). The final assembly by the BioNano approach spanned 955 Mb, and the scaffold N50 was 1.5 Mb (Supplementary Table 3). We constructed a high-resolution genetic map with 5,333 bin markers and 18 pseudo-chromosomes (10A and 8B subgenomes; Supplementary Tables 5 and 6). We then integrated a published B. juncea genetic map 24 (Supplementary Table 7). Finally, we anchored 91.5% and 72.3% of A-and B-subgenome assembly sequences onto the 10 and the 8 pseudo-chromosomes, respectively (Supplementary Table 8a and Supplementary Fig. 4). We sorted the B. juncea chromosomes into the 402.1 Mb BjuA and 547.5 Mb BjuB subgenomes based on this assembly (Supplementary Table 9).
We also sequenced the genome of a doubled haploid line of B. nigra (YZ12151) for comparative genomic study. We assembled a collection of 96× Illumina shotgun reads to generate a 396.9 Mb genome sequence for B. nigra, with a scaffold N50 of 557.3 kb, and 68% of the estimated 591 Mb B. nigra genome (Supplementary Tables 10 and 11, and Supplementary Fig. 5). We anchored the 66% scaffolds into pseudo-chromosomes for B. nigra, referring to the BjuB genetic map (Supplementary Table 8b).
To validate the genome assembly, we used subreads from PacBio, of which 10 subreads had more than 99.4% coverage and 92.3% identity, on average, with the assembled genome (Supplementary Table 12). We aligned 15 published bacterial artificial chromosomes (BACs) from B. nigra to the B. nigra genome assembly, and observed over 98.5% coverage and 99.8% identity on average to BAC clones (Supplementary Table 13 and Supplementary Fig. 6). We BLASTaligned 458 core eukaryotic genes (Cluster of Essential Genes (CEG) database) 25 to the genome assembly with core eukaryotic genes mapping approach (CEGMA) pipeline 26 , which showed high-confidence hits of 453 (98.8%) and 458 (100%) CEG proteins for all 458 essential genes in CEG with full length (>70% alignment) in the genome of B. juncea and B. nigra, respectively (Supplementary Table 14a). We validated the assembled genomes by matching expressed sequence tags (ESTs) downloaded from the US National Center for Biotechnology Information (NCBI) database, which indicated that 98.9% and 98.2% ESTs were supported by the assembled genomes of B. juncea and B. nigra (>50% alignments), respectively (Supplementary Table 14b).
We identified and compared repetitive sequences from syntenic regions of these genomes. We identified 316.  Table 15). Long terminal repeats (LTRs) are the predominant transposable element (TE) family identified in all sequenced Brassica genomes 6,7 . Copia-and Gypsy-type LTRs represent the two most abundant TE subfamilies. Using repetitive sequence from syntenic regions, we found that they constituted a similar percentage of all TEs in the BjuA and BjuB, and their respective ancestral genomes (Supplementary Fig. 7a). We observed similar repetitive sequence contributions in B. napus (Supplementary Fig. 7b). We identified TEs in the B. juncea and B. napus subgenomes that were newly formed after divergence from each ancestral genome (Supplementary Fig. 8 and Supplementary Table 16a). We confirmed five randomly selected newly formed TEs by PCR amplification from B. rapa, B. nigra and B. juncea (Supplementary Fig. 9). These newly formed TEs showed similar distribution and percentage between the B. juncea and B. napus subgenomes, and their respective ancestral genomes (Supplementary Table 16b and Supplementary  Fig. 10a,b). We observed 310 newly formed TEs to be active between   Table 21).
We extracted 28,228 and 28,917 syntenic ortholog gene pairs from the B. juncea subgenomes and their ancestral genomes to identify gene loss during the speciation process 8 . In total, we identified 562 and 545 genes lost from BjuA and BjuB, respectively, relative to their common ancestral genomes. This represents a higher percentage than the gene loss estimates for BnaA and BnaC, relative to their common ancestral genomes (Supplementary Table 22). We validated gene loss using PCR amplifications (Supplementary Fig. 11). Gene loss numbers of B. juncea and B. napus were consistent with their formation times. The identified genes lost in the B. juncea subgenomes of BjuA and BjuB are involved in different functions based on Gene Ontology ( Supplementary Fig. 12a,b). We mapped the distributions of genes, repetitive sequences, gene loss, pseudogenes, genome markers and genetic markers of the B. juncea subgenomes (Fig. 1).

Comparison of A subgenomes in Brassica
Synteny analysis among three A subgenomes of Brassica showed strong co-linearity, although chromosomal rearrangements have occurred between BjuA and BraA after their divergence from the common B. rapa ancestor ( Fig. 2a and Table 25). We constructed a neighbor-joining tree for A subgenomes in Brassica, and discovered that BjuA and BnaA had divergent origins (Fig. 2b). BjuA might derive from B. rapa ssp. tricolaris, which is distributed in Asia, whereas BnaA might derive from B. rapa ssp. rapa (European turnip), which is widely distributed in Europe (Fig. 2b). This discovery indicates that allopolyploids B. juncea and B. napus have independent geographical origins, deriving from Asian and European regions, respectively.
Furthermore, we found that all A subgenomes from B. juncea were rooted in the common ancestor, and evolved into different subvarieties for vegetable or oil use (Fig. 2b). Principal component analysis displayed that vegetable-and oil-use subvarieties of B. juncea were distributed nearby B. rapa ssp. tricolaris group and far from other subspecies of B. rapa, supporting the ancestor being closer to B. rapa ssp. tricolaris (Supplementary Fig. 13). Using the independent origin of A subgenomes in B. napus and B. juncea as a control, we compared the SNP variation characteristics between BjuA and BnaA, and that between A subgenomes of vegetable-and oil-use subvarieties of B. juncea. We found typical SNP polyphyletic origin pattern between BjuA and BnaA, and typical SNP monophyletic origin pattern for A subgenomes of vegetable-and oil-use subvarieties in B. juncea (Supplementary Fig. 14). In total, the results drawn from the phylogenetic tree, principal component analysis and SNP variation patterns point to a monophyletic origin and evolution into vegetable-and oil-use subvarieties for A subgenomes of B. juncea.
To estimate when B. juncea formed, we found that the synonymous nucleotide substitution rate was not accurate for estimating formation time of the post-neopolyploid species (Supplementary Fig. 15 and Supplementary Table 26a,b). We therefore used phylogenetic analysis and Bayesian method 27 to calculate when BjuA diverged from its closest relative genome (tricolaris; Fig. 2b), to set an upper limit for its time of formation. We considered the time between the divergence of BjuA and the earliest divergent B. juncea accessions (B. juncea; Fig. 2b) as the lower limit for its formation time. We deduced that B. juncea formed ~0.039−0.055 million years ago (Mya) (Fig. 2c). Similarly, to estimate when B. napus formed, we referred to BnaA and its closest relative genome (European rapa; Fig. 2b) to set an upper limit for its formation time, and to BnaA and the earliest divergent B. napus accessions (B. napus; Fig. 2b) to set a lower limit for its formation time.
Here we deduced that B. napus formed 0.038−0.051 Mya (Fig. 2c), which is slightly earlier than the previous estimate of ~7,500 years ago derived by synonymous substitution (Ks) estimation 8 .

Homoeolog expression dominance in allopolyploid B. juncea
To explore the transcriptional behavior of the allopolyploid subgenomes, we compared the genome-wide transcriptional levels of       19 . Transcriptional expression analysis in resynthesized Brassica allopolyploids showed that gene expression changes occurred soon after the initial genome merger and allopolyploidization 28 . These observations suggest that establishment of homoeolog expression dominance after the initial genome merger and allopolyploidization was immediate. During different developmental stages, 3,339 commonly expressed gene pairs showed homoeolog expression dominance, with 56% of gene pairs displaying dominance toward BjuB subgenomes ( Supplementary  Fig. 16). In different tissues, 2,251 commonly expressed gene pairs indicated homoeolog expression dominance, and 55% of gene pairs showed dominance toward BjuB (Supplementary Fig. 17). In all evolutionarily synthesized B. juncea, homoeolog expression dominant genes derived predominantly from BjuB, whereas one of the two transcriptomes from the resynthesized B. juncea types showed expression dominance by BjuA (Fig. 3a and Supplementary Table 27b).
We identified 5,632 gene pairs displaying homoeolog expression dominance in B. juncea (Supplementary Table 28). Using the KEGG database, we performed a pathway enrichment analysis for all homoeolog expression dominant genes. This analysis showed that genes showing homoeolog expression dominance were enriched for: cellular processes, environmental information processing, genetic information processing and metabolism and plant-pathogen interaction (Supplementary Fig. 18). Among these pathways, we found that metabolic and plant hormone signal transduction pathways were especially enriched.
We calculated the average non-synonymous/synonymous substitution (Ka/Ks) value of all genes among population accessions based on whole-genome sequencing and resequencing data. The results show significant difference between BjuA and BjuB using a permutation test, of which the BjuB has evolved faster than the BjuA, suggesting asymmetric evolution of the two subgenomes (Fig. 3b). To further analyze homoeolog expression dominant genes, we calculated the average Ka/Ks values among those genes expressed as dominant (higher expression level in homoeologous gene pair), subordinate (lower expression level in homoeologous gene pair) and neutral (equal expression level in homoeologous gene pair). The Ka/Ks values of dominant and subordinate genes (median: 0.31 and 0.35, respectively) were significantly higher than those of neutral genes (median: 0.25) using a permutation test (Fig. 3b). We also calculated the average Ka value among these genes, which indicated the same patterns with Ka/Ks values ( Supplementary  Fig. 19). This observation indicated that both dominant and subordinate genes evolved more rapidly than did the neutral genes, with subordinate genes being prone to selection in a homoelogous gene pair. Table 25), we estimated the average pairwise diversity (π) and population differentiation statistics (F ST ) between the vegetable-and oil-use varieties of B. juncea (Supplementary Tables 29 and 30). We identified selective sweep regions in vegetable-and oil-use B. juncea accessions by combining F ST < 0.05 and π ratio < 0.05 outliers (Fig. 4a and  Supplementary Table 31). In total, we identified 794 selected genes between the vegetable-and oil-use subvarieties of B. juncea, of which 36.3% (288) showed homoeolog expression dominance. A high proportion of genes with homoeolog expression dominance under selection imply their participation in agricultural trait improvement.

Selection in allopolyploid B. juncea Using SNPs from resequencing accessions of B. juncea (Supplementary
Vegetable-use B. juncea varieties have been selected based on their glucosinolate (GSL) content and composition for human nutrition and plant defense properties 29 . Oil-use B. juncea varieties have been subjected to intensive breeding to improve their lipid composition, including decreases in the levels of erucic acid and GSL, which are undesirable because they can produce toxic catabolic products in animal feed 8 . In total, we identified 13 GSL-metabolism-related genes and 22 lipid-metabolism-related genes that were differentially selected between the vegetable-and oil-use subvarieties of B. juncea (Supplementary Table 32). Of these selected genes, 6 GSL-metabolismrelated genes and 7 lipid-metabolism-related genes likewise exhibited homoeolog expression dominance (Supplementary Fig. 20 and Supplementary Table 33). One of these genes, BjuB021254, whose ortholog is AT1G04350 (AOP) in Arabidopsis, encodes a 2-oxoglutarate-dependent dioxygenase and has an essential role in GSL biosynthesis 30 . The gene BjuA030837, whose ortholog is AT1G18460 in Arabidopsis, encodes a lipase family protein that is important in the glycerol biosynthesis process 31 . The homoeolog expression dominance and selection of these genes suggests that their functions in GSL and lipid metabolism have been subjected to selection and improvement between vegetable and oil-use subvarieties of B. juncea.
With resequencing of vegetable-and oil-use subvarieties of B. juncea, we showed divergent genotypes with nonsynonymous mutation for GSL-and lipid-metabolism-related genes between vegetable-and oiluse subvarieties of B. juncea (Fig. 4b). Oil-use types displayed uniform genotypes for the GSL-and lipid-metabolism-related genes compared to vegetable-use types. In the vegetable-use groups, all but varieties CN59 and CN79 showed consistent genotypes for these genes. We constructed a phylogenetic tree for the vegetable-and oil-use types of B. juncea, in which accessions CN59 and CN79 were clustered into a subgroup independent from the vegetable-use subgroup (Fig. 2b), although we classified them into the vegetable-use subgroup on the basis of their edible organs. Transcriptomic analysis from selected GSLand lipid-metabolism-associated genes showed significant differences in expression using a two-tailed t-test between vegetable and oil-use subvarieties of B. juncea ( Fig. 4c and Supplementary Table 29). These observations indicate that genomic selection has diversified GSL-and lipid-metabolism-related genes between vegetable-and oil-use subvarieties of the plant, each in the direction of their respective agriculturally desirable traits. We also observed 24 selected genes involved in phytohormone metabolism, of which 12 genes exhibited homoeolog expression dominance (Supplementary Table 32). Transcriptomic analysis for selected phytohormone-associated genes showed significant differences in expression using a two tailed t-test between vegetableand oil-use subvarieties of B. juncea (Supplementary Table 33 and Supplementary Fig. 21). These differences may, likewise, contribute to the phenotypic deviations between the two types.

DISCUSSION
B. rapa, B. juncea and B. napus once comprised the three main Brassica oilseed crops worldwide, whereas at present B. rapa is selected primarily as a vegetable, B. juncea both as a vegetable and for oil use, and B. napus for oilseed 32 . Discovery of a possible A-subgenome-diversified origin for B. juncea and B. napus may shed light on the unusual features of selection divergence in Brassica. These insights appear promising for de novo synthesis of neo-polyploid species by introgression of individual A-subgenome types to achieve desired breeding purposes.
We demonstrated evidence of homoeolog expression dominance patterns distinguishing A-subgenome types in Brassica. Although homoeolog expression dominance has been observed in several polyploid species 17,19,22 , a correlation between homoeolog expression dominance and evolutionary rate has not been reported previously. This finding provides important evidence of agricultural selection behavior. More importantly, these observations may facilitate the improvement of agriculturally important traits by focusing selection on the transcriptionally dominant genes.
npg Selective sweep regions distinguishing vegetable-and oil-use subvarieties of B. juncea identified 794 loci, of which 36.3% showed homoeolog expression dominance. This high proportion of genes with expression dominance under selection implies their potential in agricultural trait improvement by precisely associating homoeolog expression dominance genes with target traits (36.3% selected genes of 16.2% homoeolog expression dominance genes). It is reasonable to assume that this methodology could be applied to a broader array of selected traits in other polyploid crops to provide insight into underlying physiological mechanisms.
Polyploidy is particularly common in flowering plants, is recognized as a characteristic of all angiosperm genomes during their evolution 33 and has an essential role in speciation and genomic plasticity 34,35 . The reprogramming of allopolyploid transcriptomes is shown to be triggered predominantly by interspecific hybridization 20 , displaying insight into homoeolog gene expression in phenotypic variability and plasticity. Homoeolog expression dominance or bias appears to be a consequence of genome merger and doubling 22 , but the underlying applicability of the homoeolog expression dominance in agriculture trait selection has not been substantiated to date. We found that homoeolog expression dominant genes have higher Ka/Ks than neutral genes in the allopolyploid B. juncea, consistent with these genes as targets of intensified selection for vegetable-and oil-use varieties of this agriculturally important plant. This observation implies that transcriptional dominance can predate trait selection. The potential linking of homoeolog expression dominance to trait improvement suggests that Brassica breeding programs, and those of other polyploid crops, might benefit from focusing their efforts on the subset of genes with transcriptional dominance, both as a means of enhancing response to selection and toward gaining mechanistic insights.

METHODS
Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper. Accession codes. The genome assemblies of B. juncea and B. nigra have been deposited at GenBank: LFQT00000000 and LFLV00000000, respectively (BioProject PRJNA285130). (ref. 63). The gene expression level of individual genes was quantified using RPKM values (fragments per kilobase of exon per million fragments mapped) by Cufflinks 52 . Homoeolog expression dominance analysis was performed within syntenic gene pairs. Differentially expressed genes pairs with greater than twofold change were defined as dominant gene pairs. The dominant genes were the genes that were expressed relatively higher in dominant gene pairs, and the lower ones are the subordinate genes. The rest of the syntenic gene pairs that showed non-dominance were classified as neutral genes. To test whether the occurrences of BjuA dominant gene pairs and the occurrences of BjuB dominant gene pairs are equal, we performed double-side binomial tests on dominant gene pairs for all samples 64 (Supplementary Table 27a,b). Additional details are available in the Supplementary Note. Selective pressure on dominantly expressed genes and subgenomes. All SNP sets were called by GATK 41 for 17 B. juncea accessions with default parameters and filtered out with depth < 3×. Coding region sequence sets were then reconstructed based on high quality SNPs for each sample. To detect selective pressure of each coding gene, the rates of nonsynonymous (dN) and synonymous (dS) (ω = dN/dS) substitutions were estimated site-by-site using the YN00 program with default parameters from the PAML 4.2b package 58 . Each paired gene set of 17 samples was estimated repeatedly. All Ka/Ks of gene pairs were classified to three categories (dominant genes, subordinate genes and neutral genes). Meanwhile All Ka/Ks of gene pairs were separated into BjuA/BjuB subgenomes. To test statistical significance of different data sets, we performed a permutation test on them with 1,000 permutations ( Supplementary  Table 26a,b). Additional details are available in the Supplementary Note.

Detection of selective sweep signals.
Average pairwise diversity (π) and population differentiation statistic (F ST ) were calculated through Bio::PopGen of bioperl package 65 . Selective sweep regions were identified in the 10 vegetableand 7 oil-use B. juncea subvarieties by combining F ST outliers and π ratio outliers (θπ (vegetable-use/oil-use)) with 100 kb sliding windows and 10 kb steps. Adjacent windows extended to 10 kb likely represent the effect of a single divergence region and thus were linked to define a 'candidate gene region' (Supplementary Table 31). Additional details are available in the Supplementary Note.