Genome-Wide Association Study of Glucosinolate Metabolites (mGWAS) in Brassica napus L.

Glucosinolates (GSLs) are secondary plant metabolites that are enriched in rapeseed and related Brassica species, and they play important roles in defense due to their anti-nutritive and toxic properties. Here, we conducted a genome-wide association study of six glucosinolate metabolites (mGWAS) in rapeseed, including three aliphatic glucosinolates (m145 gluconapin, m150 glucobrassicanapin and m151 progoitrin), one aromatic glucosinolate (m157 gluconasturtiin) and two indole glucosinolates (m165 indolylmethyl glucosinolate and m172 4-hydroxyglucobrassicin), respectively. We identified 113 candidate intervals significantly associated with these six glucosinolate metabolites. In the genomic regions linked to the mGWAS peaks, 187 candidate genes involved in glucosinolate biosynthesis (e.g., BnaMAM1, BnaGGP1, BnaSUR1 and BnaMYB51) and novel genes (e.g., BnaMYB44, BnaERF025, BnaE2FC, BnaNAC102 and BnaDREB1D) were predicted based on the mGWAS, combined with analysis of differentially expressed genes. Our results provide insight into the genetic basis of glucosinolate biosynthesis in rapeseed and should facilitate marker-based breeding for improved seed quality in Brassica species.


Introduction
Glucosinolates (GSLs) are secondary metabolites comprising sulfur and nitrogen that are specially produced in Brassica species, providing these plants with their pungent odor [1][2][3]. GSLs also play important roles in plant defense against pests and in human health [4,5]. However, high levels of GSLs affect the quality of seed oil and the nutritional value of seed meal from rapeseed. In addition, some hydrolysates produced from GSLs, such as oxazolidin-2-thione, have toxic effects on human and animal health [6][7][8][9][10]. Thus, further reducing glucosinolate levels in seed meal is an important goal in rapeseed breeding.
Generally, GSLs share the same basic structure, including β-D-thiosaccharide and (Z)n-hydroxamic sulfate, but have variable R-side chain groups due to the precursor amino acids [11]. The lengths and modifications of these variable R-side chain groups determine the chemical properties of each GSL. Correspondingly, GSLs can be divided into three types based on the source of precursor amino acids, for example, aliphatic GSLs derived from alanine (Ala), leucine (Leu), isoleucine (Ile), methionine (Met) and valine (Val); indole GSLs derived from tryptophan (Trp); and aromatic GSLs derived from phenylalanine (Phe) and tyrosine (Tyr) [12][13][14]. To date, over 200 GSLs have been discovered [15,16], which were usually synthesized through three steps. The first step is the lengthening of the progenitor amino acid side chain. This mainly occurs during the biosynthesis of aliphatic and aromatic

Genome-Wide Association Study of Glucosinolate Metabolites (mGWAS)
To avoid identifying the false-positive associations in mGWAS, we selected six models to identify significant associations between phenotypes and genotypes. Based on the QQ plots of the six models ( Figure S1), we performed GLM with Q model and MLM with Q+K model for the GWAS to mine more candidate genes of GSL metabolites, respectively. Herein, we used 239,945 high-quality SNPs (minor allele frequency (MAF) > 0.05 and the call frequencies <0.8) for the association analysis. The association signals were determined using a p-value < 4.17 × 10 −6 . The results of mGWAS for the six glucosinolate metabolites in two years (2017cq and 2018cq) and BLUP (best linear unbiased prediction) values ( Figure 3, and Tables S1-S3) are summarized below. cq, Chongqing environment; 1 No., metabolite ID; 2 Env., environment; 3 Min., minimum; 4 Max., maximum; 5 Avg., average; 6 SD, standard deviation; 7 CV, coefficient of variation; 8 Skew., Skewness; 9 Kurt., Kurtosis; 10 H', Shannon-Wiener diversity index; 11 Fge, the F-values for G × E for glucosinolate content. Asterisks indicate significant differences (**, p < 0.01). 12 Her., heritability. Metabolite content is expressed in µg/g FW.

Genome-Wide Association Study of Glucosinolate Metabolites (mGWAS)
To avoid identifying the false-positive associations in mGWAS, we selected six models to identify significant associations between phenotypes and genotypes. Based on the QQ plots of the six models ( Figure S1), we performed GLM with Q model and MLM with Q+K model for the GWAS to mine more candidate genes of GSL metabolites, respectively. Herein, we used 239,945 high-quality SNPs (minor allele frequency (MAF) > 0.05 and the call frequencies <0.8) for the association analysis. The association signals were determined using a p-value < 4.17 × 10 −6 . The results of mGWAS for the six glucosinolate metabolites in two years (2017cq and 2018cq) and BLUP (best linear unbiased prediction) values ( Figure 3, and Tables S1-S3) are summarized below.  For the aliphatic GSL m145, we identified 1597, 2543 and 681 significantly associated SNPs based on 2017cq, 2018cqs and BLUP values, respectively, 389 SNPs of which were repeatedly detected in this study (Figure 3a, Figure S2a and Table S1). These significant SNPs primarily covered 79 candidate intervals located across the entire B. napus genome (Table S3). Importantly, two significant regions with high SNP densities were located on chromosomes A09 (17.88~22.59 Mb) and A06 (10.98~17.71 Mb) at the intervals designated qGSL-A09-5 and qGSL-A06-3, respectively. We found that the most significant SNPs (S6_16080408 and S9_18653335) could explain 49.67% and 43.75% of phenotypic variance, respectively. Therefore, we used these two intervals to predict the candidate genes to control the accumulation of the aliphatic GSL m145 in seeds (Table S3). In addition, 49 candidate regions were only associated with m145 in one or any two years and BLUP values. We believe that these remaining candidate regions represent minor-effect intervals influencing the accumulation of m145 (Table S3). These findings seem to suggest that m145 accumulation is strongly affected by environmental factors.
For the aliphatic glucosinolate, we identified m150, 2080, 2620 and 3444 significantly associated SNPs based on 2017cq, 2018cq and BLUP values, respectively, and 1431 SNPs were repeatedly detected in two years and BLUP values (Figures 3b and S2b and Table S1). These significant SNPs primarily covered 94 candidate intervals across the entire B. napus genome (Table S3). Most repeated significant SNPs (1168/1431) were located in the interval qGSL-A09-5 with the most significant SNPs (S9_18653335) that explained 47.71% of phenotypic variance. In addition, the SNPs in the intervals qGSL-A01-4, qGSL-A06-2 and qGSL-C08-5 were repeatedly detected, which all explained more than 35% of the phenotypic variance (Table S1). Therefore, we believe that these regions are important interval regions for identifying the candidate genes for controlling the accumulation of m150. Furthermore, in total, 51 candidate regions are detected for m150, at least in one or any two years and BLUP values, suggesting that these remaining candidate regions are minor-effect intervals influencing the accumulation of m150 (Table S3).
For the aliphatic glucosinolate m151, we identified 977, 629 and 747 significantly associated SNPs based on 2017cq, 2018cq and BLUP values, respectively, while only 225 SNPs were repeatedly detected in this study (Figure 3c and Figure S2c and Table S1). These significant SNPs primarily covered 67 candidate intervals across the B. napus genome, except for chromosome A01 (Table S3). Among the 225 SNPs, highly significant SNPs were repeatedly detected at intervals qGSL-A06-3, qGSL-A09-1, qGSL-A09-5 and qGSL-C07-6, which all explained more than 30% of phenotypic variance. Therefore, these regions should be considered as major interval regions controlling the accumulation of m151. In addition, 46 candidate regions with minor effects are detected for m151, at least in one or any two years and BLUP values (Table S3).
For the aromatic glucosinolate m157, we identified 311, 550 and 1403 significantly associated SNPs based on 2017cq, 2018cq and BLUP values, respectively, 19 SNPs of which were repeatedly detected in this study (Figure 3d and Figure S2d and Table S1). These significant SNPs primarily covered 99 candidate intervals across the entire B. napus genome (Table S3). One significant association locus on chromosome C04 (named qGSL-C04-2) was detected, with the peak SNP S14_6056639, which explained 41.07% of phenotypic variance and was repeatedly detected in different environments (Table S3). Therefore, we used this interval region of chromosome C04 (5.85~6.27 Mb) to predict the candidate genes for controlling the accumulation of m157 in seeds. The interval region on chromosome A09 (named qGSL-A09-5) was also detected in different years, with the peak SNP S9_18294706, which explained 37.58% of phenotypic variance. Furthermore, S7_8312132 explaining 40.84% of phenotypic variance was identified and located on chromosome A07, named qGSL-A07-3. Thus, we believe that the interval regions (qGSL-A09-5, qGSL-A07-3 and qGSL-C04-2) are also closely related to the accumulation of m157 in rapeseed. Correspondingly, 79 candidate regions with minor effects were also detected for m157, at least in one or any two years and BLUP values (Table S3).
For the indole glucosinolate m172, we identified 211, 132 and 1035 significantly associated SNPs based on 2017cq, 2018cqs and BLUP values, respectively, including 47 repeatedly detected SNPs (Figure 3e and Figure S2f and Table S1). These significant SNPs primarily covered 50 candidate intervals across the B. napus genome, except the chromosome C09 (Table S3). For 47 SNPs, 6 significant SNPs were mapped on chromosome A01 (named qGSL-A01-4) with peak SNP S1_19207516, which explained 42.78% of phenotypic variance, and the peak SNP S9_18611734 was mapped on chromosome A09 (named qGSL-A09-5), explaining 38.62% of phenotypic variation. Therefore, these interval regions were used to predict the candidate genes for controlling the accumulation of m172 in seeds (Table S3). In addition, 43 candidate regions were detected for m172, at least in one or any two years and BLUP values (Table S3).

Candidate Gene Mining
Based on the physical positions of the significant SNPs, we mapped the intervals to the corresponding chromosomes of the reference genome B. napus Darmor-bzh (v4.1) [49] and searched for candidate genes for each significant locus based on the mGWAS data combined with the differentially expressed genes in BnHG vs. BnLG. In total, 187 candidate genes were identified and predicted for GSLs in this study, including genes encoding enzymes in the GSL biosynthesis pathway and their homologs, transcription factor genes and some novel candidate genes, respectively (Table S4).
Based on the known GSL biosynthesis pathway, we identified 64 candidate genes in the GSL biosynthesis pathway (including genes encoding enzymes in this pathway and their homologs) among the intervals, which showed differential expression during seed development in BnHG vs. BnLG (Figures 4 and 5, Table S4).
Altogether, numerous key homologous genes, including known genes (such as IMD1, MAM1, IGMTs, MYB34, MYB51, MYB122 and so on) and novel genes (MYB44, ERF025, ARF3, NF-YB10, E2FC and so on) were identified within the confidence intervals of GSLs. Our findings not only demonstrate the reliability of the association genetics approach, but also provide new insight into elucidating the biosynthesis of these GSLs in B. napus seeds.

Discussion
GSLs are well-known secondary metabolites that play important roles in plant defense against diseases and insects and in human nutrition/health [1][2][3]72]. However, some glucosinolates in seed meal have deleterious effects on poultry and livestock, leading to efforts to develop low-glucosinolate Brassica crops [43,73]. Based on the functional group of amino acids, GSLs are divided into three major types, namely aliphatic GSLs, indole GSLs and aromatic GSLs, respectively [12][13][14]. Aliphatic GSLs comprise a major proportion of total GSLs in seeds [74][75][76]. In this study, we identified six major GSL metabolites in rapeseed, including three aliphatic GSLs (m145 gluconapin, m150 glucobrassicanapin and m151 progoitrin), one aromatic GSL (m157 gluconasturtiin) and two indole GSLs (m165 Indolylmethyl glucosinolate and m172 4-hydroxyglucobrassicin) (Figures 1 and 2, Table 1), respectively. However, among the six GSLs examined, four showed highly significant differences in abundance in BnHG vs. BnLG seeds, whereas two indole GSLs did not (Figure 2a). It seems that aliphatic GSLs account for most of the total GSL content, comprising an average of 1539.35~1975.56 µg/g FW in both years of the study. We detected a higher positive correlation between the contents of aliphatic GSLs and aromatic GSLs vs. indole GSLs (Figure 1c), supporting the notion that aliphatic GSLs and aromatic GSLs are somewhat correlated in B. napus seeds. However, deeper knowledge of the specific GSLs in B. napus seeds is required to help breeders improve the current varieties and select plants with advantageous properties. Unlike in the model plant Arabidopsis thaliana, many copies of homologous genes involved in the GSL biosynthesis pathway are present in the rapeseed genome, as B. napus is an allopolyploid plant with a complex genome [49]. However, numerous studies, including quantitative trait locus (QTL) mapping and candidate gene identification, have investigated the mechanisms involved in GSL production in rapeseed. For example, numerous QTL for GSLs in leaves and seeds of B. napus have been detected [77,78]. For seed GSL content, Xu et al. [79] and Wang et al. [80] also identified the significantly associated sites located on chromosomes A9, C2 and C9, respectively. We previously identified 11 significant SNPs associated with seed GSL accumulation in B. napus located on chromosomes A08, A09, C03 and C09 [81]. Tan et al. [82] recently detected 15 reliable quantitative trait loci (QTLs) for seed GSL content via a GWAS. With the rapid development of gene chip and genome sequencing technology, mGWAS combined with genomic and transcriptomic analysis is suitable for studying the metabolism, genetic characteristics and biochemical properties of plants, and has been widely used for structural and functional analysis of metabolites and functional genomics [83][84][85]. In the current study, we detected 113 candidate intervals with significant associations with glucosinolate metabolites in rapeseed via mGWAS ( Figure 3, Table S3). Of these, 13 candidate intervals were significantly associated with only a single metabolite, while 100 were associated with multiple metabolites. Importantly, we repeatedly detected the significant interval regions on chromosomes A02, A08, A09, C03, C07 and C09, which is consistent with previous studies [77,78,81,82,[86][87][88][89]. We also identified many new candidate intervals located on chromosomes A04, C01, C05, C06 and C08, further demonstrating the power of our approach (Figure 2, Table S3).
In conclusion, we identified 113 interval regions for six glucosinolate metabolites in B. napus seeds (Figure 3, Table S3). We also identified 187 candidate genes for GSL biosynthesis and accumulation. Our results enrich our knowledge of the GSL biosynthesis pathway and provide candidate genes for improving the compositions of specific GSLs in rapeseed, which could be precisely modified by gene editing in the future.

Glucosinolate Matbolite Extraction
The raw metabolites were extracted from fresh seeds described in our previous research [52], with minor modifications. In short, about 100 mg of fresh seeds was crushed into powder using high-throughput tissue grinder (Tissuelyser-192, Shanghai, China). Subsequently, we added the 500 µL extract solution (80% aqueous methanol with 0.1% formic acid) and homogenized by vortex for 10 s. The homogenized extraction buffer was extracted using sonication (KQ-100E, Kunshan, China) at 4 • C for 1 h, followed by centrifugation at 12,000× g at 4 • C for 10 min. Then, the residues were repeatedly extracted. Eventually, the mixed liquid supernatants were used for UPLC-HESI-MS/MS analysis after being filtered by a 0.22 µm nylon filter. All experiments were performed in at least three replicates for each accession.
The glucosinolate metabolites were detected in two consecutive years (2017 and 2018) with replications; we further obtained the best linear unbiased prediction (BLUP) of six glucosinolate metabolites per accession using a linear model using an R script (http://www.eXtension.org/pages/61006, accessed on 14 October 2020), respectively. The content of six glucosinolate metabolites in the 2017 and 2018 growing seasons and resulting values of BLUP were used as phenotypes for GWAS, respectively. The heritability was calculated by using the multi-year repeated model. In addition, the quantitative data of glucosinolate metabolites were divided into 10 grades, and Shannon-Wiener Diversity Index was used to calculate metabolites [112][113][114]. The Pearson correlation coefficients among six glucosinolate metabolites were calculated by the R language psych package and significance tests (Student t-test) were performed [115]. Statistical analysis of glucosinolate metabolites was performed using Microsoft Office Excel 2009. The G × E analysis of glucosinolate metabolites was performed using AMMI model by R package "agricolae".

GWAS of Glucosinolate Metabolites
Detailed methods used for SNP genotyping and mapping were previously described [46][47][48]. In brief, DNA libraries with a mean insert size of 350 bp were constructed, and 125 bp paired-end reads were generated using an Illumina HiSeq 4000 instrument at the Biomarker Technologies Corporation (Beijing, China). Low-quality bases from paired-end reads were trimmed using Trimmomatic (version 0.33) and mapped to the rapeseed genome 'Darmor-bzh' [49] using the Burrows-Wheeler Aligner (version 0.7.10-r789). Then, we used Picard (release 2.0.1, http://broadinstitute.github.io/picard/, accessed on 26 July 2018) and GATK (version 3.2) to process local realignment and base quality detection for the alignment results, sequentially. Further, we then used AMtools mpileup (version 0.1.19-44428cd) and GATK to perform SNP calling. In total, 239,945 high-quality SNPs with a minor allele frequency (MAF) <5% were used for further analysis. The six models, including naïve, Q, K, PCA, K + Q and K + PCA model, were applied to determine the statistical associations between phenotypes and genotypes. Quantilequantile (QQ) plots were used for false-positive correction for association analyses. In this study, genome-wide association analysis for six glucosinolate metabolites was carried out using the GLM with Q model and MLM with Q + K model by TASSEL 5.2.1 software. The population structure Q matrix was completed by admixture_linux-1.3.0 software [116]. The significant signal of the associations between SNPs and the glucosinolate metabolites was assessed based on the threshold P < P = 1/N (where N is 239,945 SNPs in this study), and the threshold of significance was set to p < 4.17 × 10 −6 .

Annotation of Candidate Genes
The significant interval regions were mapped to the Darmor-bzh reference genome (version 4.1, http://www.genoscope.cns.fr/brassicanapus/data/, accessed on 26 October 2022) [49], which were anchored by the physical position of significant SNPs. Correspondingly, 200 kb flanking region of association SNPs was used as the candidate interval region. Then, the annotated genes in the interval regions were used for screening the candidate genes, and further confirmed by gene expression analysis. Herein, the gene expression levels were calculated as FPKM (Fragments Per Kilobase of transcript sequence per Millions base pairs) with the featureCounts tool in Subread [117]. Finally, correlations between the GSL phenotype and the gene expression profiles were detected to determine the candidate genes that were associated with the glucosinolate metabolites.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/plants12030639/s1, Figure S1: The QQ plots of 6 glucosinolate metabolites; Figure S2: Manhattan plots of association analysis for six glucosinolates using Q+K model; Figure S3: The significant SNPs associated with 6 glucosinolates in different environments; Figure S4: Correlation analysis of six glucosinolate metabolites in 2017cq and 2018cq; Table S1: Summary of repeatedly detected SNPs for 6 glucosinolate metabolites by mGWAS;  Table S7: Statistical analysis of 6 glucosinolate metabolites in high-, medium-and low-glucosinolate-content rapeseed