Genome-wide association study uncovers genomic regions associated with grain iron, zinc and protein content in pearl millet

Pearl millet hybrids biofortified with iron (Fe) and zinc (Zn) promise to be part of a long-term strategy to combat micronutrient malnutrition in the arid and semi-arid tropical (SAT) regions of the world. Biofortification through molecular breeding is the way forward to achieving a rapid trait-based breeding strategy. This genome-wide association study (GWAS) was conducted to identify significant marker-trait associations (MTAs) for Fe, Zn, and protein content (PC) for enhanced biofortification breeding. A diverse panel of 281 advanced inbred lines was evaluated for Fe, Zn, and PC over two seasons. Phenotypic evaluation revealed high variability (Fe: 32–120 mg kg−1, Zn: 19–87 mg kg−1, PC: 8–16%), heritability (hbs2 ≥ 90%) and significantly positive correlation among Fe, Zn and PC (P = 0.01), implying concurrent improvement. Based on the Diversity Arrays Technology (DArT) seq assay, 58,719 highly informative SNPs were filtered for association mapping. Population structure analysis showed six major genetic groups (K = 6). A total of 78 MTAs were identified, of which 18 were associated with Fe, 43 with Zn, and 17 with PC. Four SNPs viz., Pgl04_64673688, Pgl05_135500493, Pgl05_144482656, and Pgl07_101483782 located on chromosomes Pgl04 (1), Pgl05 (2) and Pgl07 (1), respectively were co-segregated for Fe and Zn. Promising genes, ‘Late embryogenesis abundant protein’, ‘Myb domain’, ‘pentatricopeptide repeat’, and ‘iron ion binding’ coded by 8 SNPs were identified. The SNPs/genes identified in the present study presents prospects for genomics assisted biofortification breeding in pearl millet.

Population structure and linkage disequilibrium. Dissection of the population structure of the association panel using SNP markers revealed a total of six (K = 6) genetic groups at the corresponding least cross validation error (CV error) of 0.659 (Fig. 3A). Among the six subgroups, group VI (orange) was the largest that consisted of 53 inbreds, followed by group I (blue) with 51, group III (red) with 50, group V (yellow) with 48, group IV (green) with 47 and the group II (purple) with 32 inbred lines (Fig. 3B). The linkage disequilibrium (LD) between each pair of SNPs across each chromosome was evaluated by the squared Pearson correlation coefficient (R 2 ). A set of 58,719 SNPs with identified physical positions were used for LD analysis (Fig. 4). The pairwise LD across each chromosome showed that the LD (R 2 ) ranged from 0 to 1 with the average LD across the genome being 0.116. Furthermore, chromosome-wise average LD varied in the order of 0.151 > 0.138 > 0.129 > 0.118 > 0.107 > 0.087 > 0.081 for chromosomes Pgl03, Pgl07, Pgl04, Pgl06, Pgl02, Pgl01, and Pgl05, respectively. The LD for 18,80,476 pairwise combinations obtained from 58,719 marker loci across the genome showed that 57% of SNP pairs showed < 0.01 R 2 , whereas 37% of SNP pairs showed 0.01-0.05 R 2 , and only 6% of SNP pairs showed 0.06-0.1 R 2 . Linkage disequilibrium-decay (LDD) across seven chromosomes was determined using the entire set of 58,719 DArT seq markers. The LDD was plotted as LD (R 2 ) between the adjacent pair of markers on the Y-axis against the distance in base pairs (bp) on the X-axis (Fig. 5). The R 2 threshold level was set to 0.2 and observed rapid LDD across the pearl millet genome with an average LDD of 2.9 kb (2900 bp). Among the seven chromosomes, the shortest LDD was observed in chromosome 1 with 0.2 kb (200 bp, R 2 = 0.2) and the longest LDD was observed in chromosome 6 with 9 kb (9000 bp, R 2 = 0.2). Table 1. Estimates of mean, variance, range and heritability for pooled analysis of phenotypic evaluation of 281 inbred lines across 2017 rainy and 2018 summer, ICRISAT, Patancheru. CV, coefficient of variation; SEm, Standard error of mean; * and **, F-values significant at 0.05, 0.01 probability level.  www.nature.com/scientificreports/ Genome-wide association study. A genome-wide association mapping was performed using 58,719 high-quality SNPs with less than 30% missing data having a call rate of more than 0.7. These SNPs covered around 301 Mb of pearl millet genome and were distributed across the seven chromosomes of pearl millet with a minimum of 6534 SNPs on chromosome 7 to a maximum of 10,942 SNPs on chromosome 2. SNP genotyping data of 58, 719 SNPs along with information on population structure and kinship matrix were used for genomewide association analysis against Fe, Zn, and PC in grains for the pooled data across the 2017 rainy season and 2018 summer season. Among two models used for GWAS, the general linear model (GLM) considering only population structure (Q) showed high genomic inflation (Fig. 6), whereas the mixed linear model (MLM) which considers both population structure and family relatedness (K) showed low genomic inflation and thus helped overcome the number of false-positive associations for Fe, Zn, and PC. Therefore, significant marker-trait associations (MTAs) finalized based only on MLM are presented here. The threshold level of 'P' value was set to 3.0, above which the SNPs are said to be significantly associated. However, a total of 43 significantly associated markers were identified for Zn with 'P' values ranging from 2.24 × 10 -5 to 9.78 × 10 -4 . Furthermore, the phenotypic variation explained by these SNPs ranged from 5.09 to    Candidate genes associated with grain Fe, Zn, and PC. Pearl millet genome sequencing reported a total of 69,398 genes and unraveled the involvement of several genes in the control of both agronomically and nutritionally important traits. The physical positions of each SNP marker from the present study were compared against the pearl millet genome sequence to determine the function of the gene underlying the respective SNP. A total of 18 SNPs associated with Fe were found linked ( Table 2 and Supplementary Table S3) to different genes viz., Like-Sm ribonucleoprotein (LSM) domain, late embryogenesis abundant protein, zinc finger, ankyrin repeat, leucine-rich repeat, pentatricopeptide repeat, oligopeptide transferase, and basic leucine zipper which were found to play a significant role in plant metabolism, including iron homeostasis. Similarly, the SNPs associated with the genes viz., protein kinase, Myb transcription factor, glycosyl transferase, chalcone/stilbene synthase, heat shock protein (HSP70), peptidase, copper domain, male sterility, etc., were found to be unique to Zn while protein binding, lipid binding, protein kinase activity, and iron ion binding genes were found associated with SNPs identified for PC.

Discussion
Developing biofortified hybrids in pearl millet requires high Fe and Zn content in both the parents since it's governed by additive gene 25 . It is highly feasible to develop biofortified inbred lines through inbreeding which accumulates more of additive variances in subsequent generations. The strong epigenetic influence on these traits expression and sample contamination during handling of breeding materials is a challenge for biofortification in pearl millet 26,27 . The process of identifying molecular markers, preferably SNPs tightly linked to genomic regions of Fe, Zn and PC, will enhance the efficiency of biofortification using genomics assisted breeding. Recently, several genomic regions controlling the inheritance of Fe and Zn have been identified through QTL mapping 28 using DArT and SSR markers and also through LD-based association mapping 29 by SSR markers in pearl millet. Though SSRs are preferred markers, their resolution is relatively low 17 . None of the previous studies have reached the gene level; therefore, the present study aimed to dissect the genetic nature of Fe, Zn and PC in pearl millet using GWAS by exploiting the DArT seq markers to discover the genomic regions and candidate genes influencing Fe, Zn and PC. Grain Fe and Zn content are strongly influenced by the available Fe and Zn content in the soil. The available soil Fe and Zn content in our experimental field was above the critical levels (2.6 to 4.5 mg kg -1 Fe and 0.6 to 1.0 mg kg -1 Zn) required for normal growth and development 30,31 . Three to fourfold significant variations for Fe (32-120 mg kg −1 ), Zn (19-87 mg kg −1 ) and twofold variation for PC (8-16%) in 281 elite inbred lines prospects the breeding feasibility (Supplementary Table S5). Similar variability for Fe/Zn has been reported among germplasm 32 , breeding lines 15 , and commercial cultivars 33 . High genetic variance for Fe/Zn indicates the least influence of G × E. Population structure along with shared co-ancestry coefficients between individuals of subdivisions of a population were estimated using ADMIXTURE 1.23 73 . A total of six genetic groups were  www.nature.com/scientificreports/ formed among 281 inbred lines with some admixtures indicating common allelic combinations in the genomic background of few genotypes. The availability of six subgroups and wide phenotypic variation observed for Fe, Zn, and PC indicated that the present GWAS panel is best suited for genome-wide association study to dissect the genetic basis of high Fe, Zn accumulation, and PC in pearl millet. LD is the non-random association of alleles at two or more loci and acts as a critical genetic force in determining population structure 34,35 . The LD of a population is the result of evolutionary changes in a population that would help in mapping quantitative traits such as Fe, Zn and PC more precisely while it also gives insights into the joint evolution of the linked sets of genes. The pattern of LD across the genome ultimately decides the success of association studies 36,37 . In the present study, the average pairwise LD (R 2 ) across the genome decreased rapidly against the increasing distance (bp). Rapid LDD has been reported in earlier studies in pearl millet 2,38 . Chromosomes Pgl01, Pgl02, Pgl03, Pgl05, and Pgl07 showed relatively more rapid LDD (~ 0.64 kb) compared to Pgl04 and Pgl06, suggesting that a larger number of markers are required for chromosomes Pgl01, Pgl02, Pgl03, Pgl05, and Pgl07 for GWAS. The gene-rich genomic region tends to have a higher rate of recombination. Thus the LDD would be higher in such genomic regions, requiring a higher marker density for LD analysis in such regions. Of 18,80,476 pairwise LD analysis, 57% of the SNP pairs showed an LD of less than 0.01 (R 2 = 1%), indicating that the LD in the current GWAS panel is relatively low. This could probably be because pearl millet is a highly cross-pollinated (> 80%) species, wherein some portion of the genome is bound to have heterozygosity (not every locus is heterozygous) as genetic load by the inbreeding process 39 . The low LD is also due to frequent recombination and higher inbreeding depression by virtue of being a cross-pollinated crop. The low value of LD in turn gives the high resolution of mapping but requires a large number of markers 40 .
While performing GWAS, care should be taken to avoid false associations arising from false positives (Type I error). In the present study, two extensively used statistical models, GLM 41 and MLM 42,43 , were used for the MTA. The MLM model is more efficient and superior in reducing false positive associations by correcting for both population structure (Q) and kinship matrix (K) which can be further visualized through Quantile-Quantile (Q-Q) plots to show low genomic inflation for MLM compared to GLM (Fig. 6). However, sometimes MLM tends to overcompensate for both population structure and kinship, which could lead to false negatives, type II   44,45 . This means that the identification of some MTAs depends on the model used 46 . Therefore, the present study used both GLM and MLM and found that > 70% of SNPs from MLM were common in GLM, with some additional markers that were absent in GLM. Therefore, the results obtained from the MLM model are presented. None of the MTAs met the Bonferroni criteria because of the utilization of 0.058 Million markers generated through the GBS method. The Bonferroni correction would be too stringent to use as not all the markers are independent 47 and may lead to false negatives 48,49 . Among the significantly associated SNPs for Fe, marker Pgl05_135500493 on chromosome Pgl05 explained the highest phenotypic variation (8.23%). For Zn, markers Pgl07_101483782, Pgl07_101483780, and Pgl07_147179490 exhibited more than 7.5% of phenotypic variation. However, the SNPs identified for PC explained the relatively lower phenotypic variation, wherein the highest phenotypic variation was explained by the SNP Pgl07_71295563 on chromosome Pgl07 (~ 6%). Interestingly, there were four SNPs discovered to be common for both Fe and Zn content on chromosomes Pgl04, Pgl05, and Pgl07 that cumulatively explain about 27.12% and 25.32% of phenotypic variation for Fe and Zn, respectively. The co-localization of both Fe and Zn and highly significant positive correlation between them further suggested some common genes and pathways involved in Fe and Zn homeostasis in plants i.e., from root absorption to till deposition in grains. A common set of markers for Fe and Zn has been reported in pearl millet 29 on LG 3, LG 5, and LG 7. QTLs responsible for Fe and Zn have been co-mapped on LG 1 and LG 7 28 ; these probably indicate that chromosome Pgl05 and Pgl07 are likely to control Fe and Zn transport and accumulation in pearl millet. Though no common MTAs were identified for PC with Fe and Zn, the positive significant correlation of PC with both Fe and Zn suggested that the selection for high Fe/Zn expected to increase PC as an associated trait. www.nature.com/scientificreports/ To know the conformity of the identified MTAs in this study, they were compared to previous genetic mapping studies for Fe and Zn in pearl millet. SNPs were identified for Fe and Zn in this study were concomitant to reported studies in pearl millet (Table 3). For instance, Anuradha et al. 29 reported that Fe was highly influenced by the genes on chromosomes Pgl05 and Pgl07, whereas Kumar et al. 28 identified genomic regions for Zn on chromosomes Pgl01 and Pgl04 in pearl millet. Zn content was also influenced by the SNPs on chromosome Pgl03, Pgl04, Pgl05, Pgl06 and Pgl07. Similar results were reported earlier by Anuradha et al. 29 while Kumar et al. 28 reported genomic regions on LG 1, 4, 5, and 7. This evidence suggests that the SNPs identified on chromosomes were consistent with the previously reported markers which might have a significant role to play in the expression of Fe and Zn content. This calls for fine mapping of these genomic regions that would ultimately provide candidate SNPs for use in marker-assisted breeding to improve grain Fe and Zn. Apart from pearl millet, genomic regions were also discovered for grain Fe and Zn content in other millets and cereals such as rice 22 , foxtail millet 50 , maize 21 , wheat 23 , through genome-wide association mapping. Genetic mapping studies have discovered genomic regions for grain Fe and Zn content in sorghum 51 , maize 52 , and wheat 53 . Hence different genomic regions in this study can be introgressed for trait improvement in pearl millet based on the targeted environment, depending on common MTAs. This is the first report on the discovery of genomic regions using GWAS for PC in pearl millet. The findings will generate research interest to further investigate the regulation of grain PC in pearl millet. A total of 17 MTAs were identified on six chromosomes (Pgl01, Pgl02, Pgl04, Pgl05, Pgl06 and Pgl07) of pearl millet, among which Pgl06_71295563 showed the highest phenotypic variation of 5.86% with a 'P' value of 3.46 × 10 4 . Similar genomic regions have been reported for PC in previous studies in maize 54 , rice 55,56 , and wheat 57 .
Gene annotation was performed by comparing the sequence reads of significantly associated SNPs at their respective physical positions against the reference genome of pearl millet. The genes identified in the present study and their functional roles in Fe and Zn metabolism in plants reported through previous studies are presented in Table 4. There were several genes identified, among which very few were involved in Fe transportation, accumulation, and homeostasis. The SNP Pgl07_147858723 corresponding to glutathione S-transferase plays a significant role in iron starvation in roots. In the roots of hexaploid wheat, a significant temporal increase in glutathione S-transferase was observed at both transcriptional and enzymatic activity levels, which established the foundation for designing breeding strategies to improve Fe nutrition in pearl millet. The SNP Pgl02_69256531 and Pgl06_231796045 were found in the region of the MYB-domain. Palmer et al. 58 observed that the MYBdomain plays a significant role in plant survival under Fe deficiency conditions, and is the most highly induced transcription factor which acted early in the Fe deficiency regulatory cascade to drive gene expression of NAS4. Shen et al. 59 isolated MYB gene MxMYB1 from Malus xiaojinensis. The expression of MxMYB1 was up-regulated by Fe starvation in the roots but not in the leaves, signifying that MxMYB1 likely to play more in iron absorption from soil to roots and not likely from root to leaves. The SNP Pgl04_190105720 corresponding to the Zinc finger plays a crucial role in preventing toxic ion damage and hence performs an important role in maintaining cellular osmotic adjustment and enzyme activities, leading to significantly improved salt stress tolerance 60 .
The significant phenotypic variability observed in the association panel coupled with high marker density across all chromosomes provided a strong case for whole-genome association mapping of the three (Fe, Zn, and PC) important nutritional traits in pearl millet. This GWAS study which identified marker-trait associations for Fe, Zn, and PC using the genotyping-by-sequencing platform presents greater prospects for utilization and traits mainstreaming. Rapid LDD observed in the current GWAS panel indicates that the SNPs identified through genome-wide association mapping are more reliable and complement previously reported QTLs in pearl millet. Pgl05_135500493 and Pgl05_144482656 SNPs for Fe; Pgl07_101483782, Pgl07_101483780 and Pgl07_147179490 SNPs for Zn, and Pgl06_71295563 SNPs for PC were found promising. Significant phenotypic correlations between Fe and Zn support simultaneous selection and improvement. This linkage and the identified co-localized MTA suggest there is a common physiological pathway. These MTAs help to move towards fine mapping and discovering a set of diagnostic markers to screen segregating population (F 2 /F 3 s) in order to avoid expensive phenotyping and G × E effects in future. Eight MTAs that were identified for Fe and Zn were found to be involved in Fe mobilization. Thus, the promising MTAs identified in the present study merit further validation in different genetic backgrounds of breeding lines and populations. Eleven inbred lines had ≥ 80 mg kg −1 of Table 3. QTLs reported from earlier studies for iron (Fe), zinc (Zn) in pearl millet and co-localized associated marker trait associations (MTAs) identified the same genomic region in present study.

SN
Genetic mapping Trait Linkage group (LG)/chromosome MTAs on respective chromosomes from current study Author www.nature.com/scientificreports/ Fe, > 60 mg kg −1 Zn, and > 13% of PC that meet global targets and will serve as trait sources in elite backgrounds. Such lines will be easily converted into CMS (maintainers) to make hybrids with high-Fe/Zn/PC restorers for fast-track product development. The inbred panel studied that is part of hybrid parents at ICRISAT. This will enhance the introgression of these traits to develop high-yielding hybrids through marker-assisted back-crossing (MABC) in India where hybrid cultivars are dominant, while in Sub-Saharan Africa where open-pollinated varieties (OPVs) are predominant, it will be done through marker-assisted recurrent selection (MARS) and marker-assisted population improvement (MAPI).  Table S1).

Estimation of grain iron and zinc content. For grain sampling, open-pollinated main panicles from
five representative plants per plot were harvested at physiological maturity (85-90 days after planting). These panicles were stored separately in a cloth bag and sundried for 10 to 15 days, and then hand threshed to produce clean grain samples for micronutrient analysis (Fe and Zn). Utmost care was taken to avoid contact with iron equipment while threshing and handling of threshed samples. Grain Fe and Zn content were analyzed using Inductively Coupled Plasma Optical Emission Spectrometry (ICP-OES) at Flinders University, Australia, following the method described by Wheal et al. 61 . Grain samples were finely ground and oven-dried at 60 °C for 48 h before analyzing them for Fe and Zn. A ground sample of 0.2 g was transferred into 25 ml polyprophelene PPT tubes with 2.0 ml of concentrated nitric acid (HNO 3 ) and 0.5 ml of 30% hydrogen peroxide (H 2 0 2 ). These samples were wetted and predigested overnight at room temperature. Samples were placed in the digestion block and heated at 80 °C for 1 h, followed by digestion at 120 °C for 2 h. After digestion, each sample digest was turned into 25 ml using distilled water. The digests were filtered using Whatman no.1 filter paper and the filtrate was used to estimate Fe and Zn content using ICP-OES.
Estimation of grain protein content. Grain protein content was analyzed using Near-Infrared Spectroscopy (NIRS) at ICRISAT. The quantified grain protein 62 content was measured in percentage. The grain samples collected were cleaned thoroughly and about two to three grams of whole grain samples were poured in a small cup. The cup was then placed in the NIRS machine and the sample was run for a minute. The readings were then noted.  www.nature.com/scientificreports/ DNA extraction and genotyping using DArT seq. Genomic DNA was isolated from tender leaf tissues of 30 day-old seedlings 64 . The quality and quantity of the extracted DNA were checked on 0.8% agarose gel using gel electrophoresis at 80 V using λ-DNA standards. The DNA was subsequently diluted to a volume 50 μl of concentration 50 ng/μl. The samples were then sent to the Diversity Arrays Technology (DArT) Pty Ltd, Australia 65 for genotyping using DArT markers. The DArT seq assay, an efficient genotyping-by-sequencing platform was employed in the present study. In brief, the DNA samples were digested and ligated primarily with two different adaptors accompanying to overhang by two different restriction enzymes 66  Phenotypic data analysis. The analyses of variance was performed over the rainy season 2017 and summer season 2018 using generalized linear model procedures following a random-effects model 69,70 in SAS University Edition (SAS/STAT, SAS Institute Inc, NC, USA) 71 . Heritability 72 was determined using the following formula:

Estimation of Fe and
where σ 2 g is the genotypic variance, σ 2 gs is the genotype × season interaction variance, and σ 2 e is the residual variance; 'r' is the number of replications, and 's' is the number of seasons. Mean and coefficient of variation (CV) were also determined using the standard procedure implemented in the SAS University Edition. Pearson's correlation coefficients among the traits were calculated using the PROC CORR procedure in R version 3.5.1 (R Project for Statistical Computing, (https ://www.r-proje ct.org). The standard error of the mean (SEm) was determined in a simple excel program using the following formula: where 'MSS' is the Mean sum of square and 'n' is the number of samples.
Population structure, kinship and genome-wide linkage disequilibrium. Population structure was determined using ADMIXTURE 1.23 software 73 . The number of genetic clusters (K) was predefined as 1 to 10 to explore the population structure of the tested accessions. This analysis provided maximum likelihood estimates of the proportion of each sample derived from each of the K populations. The optimum K value was selected based on the graph plotted using the respective K value from 1 to10 against cross-validation error (CV-error). The optimal number of sub-population (K) was determined with the lowest cross-validation error. Genetic relatedness or K matrix was generated from TASSEL V 5.3.1. 74 . LD was quantified as adjacent pairwise R 2 values (the squared allele frequency correlations among alleles at two adjacent SNP markers) 75 and was estimated for 58,719 SNPs in TASSEL V 5.3.1.
Genome-wide association analysis. Marker trait association was performed using two different models, GLM and MLM, as given below 76 : where, 'y' is phenotype vector, 'a' is a marker vector with fixed effects, 'b' is a vector with fixed effects, 'u' is a vector with random effects (kinship matrix), 'e' is a residuals vector, X denotes the accessions/genotypes at the marker, 'Q' is the Q-matrix, the result of ADMIXTURE software, and 'Z' is an identity matrix. www.nature.com/scientificreports/ The GLM principally considers the population structure (Q) while MLM considers both Q and Kinship (K). Further, among the different options available within MLM, the widely adapted approach called 'optimum levels of compression in combination with P3D' for variance component estimation was used for association analysis. For the MLM analysis, marker-based kinship matrix (K) obtained using TASSEL was used along with the Q matrix generated through ADMIXTURE to correct for both family and population structure and the phenotypic variation explained (R 2 ) by the marker is reported 74,77 . Quantile-Quantile (Q-Q) plots were developed by plotting observed negative Log 10 'P' values against expected negative Log 10 'P' values for all the available SNPs in R package CMplot 78 . A deviation from 'P' values at the initial stage may display the existing population stratification. Manhattan plots were used to visualize chromosome-wise SNPs obtained through the marker-trait association study performed across the genome. -Log 10 of the 'P' value for each SNP was plotted against seven chromosomes for the respective trait. Based on the SNP distribution, the threshold for significance of associations between SNPs and traits was fixed at [− log 10 (p) < 10 −03 ] which gave the optimum number of reliable SNPs. SNP density plots, Q-Q plots, and Manhattan plots were generated using R package CMplot v 3.4.0 78 .
The corresponding genes of associated SNPs or marker-trait associations were identified by using the physical positions of SNPs in gene annotations available in the pearl millet reference genome sequence 2 ; and thus the functions of the respective SNPs were determined.
Candidate genes discovery. The candidate genes corresponding to the significantly associated SNPs were identified using the pearl millet genome 2 sequence annotations. The SNP subsiding start and end positions of a gene or exons were explored for candidate genes based on their biological function annotation related to the trait of interest (Supplementary Fig. S2). It is possible to obtain multiple SNPs on a gene segment which are referred to as haplotypes 79 .