Genetic Loci Controlling Carotenoid Biosynthesis in Diverse Tropical Maize Lines

The discovery and use of genetic markers associated with carotenoid levels can help to exploit the genetic potential of maize for provitamin A accumulation more effectively. Provitamin A carotenoids are classes of carotenoids that are precursors of vitamin A, an essential micronutrient in humans. Vitamin A deficiency is a global public health problem affecting millions of people, especially in developing countries. Maize is one of the most important staple crops targeted for provitamin A biofortification to help alleviate vitamin A deficiency in developing countries. A genome-wide association study (GWAS) of maize endosperm carotenoids was conducted using a panel of 130 diverse yellow maize tropical inbred lines genotyped with Genotyping by Sequencing (GBS) SNP markers. Numerous significant association signals co-localizing with the known carotenoid biosynthesis genes crtRB1, lcyE and ZEP1 were identified. The GWAS confirmed previously reported large effects of the two major carotenoid biosynthesis genes lcyE and crtRB1. In addition, significant novel associations were detected for several transcription factors (e.g., RING zinc finger domain and HLH DNA-binding domain super family proteins) that may be involved in regulation of carotenoid biosynthesis in maize. When the GWAS was re-conducted by including the major effects of lcyE and crtRB1 genes as covariates, a SNP in a gene coding for an auxin response factor 20 transcription factor was identified which displayed an association with β-carotene and provitamin A levels. Our study provides a foundation for design and implementation of genomics-assisted selection strategies for provitamin A maize breeding in tropical regions, and advances efforts toward identification of additional genes (and allelic variants) involved in the regulation of carotenoid biosynthesis in plants.

deficiency, the leading cause of preventable blindness (Sherwin et al. 2012). Millions of child deaths annually are attributed to vitamin A deficiency coupled with other undernutrition problems (Black et al. 2003).
Genetic improvement of staple crops for improved nutritional quality (e.g., enhanced level of micronutrients) has been termed biofortification and is a promising approach for reducing vitamin A and other micronutrient deficiencies in human populations. Maize represents a significant proportion of the total calorie intake of people in many African countries, accounting for 30% of the per-capita calorie consumption in Eastern and Southern Africa, even reaching as high as 56% in some of the southern African countries (FAO, 2011). The biofortification of maize with higher levels of provitamin A carotenoids could play a significant role in reducing vitamin A deficiency in regions where maize is a major staple crop (Wurtzel et al. 2012;Burt et al. 2011;Meyers et al. 2014). While breeding lines of maize that can accumulate up to 26 mg/g b-carotene (and 30 mg/g of provitamin A carotenoids) in the endosperm have been reported , commonly cultivated maize varieties contain low levels of provitamin A carotenoids ranging from 0.5 to 1.5 mg/g (Harjes et al. 2008).
Understanding the genetic variation, genes and regulatory mechanisms controlling maize endosperm carotenoid levels is important for biofortification of maize with high levels of provitamin A carotenoids. Genome wide association study (GWAS) approaches are a powerful approach for ascribing gene-phenotype relationships (Huang and Han 2014;Yu and Buckler 2006;Zhu et al. 2008), while genotyping by sequencing (GBS) is a next-generation sequencing (NGS) based genotyping approach that has dramatically facilitated large-scale genomewide marker development and GWA studies in crop species (Chia et al. 2012;Davey et al. 2011;Elshire et al. 2011;Varshney et al. 2014).
A number of GWA studies have identified loci controlling agronomic traits such as plant height, yield and yield components, flowering time and plant architecture in a range of crops, including barley (Pasam et al. 2012), tomato (Shirasawa et al. 2013), wheat (Edae et al. 2014;Wang et al. 2014), and maize (Thornsberry et al. 2001;Wang et al. 2012). GWA studies have also identified loci associated with grain quality traits including oil content in maize , protein contents in wheat (Edae et al. 2014;Wang et al. 2014) and essential micronutrients such as a-tocopherol (vitamin E) and b-carotene in maize (Harjes et al. 2008;Lipka et al. 2013;Yan et al. 2010). In particular, two key carotenoid biosynthesis genes, namely LCYE and crtRB1 (HYD3), have been found to be significantly associated with accumulation of provitamin A carotenoids in maize endosperm (Harjes et al. 2008;Yan et al. 2010). Different allelic variants of these genes can affect the flux of substrates through the carotenoid biosynthesis pathway leading to synthesis of higher levels of provitamin A carotenoids (e.g., b-carotene). The total provitamin A carotenoid proportion in maize endosperm is affected by the level of total carotenoids accumulated in the endosperm which is a function of substrate flux into the carotenoid pathway and downstream catabolic steps involving degradation of carotenoids (Rodríguez-Concepción 2010; Vallabhaneni et al. 2010) There have been only a few GWA studies for carotenoid composition and content in maize endosperm to date (Owens et al. 2014;Suwarno et al. 2015). In our study, we have used a GBS-based GWA approach to identify loci associated with carotenoid content and composition of maize endosperm. Uniquely, our study used a collection of genetically diverse yellow maize inbred lines (with a mixed genetic background of both tropical and temperate germplasm) developed by the maize breeding program of the International Institute of Tropical Agriculture (IITA). In addition, we factored in the effect of already identified provitamin A alleles as covariates to detect additional association signals. Our findings contribute to ongoing efforts to identify allelic variants that can be used for genomic selection to develop maize lines with higher levels of provitamin A carotenoids.

Germplasm used for GWAS
A panel of 130 diverse yellow maize inbred lines previously described in Azmach et al. (2013) was employed for the GWAS. This panel had inbred lines with kernel colors ranging from light yellow to dark orange. White maize lines were not included, since our focus was to investigate the genetic variability underlying composition and content of the various carotenoids in maize endosperm. We did not investigate the variation between white and yellow lines, which is largely determined by a mutation in the psy1 locus (Palaisa et al. 2003). This maize germplasm panel was composed of inbred lines developed by IITA from eight bi-parental crosses, four broad based populations, and 28 backcrosses of tropical inbred lines, involving five temperate lines as donors of high b-carotene alleles (Azmach et al. 2013). The inbred lines were considered to represent the allelic diversity underlying the variation in carotenoid composition and content in both the temperate and tropical maize gene pools, since each line contained both tropical and temperate maize germplasm in its genetic background.
Field trial evaluation and analysis of carotenoids Field trial evaluation of the maize inbred lines was performed at IITA's research station, Ibadan, Nigeria (7°29911.99$N, 3°5492.88$E, altitude 190 m above sea level) for two seasons, in 2010 and 2011. The trial was arranged in (10,13) alpha-lattice design with two replications. Each line was planted in a 5 m row plot, with 0.75 m spacing between rows and 0.25 m within each row. The fields were managed as per the recommended agronomic practices (Menkir et al. 2008) which included fertilization at the rates of 60 kg N, 60 kg P, and 60 kg K ha 1 at the time of sowing, with an additional 60 kg N ha 1 applied as top dressing 4 wk later; plus weed control using Primextra and Gramazone herbicides applied as pre-emergence herbicides each at 5 liter ha 1 . Subsequent manual weeding was done to keep the trials weed-free. The environmental conditions during the first season were as follows: Total rainfall of 310 mm (supplemented with irrigation); temperature ranged from 19.4 to 33.8°with average 27.7°; relative humidity ranged 28-97% with average 67%; and solar radiation ranged from 19.4 to 21.2 MJ/m 2 /d. During the second season the total rainfall was 1022.5 mm; the temperature ranged from 21.7 to 32.4°with average 25.8°; the relative humidity ranged from 40 to 97% with average 78%; and the solar radiation was from 15.2 to 20.2 MJ/m 2 /d. The dominant soil type of the trial site is Ferric Lixisols (FAO 1991), which is a sandy loam soil, moderately drained with a PH of 6.2.
Seed samples for carotenoid analysis were generated by controlled self-pollination of all plants in each plot. The self-pollination protocol employed consisted of covering the shoots with shoot bags before emergence of the silks to avoid cross pollination, once the shoots were ready for pollination, the tassels were bagged with pollination bags a day before pollination. The next day fresh pollen was collected and applied on the silks of the same plant using the pollination bags, after which the shoots were covered with the same bag used for self-pollination. The shoots remained bagged until harvesting. The ears of each self-pollinated maize line in each plot were harvested, dried under ambient temperature with minimal exposure to direct sunlight, and separately shelled. Samples of 100 kernels were used from each seed lot for carotenoid analysis.
The carotenoids from kernel samples of each of the 130 maize inbred lines were extracted and quantified with HPLC at the University of Wisconsin, USA. The extraction protocol used was the method of Howe and Tanumihardjo (2006) for carotenoid analysis of dried maize kernels, as previously described in Azmach et al. (2013). Extraction was performed using finely ground 0.5 g samples of each inbred line's kernels. The internal standard consisted of 200 ml of b-Apo-8'-carotenal (Sigma-Aldrich, St. Louis, MO), which was added at the beginning of the analysis for calibrating losses of carotenoids during extraction and the entire work-flow process. Fifty microliter aliquots of each extract were injected into the HPLC system (Waters Corporation, Milford, MA). The gradient was applied for 30 min from 70% solvent A:30% solvent B, to 40% solvent A:60% solvent B. Each carotenoid type was quantified based on calibrations using its respective external standard. Total carotenoid content was calculated as the sum of concentrations of a-carotene, lutein, b-carotene, b-cryptoxanthin, zeaxanthin. Provitamin A was calculated by summing the concentrations of b-carotene, and half concentrations of each of b-cryptoxanthin and a-carotene, since b-cryptoxanthin and a-carotene can provide only one molecule of retinol each as opposed to two molecules of retinol for b-carotene (US Institute of Medicine 2001). Other derived carotenoid traits were also calculated as indicated in Harjes et al. (2008), Yan et al. (2010): i.e., ratio of carotenoids in b to a branch of the carotenoid pathway, ratio of b-carotene to b-cryptoxanthin and ratio of b-carotene to all carotenoids (b-carotene + a-carotene + lutein + zeaxanthin + b-cryptoxanthin). The data for the ratio traits were transformed using natural logarithm (log e ) before being subjected to statistical analysis to correct for the non-normal distribution of the data. All carotenoid concentrations were measured in microgram/gram dry weight (DW). BLUEs (best linear unbiased estimates) calculated for each trait based on the two season carotenoid data were used in the GWAS. BLUEs were calculated using the GLM option of TASSEL software version 4 (Bradbury et al. 2007) with a statistical model Y = Xb + e, where Y is matrix of the dependent or response variables, i.e., each carotenoid type; X is the design matrix; b is vector of fixed effect parameters, and e is vector of the random errors that are assumed to be normally distributed and independent of the other variances.
Genome wide SNP marker generation using GBS DNA samples were isolated from freeze-dried leaf samples of each inbred line using Qiagen DNeasy plant mini kit following the protocol supplied with the product. DNA samples were quantified using a NanoDrop 2000 Spectrophotometer. Samples having at least 10 ng/ml DNA each were prepared and sent to the Genome Diversity Facility (GDF), formerly Institute for Genomic Diversity (IGD), Cornel University, USA, for GBS genotyping. Genotyping by sequencing (GBS) libraries were prepared, analyzed and sequenced at GDF, according to Elshire et al. (2011). SNP calling from the sequenced GBS library was also performed at GDF using the GBS production pipeline (Version: 3.0.134), an extension of the Java program TASSEL (Bradbury et al. 2007;Glaubitz et al. 2014) which used aligned short reads of GBS (tags). The GBS pipeline options used for calling SNPs consisted of: 0.1 minimum locus coverage, 1 · 10 6 maximum number of SNPs per chromosome, duplicate SNPs above 0.05 mismatch rate were not merged, and 0.8 cutoff frequency between heterozygote vs. homozygote calls. Tags were aligned to the reference genome B73 refgen_v2 (Schnable et al. 2009).
The GBS pipeline generated a data set containing a total of 619,596 unfiltered SNPs. This SNP dataset had a total of 51% missing data points possibly caused by biological presence-absence of sequences between the reference and each test genome, or errors introduced in the GBS procedures (Glaubitz et al. 2014;Poland and Rife 2012). The dataset was further filtered in TASSEL 4 on the basis of missing data proportion and minor allele frequency (MAF) cutoff thresholds (Bradbury et al. 2007). The cutoff thresholds used to filter the dataset for the GWAS allowed only those SNPs showing a maximum of 20% missing data, and 1% minimum MAF (MMAF). This resulted in a dataset of 109,937 SNPs. The diversity and genome-wide Linkage Disequilibrium (LD) analysis were performed using datasets obtained by filtering with criteria of no missing data points and 1 and 10% MAFs which resulted in 3532 and 1658 genome-wide SNPs, respectively. SNP data summary and basic diversity parameters were calculated using TASSEL 4 (Bradbury et al. 2007) and PowerMarker 3.25 (Liu and Muse 2005) softwares.

Linkage disequilibrium (LD) estimation
The two commonly used measures of LD are Lewontin's D and the squared pairwise correlation coefficient R 2 (Chen et al. 2006;Flint-Garcia et al. 2003). Although D' is a good measure of recombination history, it is severely affected with reduced sample size. R 2 summarizes both recombination and mutation history (Flint-Garcia et al. 2003). In our study, LD was estimated using R 2 , since it helps detect LD with minimal error despite small sample size and low MAF (Khatkar et al. 2008;Yan et al. 2009). In addition R 2 is a more relevant measure of LD for conducting association analysis between genotype and traits (Flint-Garcia et al. 2003).
To determine the degree of resolution achieved in the association analysis (Yu and Buckler 2006), both genome and chromosome wide linkage disequilibrium (LD) were estimated using the squared allele frequency correlation coefficient (R 2 ) for all possible pairs of SNPs in a dataset. For genome-wide LD, SNP datasets of the 10 maize chromosomes were combined and filtered with cutoff threshold of no missing data and 10% MMAFs yielding 1658 SNPs typed across all inbred lines. On the other hand, LD estimation within each chromosome was performed using the SNP data of each chromosome filtered at 10% maximum missing data per marker and 10% MMAFs. Missing data in all the SNP datasets used for chromosome wide LD analysis were not imputed. The software used to estimate LD was TASSEL 3 (Bradbury et al. 2007), which uses permutation tests to determine the P-values for each pairwise correlation. LD estimate significance levels were considered at a = 0.001 (Pasam et al. 2012). Genome-wide and chromosome wide rate of LD decays were estimated by plotting localized regression curves (LOESS) of the R 2 values vs. the corresponding physical distances between the SNP pairs, followed by observation of the intersection point between the fitted LOESS curve and a critical R 2 values (Cleveland and Devlin 1988;Breseghello and Sorrells 2006). Two background critical R 2 values for estimating LD decays within and across chromosomes were considered in the present study to offer comparison. The first baseline critical R 2 was determined by taking the parametric 95 percentile of distribution of R 2 values for unlinked SNPs, taking SNPs on different chromosomes and SNPs beyond 50 Mbp apart on the same chromosome as unlinked (Breseghello and Sorrells 2006;Pasam et al. 2012). The second baseline R 2 value was 0.2, an arbitrary value often used to describe LD decay (Zhu et al. 2008). Scatter plots and fitted smooth curves for estimating LD decay were plotted using a base scatter plot function of R version 3.0.3, "scatter.smooth" (R Core Team 2014). The function plots and adds a smooth curve to a scatter plot computed according to LOESS (R Core Team 2014). LD patterns of all SNPs significantly associated with carotenoids and local LD patterns in regions surrounding significant genes were visualized using LD plots generated with HaploView (Barrett et al. 2005).

Genome wide SNP-trait association
Associations between genome-wide SNPs and carotenoid content was identified using the R (R Core Team 2014) package GAPIT (Genetic Association and Prediction Integrated Tools) (Lipka et al. 2012. GAPIT package uses a unified mixed linear model (MLM) to calculate genome-wide association between traits and large number of markers by employing methodologies that maximize statistical power, provide high prediction accuracy, and run in a computationally efficient manner (Kang et al. 2010;Lipka et al. 2012Lipka et al. , 2013Zhang et al. 2010). A unified MLM incorporates both population structure (Q) and relative kinship (K) inferred from marker data into the GWAS to control for the confounding effect of Q and K and thus minimize spurious associations due to both type I and type II errors . Since the panel used in this study was composed of groups of inbred lines that were extracted from many backcrosses and single crosses involving diverse parental germplasm, multiple level relative kinship and non-random population structure was expected. Thus, the unified MLM model was applied to compute accurate associations. The analysis was executed mainly with the default settings of the software which automatically calculated both K and Q using the entire SNP marker data. The default setting implements VanRaden's algorithm option (VanRaden 2008) to calculate the K matrix, and uses principal component analysis (PCA) to define Q. It applies optimum compression levels using default kinship clustering and grouping values "average" and "mean," respectively. The model selection option was used to estimate the optimum number of principal components (PC) covariates using Bayesian Information Criterion (BIC) (Schwarz 1978).
n Table 1 Summary of extent of chromosome and genome-wide LD estimates plus information on the SNP datasets used to calculate the LD  The variation explained for a trait by the model and a particular SNP in question were determined using the likelihood R 2 statistics calculated in GAPIT. SNP data used for GWAS was filtered in TASSEL 4 with maximum missing data of 20% and MMAF of 1%. Missing data were imputed automatically within GAPIT using the conservative option of "major allele," which replaces missing data points with the major allele of the SNP. Different significance cut-off thresholds were used to assess the effect of the SNPs on carotenoids. The statistical significances of the SNPs were evaluated at 5 and 1% critical thresholds of the false discovery rate (FDR) adjusted P-values (Benjamini and Hochberg 1995) and the Bonferroni procedure was used to control the experimentwise type I error rate at both a = 0.05 and a = 0.01. FDR values generated with the GWAS result in GAPIT were used.
GWAS including allele specific markers of lcyE and crtRB1 as covariates Variations in carotenoid content and composition of the association panel caused by allelic variants in the two genes, lcyE (Harjes et al. 2008) and crtRB1 (Yan et al. 2010) were accounted for by including marker score data of the three allele-specific markers of each gene as covariates. These marker data were scored in the same inbred line panel used in the current study to validate the allele specific markers by Azmach et al. (2013). To incorporate these markers into the GWAS the six allele specific marker data were first transformed to principal components. Components explaining the largest proportion of the variation were then included as covariates in the unified mixed model for calculating the second GWA using GAPIT.

Data availability
The genotype and phenotype data for this GWAS population are available in Supplemental Material, File S1 (hapmap file containing GBS generated SNP data for the 130 maize inbred lines, filtered with maximum of 20% missing data and 1% minimum allele frequency) and File S2 (an excel file containing BLUEs of carotenoid contents for the 130 maize inbred lines).

Carotenoid profile
The carotenoid composition and content of maize lines used for this GWA study has been described in Azmach et al. (2013). The panel displayed considerable diversity in carotenoid profile. The ranges of least square means of the carotenoid concentrations (over two growing seasons) are presented in Table S3 in File S3 (can also be referred in detail in Azmach et al. (2013)). The BLUEs of the carotenoids are available in File S1. The concentration of a-carotene was low across the inbred lines, with poor repeatability, and hence a-carotene was not included in the GWA study.

SNP diversity
The summary of the 110 k SNP data set used for the GWAS and its diversity parameters are presented in Table 1. The average missing data for this data set was 10%. SNP distribution across the genome was not uniform but attained significant coverage ( Figure S1). The MAF displayed a uniform distribution across the 10 maize chromosomes n SNPs associated with multiple carotenoids were included in the counts in this Table. Hence, sums of SNPs may not tally with those indicated in Table 2. FDR thresholds of 5% were considered only for traits for which no significant SNPs could be obtained at the stringent thresholds. Bon, Bonferroni; FDR, false discovery rate.
n   a Representative significant SNPs selected based on their positions and approximate LD decay. Significant SNPs were selected at FDR 1%, except for chromosome 8 SNPs associated with lutein -which were selected only at Bonferroni 1%. For zeaxanthin and total provitamin A the threshold was set at 5% FDR to be able to detect significant SNPs. Some SNPs may appear two to four times as they were associated with multiple related traits. bcar, b-carotene; bcryp, b-cryptoxanthin; lut, lutein; zea, zeaxanthin; pva, provitamin A; tcar, total carotenoid; b br/abr, ratio of carotenoids on b to a branch; Chr, Chromosome; MAF, minor allele frequency; FDR, false discovery rate. Allelic effects of SNPs indicated refer to the second allelic variants. The number after "S" before the underscore in each SNP's name refers to the chromosome number; The number indicated after the underscore is the concerned SNP's position.
n b S8_138888278 is not the most significant SNP linked with lcyE but was located within the gene, whereas SNPs S8_138938949/S8_138938983 were the most significant but were located in another gene 50 kb downstream of lcyE.
c Association with zeaxanthin detected only when accounting for the allelic effects of lcyE and crtRB1. d S5_78384689 detected in GWAS that considered effects of lcyE and crtRB1 as covariates on the bases of their allele specific markers. Genes and associated information retrieved from maizeGDB.org and gramene.org.
Physical positions of SNPs and coordinates of genes given according to B73 RefGen_2. The number after "S" before the underscore in each SNP's name refers to the chromosome number; and the number after the underscore is the concerned SNP's position. bar, b-carotene; bcryp, b-cryptoxanthin; lut, lutein; zea, zeaxanthin; pva, provitamin A; tcar, total carotenoid bbr/abr, ratio of carotenoids on a to b branch. and f had more or less uniform values across the chromosomes. The genome-wide polymorphic information content (PIC) of the SNPs ranged from 0.02 to 0.38, while the average was 0.18. PIC is one of the diversity parameters that is used to measure the informativeness of genetic markers. A large proportion (.40%) of SNPs used for diversity analyses in this study had PIC values higher than 0.2, suggesting informativeness of the GBS generated SNPs for the association study.
Population structure, kinship and LD The Bayesian Information Criteria (BIC) suggested the population structure calculated (based on PCA) had only a small contribution to the variation in carotenoid profile of the panel (Table S2 in File S3). The kinship heat map indicated a low level of overall relatedness in the panel ( Figure S3). The genome-wide extent of LD estimate was 0.83 Mbp at baseline R 2 = 0.2 and 0.65 Mbp at R 2 = 0.25 ( Figure S4, Supplemental Information in File S3, and Table 1). There was heterogeneous distribution of LD decay across the genome, as was evident from the pattern of LD heat-map generated using the same SNP dataset (Figure 1).

GWA analysis of carotenoid content and composition
Of the 110 k SNPs tested, 386 unique significant SNPs were detected at 5% FDR (Table 2). At this significance threshold, at least two significant SNPs were identified on each of the 10 chromosomes. The number of significant SNPs declined to 168 at 1% FDR correction rate, discarding all the significant SNPs on chromosomes 1, 5 and 7. Application of the conservative multiple comparison correction term, the Bonferroni test, at 5 and 1% levels further reduced the number of significant SNPs to 81 and 32, respectively. The vast majority of significant SNPs were found on chromosome 8 followed by chromosome 10, which were mainly associated with lutein and b-branch carotenoids, respectively. Except for significant SNPs on chromosome 6 and 9, the average MAFs of the significant SNPs at FDR 1 and 5% were above 10%. Only 5% of the significant SNPs at FDR 1% had their MAFs below 10% (Figure 1). The number of significant SNPs in relation to each carotenoid across each chromosome is summarized in Table 3. The allelic variants and effects selected for the most significant SNPs in the GWAS are indicated in Table 4, while the associated candidate protein coding genes along with their genomic positions are listed in Table 5. Figure 2 illustrates the GWAS result for each carotenoid trait, complemented by Figure S5 which summarizes the association using the lowest P-values attained at 5% FDR threshold. The strongest association was detected for lutein content. At the significance level of 1% FDR, a total of 129 SNPs The most significant SNPs are highlighted in orange color and labeled with the IDs of the putative genes. Light green horizontal lines represent 1% Bonferroni-adjusted significance threshold (2log 10 (p) = 7.04) except for chromosome 2, which refers to 5% Bonferroni significance threshold (2log 10 (p) = 6.34). Vertical orange lines show regions of the major carotenoid candidate genes lcyE, crtRB1, and ZEP1. Plots were made with Microsoft Excel. distributed on chromosomes 2, 3, 4, 6, 8 and 9 were associated with lutein levels, with the largest fraction of SNPs (.90%) located on chromosome 8. The most significant SNPs associated with this carotenoid scored the lowest of all the P-value (SNPs S8_138938983 and S8_138938949, P = 9.81E212). The model containing each of these SNPs explained 53% of the variation in accumulation of this carotenoid. The majority of significant SNPs that survived the stringent significance threshold of 1% Bonferroni were also associated with lutein (27 SNPs on chromosome 8). Many of these SNPs were also associated with the ratio of ato b-branch carotenoids at FDR 5%. The second most significant association was detected for the ratio of b-carotene to b-cryptoxanthin derived carotenoid trait. Twenty six SNPs were associated with this derived trait at FDR 1%, the most significant SNP (S10_136007578) scoring P-value of 6.75E210 and R 2 of 60%.
Using the Bonferroni approach to adjust the family-wise type I error rate at a = 0.01, 13 SNPs on chromosome 10 were associated with carotenoids of the b branch and some of the derived ratio traits (b-carotene, b-cryptoxanthin, b-carotene to b-cryptoxanthin and/or b-carotene to zeaxanthin). The ratio of a to b branch carotenoid was significantly affected by 10 SNPs on chromosome 8 at FDR 5%, the most significant SNPs in the group accounting for 33 and 36% of the variations in the derived trait, respectively. These SNPs were also significantly associated with lutein.
Associations with zeaxanthin (12 SNPs) and provitamin A (3 SNPs) could only be detected when relaxing the significance cutoff threshold to 5% FDR. The variances explained by the model involving the most significant SNPs were 33% for zeaxanthin (SNP S10_136840488, P = 5.53E207) and 51% for provitamin A (SNP S10_134601800, P = 6.39E207). These SNPs were also associated with b-carotene and its derived ratio traits.

Genes in LD with significant SNPs
The genomic locations of significant SNPs were investigated to identify what protein-coding genes the SNPs were located in or adjacent to, zooming in based on SNP data retrieved from online databases for maize genome (http://www.maizegdb.org/ and http://ensembl.gramene.org/ Zea_mays/). The list of all annotated genes, including those encoding uncharacterized proteins, within circa 0.8 Mb of the most significant SNPs are presented in Table S4 in File S3. Here only those candidate genes the closest to the most significant SNPs listed in Table 5 are described.
The most significant SNP in the association signal for lutein content 16 Mbp on chromosome 8 (SNP S8_16743428, P-value = 8.65E209) was located within a putative gene GRMZM2G143211 (Figure 3a). This gene model contains a WD domain and displays sequence similarity to the yeast autophagy 18 (AtATG18) gene class in Arabidopsis thaliana. Two additional significant SNPs (S8_16444572 and S8_16444587) in this region were located within another candidate gene GRMZM2G380414, which encodes a protein called Ultraviolet-B-repressible which is likely involved in photosynthesis. The association peak between 110 and 144 Mbp on the same chromosome for the same trait contained three highly significant SNPs, namely S8_111289041 (P-value = 5.82E209); S8_124434722 (P-value = 3.97E210), and S8_138938949/S8_138938983 (P-value = 9.81E212) that were in strong LD to one another, with R 2 value ranging from 0.31 to 0.67 (Figure 4). These SNPs were located within three different protein-coding putative genes GRMZM2G333079, GRMZM2G330693 and GRMZM2G463133, respectively with the first and third genes having some evidence of expression in maize endosperm (Sekhon et al. 2011). The two SNPs, S8_138938949 and S8_138938983, are 50 kb distal from one of the major carotenoid biosynthesis genes, lcyE. Pairwise LD among these highly significant SNPs on chromosome 8 varied from R 2 = 0.23 to 0.67 (Figure 4).
On chromosome 10, the strong association peak surrounding the 138 Mbp region for b-carotene (Figure 2 and Figure 3b) contained two closely spaced and significant SNPs S10_136007575 and S10_136007578, P = 6.75E210. These SNPs are the closest significant SNPs to the major candidate gene of carotenoid crtRB1 (40 kb distal), but are physically located within a putative RING zinc finger domain protein coding gene, GRMZM2G397684 (Figure 3b). The other significant SNPs in this region were S10_134650981 (P-value = 1.99E208) and S10_139877594 (Pvalue = 5.12E208), residing within candidate genes GRMZM2G018314 and GRMZM2G080516, respectively. The latter encodes an AP2-EREBP transcription factor which is expressed in maize seed endosperm (Sekhon et al. 2011). LD among the peak SNPs on chromosome 10 ranged from 0 between SNPs S10_134650981 and S10_139877594, to 1, between SNPs S10_136007575 and S10_136007578 (Figure 4).
A smallbut significant SNP association was detected on the short arm of chromosome 2 coinciding with a gene involved in the conversion of carotenoids to abscisic acid, namely zeaxanthin epoxidase 1 (ZEP1, GRMZM2G127139). Two of the six SNPs that were significantly associated with total carotenoids at FDR 5% (S2_44448438, P = 1.11E206) and S2_44448432 P = 1.11E206) were physically located within the ZEP1 gene (Figure 3c). However, the most significant SNP, S2_44473758, P = 1.78E207, was located circa 33 kb downstream of the ZEP1 gene within another protein-coding gene GRMZM2G062559 that encodes an uncharacterized protein. All of these SNPs were in high LD forming a haplotype block (Figure 4) when considered without the non-significant SNPs in the region. We consider that the significant effect most likely arises from ZEP1 (or some of these SNPs could be linked with regulatory regions or control elements). The other SNP on the same chromosome around position 139 Mbp that showed a strong association with total carotenoid (S2_139644276, P-value = 2.53E207) was located in a putative gene GRMZM2G066213 (Figure 3c).
Strong and extensive pairwise LD was observed among the significant SNPs selected at 1% FDR (Figure 5a and Table 6). Seventy percent of the pairwise comparison among the SNPs led to statistically significant LD (P , 0.001) of which 21% was comprised of inter-chromosomal correlations. LD for within chromosome Figure 5 LD plots of significant SNPs and LD blocks surrounding the genes lcyE, crtRB1 and ZEP1. (a) LD plot of all significant SNPs selected at FDR 1%. Labels 2, 3, 4, 8 and 10 refer to the chromosomes of the SNPs that reached significance at this threshold; (b) an LD block on chromosome 2 surrounding the gene ZEP1; (c) an LD block on chromosome 8 surrounding the gene lcyE; (d) an LD block on chromosome 10 comprising the gene crtRB1; (e) LD plots that included non-significant SNPs in regions +/2 of crtRB1, lcyE and ZEP1 where significant associations were detected. Haplotype blocks were defined with the option of confidence interval (Gabriel et al. 2002). Green highlighted SNPs are the closest SNPs to the carotenoid genes indicated, with the most significant ones enclosed with oval shapes. The grayscale shading pattern of LD plot reflects the strength of linkage as it increases from the lightest to the darkest shaded cells paralleling the range of no LD (R 2 = 0%) to absolute LD (R 2 = 100%). Plots generated using HaploView software (Barrett et al. 2005). comparisons ranged from 0.37 to 1, both on chromosome 3, with genome-wide average of 0.42. For inter-chromosomal comparisons, LD ranged from 0.18 in chromosome 10 to 0.5 in chromosome 3, with a genome-wide average of 0.25. Significant SNPs on chromosomes 3 and 4 displayed strong inter-chromosomal LD with those on chromosome 8, but was negligible for SNPs on chromosome 10. The confidence interval algorithm deployed in HaploView software generated 11 haplotype blocks based on LD of the significant SNPs on chromosome 8, five blocks for those on chromosome 10 and one block for those on chromosome 2. Haplotype blocks were identified for each of the three carotenoid genes crtRB1, lcyE and ZEP1 when analyzing the significant SNPs in regions surrounding their corresponding genomic locations. Further analysis of LD for regions comprising these three genes, with the inclusion of non-significant SNPs revealed heterogeneous LD. This suggests that the LD among the significant SNPs residing in regions of these major genes could be functional, rather than tight genetic linkage occurring as a result of long range average LD decay in the association panel.
GWA re-calculated with the allele specific markers of crtRB1 and lcyE included as covariates GWA was re-calculated by incorporating the allele specific markers of the two genes lcyE and crtRB1 as additional fixed effect covariates in the MLM model. As expected, the number of SNPs significantly associated with the traits in this analysis was drastically reduced from 386 in the previous analysis to 38 SNPs (excluding the four SNPs significant at 10% FDR), at a cut-off threshold of 5% FDR (Table 7). Numerous SNPs on chromosome 8 and 10 previously associated with lutein and b-carotene (plus its derived traits) became statistically non-significant, even at a lower significance threshold of 10% FDR ( Figure 6 and Table  7). Using this approach, chromosome 10 was devoid of significant SNPs, and only two SNPs on chromosome 8 (SNP S8_138938949 and S8_138938983) were strongly associated with lutein (P-value = 7.66E208; R 2 = 0.53). These SNPs were also the most significant SNPs in the initial GWAS result. The SNPs were physically located within a putative gene GRMZM2G463133 encoding an HLH binding domain protein. Since these two SNPs were in high LD with SNPs in the lcyE region (Figure 5c), it is possible that the significant effect arises from linkage to this known carotenoid biosynthesis gene. However, functional studies would be required to unequivocally ascribe the significant effect to these SNPs.
On the other hand, the re-run GWAS detected new significant associations on chromosome 5 for b-carotene and provitamin A. In particular, SNPs S5_78384689 and S5_78427240 were associated with provitamin A at 5% FDR (P-value = 1.81E207; R 2 = 68) and one of these SNPs S5_78384689, was associated with b-carotene at 10% FDR (P-value= 7.08E207; R 2 0 = 81). The SNP S5_78384689 lay within an auxin-response factor 20 gene (GRMZM2G102845, 5:78,381,834-78,389,884, Table 5). In addition, seven SNPs on chromosome 2 were significantly associated with zeaxanthin content at FDR 1% ( Table 7). Three of the zeaxanthin associated SNPs (44,473,748, S2_44473758, and S2_44473801) were located within the gene ZEP1 while the other two were located 23 kb upstream of the ZEP1 gene.

DISCUSSION
Genome-wide and candidate-gene based association studies are powerful approaches to identify nucleotide variants that functionally underlie important agronomic and nutritional traits. Such nucleotide variants can be harnessed in breeding programs to develop improved cultivars through marker-or genomics-assisted selection Hamblin et al. 2011). Association mapping using large population sizes and high marker densities can be used for successful and reliable prediction of LD and associations between alleles and target phenotypes (Yu and Buckler 2006;Khatkar et al. 2008;Zhu et al. 2008).
As a major staple crop, maize has been the focus of both candidate gene and GWA studies for a number of agriculturally and nutritionally significant phenotypes (Cook et al. 2012;Li et al. , 2013Lipka et al. 2013;Yan et al. 2010;Yu and Buckler 2006). In our GWA study, we have used a panel of 130 diverse and partially related inbred lines of maize where we have used genome-wide GBS to generate a highly-dense SNP map for association analyses. Our use of inbred lines combining the genomes of both temperate and tropical maize germplasm has allowed us to capture small to large effect carotenoid allelic variants that are present in the two gene pools within IITA's maize breeding program.
Despite its predisposition to large levels of missing data (Heslot et al. 2013), GBS generates large number of SNPs with dense coverage and potentially less ascertainment bias, which is ideal for consistent GWAS (Crossa et al. 2013;Elshire et al. 2011). The SNP data set used in our GWA study had acceptable level of missing data of only 10% (which was predicted with conservative imputation criteria in GAPIT genetic analysis software to allow reliable genome-wide associations). A minimum MAF criteria of 1% was used to filter out potential spurious SNPs stemming from sequencing error (Glaubitz et al. 2014).
The frequency of minor alleles is an important factor that can affect the accuracy of LD analysis and GWAS especially when using small samples (Tabangin et al. 2009;Yan et al. 2009). The filtered data set had a large proportion of MAFs distributed uniformly across the genome, frequencies ranging between 1 and 5% accounting for the largest proportion. However, the MAFs of the vast majority of the significant SNPs were above 10% which might be indicative of the positive detection power of the GWAS as the biasness associated with rare alleles when using small sized samples for association mapping was eliminated (Schnable et al. 2009). This could suggest that alleles associated with n carotenoid content and composition may be segregating in our panel at frequencies higher than 10% (Hamblin et al. 2011). The average genome-wide LD decay in our study was estimated at circa. 830 kbp at a background critical R 2 = 0.2. Previous studies in maize reported LD extent to be ,1000 bp for maize landraces, .2000 bp for diverse breeding lines, and 100 kb for commercial elite inbred lines (Yu and Buckler 2006). Although it can lack the power for high precision mapping, a mapping panel with persistent LD can be considered ideal, if low-resolution mapping is targeted (Flint-Garcia et al. 2003). The long range LD in this panel was expected since such extensive LD is characteristics of advanced maize inbred lines that have experienced strong recent selection (Yu and Buckler 2006). Also, small populations are prone to genetic drift leading to loss of rare alleles and increased LD (Flint-Garcia et al. 2003). Nonetheless, there was considerable localized variation in LD structure across the genome suggesting that the map-ping resolution also vary. The extensive LD in our study could lead to the identification of SNPs in genes that are either causal or contributory to the carotenoid phenotype, or which act as linked markers associated with the carotenoid phenotype.
The two MLM GWAS models we employed detected a number of small to large effect known carotenoid biosynthesis genes, as well as several putative genes encoding characterized or uncharacterized proteins. The first MLM GWAS considered population structure (Q) and relative kinship (K) only, while the second incorporated the allele specific markers of lcyE and crtRB1 major carotenoid genes as additional fixed effect covariates. The identification of known carotenoid biosynthesis genes in our study indicates that our study had sufficient power to identify causal or contributory genes.
The vast majority of highly significant hits in the first GWAS were on chromosome 8 associated with lutein, followed by chromosome n   Table 4, and Table 5). The large effects of these genes on carotenoids within the maize panel used in our study were expected, as the markers designed to detect the allelic variants of these genes were previously confirmed (Harjes et al. 2008;Yan et al. 2010) to Figure 6 Manhattan and QQ-plots for GWAS conducted with allele specific markers of crtRB1 and lcyE genes as covariates. The association signals were significant after 5% FDR correction of the P-values.
have significant impact in the same mapping panel (Azmach et al. 2013), indicating the successful introgression of the favorable alleles of these two genes into the tropical yellow maize genetic background. While the most significant SNP (P = 9.81E212; Table 4 and Table 5) on chromosome 8 was located 49 kb downstream of the lcyE gene, another significant SNP (SNP: S8_138888278, P = 3.19E28) was detected within the gene 1 kb upstream of the 39indel functional polymorphism that was previously reported by Harjes et al. (2008). Despite no SNP was found within the gene crtRB1, the closest SNPs (S10_136007575 and S10_136007578) significantly associated with b-carotene to b-cryptoxanthin ratio (P-value = 6.75E210) and other ratio involving b-carotene were located 50 kb upstream of this gene.
The inclusion of the allele-specific marker information for lcyE and crtRB1 as additional fixed effect covariate allowed to control for the large effects of the two genes. Using this approach, 10% of the significant SNPs detected at 5% FDR in the first GWAS survived the correction for the allele specific markers. The two most significant SNPs on chromosome 8 detected in the first GWAS still displayed a strong association with lutein levels, which could be due to a stronger LD of the SNPs with larger-effect functional polymorphisms in the 39TE untranslated region of lcyE not captured with the present genotyping and possibly different from the polymorphisms previously described by Harjes et al. (2008). This could explain the relatively low effect of the known allele-specific markers of lcyE in our previous marker validation study (Azmach et al. 2013), although the strongest association was still detected in this gene region in our GWA study.
The controlling of the effects of lcyE and crtRB1 in the GWAS including marker covariates led to the detection of significant associations for zeaxanthin levels on chromosome 2 at 1% FDR. The significant SNPs co-localized with a known downstream carotenoid biosynthesis gene ZEP1 (chromosome 2: 44,440,449,237). These SNPs were detected in the first MLM GWAS, but they were then significantly associated with total carotenoid content at a 5% FDR. In the GWAS without covariates, zeaxanthin was significantly affected by SNPs only from the large association signal detected on chromosome 10. This could suggest that increases or decreases in the rate of conversion of b-carotene to zeaxanthin through b-cryptoxanthin may be more pronounced than that of decreases in the rate of conversion of zeaxanthin to violaxanthin by ZEP1 in the maize inbred line panel used in our study. This would provide a reason for the greater impact of crtRB1 on the level of zeaxanthin than ZEP 1. This could be interpreted as a scenario where reduced function of crtRB1 leads to accumulation of b-carotene at the expense of zeaxanthin synthesis, reflecting the larger effect of crtRB1 on the concentration of zeaxanthin.
Indeed, recent association studies have reported similar associations of SNPs within the gene ZEP1 with zeaxanthin content (Owens et al. 2014;Suwarno 2012). In addition a small effect QTL underlying kernel color close to the gene ZEP1 has also been reported and suggested as a target for allele mining by Chandler et al. (2013). This locus can therefore be considered as one of the loci potentially contributing to the variation in total carotenoid in the mapping panel used in this study and can be the next target gene for allele mining. Allele-combinations of ZEP1 and other genes in the biosynthesis pathway can be used in breeding programs to increase accumulation of provitamin A and total carotenoid in maize endosperm.
Our GWAS including marker covariates also identified an association between SNPs on chromosome 5 and provitamin A at 5% FDR and b-carotene at 10% FDR. These SNPs were co-localized with a gene encoding an auxin-response factor 20 (arf20) protein (5:78,381,834-78,389,884). Auxin-response factors are transcription factors that target auxin response DNA elements (AuxRE) in the promoters of auxinregulated genes (Li et al. 2016). The key enzymes involved in carotenoid biosynthesis in cereals are well known and have been previously reviewed; e.g., figure 1 in Zhai et al. (2016). This family of transcription factors is known to have a role in conditioning carotenoid biosynthesis through coordinated regulation of transcription of genes involved in the pathway (Meier et al. 2011). As the ARF20 gene is highly expressed in the maize endosperm (Sekhon et al. 2011), the gene may constitute a novel target for further unraveling of the regulatory mechanism of carotenoid biosynthesis in maize endosperm.

Conclusions
Using a panel of IITA's tropical maize inbred lines that incorporated high b-carotene alleles introgressed from a temperate maize germplasm, our study detected SNPs co-localizing with known major and small effect carotenoid biosynthesis genes, demonstrating the detection power of our GWA analyses. In addition, a number of associations were detected for novel candidate genes encoding transcription factors, which might have roles in regulation of the carotenoid biosynthesis in maize endosperm. As our study is based on IITA's tropical maize breeding program, it can contribute to transitioning of the maize biofortification efforts of the breeding program toward molecular-marker assisted approaches. Our findings pave the way for additional allele mining efforts and greater understanding of the genes involved in regulation of expression of carotenoid biosynthesis genes, which is necessary to further exploit the genetic potential of maize in accumulating provitamin A in maize endosperm.