Genomic Prediction and Association Mapping of Curd-Related Traits in Gene Bank Accessions of Cauliflower

Genetic resources are an important source of genetic variation for plant breeding. Genome-wide association studies (GWAS) and genomic prediction greatly facilitate the analysis and utilization of useful genetic diversity for improving complex phenotypic traits in crop plants. We explored the potential of GWAS and genomic prediction for improving curd-related traits in cauliflower (Brassica oleracea var. botrytis) by combining 174 randomly selected cauliflower gene bank accessions from two different gene banks. The collection was genotyped with genotyping-by-sequencing (GBS) and phenotyped for six curd-related traits at two locations and three growing seasons. A GWAS analysis based on 120,693 single-nucleotide polymorphisms identified a total of 24 significant associations for curd-related traits. The potential for genomic prediction was assessed with a genomic best linear unbiased prediction model and BayesB. Prediction abilities ranged from 0.10 to 0.66 for different traits and did not differ between prediction methods. Imputation of missing genotypes only slightly improved prediction ability. Our results demonstrate that GWAS and genomic prediction in combination with GBS and phenotyping of highly heritable traits can be used to identify useful quantitative trait loci and genotypes among genetically diverse gene bank material for subsequent utilization as genetic resources in cauliflower breeding.


2014), barley
, and wheat (Edae et al. 2014). In Brassicaceae, GWAS have mainly been conducted in rapeseed (Brassica napus) to dissect the genetic basis of disease resistance (Jestin et al. 2011), seed oil content and quality (Zou et al. 2010;Rezaeizad et al. 2011), seed weight and quality (Li et al. 2014), seed glucosinolate content (Hasan et al. 2008), and several morphological and phenological traits (Cai et al. 2014). Recently, GWAS was used in cauliflower to identify flowering-related quantitative trait loci (QTL) (Matschegewski et al. 2015), which are useful for breeding varieties adapted to different temperature regimes. For example, expression analysis identified a homolog of the Arabidopsis thaliana CDAG1 (Curd Development Associated Gene 1) gene in cauliflower, whose overexpression increased curd yield (Li et al. 2017).
In addition to genetic mapping, genomic selection (Meuwissen et al. 2001) has become an important method in plant breeding (Schmid and Thorwarth 2014). Genomic selection is based on the prediction of breeding values based on marker information alone. A training population with genotypic and phenotypic information is used to estimate marker effects with a statistical model, which is then applied to calculate the breeding values of the potential selection candidates in the breeding population (Heffner et al. 2009). Cross-validation allows selection of the best model for a certain population, trait, and genetic architecture (Crossa et al. 2010). Genomic selection provides several benefits in both animal and plant breeding (Hayes et al. 2009;Heffner et al. 2009;Jannink et al. 2010), including more rapid breeding cycles and fewer field trials, leading to increased genetic gain per time unit at a lower cost (Schaeffer 2006;König et al. 2009;Heffner et al. 2010). Genomic prediction has been successfully applied in wheat (Poland et al. 2012b), maize (Crossa et al. 2013), and soybean breeding (Jarquín et al. 2014). In Brassicaceae, one study investigated the potential of genomic selection in rapeseed (Würschum et al. 2014) and concluded that it is a promising method for rapeseed breeding.
Advances in sequencing technology allow the detection of singlenucleotide polymorphisms (SNPs) from large and diverse germplasm collections (Crossa et al. 2013). Genotyping-by-sequencing (GBS) generates tens of thousands of molecular markers at low cost (Elshire et al. 2011;Poland et al. 2012b;Sonah et al. 2013) and is an efficient tool for plant genetics and breeding (Poland and Rife 2012;Fu et al. 2014;Tardivel et al. 2014). It has been successfully used for genomic selection in wheat, maize, and soybean (Poland et al. 2012a;Crossa et al. 2013;Jarquín et al. 2014), and in GWAS of morphological traits and flavonoid pigmentation in maize and sorghum, respectively (Romay et al. 2013;Morris et al. 2013). Recently, Zhao et al. (2016) used specificlocus amplified fragment sequencing to create a genetic map of cauliflower. One disadvantage of GBS is a high proportion of missing data, which may be alleviated by genotype imputation (Poland and Rife 2012), although the value of imputation for association mapping and genomic prediction is debated (Marchini and Howie 2010;Rutkoski et al. 2013).
In this study, we combined GBS with a phenotypic analysis of cauliflower genetic resources to investigate the potential of GWAS for the identification and genomic prediction for the selection of useful genetic variation in cauliflower breeding. More specifically, the main objectives of this study were (1) to identify SNP markers that are associated with phenotypic variation in curd-related traits using GWAS, (2) to quantify the predictive ability of genomic prediction models for curd-related traits in diverse cauliflower genotypes obtained from ex situ gene banks, and (3) to evaluate the effect of data imputation on association mapping and genomic prediction. We show that GWAS identifies genomic regions harboring potentially useful variation and that genetic resources are suitable for genomic prediction of phenotypic variation in highly heritable traits.

Plant materials and phenotyping
A total of 192 cauliflower accessions representing a wide range of morphological diversity and geographical origins were obtained from the gene banks of the United States Department of Agriculture (USDA) and from the Leibniz-Institut für Pflanzengenetik und Kulturpflanzenforschung (IPK) Gatersleben, Germany. Detailed information about accessions is given in Supplemental Material, Table S1 in File S1. All accessions were phenotyped for curd-related traits in replicated field trials at two locations (experimental stations: Heidfeldhof and Kleinhohenheim in Stuttgart, Germany), for three successive growing seasons (June 2011, April 2012, and August 2012. The field experiment was conducted in randomized complete block design with two replications (Yousef et al. 2015). Five ripened curds were harvested from each plot and used to measure six traits that reflect various aspects of curd development and morphology according to Lan and Paterson (2000). where the closest first-rank branch originated from the main stem. 5. Nearest branch length (cm): length of the branch that is nearest to the apical meristem. 6. Days to budding: number of days from planting to appearance of the first floral bud.

Genotyping and marker imputation
We genotyped the material with the original GBS protocol using the ApekI restriction enzyme (Elshire et al. 2011). The sequencing and bioinformatic analyses of our material are described in Yousef et al. (2017). Briefly, 192 genotypes were sequenced as two sequencing libraries of 96 individuals on an Illumina HiSeq 1000, using a set of in-house scripts and public sequence analysis tools including bwa (Li and Durbin 2009)  We imputed missing genotypes with fastPHASE (Scheet and Stephens 2006) and BEAGLE (Browning and Browning 2007); both use a hidden Markov model to cluster haplotypes but they differ in the underlying model. The fastPHASE method uses an expectation-maximization algorithm for parameter estimation and fixes the number of haplotype clusters in the model. By contrast, BEAGLE uses empirical frequencies and allows the cluster number to be changed at each locus for a better fit to the localized linkage disequilibrium (LD) (Pei et al. 2008). We used fastPHASE as the main imputation method because we expected it to perform better than BEAGLE with our data set, which is characterized by a low LD, small sample size, and high marker density. The advantages of imputation methods under different scenarios were described by Browning and Browning (2011). We included BEAGLE for comparison for some analyses, but used only fastPHASE-imputed data for the GWAS to keep the number of analyses manageable. SNP markers with minor allele frequency (MAF) , 0:05 and any missing values were excluded from further analyses, which resulted in a total of 675 markers (unimputed data set). Next, SNP alleles were imputed with fastPHASE (Scheet and Stephens 2006) and markers with a MAF , 0:05 were excluded, which resulted in a total of 64,372 SNPs (imputed data set with fastPHASE). Genomic prediction was conducted with a third data set, in which SNP alleles were imputed with BEAGLE 4 (Browning and Browning 2007) and markers with a MAF , 0:05 were excluded, which resulted in a total of 62,566 SNPs (imputed data set with BEAGLE).

Analysis of phenotypic variation
The effects of the genotype and environment (environment was treated as the combination of location and season) and their interactions with phenotypic variation were evaluated using analysis of variance with the aov function of R (R Core Team 2015). Details of the phenotypic data analysis are described in Yousef et al. (2015). A mixed-effects model was fitted using restricted maximum likelihood (REML) with the lmer function from the R package lme4 (version 1.0-5) (Bates et al. 2014): (1) where y ij are the adjusted means of the i th genotype in the j th environment, m is the overall mean, G i is the effect of the i th genotype, E j is the effect of the j th environment, and GE ij the genotype · environment interaction. All effects were considered as random, except the intercept, which was treated as a fixed effect. Variance components of this model were used to calculate broad sense heritability for each trait according to Nyquist and Baker (1991) as where H 2 is the broad sense heritability, s 2 g is the genetic variance, s 2 ge is the genotype-by-environment variance, s 2 e is the error variance, r is the number of replications, and e is the number of environments. Best linear unbiased predictors (BLUPs) for the genotypic effects were extracted from model (1) and used to calculate the genetic correlation (r G ) among all traits. The genetic and phenotypic correlation coefficients are based on the Pearson correlation coefficient.

Population structure and LD
We used a discriminant analysis of principle components (DAPC) to infer clusters of genetically related individuals by using a k-means algorithm as implemented by Jombart and Ahmed (2011). LD between adjacent markers was calculated and the LD decay over distance for each chromosome was assessed. To identify differences in LD levels between the complete sample and the clusters identified by k-means, we used PLINK (Purcell et al. 2007) to calculate LD as: where p ab is the frequency of haplotypes with allele a at one locus and allele b at the other locus (VanLiere and Rosenberg 2008). The extent of background LD was estimated as the correlation of the 95% percentiles of all pairwise markers between chromosomes (Breseghello and Sorrells 2006). Additionally, we analyzed the persistence of linkage phase between DAPC-inferred clusters and the whole sample to validate whether a marker effect estimated in one cluster will contribute to the prediction ability in other clusters. Persistence of linkage phase is calculated as correlation coefficient r by: where D ¼ p ab 2 p a p b : As a measure of LD, r ranges from 21 to 1 (De Roos et al. 2008). Persistence of linkage phase is expressed as correlation of r between the same chromosomes of each cluster. The number of markers differed between clusters; therefore, we averaged the correlation of r values between clusters over groups of 50 markers.
where y is the vector of phenotypes; X is a matrix containing the markers; b is the vector of fixed-effects coefficients; Z is an incidence matrix; u is the random effect, where Var u ð Þ ¼ s 2 a K, with K representing the relationship between genotypes inferred from genetic markers; and e is the residual effect with e Nð0; s 2 e IÞ. Additive genetic variance (s 2 a ) and environmental variance (s 2 e ) are derived from the REML estimates. Variance components are only calculated once and are taken as fixed in EMMAX, which speeds up computation (Kang et al. 2010). MLMM uses the same linear model as EMMAX, but additionally includes significant SNPs as covariates in the model by using a forward-backward stepwise algorithm with reestimation of variance parameters (s 2 a and s 2 e ) at each step.
Following Utz et al. (2000), the proportion of phenotypic variance explained by a QTL was calculated as: where R 2 is the coefficient of determination, z is the number of predictors (number of significant SNPs in GWAS), and N is the number of observations. The proportion of genotypic variance explained was calculated as:r where b H 2 is the heritability of a given trait as defined in equation (2). Confounding effects due to population structure were evaluated with the inflation factor l, which is the ratio of the observed median to the expected median of a test statistic distribution (Devlin and Roeder 1999). Values close to 1 indicate no inflation. The significance threshold was set for the unimputed and imputed data for each trait separately using the false discovery rate (FDR; Benjamini and Hochberg 1995), where FDR values are computed from the P-values. The FDR was set to 0.2 for all data sets.

Genomic prediction
We evaluated three different genomic prediction methods, namely genomic BLUP (GBLUP), ridge regression BLUP (RRBLUP), and BayesB. The GBLUP model uses a realized relationship calculated from genetic markers (Habier et al. 2013) and is defined as: where g is an n · 1 vector of random effects and Z is the design matrix. Genetic values are modeled as random effects with g Nð0; Gs 2 g Þ; with s g as the genetic variance and G the realized relationship matrix.
An RRBLUP model (Meuwissen et al. 2001) was used to estimate the marker effects and to calculate the prediction ability. The model is of the form: where y is the vector of n phenotypic records, m is a vector of fixed effects that represents the overall mean, b is an n · 1 vector of random effects, and X is the marker matrix. The residuals follow a normal distribution e Nð0; Is 2 Þ; where I is the identity matrix. The GBLUP and RRBLUP implementations of the synbreed R package (Wimmer et al. 2012) were used. Additionally, a BayesB (Meuwissen et al. 2001) model as implemented in the BGLR R package (Pérez and de los Campos 2014) was used to estimate marker effects and to calculate prediction ability. BayesB uses the same linear model as RRBLUP, but a prior for the marker effects is modeled as mixture of a point of mass at zero and a slab that has a scaled-t density. Following Pérez and de los Campos (2014), the notation of the prior distribution is: where b represents the vector of regression coefficients and s 2 b is the respective variance. Parameter p is the proportion of nonzero effects and follows a b prior distribution, which implies the possibility of GðS b r; s j Þrepresent the normal, chi-squared, b, and g density distributions, respectively (Pérez and de los Campos 2014). BayesB was chosen over other Bayesian models such as BayesA and BayesC because it assumes an a priori distribution of marker effects following a mixture distribution with point mass at zero and a scaled-t slab similar to BayesA, and utilizes both shrinkage and variable selection, similar to BayesC (Pérez and de los Campos 2014). The hyper-parameters were chosen according to the default values in BGLR.
Since population structure can inflate prediction ability (Guo et al. 2014), we estimated the effect of population structure on prediction ability by directly correcting the kinship matrix in the GBLUP model for confounding effects of population structure (Thorwarth et al. 2017). The corrected kinship matrix was calculated with PC-Relate (Conomos et al. 2016). Prediction ability was defined as correlation between observed phenotypic and predicted genotypic values [corðy;ĝÞ] and was calculated by a five-fold cross-validation with 10 replications for each trait. To test whether genotypic effects in the genomic prediction and GWAS were caused by the same QTL, the position of SNPs with the greatest effects in genomic prediction models were compared with the most significant SNPs in the GWAS.

Phenotypic analysis of the six yield-related traits
The 174 gene bank accessions were evaluated for six curd-related traits and exhibited a large phenotypic variation in all traits (Yousef et al. 2015). For example, number of days to budding ranged from 45 to 118 d with an average of 75:24611:68 d, and curd width ranged from 11.11 to 15.29 cm with an average of 13:5260:73 cm (Table S2 in File S1). Analysis of variance showed that all traits were strongly affected by genotype (G), environment (E) and genotype by environment interaction (G · E; P , 0:001). Broad-sense heritability (H 2 ) differed strongly between traits. The traits of cluster width and number of days to budding showed moderate (56%) and high (94%) heritabilities, respectively. Furthermore, the traits of curd width and cluster width showed high phenotypic (r p ¼ 0:69) and genotypic (r g ¼ 0:59) correlations (Table S3 in File S1), as did apical length and nearest branch length (r p ¼ 0:79 and r g ¼ 0:71). Number of days to budding was negatively correlated with number of branches (r p ¼ 2 0:22 and r g ¼ 2 0:23).

Analyses of population structure and LD decay
We inferred five genetic clusters ( Figure 1A) by k-means clustering ( Figure S1 in File S1). The clusters are differentiated by geographic origin and flowering time. Cluster 1 (n = 25) consists of geographically diverse accessions (Table S4 in File S1) without a distinct geographic origin, but differs from the other clusters by its high average time to flowering ( Figure 1B). Clusters 2 (n = 56) and 4 (n = 40) consist of predominately European accessions that differ by their mean time to budding. Clusters 3 (n = 24) and 5 (n = 22) consist mainly of Asian accessions that also differ by mean time to budding.
We next tested whether the five clusters also differ by the extent of genome-wide LD, which may reflect differences in the breeding history. Based on the method described by Breseghello and Sorrells (2006), we estimated background LD as r 2 ¼ 0:17 intersecting with the nonlinear regression curve at 151 kbp ( Figure 2). The extent of LD is influenced by population structure and history, and we therefore calculated LD parameters for each cluster. Clusters 1, 3, and 5 show a rapid decay in comparison with the whole population, with an average background LD of r 2 ¼ 0:20; 0:20, and 0.24, extending to 98, 83, and 41 kbp (Figure 2). Clusters 2 and 4 have lower background LD values of 0.17 and 0.16, but higher long-range LD with averages of 280 and 231 kbp.
The average persistence of linkage phase is moderate between all clusters except for cluster 5, which reaches the expectation value of independent segregation (50%) for unlinked markers between clusters already at close distances in comparison with cluster 2 and cluster 4 ( Figure S2 in File S1), indicating a stronger differentiation between those clusters.

GWAS of six yield-related traits
Since the accessions show a strong population structure, we conducted a GWAS with models that correct for population structure and used the l parameter to assess how well a correction was achieved. Overall, both GWAS methods showed l values close to 1 (reflecting a good correction) for all traits and data (imputed vs. unimputed; Table S5 in File S1), which shows that the two methods account sufficiently well for population structure.
In total, 24 SNPs are associated with at least one of the six curd-related traits ( Figure S3 in File S1; FDR¼ 0:2). With EMMAX, six of 675 unimputed SNPs were associated with the traits of curd width, cluster width, number of branches, and number of days to budding. Only three SNPs of the imputed data set were associated with number of days to budding using EMMAX. Significant SNPs between the imputed and unimputed data sets differ from each other; using the imputed SNPs we identified an additional putative major QTL on chromosome O6 ( Figure S4 in File S1 and Table 1). MLMM identified nine SNPs associated with apical length, number of branches, nearest branch length, or number of days to budding using the unimputed data set, whereas in total six SNPs from the imputed data set were associated with either apical length or number of days to budding (Table 1).
Of the 24 SNPs significantly associated with the six traits, six QTL were identified by EMMAX and by MLMM. Moreover, one SNP (C02:5063181) was associated with both curd width and cluster width. Another SNP (C07:41524584) was associated with number of days to budding and number of branches (Table 1). We identified minor-effect QTL related to apical length, number of branches, and nearest branch length. Taken together, the results indicate that the imputation slightly increased the number of significantly associated SNPs.
Further, we tested whether SNPs identified as significantly associated with phenotypic variation also contribute to population structure (expressed as high DAPC loadings using only the imputed data; Figure S5 and S6 in File S1). For the traits of curd width and cluster width, no significant SNP detected by the GWAS was linked to a SNP with a high DAPC loading ( Figure S5 in File S1), but for number of days to budding the significant association detected on chromosome O6 is located close to SNPs with high DAPC loadings ( Figure S6 in File S1). This suggests that SNPs linked with flowering time variation also contribute to population differentiation.

Genomic prediction with GBLUP and BayesB
Finally, we tested whether genomic prediction of phenotypic traits can be carried out with genetically diverse gene bank accessions to select new genetic resources for breeding purposes. The average prediction ability for each trait was calculated using five cross-validations with 10 replications; it ranged from 0.13 to 0.65 with GBLUP (Table 2) and from 0.09 to 0.66 with BayesB (Table 3) for the different traits and data sets. The prediction ability for RRBLUP was the same as for GBLUP, and we used RRBLUP estimates of marker effects for comparison with GWAS results. For all traits and both methods (GBLUP and BayesB), average prediction abilities were higher with imputed than with unimputed data, but the differences were minor (0.42 vs. 0.39).
With BayesB, prediction ability ranged from 0.09 to 0.66 (Table 3). Imputation resulted in slightly higher prediction abilities for all traits, except for number of days to budding. The prediction ability of fastPHASE was slightly higher than with BEAGLE for all traits except number of days to budding and apical length (Table 3). To compare the GWAS with the genomic prediction results, we ranked the marker effects estimated by RRBLUP and BayesB in descending order and compared them with the 24 significant associations detected by EMMAX and MLMM. Of these 24 SNPs, 18 also produced the largest marker effects and were among the top eight SNPs with the highest P-values in the GWAS (Table 1).
We assessed the influence of population structure on prediction ability using cross-validation. A correction for population structure resulted in a minor decrease in prediction ability for most traits, except curd width. The decrease in prediction ability was substantial for number of days to budding (0.64 to 0.39; Table 2). To test the influence of population structure on prediction ability, we randomly sampled subsets of markers and observed fairly high prediction abilities for small marker numbers (,100; Figure 3). Another approach to characterize the effect of population structure on prediction ability is to estimate the phenotypic variance explained by the first three principal components of a principal component analysis (Table S6 in File S1). According to this analysis, population structure had a strong effect on the genomic prediction ability of cluster width (40.01% variation explained by principal component 1), curd width (16.37% by principal component 1), and number of days to budding (27.66% by principal component 2). The variance explained by principal components was marginal for the other traits.

Phenotypic variation
Our sample of cauliflower accessions could be grouped into distinct genetic clusters that also differed in phenotypic traits. Heritability estimates of the traits of cluster width, curd width, and number of days to budding were similar to previous studies (Lan and Paterson 2000;Matschegewski et al. 2015) and indicate a sufficient quality of the field trial data for GWAS and genomic prediction. Heritabilities for number of branches, apical length, and nearest branch were very low, which reflects a low data quality or a more complex genetic architecture.
Strong population structure and differences in LD decay A comparison of the population structure inferred by genetic markers and the passport data indicates limitations of available passport data. Among the 93 USDA accessions in the sample, 49 have a European origin according to their passport data. Of these, 24 (49%) cluster with the European accessions from IPK and the remaining 25 (51%) are distributed between cluster 1 (late flowering, geographically diverse accessions) and cluster 5 (predominately Asian accessions). This is unexpected under the assumption of a strong correlation between geographic origin and genetic relationship, but may be explained by incomplete passport data. Data about the seed donor but not the collection site were available for 79 of 93 USDA accessions (85%), and in these cases the true geographic origin could not be verified. For this reason, the analysis of genetic relationship allows a putative geographic origin of accessions to be assigned and can be used to complement missing passport data.
Genome-wide LD levels in the complete sample, which we measured as LD decay, were fairly high for an outcrossing species. The maximum physical distance for genetic linkage is reached at 151 kbp using the whole population. This value is comparable with that of self-fertilizing species, but lower than a previous estimate for cauliflower of up to 400 kbp (Matschegewski et al. 2015). A comparison of LD decay and background LD for the complete sample and each inferred cluster revealed differences between these groups. For example, the high LD in clusters 2 and 4 that consist mainly of European cultivars suggests that breeding and small population size likely contributed to the observed differences.
The highest level of long-range LD is located on chromosome O4 in accessions of clusters 2 and 4 (data not shown). This chromosome harbors multiple paralogs of the BolbZIP gene family, which is involved into cold stress tolerance (Hwang et al. 2016). Although this trait was not evaluated in our field trials, we hypothesize that selection for cold tolerance in European varieties may explain the higher level of LD.
Finally, we compared the persistence of linkage phase between clusters ( Figure S2 in File S1). The proportion of markers in the same linkage phase was moderate between most clusters, but slightly lower between clusters 2 and 5 and clusters 4 and 5, which indicates that genomic prediction ability is mainly influenced by the genetic relationship of individuals and not by a persistent linkage of marker alleles with causal QTL in different clusters.
Significant marker-trait associations in a GWAS of traits with a high heritability The phenotypic differences in flowering time and curd width between genetic clusters suggest a genetic basis for these differences. We used two n The Rank column indicates which rank the significant association had among marker effects in the genomic prediction with ridge regression (RR) or BayesB methods. The last letter in the Method column indicates in which data set the QTL was discovered (U, unimputed, I, imputed). Phenotypic and genotypic variance indicate the percentage of variance explained by the respective SNP. Chr, chromosome; MAF, minor allele frequency; Pos, position.
GWAS methods for the imputed and unimputed SNP data to map QTL controlling these traits and identified 24 SNPs that were associated with at least one of the six phenotypic traits. The genomic locations of SNPs associated with flowering time (e.g., time to budding) coincide well with QTL regions identified in previous studies. For example, a SNP on chromosome O6 explained 26.8% of variation and was close to a QTL detected by Hasan et al. (2016). Another SNP found on chromosome O2 explains 16.4% of the phenotypic variation for the same trait and maps to a known QTL (Okazaki et al. 2007;Uptmoor et al. 2008;Matschegewski et al. 2015). The last of these studies also demonstrated the influence of QTL on chromosome O2 on time to flowering. The QTL region harbors a homolog of the A. thaliana Flowering Locus C (FLC) gene, which controls vernalization and has a major influence on flowering time. A third SNP on chromosome O1 explained 13.5% of genetic variation. It is located in a genomic region that harbors an ortholog to the A. thaliana the flowering time gene VRN1 (Matschegewski et al. 2015). Additional SNPs that explain a minor proportion of the genetic variation for flowering time are located on chromosome O7 but have not been described in literature.
The GWAS also identified three polymorphisms on chromosome O2 that are associated with curd width and cluster width. Strongly associated polymorphisms overlapped for these traits, which can be explained by the fact that the two traits are highly correlated with each other (Table S3 in File S1). The SNPs explain between 27.4 and 34.1% of the genetic variation and therefore seem to be linked to a major QTL. These SNPs are located in genes whose A. thaliana orthologs regulate response to heat stress, which is consistent with the observation that temperature has a strong effect on cauliflower curd development (Matschegewski et al. 2015). Hasan et al. (2016) identified a QTL on the same chromosome (O2), which was only expressed under high-temperature conditions (27°).
The differences between the two GWAS methods (EMMAX and MLMM) may result from the small sample size. Sample size has a great influence on detection power for complex traits in GWAS (Korte and Farlow 2013). However, small samples are sufficient to detect major QTL, because simulations show that QTL explaining 10% of genetic variance can be identified in a sample of 100 genotypes (Gawenda et al. 2015). As an example, a flowering time QTL caused by variation in the vernalization response gene FRIGIDA was found in a sample of only 107 natural accessions of A. thaliana (Atwell et al. 2010). We therefore consider our sample of 174 individuals sufficiently large to reliably detect major QTL for highly heritable traits such as flowering time regulation, curd width, and cluster width.

Evaluation of genomic prediction in cauliflower genetic resources
Genomic prediction is a useful method for characterizing gene bank accessions, because it may allow phenotyping to be restricted to a subset of accessions in order to predict trait values for the complete collection. We used GBLUP and BayesB for prediction, because these models have a good performance, stable prediction ability, and are suited for different genetic architectures (Meuwissen et al. 2001;Heslot et al. 2012). Genomic prediction worked best for traits with a high heritability such as number of days to budding and cluster width, and less well for the traits of apical length and length of nearest branch (Table 2 and Table 3). In addition to a low heritability, the precise phenotyping of the latter two traits is challenging, and measurement errors result in biased estimates of variance components and adjusted means that affect prediction ability.
Since this is the first study that uses genomic prediction in cauliflower, we can only compare our results with those for other Brassica species. Prediction abilities for number of days to budding and curd width are similar to those for flowering time (0.70) and grain yield (0.50) in rapeseed (B. napus) (Würschum et al. 2014), a close relative of B. oleracea, which suggests that genomic selection is a robust and promising breeding method for Brassica species. Prediction ability is influenced by the presence of population structure (Thorwarth et al. 2017). If both training and validation sets for genomic selection show a population structure, a correction for structure reduces prediction ability (Guo et al. 2014). The effect of population structure on prediction ability depends on the trait. For the trait days to budding, prediction ability decreased from 0.64 to 0.39 after correction, whereas a correction had no effect on prediction ability of the other five traits.
The strong effect of structure correction on prediction ability for flowering is consistent with our observation that this trait differs significantly between genetic clusters. Mainly loci close to the putative QTL on chromosome O6 seem to contribute to population differentiation (measured as high loading values in the DAPC analysis) and are close to SNPs linked to flowering time QTL in the GWAS. This indicates that selection for different flowering time or adaption to different areas contributed to the genetic population structure (Matschegewski et al. 2015). For the other traits such as curd with or cluster width there is no such overlap, which may be explained by a more complex genetic architecture of these traits.
Genomic prediction models simultaneously utilize all SNPs for calculating the breeding value. RRBLUB and BayesB estimate marker effects, which can be used for comparison with GWAS models that consider each SNP separately. We found strong overlaps of significant SNPs in the GWAS, with the highest marker effects obtained from the genomic prediction models (Table 1). This overlap reflects the similarity of the statistical models used for GWAS and genomic prediction, and n suggests that these SNPs are linked to robust QTL. In summary, a comparison of putative causal SNPs identified by GWAS, the estimation of marker effects in genomic prediction, and the analysis of allele contribution to population structure (DAPC loading) may be considered as validation of significant marker-trait associations and provides useful information to improve genetic analysis (Spindel et al. 2016;Bian and Holland 2017). The success of genomic prediction depends on the method used and the size of the training set. We observed only minor differences between GBLUP and BayesB regarding their mean prediction ability over all traits and data sets ( Table 2 and Table 3). This observation confirms earlier work that Bayesian models do not outperform GBLUP (Bao et al. 2014). Since GBLUP is computationally less expensive and much simpler to implement than BayesB, our results suggest that GBLUP can be recommended for genomic selection in cauliflower breeding. The size of the training set is another point to consider, because larger training sets improve prediction ability and allow a robust estimation of marker effects (Windhausen et al. 2012). We provide a first assessment of genomic prediction in a small sample of genetically diverse cauliflower gene bank material, but larger training sets are required, especially for traits with a complex genetic architecture (.1000; Thorwarth et al. 2017).

Imputation effect on GWAS and genomic prediction results
Although GBS is an efficient method for obtaining large numbers of polymorphisms, the resulting data have a high proportion of missing values. Imputation of missing data may be used to overcome this limitation. In our GWAS analysis, we obtained fewer significant associations (10) with the imputed data than with unimputed data (14; Figure S3 in File S1). However, imputation may improve the power of GWAS because in the imputed data only we identified one additional QTL for flowering time on chromosome O6, which was also found by Hasan et al. (2016). However, several QTL observed in the unimputed data were not identified in the imputed data. One explanation for this discrepancy is significance levels that are based on marker number. For quantitative traits that are influenced by QTL with small effects, a Bonferroni correction of P-values in GWAS may be too conservative and result in a high proportion of false negatives (Perneger 1998). Therefore, we used the FDR, which is the expected proportion of significant associations that are false positives. It is defined as i n · Q, where i is the rank of the ascending P-values, n is the number of markers (tests), and Q is the FDR (Benjamini and Hochberg 1995). We used a FDR of 0.05, which implies that 20% of the observed significant associations are false positives. A smaller or higher Q value will lead to a more stringent or a more relaxed threshold, respectively. For example, with an FDR value 5% we observed 10 and with an FDR of 30% we obtained a total of 35 significant associations with EMMAX and MLMM. It should be noted that FDR thresholds are specific for each combination of traits and data sets, and for this reason the identification and implementation of optimal significance thresholds is still debated (Sham and Purcell 2014). Thus, a careful assessment of different thresholds together with the ranking of marker effects and a comparison of the DAPC loadings can help to improve the conclusions drawn from GWAS studies and validate the robustness of putative QTL.
The difference in the number of significant associations between the unimputed and imputed data sets is also influenced by the imputation method. The fastPHASE method is not well suited for GBS data, because a parameter vector of allele frequencies has to be estimated for each SNP. Figure 3 Effect of increasing the number of markers, included in a five-fold cross-validation with 10 replications using a standard GBLUP model, on prediction ability. Values represent averages of 100 runs. 10, 25, 50, 100, 250 and 500 markers, respectively, were sampled randomly for each run.
n As fastPHASE performs best with SNPs genotyped in many individuals (Scheet and Stephens 2006), the high proportion of missing data reduces the average number of individuals available per SNP. As the second imputation method, we used BEAGLE, which performs well with medium to large sample sizes (.1000 individuals), but not as well as fastPHASE when compared for small samples of 100 individuals, as demonstrated by Browning and Browning (2011). The clustering approach of BEAGLE flexibly changes cluster numbers to better accommodate local LD patterns, but in general neither method can cope very well with a large amount of missing data as observed in this study Xavier et al. 2016). The imputation of missing markers slightly improved prediction ability for most traits, consistent with previous studies that revealed only minor advantages of imputation, particularly in self-fertilizing crops with a slow LD decay (Rutkoski et al. 2013;Poland et al. 2012a;Jarquín et al. 2014). Prediction ability is mainly determined by relatedness and less by linkage between marker and QTL, as indicated by the linkage phase analysis. For this reason, imputation has little influence on prediction ability, because small numbers of high-quality SNPs are sufficient to capture the relationship structure among individuals.
In summary, there was only a minor advantage of imputation for GWAS and genomic prediction with our sample. However, the rapid development of genome sequencing technologies and rapidly decreasing costs will alleviate the problem of missing data for genomic prediction, as genome resequencing will uncover most genetic variation, in particular for crops with small genome sizes such as B. oleracea (Golicz et al. 2016).

Implications for the utilization of genomic resources
Breeding populations of cauliflower are characterized by a low genetic diversity (Golicz et al. 2016), which limits the potential for improving varieties that meet the expectations of growers and consumers. We showed that both GWAS and genomic prediction contribute to the genetic analysis of complex traits and to the identification of novel and potentially useful genetic variation in gene bank material ). Our study also provides a perspective with respect to the utilization of ex situ conserved gene bank accessions. Our sample was randomly selected and represents a broad genetic and geographic diversity. We achieved reasonable genomic prediction abilities, although genetic clustering inflates prediction ability if the cluster structure is correlated with trait distribution. For this reason, genotyping whole collections of gene bank accessions and the phenotyping of a sufficiently large subset allows the prediction of relevant phenotypic traits in the whole collection and the subsequent selection of accessions for further use as genetic resources. This will contribute to an efficient description and utilization of ex situ conserved germplasm resources.