Introduction

Olive is among the most important trees worldwide. Current world production of table olive and olive oil is over 19.21 and 32 million tons, respectively. Olive tree products are some of the main components of the Mediterranean diet that contribute to good health3. Olive fruits and leaves contain various functional compounds, such as hydroxytyrosol and oleuropein, beneficial for human health4,5. Olive trees are predominantly located in Mediterranean and Asian countries though there is increasing cultivation in Argentina, the United States, Chile and Australia1 due to high consumption of olive products. It is thought that olive domestication began in the region closest to the border between Turkey and Syria about 6000 years ago, before olive cultivars spread throughout the Mediterranean countries via different routes6. Turkey has seen continuous cultivation of olive trees since then. In Turkey, olive growing regions occupy a large area including western and southern coastal strips of the country. This has led to the rich variety of cultivars in Turkey. In the long-term conservation efforts of plant genetic resources, olive germplasm collections play an important role7. There are more than 100 olive germplasm collections at international, national and regional levels in mostly Mediterranean countries for conservation and breeding purposes8. These collections have been extensively used in molecular studies including identification, molecular characterization and also mapping studies9,10,11.

In olive breeding programs, clonal selection and cross-breeding have been conducted for developing novel cultivars12, however, these efforts are slowed down by olive’s juvenile period and complex genome13. As a result, very few mapping studies have been reported for dissecting agronomic traits in olive and limited numbers of markers have been identified as related to fruit traits14,15, flower traits, tree growth traits such as trunk diameter14 and olive oil quality traits16.

Genome-wide association studies (GWAS), which emerged as an alternative to classical linkage mapping17, utilizes historical recombination in a diverse population18. Compared with QTL mapping, GWAS mostly provides a higher mapping resolution19 and does not need to use an experimentally developed segregating population. GWAS has evolved as a powerful tool to dissect the genetic architecture of complex traits in large germplasm sets. It does, however, require a large number of markers for whole genome scans in crops with low linkage disequilibrium (LD) and high haplotype diversity20.

Next generation sequencing (NGS) technologies have allowed discovering and genotyping thousands of markers in large and diverse germplasm collections21. Single-nucleotide polymorphisms (SNPs) have become popular in QTL mapping and GWAS in plants20,22. They are co-dominant and bi-allelic markers that are distributed along the whole genome23. Genotyping by Sequencing (GBS) is a simple and inexpensive technique originally developed for high-resolution association studies in maize24, which involves reducing genome complexity25 by using restriction enzymes. GBS has been implemented in many crops such as maize24, barley26, wheat26,27, soybean28, rice29, oat30 and cassava31,32 for purposes of genetic characterization, GWAS, linkage analysis and genomic selection. The GBS technique, which does not require prior knowledge of the genome, is preferred for species that do not have reference sequence information25,30. Genetic mapping with SNPs generated by GBS has been extensively used in tree species including peach33, grapevine34, sweet cheery35, eucalyptus36, oil palm37, and apple38, and found to be effective to identify marker traits associations. In olive, GBS technology was used to assess the genetic diversity in Italian cultivars39 and to construct linkage maps in F140,41,42 and F243 populations. However, there are no reports of using GBS based SNP markers for association mapping study in diverse olive accessions.

Most genetic studies in olive genotypes have focused on characterization of this species and QTL mapping44. GWAS in olive has been used by our group in Turkish olive genotypes using AFLP, SSR and SNP markers45. Here, we report the development and application of GBS in a diverse set of olive germplasm from Turkey and the USA. Our objectives were to (1) identify SNPs within olive genotypes based on GBS analyses and (2) perform a comprehensive GWAS to identify significant marker trait associations. Successful application of GBS in olive would suggest that the method can be used in other tree species.

Results

Evaluation of phenotypic data

The descriptive statistics of leaf length (LL), leaf width (LW), fruit weight (FW), stone weight (SW) and fruit flesh pit ratio (FFPR) showed substantial variation was observed in all traits (Supplementary Table S1). Trait phenotypes ranged from 40.51 to 77.37 mm for LL, 7.30 to 26.10 mm for LW, 0.99 to 16.33 g for FW, 0.21 to 4.72 g for SW and 72.29 to 92.57 for FFPR. The statistical distribution of traits divided over geographical origin of accessions showed that year had a substantial effect on traits of accessions from NCGR, more so than TOGR (Supplementary Fig. S1). Relatively high H2 was calculated for FW and SW, 0.73 and 0.74, respectively (Supplementary Table S1). Heritability was moderate (0.52) for LW. The H2 estimates of LL (0.36) and FFPR (0.43) were low compared with other traits. The BLUPs of phenotypic values exhibited a near normal distribution for FULL, TOGR and NCGR panels (Fig. 1). Pearson’s correlation among the phenotypic traits showed that the highest degree of correlation was observed between FW and SW (r = 0.89). LL correlated positively with FW and SW (0.33 and 0.37, respectively). FW and SW also correlated positively with FFPR (0.6 and 0.28, respectively). There was no significant correlation between LW and other traits (Supplementary Fig. S2).

Figure 1
figure 1

The distribution of BLUPs for phenotypic values used for the GWAS. Distribution of BLUPs of phenotypic values in TOGR (Turkish Olive GenBank Resource) and NCGR (National Clonal Germplasm Repository) panels. Black and red lines are normal distribution approximations for the TOGR and NCGR panels, respectively.

Genotyping-by-sequencing and SNP detection

The fragment size distributions of GBS libraries from olive genomic DNA digested with EcoT22I and PstI restriction enzymes are shown in Supplementary Fig. S3. The size distribution curve was smoother for EcoT22I (majority of fragments are <500 bp) than for PstI and there was no highly repetitive DNA amplified (as would be evidenced by the presence of strong, discrete peaks or bands). The PstI library, however, contained a large amount of repetitive DNA (discrete peaks/bands). EcoT22I was chosen for reducing genome complexity in olive. The sequencing produced a total of 469,721,669 raw reads, an average of 2.56 million reads per sample, and those reads produced 3,415,115 tags. A set of 61,892 unfiltered SNPs were obtained from the FULL panel. The percentage of missing data and minor allele frequencies for all accessions are shown in Supplementary Fig. S4. SNPs were filtered based on minor allele frequency (MAF > 0.05) and missing rate (<0.20). A final set of 24,977 SNPs were obtained and used for genetic diversity, population structure and GWAS.

Genetic diversity and structure analysis

The genetic structure of the FULL panel was estimated using two complementary approaches. We estimated the marker-based kinship and found 67.2% of the kinship coefficients ranged from 0 to 0.2, indicating that most accessions have weak genetic relationship with the other accessions. This wide genetic diversity among olive accessions was also supported by their broader Euclidean genetic distance (28.74 to 162.28, mean: 120.61) (Supplementary Table S2). The maximum genetic dissimilarity between genotypes was 162.28 for Samsun Yaglik (GENO2) and Halhali 1 (GENO62). The minimum genetic dissimilarity was 28.74 for Gordal Sevillana (DOLE 13) and Koroneiki (DOLE 149). The genetic relationship for the FULL panel was visualized in the heatmap of the distance matrix (Fig. 2).

Figure 2
figure 2

Heatmap based on genotypic pairwise Euclidean distances.

PCA of the SNP dosage matrix was used to assess the clustering of genetic variation in olive accessions (Fig. 3a). PC1 explained 8.56% of the variation in the genotypic data, whereas PC2 and PC3 explained 5.68% and 5.37% of the variation, respectively (Fig. 3b). Although PCA analysis did not sort accessions based on their geographical locations, a subtle geographical pattern of distribution among Turkish genotypes may be deduced.

Figure 3
figure 3

PCA of olive FULL panel a) Scatter plot of the first two principal components (PC1 and PC2). (a) The genetic variation explained by the first ten 10PCs (b).

The optimum number of clusters (K) in the population was inferred to be six based on maximum likelihood and delta K (ΔK) values (Supplementary Fig. S5). When using a probability of membership threshold of 70%, 106 accessions were assigned into the six subgroups (Supplementary Fig. S6), while the remaining 77 accessions were classified into a mixed subgroup (Supplementary Table S3). Most accessions of Group 1 came from NCGR with only 2 from TOGR. Group 2 had the highest number of accessions among the groups with 23 and 22 accessions from TOGR and NCGR, respectively. The accessions of Group 3 were only from TOGR while the accessions of Group 6 included only accessions from NCGR. The accessions of Group 4 were primarily from TOGR with just six accessions from NCGR. Group 5 contained the fewest genotypes, with one from TOGR and four from NCGR. The Mixed group contained accessiong from both TOGR (36 accessions) and NCGR (41 accessions).

To investigate the extent of population differentiation between groups from STRUCTURE analysis, Fst values were calculated using the filtered markers (Supplementary Fig. S7). Genetic differentiation was higher between Group 5 and Group 6 (Fst = 0.417). The lowest degree of differentiation was found between Group 2 and Group 4 (Fst = 0.170). Fst values between the groups suggested that there was significant divergence across all groups.

Linkage disequilibrium

A total of 20,799 (4.81%) pairs of markers showed a significant LD value (D’) at P < 0.01 while 2,678 pairs of markers showed a significant LD at P < 0.001. Based on r2 estimates, 28.4% and 14.7% of the marker pairs showed a significant LD value of r2 ≥ 0.05 and r2 ≥ 0.01, respectively. The r2 values for all significant loci ranged from 0.05 to 1. The mean r2 and D’ for all pairs was 0.05 and 0.01, respectively. Supplementary Fig. S8 shows the distribution of the r2 values of all (left panel) and r2 ≥ 0.1 (right) for all marker pairs.

Genome-wide association study

Of the 3 three statistical models tested, the MLM_PCs + K model and the MLM_K model had similar power and showed a significant improvement in goodness of fit compared with the MLM_Q + K model. Increasing the number of PCs in the models did not decrease the type I error inflation (Supplementary Fig. S9). Hence, we kept only two PCs in the MLM_PCs + K model. Based on this information and on QQ plots of observed vs. expected P-values (Fig. 4), we chose the MLM_K model for association analysis and all subsequent results are based on it. Multiple testing burden was controlled using FDR correction46 at a 5% rate.

Figure 4
figure 4

Comparison of QQ plots obtained with different GWAS models for five traits in the FULL panel.

In total, 52 significant associations were detected in the FULL panel (Supplementary Table S4). Among these associations, 12, 19, 18 and 4 markers were associated with LL, FW, SW and FFPR respectively. No significant associations were detected for LW. The data listed in Supplementary Table S4 also showed that some of the markers were associated with more than one trait, e.g., S1_904125, S1_12591134, S1_1899635, S1_4122458 and S1_9030959 markers were associated with FW and SW. The most significant marker (S1_13767032) had a P-value of 9.11E-08 and was associated with SW (Fig. 5).

Figure 5
figure 5

Manhattan plots summarizing genome-wide association results for LL, FW, SW and FFPR in the FULL panel. The FDR significance threshold is shown in black.

The twelve significant SNPs, which were identified for LL, explained 13.7% of phenotypic variance on average (8.5–20.4% for different loci). FF exhibited 19 significant SNPs that explained 5.27–27.8% of the phenotypic variance with an average 15.08%. A total of 18 significant SNPs was identified for SW, which explained 6.29–27.4% of the phenotypic variance with an average 13.18%. Variance explained by significant SNPs for FFPR ranged from 11.6 to 18.6% with an average of 14.7%.

Two recently published olive genomes were used to align sequence reads of significant SNPs. Out of the 53 significant associations in our study, 27 SNPs (51%) were mapped in the wild olive reference genome41 while 40 SNPs (75%) were mapped in the genome of Olea europaea L. subsp. europaea var. europaea cv. ‘Farga’47 (Supplementary Tables S5 and S6). Chromosomal positions of significant SNPs determined according to wild olive reference genome41 located SNPs far from each other. Four SNPs associated with LL were identified on chromosomes 2, 4 and 23. The 8 SNPs associated with FW were present on chromosomes 1, 6, 9, 11, 15 and 19 while the 10 SNPs associated with the SW were located on chromosomes 6, 12, 16, 18, 21, 22 and 23. Lastly, the 2 SNPs associated with the FFPR content were present on chromosomes 10, and 19. Among the significant SNPs, 5 were associated with both FW and SW. Three of them were mapped on chromosomes 8, 10, and 11. The most significant SNP marker (S1_13767032) explaining 19.8% of the phenotypic variance was located on chromosome 16 (Supplementary Table S5).

To assess the extent of association mapping, triangle plots for pairwise LD between significant markers were created for each trait (Fig. 6). The pattern of LD blocks shows that significant LD was not only detected between significant markers located on same chromosomes but also between significant markers on different chromosomes (chromosome information in Supplementary Table S5). The highest LD was obtained between two significant markers (S1_6412238 and S1_11607122, r2 = 0.88) associated with FW which were not aligned to the wild olive reference genome.

Figure 6
figure 6

A heatmap of LD (r2) between significant SNPs, where darker blue colors represent stronger positive correlations between SNPs, (a) LL, (b) FW, (c) SW, (d) FFPR.

In the TOGR and NCGR panels, GWAS detected a total of 15 and 23 significant SNPs respectively (Supplementary Table S7 and S8, Supplementary Figs. S10 and S11). No significant SNP was detected for LW in either panel. The most significant SNP marker (S1_4640124) in TOGR panel had a P-value of 1.82e-05 and explained 25.5% of the phenotypic variation. Out of the 15 significant associations in TOGR panel, 7 SNPs (47%) were mapped in the wild olive reference genome (Supplementary Table S9)41 while 14 SNPs (93%) were mapped in genome of Olea europaea L. subsp. europaea var. europaea cv. ‘Farga’ (Supplementary Table S10)47. In the NCGR panel, the SNP marker (S1_13473561) was the most significant SNP with a P-value of 7.15e-07 and explained 18% of the phenotypic variation. Twelve (52%) and 11 (48%) significant SNPs in NCGR panel were mapped in wild olive reference genome41 (Supplementary Table S11) and genome of Olea europaea L. subsp. europaea var. europaea cv. ‘Farga’47 (Supplementary Table S12) respectively.

Candidate genes

Significant SNP tags were aligned against the wild olive reference genome. Regions within 10 kb were searched for candidate genes. A total of 19 unique genes were identified within these regions (Table 1), most of which are annotated to a protein that are responsible for developmental and physiological processes. For LL, two particular genes were found close to marker S1_1842014 on chromosome 2. We discovered 8 candidate genes for FW on chromosome 1, 9, 10, 11, 15 and 19. Two candidate genes for FW were present within a 2 kb window of S1_7336035 marker on chromosome 15. For SW, 8 genes were predicted on the chromosome 10, 12, 16, 18, 21 and 22 and three genes were found close to marker S1_6292562 on chromosome 22. On chromosome 10, 2 candidate genes were identified at upstream and downstream of S1_138350 marker that was significant for FFPR.

Table 1 List of the SNPs and nearest gene(s) for LL, FW, SW and FFPR.

Discussion

Understanding the genetics behind fruit, endocarp and leaf related traits is a key element for the improvement of olive accessions for breeding purposes. This study is the first comprehensive report of association analysis on olive using GBS markers. We used a diverse panel of 183 olive accessions from two different Genbank resources (TOGR and NCGR) to identify significant markers associated with LL, LW, FW, SW and FFPR. The large variation in traits observed among accessions as well as the significant correlations between some traits indicates the large phenotypic trait diversity among accessions. Three traits (LL, FW and SW) showed a larger than 3-fold difference between minimum and maximum values (Table S1). Previous studies in cultivated and wild olive genotypes indicated similar high correlations between FW and SW14,48 and large variation in fruit, leaf and endocarp related traits49. Arias-Calderon et al.50 observed significant phenotypic variability in traits such as fruit weight, stone weight and flesh/stone ratio among progenies, which agrees with our findings. Phenotypic measurements were carried out for two and four years in TOGR and NCGR accessions, respectively and considerable phenotypic variation was found for each year in all traits (Supplementary Fig. S1). Similar year variations for fruit and endocarp related traits have also been reported in other olive studies14,48. BLUPs were used in GWAS to reduce environmental deviation in association analysis as suggested by Piepho et al.51.

The estimates of H2 for SW, FW, LW, FFPR, and LL were high to low, ranging from 0.74 to 0.36. SW and FW were among the highly heritable (0.74 and 0.73 respectively) traits in this study (Table S1). High heritability estimates are indicative of high quality of the data obtained52. Moderate to high H2 estimates of LW, FW and SW obtained in this study imply that these traits are under strong genetic control. Heritability estimates are critical in plant breeding and genetics, but experimental approaches are difficult to implement especially in long-lived plants such as trees53. Only a limited number of studies that estimate broad sense and narrow sense heritability of tree, fruit, endocarp and oil related traits have been published in olive50,54,55,56,57,58,59 and, consistent with our results, relatively high heritability estimates for fruit and endocarp characteristics were obtained. Arias-Calderon et al.50 reported high narrow sense heritability (0.82) while Fanizza et al.58 reported a moderate heritability estimate (0.6) for FW. Zeinanloo et al.59 obtained higher H2 estimates (0.85) for FFPR than we did, and they also obtained H2 estimates for FW (0.42) and SW (0.31). Contrary to our study, Padula et al.55 reported higher H2 estimates for FFPR than FW. These findings imply that accessions had abundant genetic variation and were suitable for marker-trait association mapping. Previous studies indicated that core collections of different numbers of olive genotypes from Genbank collections showed abundant phenotypic and genetic variation9,10,11. To investigate suitability of the olive core collections for association mapping studies, different sampling approaches and different numbers of genotypes from the World Olive Germplasm Bank (WOGB) in Cordoba, Spain9,10 and the WOGB in Marrakech, Morocco11 were analyzed. The studies reported that both core collections contained mostly Western Mediterranean cultivars9,10 but core collections with cultivars that reflect the full geographic distribution of olive11 are suitable for association mapping.

We present the first application of GBS in diverse olive accessions from two different Genbank resources. The few GBS studies in olive to date have focused on genotyping F1 and F2 individuals40,41,42,43 and Italian cultivars39. SNP calling in highly heterozygous species such as olive is more difficult than inbred lines34,38. The GBS protocol we implemented enabled the discovery of thousands of SNPs. The EcoT22I restriction enzyme was used for the reduction of genome sequence complexity. Ipek et al.40 used ApeKI while Unver et al.41 selected a combination of PstI-MseI restriction enzymes to perform GBS in olive. D’Agostino et al.39 used EcoT22I restriction enzyme in GBS of Italian cultivars. A total of 24,977 SNPs were obtained after filtering which is higher than the number of SNPs detected in other GBS studies in olive40,41. The average number of sequence reads per sample we obtained (2.56 million) was similar to what was reported in other olive studies by Ipek et al.40 (2.1 million) and D’Agostino et al.39 (2.6 million).

Analysis of the population structure and genetic relatedness between accessions in a GWAS has critical importance for elimination of spurious marker-trait associations18,20. The PCA visualization did not show separation of accessions into subpopulations on the basis of geographic origin (East, Central, and West Mediterranean Basin, North America and Japan). Predictably, most Turkish genotypes were clustered together however, some fell into clusters comprised of European genotypes from NCGR. Diez et al.9. reported an indistinct geographical pattern of distribution among olive accessions from WOGB in Cordoba. Contrary our study, Belaj et al.10 and El Bakkali et al.11 showed that PCA analyses clustered olive accessions based on their geographic origin (western, central, and eastern Mediterranean).

The FULL panel in this study was divided into 6 groups by STRUCTURE analyses (Supplementary Fig. S6). Seventy-seven accessions (42%) were categorized as admixed with varying levels of membership in the 6 groups. Previous research has reported genetic admixture on olive60,61,62,63. Differentiation between groups due to genetic structure was measured with Fst values and an Fst value greater than 0.15 can be considered significant64. The pairwise Fst values between all groups were higher than 0.15, indicating high genetic differentiation in our FULL panel. We observed concordance between distance-based cluster analysis (Fig. 2) and model-based STRUCTURE analysis (Supplementary Fig. S6). Neither cluster nor STRUCTURE analysis distinguished the accessions based on their geographic origin. The lack of concordance between geographic and genetic distance may come from olive trees being transported among ancient civilizations around the Mediterranean basin65.

Other studies on population structure and genetic diversity of olive also reported high genetic variation among olive accessions, supporting our findings7,60,66,67. Precise evaluation of population structure and genetic diversity of germplasm collections is crucial for not only GWAS studies but also for efficient management of accessions in terms of conservation of genetic variability. Similar to other fruit trees, use of synonyms and homonyms are among the most common issues in cultivar designation of olive. Synonyms and homonyms in olive were widely reported using various marker techniques such as AFLP, SSR, SNP68,69. Koehmstedt et al.69 and Barranco et al.68 stated that ‘Oblonga’ and ‘Frantoio’ accessions were synonymous according to their findings obtained using limited number of SSRs. In our study, two separate approaches (model-based STRUCTURE and distance-based clustering) located these accessions close to each other. Frantoio (Dole 181 from Albany) and Oblonga (Dole 3 from France) accessions were closer to each other than any other pair of accessions. Also, the other Frantoio accession (Dole 2 from Albany) was located in same cluster with those accessions. Finally, some accessions that had the same accession name but different code numbers (DOLE 118 and DOLE 124, DOLE 10 and DOLE 28, DOLE 114 and DOLE 115, DOLE 6 and DOLE 13, DOLE 54 and DOLE 174, DOLE 11, DOLE 49 and DOLE 58) collected from different locations were located side by side in the heatmap (Fig. 2).

Linkage disequilibrium between markers is one of the critical factors in association mapping studies since it provides information related to mapping resolution and strength70. Different descriptors of the amount of LD, D’ and r2, were estimated for every pairwise combination of SNPs. In all, 28.4% of the marker pairs showed significant LD at r2 > 0.05. To date, very few studies attempted to estimate LD in olive using various numbers of accessions. El bakkali et al.11 obtained significant LD scores in 59.5% and 26.5% of the pairwise comparisons analyzing different number of accessions. LD measured in Turkish Genbank accessions45 was quite low in terms of r2. Low LD scores in a small data set of olive accessions also reported by Reale et al.71. The mating system of the species is one the most important factors that affects LD72. The creation of new recombination leads to low LD in out-crossing species70. The low LD we observed is similar to those other outcrossing tree species such as conifers73, almond74, eucalyptus75 and coffee76.

Fruit weight, stone weight and pulp stone ratio are agronomically important traits in olive tree similar to other fruit trees77. After proving that olive leaves are also a rich source of secondary metabolites78, leaf traits have also been included among important traits in olive. Limited information exists in olive on QTL linked to agronomic traits including flowering related traits15, fruit related traits14, trunk diameter and oil content14,16. To the best of our knowledge only three studies have been reported on association mapping in olive for fruit related traits45,79, oil content80 and plant vigor45. In two of these studies only 1879 and 2280 olive accessions were used while a study published by our group45 included 94 olive accessions. We tested three models on the FULL panel. The MLM_K model and MLM_PCs + K showed a similar expected distribution of P-values. We used the MLM_K model since it showed a significant improvement in goodness of fit. The MLM_K model also has a shorter computational time and it does not need any additional steps such as obtaining population structure81,82. Comparisons of different statistical models for GWAS were also conducted in other tree species such as apple83, pine84, and almond74. Previous studies in almond74 and grapevine85 reported that The MLM model with kinship matrix had a better fit by controlling population structure and relatedness.

We found 53 significant markers associated with four traits in the FULL panel, including 12 associations with LL, 19 associations with FW, 18 associations with SW and 4 associations with FFPR. Two significant markers (S1_6412238 and S1_11607122) associated with FW were in high LD (r2 = 0.88, Fig. 6). The most significant marker (S1_13767032, P-value = 9.11E-08), associated with SW, was in relatively high LD with marker S1_12591134 (r2 = 0.65). GWAS was also conducted for TOGR and NCGR panels separately. Fifteen and 23 significant markers were detected in TOGR and NCGR panels, respectively. None of these were common with each other or with significant markers found in FULL panel.

We identified significant SNP marker locations using two recently published olive reference genomes. Among the 53 significant SNPs, 27 SNPs (51%) and 40 SNPs (75%) were mapped in the wild olive reference genome41 and the genome of Olea europaea L. subsp. europaea var. europaea cv. ‘Farga’47, respectively. We mapped more significant SNPs in genome of Olea europaea L. subsp. europaea var. europaea cv. ‘Farga'47 which is likely due to its higher genome coverage (95%) compared to wild olive reference genome (42%). However, this reference genome47 does not have chromosome assignments so mapped significant SNPs could not be assigned to a particular chromosome using this reference genome (Supplementary Table S6). Chromosomal positions of 27 significant SNPs according to wild olive reference genome41 located them far from each other. Some significant markers (S1_12085523 and S1_13002224, S1_11607122 and S1_7336035) associated with FW were found on chromosome 11 and 15, respectively. Also, 6 significant markers associated with SW were found on chromosomes 12 (S1_11074838 and S1_13555831), 18 (S1_984251 and S1_2244037) and 23 (S1_7495520 and S1_8195416). Although these pairs of markers were located on same chromosome, they were distant from each other. Comparison of chromosomal locations between this study and previously published studies could not be done due to the use of different molecular marker techniques. It is also important to explore whether significant markers we found are located in the same regions as in the previously reported QTL studies. Limited information exists in olive identifying QTLs linked to the traits analyzed. In a QTL mapping study published by Sadok et al.15, 8 QTLs linked to fruit weight were identified on different 7 linkage groups using ISSR, SSR and AFLP markers. In another QTL mapping study14, one QTL was identified for fruit weight on linkage group 17, while 3 QTLs were identified linked to pulp/stone ratio on linkage groups 10 and 17 using DArT-SSR markers.

We found 19 candidate genes close to significant markers in the FULL panel within a 10-kb region window in either direction of a significant SNP. The most significant SNP marker, S1_13767032, was located on 0.497 kb upstream of the Oeu046142.1 gene. This gene is annotated as a PHD finger transcription factor and the family to which this gene belongs, plays a key role in regulating plant growth and development86. The analysis of transcriptomes provides genomic resources for functional annotation to discover genes for olive breeding87. To date, several transcriptome studies have been performed for olive using different organs at different developmental stages41,47,87,88,89. There have been attempts to identify candidate genes associated with important traits such as plant architecture90 and juvenility91 in olive, but no putative candidate genes underlying QTL have been reported. More studies are still required to facilitate validation of these results in different olive populations.

Materials and Methods

Plant materials

We used 94 accessions from Turkish Olive GenBank Resources (TOGR) panel in Izmir, Turkey and 89 accessions from the USDA, ARS, National Clonal Germplasm Repository (NCGR) panel in Davis, CA, USA. Detailed information of these accessions (FULL panel) is provided in Fig. 7 and Supplementary Table S13. The map in Fig. 7 was generated using the ‘ggmap’ package92 in R version 3.4.293. Fresh leaf tissue was harvested from the youngest leaves of each tree in the leaf shooting stage. Leaf tissue samples were stored at −80 °C until DNA was extracted.

Figure 7
figure 7

Geographical distribution of accessions. Origin of accessions is represented by a dot on the world map. Accessions without origin information (4 accessions from NCGR) are not shown in here.

Phenotypic data

Phenotypic data for the following five traits was evaluated: leaf length (LL), leaf width (LW), fruit weight (FW), stone weight (SW) and fruit flesh to pit ratio (FFPR). For 94 accessions from TOGR, phenotypic data measurements (FW and SW) were carried out during 2011 and 2013 and this data was obtained from a previously published study by Kaya et al.45. LL, LW and FFPR measurements were carried out according to the methodology proposed by the International Olive Oil Council (IOOC). Phenotypic data for 89 accessions from NCGR in Davis was downloaded from USDA-GRIN (Germplasm Resources Information Network system) website for four seasons from 2005 to 2008 (Available at www.ars-grin.gov/npgs and accessed May 2017). To show the statistical distribution of traits divided over geographical origin of accessions and year, box plots were generated using R93. Phenotypic data from different years was averaged and used for descriptive statistical analysis using the pastecs package in R.93.

To eliminate the effect of environment variation, the best linear unbiased prediction (BLUP) values of lines were calculated for each trait using mixed linear model (1):

$${\bf{Y}}={\rm{X}}{\bf{b}}+{\rm{W}}{\bf{p}}+{\rm{Z}}{\bf{a}}+{\rm{e}}$$
(1)

where X, W and Z are incidence matrices, b is the vector of fixed effect for country and p and a are vectors of random effects for genotype and year, respectively. The random effects and residual errors are assumed to be normally distributed independent of each other and each of them have covariance structure proportional to an identity matrix. We fitted the model using the lmer function from the lme4 R package94. The predictions for the random effects for GIDs (BLUPs) from this model were used as phenotypic data for the GWAS.

Variance components were extracted from the lmer output and broad-sense heritability (H2), based on clone means, was estimated according to Hallauer et al.95. The H2 for each trait was estimated by the Eq. (2) using variance components from lmer.

$${H}^{2}={\rm{VG}}/({\rm{VG}}+{\rm{VE}})$$
(2)

where VG and VE represent estimates of genetic and environmental variance, respectively. In addition, heritability value estimates were calculated for TOGR and NCGR panels, separately. Pearson correlations between traits were estimated from BLUPs using the “cor” function in R and distributions of BLUPs in the two locations were plotted in R.

DNA extraction and genotyping-by-sequencing

Genomic DNA was extracted by the CTAB method of Doyle96 with some minor modifications. The Chloroform: Isoamyl Alcohol (24:1) extraction step was applied twice to improve removal of phenolic compounds. DNA was quantified with Qubit dsDNA BR Assay Kit (Invitrogen) on a Qubit 2.0 Fluorometer (Invitrogen) according to the manufacturer’s instructions. DNA samples were diluted to 20 ng/μl and subsequently used for GBS library preparation.

The GBS libraries were constructed in 96-plex where each plate included a single random blank well as control. For choosing appropriate restriction enzymes for olive genotypes, two different restriction enzymes, EcoT22I (ATGCAT) and PstI (CTGCAG) (both 6-base cutters), were tested to make GBS libraries and EcoT22I was selected. PCR amplification was performed to generate the GBS libraries and DNA was sequenced on an Illumina HiSeq. 2000 (Illumina Inc., USA). GBS was carried out at the Institute of Genomic Diversity (Cornell University, Ithaca, NY, USA) as described by Elshire et al.24.

SNP calling, filtering and imputation

SNPs were identified using the TASSEL UNEAK (non-reference) GBS pipeline97 in the TASSEL 3 bioinformatics analysis package. SNPs were filtered to remove markers with more than 80% missing data and genotypes with more than 80% missing SNP calls in using R version 3.4.293. After filtering, the SNP dataset was converted to numeric coding (1, 0, −1) from nucleotide coding for statistical analysis in R version 3.4.293. The A.mat function from rrBLUP package in R was used to remove markers with minor allele frequency (MAF < 0.05) and impute the missing marker data based on expectation maximization (EM) algorithm.

Genetic diversity and population structure analysis

To assess genetic structure, we applied both model-based and distance-based approaches. The Admixture-based clustering model we applied was STRUCTURE v.2.3.498. Each simulation included 10,000 burn-in and 50,000 iterations. Ten independent runs were performed for each K value ranging from 1 to 10 with an admixture model and correlated allele frequencies. The optimal K was chosen based on Evanno’s methods99 using the STRUCTURE HARVESTER software100. To visualize the population structure, a bar plot was obtained with sort by Q option based on the optimum K value. Genotypes with membership probabilities higher than 0.7 were assigned to one of the subpopulations. Otherwise, they were considered to be admixed. The membership coefficient matrix (Q matrix) that shows the percentages of admixture of each accession given by the STRUCTURE software was used as cofactors in the association analyses. Calculation of pairwise genetic differentiation (Fst) between the groups from STRUCTURE was performed using VCFtools101. Principal component analysis (PCA) was also carried out to study the structure of the genotypes using the function prcomp in R93. First two principal components (PCs) were plotted using the ggplot2 R package102. To apply the distance-based approach, a Euclidean marker distance matrix was obtained using the dist function and a graphical representation of distance matrix was created using the heatmap function in R.

Estimation of linkage disequilibrium

The LD between marker pairs was calculated based on D’ and r2 using TASSEL. Permutation testing was applied to examine the significance level of LD between loci. Pairwise LD analysis between significant markers for the best GWAS model based on marker score correlations (r2) was performed separately on the FULL panel and was visualized as a heatmap using heatmap function in R.

Association analyses

Genome wide association analyses were performed using the SNP dataset consisting of 24,977 SNPs in FULL, TOGR and NCGR panels. For association analysis, three different models were tested for controlling population structure with the R package rrBLUP103. Mixed Linear Model (MLM) that accounts only for relative kinship (MLM_K model), MLM that accounts for both relative kinship and model-based population structure (MLM_Q + K) and MLM with first two PCs and K-matrix as correction for population structure (MLM_PCs + K) were compared. Q is the matrix of sub-population membership probabilities obtained from STRUCTURE98, K is the kinship matrix calculated using the A.mat function in the rrBLUP R package103 and the number of PCs was selected based on the scree plot of the variance explained by the first 10 PCs. We also examined the effect of including different numbers of PCs (first two PCs, first three PCs, first four PCs, first five PCs, first 10 PCs, first 15 PCs) as covariates in MLM_PCs + K model.

Correction for multiple testing was carried out using the false discovery rate (FDR) values according to the procedure by Benjamini and Hochberg46. Markers with FDR < 0.05 were considered significant. The proportion of phenotypic variance explained by each significant marker was estimated via R2 by fitting a regression between phenotypes and marker profiles using R93. The quantile–quantile (QQ) plots were used for selecting the best GWAS model. The QQ plots were produced using the R qqman package104 and Manhattan plots were visualized using R93.

Mapping the significant hit to the olive reference genome

The significant GWAS hits were mapped to their positions in the two recently published olive genomes. The first genome, Olea europaea var. sylvestris,41 includes ~1.1 Gb of sequence and is available at https://phytozome.jgi.doe.gov. Only ~573 Mb of this assembly is mapped to its chromosomal position (n = 23). The second olive genome, Olea europaea L. subsp. europaea var. europaea cv. ‘Farga’47 has a total length of 1.31 Gb which represents 95% of the genome’s estimated size (1.38 Gb). This genome has no chromosome assignments and is divided in more than 50k scaffolds with an N50 of 443Kb.

Briefly, the significant GWAS hits were linked to their 64mer sequence tag using the TOPM file produced by the UNEAK GBS calling pipeline97. A multi-fasta was created using the marker identifier and the 64mer sequence. We then used Blast +105 to map each GWAS hit with its most probable location in both genomes. The best blast hit was chosen based on the percentage of alignment and E-value.

Identification of candidate genes

To find candidate genes associated with significant SNPs, the Jbrowse feature of Phytozome v.12.1 (http://phytozome.jgi.doe.gov/pz/portal.html)106 was used to browse the wild olive reference genome41. Candidate genes were searched within 10 kb upstream and downstream of each significant SNP region in the genome browser.