Population size in QTL detection using quantile regression in genome-wide association studies

The aim of this study was to evaluate the performance of Quantile Regression (QR) in Genome-Wide Association Studies (GWAS) regarding the ability to detect QTLs (Quantitative Trait Locus) associated with phenotypic traits of interest, considering different population sizes. For this, simulated data was used, with traits of different levels of heritability (0.30 and 0.50), and controlled by 3 and 100 QTLs. Populations of 1,000 to 200 individuals were defined, with a random reduction of 100 individuals for each population. The power of detection of QTLs and the false positive rate were obtained by means of QR considering three different quantiles (0.10, 0.50 and 0.90) and also by means of the General Linear Model (GLM). In general, it was observed that the QR models showed greater power of detection of QTLs in all scenarios evaluated and a relatively low false positive rate in scenarios with a greater number of individuals. The models with the highest detection power of true QTLs at the extreme quantils (0.10 and 0.90) were the ones with the highest detection power of true QTLs. In contrast, the analysis based on the GLM detected few (scenarios with larger population size) or no QTLs in the evaluated scenarios. In the scenarios with low heritability, QR obtained a high detection power. Thus, it was verified that the use of QR in GWAS is effective, allowing the detection of QTLs associated with traits of interest even in scenarios with few genotyped and phenotyped individuals.

The world's population reached 7.7 billion inhabitants in 2019 and may reach 9.7 billion by 2050 1 . To the increase in population is added the growing concern about environmental impacts and the limitations of arable areas, which culminates in the demand for increased productivity of agronomic species 2 . In recent years, it is estimated that about 50% of the increase in productivity of several species was driven by genetic breeding, which has been seeking new strategies to obtain more adapted, resistant, and productive cultivars 3,4 .
In GWAS, a classic and widely used statistical method is single markers regression. This method estimates the individual effect of each marker on the phenotype of interest, and, subsequently, multiple hypothesis tests are performed in order to detect which marker effects are statistically significant 27 . When the correction for population structure is added to the single markers regression model, this model is called General Linear Model (GLM) 28 . However, the estimation of parameters via single markers and GLM are based on conditional means, which may be inadequate when the errors do not follow a normal distribution 29 and in the presence of heteroscedasticity. An alternative and still little explored methodology for GWAS studies is Quantile Regression (QR) 30 . This methodology, unlike methods based on means, allows adjusting regression models for different levels (quantiles) of the distribution of the phenotype of interest, does not require assumptions about the error distribution, and is robust to discrepant points 31 . QR has already been successfully applied in GWAS studies on real data by 32 for traits related to the flowering time of common beans. These authors evaluated 80 common bean genotypes and Genome and simulated populations. An advanced generation composite was obtained from two random mating populations in linkage equilibrium, which were crossed to generate a population of 5,000 elements from 100 families using linkage disequilibrium (LD), subjected to five generations of random mating without mutation, selection, or migration.
From the advanced generation of the composite, 1000 individuals from the same generation and from 20 families of full siblings, each consisting of 50 individuals, were simulated. The simulated genome was composed of ten chromosomes with a size of 200 centimorgans (cM) each and comprised 2000 bi-allelic single nucleotide polymorphisms (SNPs) separated by 0.1 cM across the ten chromosomes. The LD value in a composite popula- , where a and b are two SNPs, two QTLs, or one SNP and one QTL, θ is the frequency of recombinant gametes, and p 1 and p 2 are the allele frequencies in the parental populations (1 and 2). The LD value depends on the allele frequencies in the parental populations. Thus, regardless of the distance between the SNPs and/or QTLs, if the allele frequencies are equal in the parental population, Δ = 0. The LD is maximized (|�| = 0.25) when θ = 0 and p 1 − p 2 = 1 . In this case, the LD value is positive with coupling and negative with repulsion 36 .
Simulation of traits and the phenotypic values. Two genetic architectures were simulated, representing different scenarios, with heritabilities of 0.30 and 0.50 and with 100 and 3 numbers of quantitative trait loci (QTLs), distributed randomly in the regions covered by the SNPs. The first scenario follows the infinitesimal model and the other (second scenario) with three major effects genes accounting for 50% of the genetic variability. For the former, to each of 100 QTLs one additive effect of small magnitude on the phenotype was assigned (under the Normal Distribution setting). For the latter, small additive effects were assigned to the remaining 97 loci. The effects were normally distributed with zero mean and variance, allowing the desired heritability level. The phenotypic value was obtained by adding to the genotypic value a random deviate from a normal distribution N 0, σ 2 e , where the variance σ 2 e was defined according to two levels of broad-sense heritability, 0.30 and 0.50.
The data set was simulated using the Real Breeding program 37 . More information can be found detailed in 38 . Subsequently, in order to evaluate the effect of population size reduction, populations were defined with numbers of individuals ranging from 1,000 to 200 individuals. According to 39 , 200 individuals are considered as being sufficient for the construction of reasonably accurate genetic maps. A random reduction of 100 individuals was defined in each scenario, respecting the proportionality of individuals removed from each family. Thus, in all, thirty-six distinct scenarios were evaluated. These scenarios correspond to the combination of two levels of heritability, two genetic architectures, and nine variations in population size.
Linkage disequilibrium. A linkage disequilibrium (LD) analysis was performed to determine the markers associated with QTLs. Specifically, the LD decay pattern between marker pairs across the genome was obtained using a figure in which the square values of the correlation coefficient r 2 were plotted against the genetic distance between markers (in cM). Subsequently, a local polynomial regression (LOESS) [40][41][42] was fitted to the data and a horizontal straight line was plotted with a critical value of r 2 = 0.20 43,44 . The window distance, defined as the intersection of the fitted LOESS curve and the horizontal straight line, will be used to determine which markers are associated with QTLs. Thus, all markers that distance the value of the window obtained (depending on the scenario evaluated) in relation to each QTL are considered as markers associated with the QTLs. The square of the correlation coefficient r 2 was estimated using the LD.decay function of the sommer package 45 and the fit of the polynomial regression model using the loess function, both from the R software 46 .
Genome-wide association study. To perform the genome-wide association analysis, first, the correction for population structure was performed through principal component analysis (PCA) of the genomic relatedness matrix (G) 20 www.nature.com/scientificreports/ software 49 , selecting 300 markers in linkage equilibrium, aiming to ensure that these markers are not associated. A cluster number (K) ranging from 1 to 21 was tested, with ten independent replicates for each K value. In order to identify the optimal number of K, 10,000 iterations were run, with 1,000 burn-in. Then, the ∆K index 50 implemented in Structure Harvester software 51 was calculated to determine the choice of the most likely value of K. Subsequently, the K first principal components (CP) were used as fixed effect covariates in the GWAS model. The GWAS model was defined by: where Y is the vector of phenotypic information; μ is the population mean; α j is the effect of the j-th marker considered as fixed, j = 1, . . . , 2000 ; SNP j is the incidence vector of the j-th SNP marker; β k is the fixed effect of the k-th principal component, adjusted as a covariate; CP k is the vector of the k-th principal component; ε is the vector of random errors. The vector θ = µ, α j , β 1 , ..., β k ′ represents the unknown parameters, being estimated by means of QR and the GLM.
The methods estimate the individual effect of each marker on the phenotype of interest and then perform multiple hypothesis tests in order to detect which marker effects are statistically significant. The parameters were estimated via QR for different levels (quantiles) of the distribution of the phenotype of interest 30,32 . This methodology consists of estimating the parameters at the τ quantile by solving the following optimization problem: where τ ∈ (0, 1) indicating the quantile of interest, N indicates the population size evaluated, and ρ τ (·), denoted check function by 30 , is defined by: In this study, three quantiles (τ = 0.10, 0.50 and 0.90) were evaluated. For model fitting, the rq function from the quantreg package 52 of the R software was used. The individual coefficients (effects) of each marker are estimated by summing the weighted absolute errors. For estimation, it is necessary to use linear programming algorithms. One of the methods used is the Simplex Method 53 .
The parameters were also estimated using GLM. This methodology consists of estimating the parameters in average terms and solving the following optimization problem: For model fitting, the individual coefficients (effects) of each marker were estimated by minimizing the sum of squared errors by the ordinary least squares method using the GAPIT R package 54 of the R software 46 .
Hypothesis testing. After estimating the effects of individual markers through QR and GLM, multiple t-student tests were performed according to the methodology used, in order to analyze the existence of significant associations between the marker and the phenotype of interest. In the general linear model, the standard error estimate used was the usual, while in the quantile regression it was based on rank 53,55,56 . However, due to the high density of markers, performing multiple tests can lead to an increase in false positive associations 27 . An alternative to controlling this rate is the False Discovery Rate (FDR) 57,58 . One way to consider the FDR in hypothesis testing is through a correction in the p-value associated with the test, called the q-value 59 . In this study, a significance level of 0.01 ( α = 1% ) corrected by the FDR was used.
Comparison between methodologies. In order to evaluate the efficiency of the analyzed methodologies, the QTL detection power and the false positive rate were calculated and defined below: i) The power of QTL detection corresponds to the proportion of pre-established windows (intervals) (by means of LD analysis) that contain at least one marker considered significant by means of the statistical methods evaluated. ii) The false positive rate corresponds to the ratio between the number of markers that were significant by the evaluated statistical methods and are not associated with QTLs and the number of markers that are not associated with QTLs.

Results and discussion
Population structure. According to the method of 50 , ∆K was plotted against the number of clusters (k).
The maximum value of ∆K occurred at K = 19 and K = 18 for the scenarios of 3 QTLs and 100 QTLs, respectively (Fig. 1). Thus, 19 and 18 principal components were used as covariates in the GWAS analyses. According to the principal component analysis, 19 and 18 PCs accounted for explanation percentages of the variance present in the genotypic data between 85 and 96%, depending on the scenario evaluated. This result is in agreement with the simulated data of this study, where populations were simulated from 20 full sib families.
Linkage disequilibrium. The LD was calculated for all marker pairs in the same linkage group by means of r 2 . Figures 2 and 3 graphically represent the decay of LD as a function of genetic distance according to the  (Fig. 3).
After obtaining these values, it was determined that all markers that are less than the distances mentioned above (depending on the scenario evaluated) from each QTL are considered as markers associated with the QTLs.
Genome-wide association. The general linear model obtained a low power of detection of QTLs in all scenarios evaluated (Table 1). In the scenarios with 3 QTLs, regardless of heritability and population size, this methodology showed power values equal to or less than 0.03 (Table 1). In the scenarios with 100 QTLs with 1000 individuals and a heritability of 0.30, the GLM obtained a power of detection on average of 0.21 ± 0.07 and with heritability 0.50, the power of detection was on average 0.56 ± 0.09. As the population size was reduced, the detection power was reduced until it reached zero in all scenarios evaluated (Table 1). This result was already expected and can be corroborated by several studies in the literature. For example, in the study by 60 , in which the authors evaluated the effect of population size in GWAS, considering data from barley germplasm. In this study, the authors used a base population consisting of 766 individuals, and population size reduction was achieved by random resampling without replacement, forming populations with 96, 192, 288, 384, 480, 576, and 672 individuals, and observed that the detection power of QTLs decreased according to population size reduction 61 . Also evaluated the power of GWAS to identify true significant associations using simulated Arabidopsis data set with 200, 400, and 800 individuals. As a result, the authors observed that the power of identifying true associations decreased as the number of individuals decreased. In addition to these, 62 evaluated the influence of sample size in GWAS using simulated data from a Chinese soybean germplasm population consisting of 200, 400, 600, and 800 individuals randomly sampled from an ideal base population. As a result, the authors observed that the detection power of true significant associations decreased, and the false positive rate increased with decreasing sample size. Furthermore, according to 63 and 64 , the efficiency of GWAS requires large population sizes.
However, the pattern reported by the authors mentioned above and those observed here for the GLM was not observed when using the QR models. In general, the QR, in all scenarios evaluated, obtained high detection power (Table 1). Additionally, unlike the results obtained using GLM, the detection power of QTLs did not reduce with the decrease in population size (Table 1). This result may be related to the way in which the standard error is calculated by the two methodologies. In the GLM, the standard error estimate used was the usual one, while in the QR it was based on the rank statistic. The rank statistic is greatly influenced by the sample size 53,55 . Thus, the statistic of the test used generally presents higher values and, therefore, a greater number of QTLs being considered significant.
In scenarios with 3 QTLs, at quantiles of 0.10 and 0.90, regardless of heritability and population size variation, QR detected almost all simulated QTLs ( Table 1). As for the scenarios with 100 QTLs, QR at the extreme quantiles (τ = 0.10 and 0.90) obtained higher or equal QTL detection power when compared to QR (τ = 0.50) ( Table 1). In terms of population size, independent of heritability and quantile evaluated, QR detected all QTLs of interest considering population sizes equal to that of 200 and 300 individuals to QR (Table 1).
In general, the use of QR obtained a high QTL detection power independent of the population size, and especially in the extreme quantiles. This result is reasonable since QR uses the same idea of sampling for extremes 65 . Sampling extreme phenotypes samples individuals at the extremes in the hope that rare causal variants will be enriched among them 32 . However, unlike the extreme phenotype sampling approach, the use of QR does not require any assumptions about the distributions of traits, is robust to outliers, and uses all individuals in the   www.nature.com/scientificreports/ end of flowering-DFF). As a result, the authors found no significant associations using GLM. On the other hand, when using QR at the 0.10 quantile, one and six significant SNPs were found for DPF and DTF, respectively. Although the work of 66 and 67 was not conducted in the context of genome-wide association, the authors also evaluated the performance of QR on simulated data set with small population sizes and concluded that QR is a robust technique in these situations. This result is very promising in breeding programs that have a reduced number of available genotypes.
Regarding the rate of false positives, we have found that the GLM, in all scenarios evaluated, presented low values for this rate. This result may be related to the low detection power of QTls by this methodology ( Table 2). The false positive rate obtained by the QR methodology is relatively low in the scenarios with a higher number of individuals. QR (τ = 0.50) was the methodology that presented lower false positive rates. In scenarios where the QR detection power in the three quantiles evaluated was equal, the QR (τ = 0.50) showed better results than in the extreme quantiles QR (τ = 0.10 and 0.90) since the false positive rate was lower ( Table 2). Regarding the reduction in the number of individuals, the false positive rate increased substantially according to the reduction in population size, a result that may be related to the observed increase in the number of QTLs detected in these scenarios.
Finally, it was observed that the decrease in the heritability of the trait implies a lower power of detection of QTLs when using the GLM in all scenarios evaluated (Table 1). This result is similar to that found by 62 , in which the authors compared the detection power of true significant associations using five GWAS methods. This was done using simulated data from a Chinese soybean germplasm population with different levels of heritability (h 2 = 0.20, 0.50 and 0.90) and two genetic architectures with 10 and 100 QTls. As a result, the authors observed that the detection power was dramatically reduced for all methods and scenarios evaluated when the heritability of the trait was reduced. On the other hand, this behavior was not observed when using the QR methodology. The QR obtained greater or equal powers of detection of true significant associations in scenarios with lower heritability (h 2 = 0.30) regardless of the number of QTLs and sample size (Table 1). This result is interesting since it indicates that QR is an interesting methodology for GWAS studies in both low and moderate heritability scenarios.
Overall, these results indicate that using quantile regression to perform GWAS in the identification of QTLs is an interesting approach. QR proved to be efficient both in scenarios with many individuals and in scenarios with a reduced population size. Additionally, this methodology also proved to be interesting for GWAS studies in which the traits have low and moderate heritabilities.

Conclusion
The use of Quantile Regression models in genomic association studies on simulated data proved to be effective. Since its use, it allows a high power of detection of QTLs in all the scenarios analyzed in relation to the GLM. In scenarios with larger population sizes, the QR in the extreme quantiles (τ = 0.1 and 0.9) were the most efficient models in the simulated conditions because they were the ones that obtained the highest QTL detection powers. In the scenario where the detection power of the QR in the three evaluated quantiles was equal, the QR (0.50) was more efficient, as the false positive rate was lower. In the low heritability scenarios, QR obtained a high detection power of QTLs. The false positive rate obtained by the QR methodology in the scenarios with many individuals is relatively low. QR proved to be efficient both in scenarios with many individuals and in scenarios with a small population size.

Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.