Deciphering the mode of action and position of genetic variants impacting on egg number in broiler breeders

Aim of the present study was first to identify genetic variants associated with egg number (EN) in female broilers, second to describe the mode of their gene action (additive and/or dominant) and third to provide a list with implicated candidate genes for the trait. A number of 2586 female broilers genotyped with the high density (~ 600 k) SNP array and with records on EN (mean = 132.4 eggs, SD = 29.8 eggs) were used. Data were analyzed with application of additive and dominant multi-locus mixed models. A number of 7 additive, 4 dominant and 6 additive plus dominant marker-trait significant associations were detected. A total number of 57 positional candidate genes were detected within 50 kb downstream and upstream flanking regions of the 17 significant markers. Functional enrichment analysis pinpointed two genes (BHLHE40 and CRTC1) to be involved in the ‘entrainment of circadian clock by photoperiod’ biological process. Gene prioritization analysis of the positional candidate genes identified 10 top ranked genes (GDF15, BHLHE40, JUND, GDF3, COMP, ITPR1, ELF3, ELL, CRLF1 and IFI30). Seven prioritized genes (GDF15, BHLHE40, JUND, GDF3, COMP, ELF3, CRTC1) have documented functional relevance to reproduction, while two more prioritized genes (ITPR1 and ELL) are reported to be related to egg quality in chickens. Present results have shown that detailed exploration of phenotype-marker associations can disclose the mode of action of genetic variants and help in identifying causative genes associated with reproductive traits in the species.

estimated as high as 0.32, while respective estimates are in the range from 0.13 to 0.36 when using genomic relationship matrices [3,4]. The contribution of dominance may also be of importance for the trait, as estimates of the genomic dominant heritability has been found as high as 0.06 [3].
High-density SNP (single nucleotide polymorphism) genotyping arrays have greatly facilitated the detection of candidate causal variants in genome-wide association studies (GWAS) for various traits related to egg production and egg quality. Most GWAS have, so far, focused on the detection of additive SNPs for egg production [5][6][7] and egg quality traits [5,[7][8][9][10]. It is noted that these studies have been focusing on EN in layer chickens and not broiler breeders. Moreover, to our knowledge, there is only one study [7] that aimed at identifying dominant SNPs for egg production and quality traits in chickens.
Driven from the scarcity of published reports for broiler breeders, we elaborated the present study with the primary aim to detect genetic variants impacting on EN. Next, we sought to describe the mode of gene action of the significant genetic variants and finally attempted to provide a list with most likely candidate genes for the trait under investigation. Current findings are expected to contribute to a better understanding of the genetic mechanism(s) underlying the EN phenotype in the species.

Significant SNPs and PVE
Additive and dominant genomic heritability estimates were identical and equal to 0.167 (SE = 0.03) for the trait. The Q-Q plots (see Supplementary Fig. 1, Additional file 1) of the expected and observed SNP p-values along with estimates of the genomic inflation factors (λ = 0.95 and 0.97 for the respective additive and dominant genetic model) were indicative of no systematic bias due to population structure or analytical approach. Profiles of the SNP p-values (expressed as -log 10 ) for the additive and dominant genetic model are presented in form of circular Manhattan plots in Fig. 1. No SNP was found to reach genome-wide significance (p < 2.09E-07) using the Bonferroni correction method. Nevertheless, using the same correction method, a total number of 17 SNPs reached chromosome-wide significance across four autosomes (12, 22, 26 and 28) (Table 1). Specifically, one marker (rs313298834) was detected on GGA12 (threshold p = 0.05/7475 = 6.68896E-06), one (rs314011910) on GGA22 (threshold p = 0.05/1870 = 2.6738E-05), one  Table 1, 7 SNPs were associated with additive, 4 SNPs with dominant and 6 markers with both gene actions. Of the additive SNPs, one marker (rs313045367) resided on GGA26 while 6 were located on GGA28. One dominant SNP (rs314011910) was detected on GGA22 while 3 dominant SNPs (rs15250929, rs314052602 and rs318126353) were located on GGA28. Of markers displaying both gene actions, one marker (rs313298834) resided on GGA12 and 5 were located on GGA28 ( Table 1). Note that the 14 significant SNPs residing on GGA28 were co-localized in a region spanning 240,432 bp (3,818,934-4,059,366 bp) with high LD (r 2 ) levels. A detailed view of these SNPs along with LD (r 2 ) levels between markers is depicted in Fig. 2. As the LD heatmap shows, there are two haplotype blocks (r 2 > 0.70) formed by marker pairs rs15250929-rs16212041 and rs314418757-rs318126353 (Fig. 2). PVE by significant markers ranged from 0.70% (rs10724922, rs317783777) to 0.85% (rs314418757) for the additive markers and from 0.69% (rs314011910) to 0.84% (rs16212031, rs16212040, rs16212041) for the dominant markers (Table 1). All together, the significant additive and dominant SNPs explained a considerable part i.e. 60 and 47% of the additive and dominant genomic heritability, respectively. Nevertheless, as many of the significant markers were localized in nearby locations on GGA28, PVE by markers are biased upwards.

Estimation of degree of dominance
Application of the LASSO method on the 14 colocalized SNPs on GGA28 resulted in selection of two markers i.e. rs16212040 and rs318126353 each one residing per different LD block (Fig. 2). Of these, rs16212040 was associated with both gene actions while rs318126353 was associated only with dominant gene action. Two more SNPs i.e. rs313298834 (GGA12) and rs314011910 (GGA22) were detected as additive/dominant or dominant markers, respectively. Estimates of a, d and |d/a| for the four SNPs (rs16212040, rs318126353, rs313298834 and rs314011910) are shown in Table 2. In line with a purely dominant model where genotypic values are solely determined by the presence or absence of the dominant allele, genotypic means of the minor homozygous and minor heterozygous were found to significantly differ from the major homozygous genotypic means ( Table 2). Degree of dominance for the four SNPs ranged from 0.42-0.76 (partial dominance, markers: rs16212040, rs313298834 and rs318126353) to 1.1 (complete dominance, marker: rs314011910). Notably, no marker was associated with over-dominance. We furthermore sought to quantify the joint effect of the combined genotype of the two markers (rs16212040 and rs318126353) retained by LASSO on GGA28 by estimating LS means for the combined genotypes (Table 3). This exercise delivered interesting results as highest EN values were attained for AABB (μ = 138.8, n = 9) and ABAA (μ = 138.9, n = 71) that could not be attributed to additive allelic effects of individual markers. Specifically, the highest EN estimate for AABB is suggestive of additive-by-additive (AABB) interaction (epistasis) while that of ABAA of additive-by-dominance (ABAA) epistasis. However, due to limited number of observations,  especially for the AABB combined genotype (n = 9), current results should be treated with caution.

Functional enrichment analysis
A total number of 50 out of the 57 positional candidate genes were recognized by the DAVID tool and used for functional enrichment analysis. The latter analysis revealed the 'entrainment of circadian clock by photoperiod' (GO:0043153) as the only significantly (p = 0.028) enriched BP with two participating genes (CRTC1 and BHLHE40) (results not shown).

Prioritized genes
Results of PA are displayed on Table 4. A total number of 10 out of the 43 positional candidate genes were prioritized (overall p-value< 0.05) according to the semantic annotations imposed. The majority (n = 7) of the prioritized genes resided on GGA28, followed by two genes (BHLHE40 and ITPR1) on GGA12 and one (ELF3) on GGA26. On GGA28, the first ranked gene was GDF15, followed by JUND, GDF3, COMP, ELL, CRLF1 and IFI30. Notably, two highly ranked genes i.e. GDF15 (1st) and GDF3 (4th) belong to the transforming growth factor beta (TGF-β) superfamily. The two genes (BHLHE40 and CRTC1) that participated in GO:0043153 'entrainment of circadian clock by photoperiod' were also prioritized and ranked 2nd and 13th, respectively.

Mode of gene action
This is the first GWAS enlisting a significant number of animals (n~2600) and reporting on genetic variants implicated in the genetic control of EN in broiler breeders. Present results have demonstrated the need to thoroughly exploring the applicability of all possible genetic models when conducting a GWAS. This is particularly important when analyzing quantitative traits such as EN where not only additive but also non-additive e.g. dominant gene action of the causative loci may be fairly anticipated [3]. In line with this expectation, 4 of the 17 significant variants were dominant while 6 more were additive and dominant associations. The latter seems to be a controversial finding, but it can be fairly explained by examining the genotypic means across the examined variants of Table 2. A 'complete dominant' genetic model is when |d| = |a| meaning equal genotypic values for the minor homozygous (μ ΑΑ ) and the minor heterozygous genοtypes (μ ΑΒ ) that both differ from the major homozygous genotypic mean (μ ΒΒ ). This was exactly the case for marker rs314011910 that was detected only as dominant variant. But what happens in the case of partial dominance (0 < |d| < |a|)? In such cases (see markers rs313298834 and rs16212040 in Table 2) all genotypic means differ (μ ΑΑ ≠ μ ΑΒ, μ ΑΑ ≠ μ BΒ and μ ΑB ≠ μ BΒ ) meaning that apart from the dominant model, a linear relationship between the genotypic mean values and the number of copies of the minor allele i.e. the additive genetic model may also be applicable. For an excellent interpretation of how least squares regression performs in GWAS in additive and dominant models we refer to Huang and Mackay [12]. So far, we have discussed the applicability of the additive and dominant model, but we have neglected the case of over-dominance (|d| > |a|). In the latter case, μ ΑB > μ ΑA and μ ΑB > μ BB implying the need of using a different model parameterization by coding the heterozygous genotypes as 1 and the two homozygous genotypes as 0. Due to model parameterization difficulties we could not explore the validity of an overdominant genetic model here and this may be the reason why no marker has been associated with overdominance in the current study. While estimates of genetic effects (additive and/or dominant) are expected unbiased for a few, independent variants, this may not be the case for multiple, highly correlated variants residing on the same haplotype block(s) since the effect(s) may be 'shared' by many SNPs. For this reason, it is important to have a parsimonious model involving limited number of regressors (SNPs). To this end, application of the LASSO technique has proved particularly helpful as it has selected only two markers, each one residing in the two LD blocks on GGA28. Then, the next step was to explore whether the two variants interact and, if yes, to portray the exact type of interaction. This exploration has delivered interesting results since non-additive genetic interaction(s) between the two variants could also be detected. Although these findings are based on limited number of observations, they are indicative of potential importance of epistasis in the inheritance of the trait.

Functional candidate genes
Another intriguing problem that needed to be addressed in the present study was as how to narrow down the list with the 43 positional candidate genes. This post-GWAS step presents an important problem, because the experimental validation of the true causal genetic variants requires considerable costs, effort and time. To address this issue, we performed in silico prioritization analysis (PA) using explicitly two semantic annotations: GO: BP and co-expression. This approach was based on the assumption that co-expressed genes tend to be involved in the same biological process and that expression of functionally related genes should vary concordantly across the various tissues. Typically, gene co-expression networks do not provide information about causality, but they can serve as first proof of their involvement in a particular biological process [13] and can be effectively used for the identification of regulatory genes underlying phenotypes [14]. Following this approach, 10 prioritized genes (GDF15, BHLHE40, JUND, GDF3, COMP, ITPR1, ELF3, ELL, CRLF1 and IFI30) with interesting biological properties were highlighted. Genes GDF15 (growth differentiation factor 15, placed 1st) and GDF3 (growth differentiation factor 3, placed 4th) serve as good examples here since they both belong to the TGF-β superfamily genes. In rodents and humans, many factors belonging to the TGF-β superfamily are expressed by ovarian somatic cells and oocytes in a developmental manner and function as intraovarian regulators of folliculogenesis [15]. In humans, GDF15 is involved in placentation [16], while GDF3 might affect folliculogenesis by inhibiting the bone morphogenetic protein cytokines [17]. In chickens, GDF3 (also known as cVg1) is expressed at the early blastoderm stages of embryonic development [18] while another TGF-β member i.e. GDF9 is expressed in the ovary and functions on hen granulosa cell proliferation as in mammals [19]. Expression of BHLHE40 (basic helix-loop-helix family member e40) in the mouse ovary leads to a circadian gating of cellular processes in the ovary as well as in the hypothalamus during ovulation [20]. JUND (JunD proto-oncogene, AP-1 transcription factor subunit) is important for maturation of human ovarian cells [21]. COMP (cartilage oligomeric matrix protein) is involved in ovarian follicle development in mice [22] while mutations of COMP gene affect chondrogenesis in chickens [23]. ITPR1 (inositol 1,4,5-trisphosphate receptor type 1) is involved in the Ca 2+ transport for supplying eggshell mineral precursors in chicken uterus [24,25] while ELF3 (E74 like ETS transcription factor 3) has been related to the development of chicken oviducts [26] and ELL (elongation factor for RNA polymerase II) has been associated with yolk weight [27] in chickens. Notably, the final two nominated candidates i.e. CRLF1 (cytokine receptor like factor 1) and IFI30 (IFI30, lysosomal thiol reductase) had no documented involvement in reproduction. Such a finding underscores the limitations of in silico PA. In almost every guilt-by-association (GBA)-based prioritization tool, functional annotations of genes refer mainly to human and mouse PPINs (protein-protein interaction networks) [28] neglecting relevant information on livestock species [29] such as that examined here. One more limitation of GBA-based networks relates to their degraded predictive performance for genes with unknown or multiple functions [28].
Of particular interest in this study were genes BHLHE40 and CRTC1 (CREB regulated transcription coactivator 1). Both genes were enriched in the BP of 'entrainment of circadian clock by photoperiod' raising the intriguing question as what might be the exact mechanism of their implication in egg production. To answer the question, first we have to provide a short description of the molecular mechanism underlying circadian rhythms (CR). CR are controlled by a pacemaker located within the suprachiasmatic nucleus of the hypothalamus that is entrained to the external light-dark cycle via light input from the retina conveyed via the retinohypothalamic tract [30]. In hens, as in many avian species, exposure to photoperiods of longer than 11.5 h/day causes development and growth of testes and ovarian follicles via rapid induction of the hypothalamo-hypophysialgonad axis [31]. At the intracellular level, four clockgene families are reported to be involved in a transcription-translation feedback loop that generates the CR. Gene products of Clock and Bmal1 act as positive components, whereas those of the Per and Cry genes act as negative ones [32]. With regard to our candidate genes, BHLHE40 (also known as BHLHB2) acts as a suppressor of Clock and Bmal1 genes [33] while an entrainment stimulus causes CRTC1 to induce expression of Per1 and Sik1 [34]. As the molecular bases for circadian clocks are highly conserved across species, it is very likely that the avian molecular mechanisms are similar to those expressed in mammals, including humans [31].
In total, 7 (GDF15, BHLHE40, JUND, GDF3, COMP, ELF3 and CRTC1) of the prioritized genes were associated with reproductive traits while 2 (ITPR1 and ELL) were related to egg quality traits. From the above, only 3 genes i.e. COMP, ELL and CRTC1 included significant SNPs. We finally, compared our candidate genes list (Supplementary Table 1, Additional file 2) to a compiled gene list including 271 genes (Supplementary Table 2, Additional file 3) identified in previous GWASs for chicken egg and reproductive traits. This comparison highlighted two common genes i.e. ELL and ARL8A. Note that the first is among the prioritized candidates (ranked 8th) while the second ARL8A (ADP ribosylation factor like GTPase 8A) has been associated with eggshell thickness and eggshell formation [5] in chickens.

Conclusions
Current results have shown that apart from the additive also the dominant genetic model was of importance for EN in broilers. These results underline the need to thoroughly exploring the applicability of all possible genetic models when performing GWAS for a trait such as that examined here. Detailed follow-up studies are warranted to verify whether the identified genomic markers and the associated candidate genes present true causal genetic entities impacting on the trait. Such studies would entail targeted re-sequencing and molecular characterization of the candidate variants to facilitate the identification of true causal variants.

Data
Genotypic and phenotypic records for 2992 female broiler breeders from a purebred line were provided by Aviagen Ltd. EN records (28 to 50 weeks of age) ranged from 26 to 196 eggs per hen with an average of 132.4 (SD = 29.8). The 600 k Affymetrix® Axiom® high density genotyping array [35] was employed for animal genotyping with a total number of 544,927 SNPs available, dispersed in 29 autosomes (GGA1-28 and GGA33). Quality control (QC) assessment was performed at both sample and marker level. At a sample level, 406 animals were excluded when they had a call rate lower than 0.99 and an autosomal heterozygosity outside the 1.5 interquartile range (IQR: 0.013). At a marker level, a number of 305,660 SNPs were removed, based on: call rate (lower than 0.95), minor allele frequency MAF (lower than 0.05) and linkage disequilibrium (LD) pruning (r 2 values greater than 0.99 within windows of 1 Mb intermarker distances). Finally, a total of 2586 samples and 239,267 SNPs across 28 autosomes (GGA1-28) were retained for further analyses. QC was performed using the SNP & Variation Suite software (version 8.8.3).

Marker-trait association analysis
Stepwise regression with forward inclusion and backward elimination of multiple markers (SNPs) [36] was applied to identify SNPs associated with the trait, assuming first an additive and second a dominant gene action for the SNP effects.
Specifically, the following mixed model was used for EN data: where y is the n × 1 vector of phenotypic values of EN for n female broilers, X is the n × 53 matrix of fixed effects: hatch (36 classes) and mating group (17 classes), β is the 53 × 1 vector of corresponding coefficients of fixed effects, a is the fixed effect for the minor allele of the candidate SNP to be tested for association, w is the incidence vector relating observations to SNP effects with elements coded as 0 for the major homozygous genotype, 1 for the heterozygote genotype and 2 for the minor homozygous genotype (additive genetic model) and 0 for the major homozygous genotype and 1 for the heterozygous and minor homozygous genotypes (dominant genetic model). Z is the incidence matrix relating observations to the polygenic random effects, u is the vector of polygenic random effects and e is the vector of random residuals.
The random effects were assumed to be normally distributed with zero means and the following (co)variance structure: where σ 2 u and σ 2 e are the polygenic and error variance components, I is the n x n identity matrix, and G is the n x n genomic relationship matrix (GRM [37]) with elements of pairwise relationship coefficient using the 239,267 SNPs. Τhe genomic relationship coefficient between two individuals j and k, was estimated as follows: where x ij and x ik represent the number (0, 1, 2 in the additive model and 0, 1, 1 in the dominant model) of the minor allele of the i th SNP of the j th and k th individuals, and p i is the frequency of the minor allele [37]. Statistically significant SNPs per genetic model were selected at the optimal step of the stepwise regression according to the extended Bayesian Information Criterion (eBIC [38]). SNP p-values were subsequently corrected for multiple comparisons using the Bonferroni correction method. A SNP was considered as significant at the genome-wide level when its pvalue was lower than the threshold value 2.09E-07 (0.05/239, 267) while a chromosome-wide significant SNP had a p-value lower than 0.05/N, where N is the number of SNPs on a given chromosome. All analyses were performed using the SNP & Variation Suite software (version 8.8.3). SNP positions were based on GRCg6a assembly (https://www.ncbi.nlm.nih. gov/assembly/GCF_000002315.6 [39], https://www.ncbi.nlm. nih.gov/genome/annotation_euk/Gallus_gallus/104/ [40]).

Quantile-quantile plots and genomic inflation factors
To characterize the extent to which the observed distribution of the test statistic followed the expected (null) distribution, quantile-quantile (Q-Q) plots were used. These plots in combination with estimates of the genomic inflation factor (λ) were used to assess potential systematic bias arising from population structure or the analytical approach [41]. Estimation of λ was performed using the SNP & Variation Suite (version 8.8.3).

Estimation of genomic heritability and proportion of variance explained
Estimation of the genomic heritability was implemented via the realized GRM of 2586 animals derived from the 239,267 SNPs.
The proportion of variance explained by a SNP k (PVE k ) was also calculated as follows: where mrss h0 is the Mahalonobis root sum of squares (mrss) for the null hypothesis and mrss k is the same for marker k. All above estimations were performed using the SNP & Variation Suite software (version 8.8.3).

Identification of significant SNPs under multicollinearity conditions
When multiple markers were present in a specific genomic region, a variable selection method i.e. the Least Absolute Shrinkage and Selection Operator (LASSO) [42] as implemented in procedure GLMSELECT in SAS 9.3 (2012) was applied to identify the most representative markers in the area.

Estimation of the degree of dominance
Significant SNPs associated with dominant or dominant and additive gene action(s) were further analysed toward estimation of additive allelic effects, dominance deviation and degree of dominance. This analysis was based on estimates of genotype least squares (LS) means by application of a mixed model to the EN data fitting hatch, mating group and the marker as fixed effects and the animal as a random effect. Degrees of freedom were estimated using the Satterthwaite method while the Tukey-Kramer method was used to adjust the p-values because of multiple means comparisons. Results of the mixed model analysis are presented as LS means (μ) with standard errors (SE). Additive allelic effect (a) was defined as half the difference between LS means of the two homozygous genotypes, using the minor homozygous genotypes as reference. Dominance deviation (d) was the heterozygous genotype LS mean minus the average of the two homozygous genotype LS means. Finally, degree of dominance was determined as |d/a|, where additive = 0-0.20, partial dominance = 0.21-0.80, complete dominance = 0.81-1.20 and over-dominance> 1.20 [43,44]. This analysis was performed by the MIXED procedure in SAS 9.3 (2012).

Detection, functional characterization and prioritization of positional candidate genes
We searched within 50 kb downstream and upstream flanking regions of each significant marker for positional candidate genes using the NCBI database (https://www.ncbi.nlm.nih. gov/gene/ [45]) and the GRCg6a assembly (https://www.ncbi. nlm.nih.gov/assembly/GCF_000002315.6 [39], https://www. ncbi.nlm.nih.gov/genome/annotation_euk/Gallus_gallus/104/ [40]). Subsequently, the total number of positional candidate genes was subjected to the following analyses: Gene Ontology (GO) Biological Process (BP) enrichment analysis and gene prioritization analysis (PA). GO enrichment analysis for BP was carried out using the DAVID functional annotation tool (https://david. ncifcrf.gov/, version 6.8) [46] and the Gallus gallus species for the input gene list and as genome background. During enrichment analysis, the following settings were used: EASE score (modified Fisher's exact p-value [47]) cutoff = 0.05 and minimum number of genes per GO BP term = 2. GO BPs with p-values lower than 0.05 were considered as significantly enriched.
Gene prioritization analysis (PA) of the positional candidate genes followed, using the ToppGene portal (https://toppgene.cchmc.org/prioritization.jsp [48]). PA was based on the functional similarity of the positional candidate genes (test genes) to a training gene list including a total number of 31 genes (Supplementary  Table 3, Additional file 4). The latter genes were retrieved from the NCBI database (https://www.ncbi.nlm. nih.gov/gene/ [45]) using the search terms 'reproduction' and 'egg production' in Gallus gallus. Candidate gene prioritization is based on fuzzy functional similarity measures between any two genes and specific semantic annotations imposed. In our study two semantic annotations: 'GO: Biological Process' and 'Coexpression' were used. A p-value for each annotation of a test gene was derived by random sampling of 5000 genes from the whole genome. The partial p-values were finally combined into an overall p-value using the probability density function. For gene prioritization, there were 30 training genes (ZNF764L was omitted) and 43 test genes (positional candidate genes). Not all of the 57 positional candidate genes were included in the analysis because the human homologs could not be found for all of them, especially for LOC genes (n = 14). Genes with an overall p-value lower than 0.05 were considered as prioritized.