Accelerated discovery of functional genomic variation in pigs

The genotype-phenotype link is a major research topic in the life sciences but remains highly complex to disentangle. Part of the complexity arises from the number of genes contributing to the observed phenotype. Despite the vast increase of molecular data, pinpointing the causal variant underlying a phenotype of interest is still challenging. In this study, we present an approach to map causal variation and molecular pathways underlying important phenotypes in pigs. We prioritize variation by utilizing and integrating predicted variant impact scores (pCADD), functional genomic information, and associated phenotypes in other mammalian spe-cies. We demonstrate the efficacy of our approach by reporting known and novel causal variants, of which many affect non-coding sequences. Our approach allows the disentangling of the biology behind important phenotypes by accelerating the discovery of novel causal variants and molecular mechanisms affecting important phenotypes in pigs. This information on molecular mechanisms could be applicable in other mammalian species, including humans.


Background
Closing the gap between genotype and phenotype is a major goal in the life sciences, but remains extremely challenging [28]. Part of the complexity stems from phenotypes being influenced by many genes. Genome-wide association studies (GWAS) have been instrumental in statistically linking genotypes and phenotypes. These studies, resulting in identification of quantitative trait loci (QTL), have resulted in better understanding of the genomic architecture of complex traits [65]. However, the resolution of GWAS is limited since it only requires correlation to phenotypes by neighbouring markers in linkage disequilibrium (LD) Hence, unravelling the molecular drivers (causal variants) underlying phenotypes of interest requires further fine mapping [23]. The majority of causal variants are expected to reside in the noncoding regions of the genome, in particular in transcriptional regulatory regions [59], which can be very difficult to predict.
In human genetics, a combination of statistical fine-mapping methods and expression QTL (eQTL) studies are used to decrease the number of candidate genes and causal variants [8]. Further, functional annotation, facilitated by large consortium efforts including the Encyclopaedia of DNA Elements (ENCODE, [18]), can be applied to prioritize variants based on the likelihood of the variant(s) affecting gene expression. Despite this effort, identifying causal variants remains difficult, partly because of the fundamental complexity of phenotypegenotype relations, in which the environment also plays an important role.
In livestock, economically important phenotypes are typically determined by many genes, each explaining a small fraction of the phenotypic variation. However, for many traits it is now known that QTLs exist that explain more than 1% of the variation. For these relatively large QTLs it is of interest to identify the underlying causal variation to better understand the biology of traits. Due to intense selection, the effective population size (N e ) of most livestock populations is small [33]. This often leads to a high degree of LD. This results in QTL comprising up to millions of base pairs (Mb) in length, especially in regions with low recombination rates [77]. High LD yields an additional layer of complexity to fine-map GWAS results in livestock populations, and the use of crossbreeding to break down the LD is a costly, labourintensive and time-consuming procedure to fine map the QTL. In contrast, livestock populations are less confounded by population stratification (i.e. ancestry differences between cases and controls), which can be a major problem in human GWAS studies [35].
Similar to human medical studies, further functional genomic information can help to prioritize the variants underlying the phenotypes of interest in livestock [63]. In pigs, the level of functional genome information, in contrast to human genome information, is limited. Fortunately, recent advances have been achieved in pigs by developing the pig Combined Annotation-Dependent Depletion (pCADD) tool [31], providing impact scores of all possible single-nucleotide substitutions in the pig genome. The CADD tool was originally developed to score variants with respect to their putative deleteriousness to prioritize potentially causal variants in human genetic studies [61]. This tool is frequently used to score variants in human GWAS studies [8]. Subsequently, other species-specific CADD tools were developed [30]. This tool scores the deleteriousness (or functional impact) of single nucleotide variants (SNPs) and it is built on a number of layers of annotations including sequence context, conservation scores, gene expression data, non-synonymous mutation scores, and epigenomic data, if available for the investigated species. The pCADD scores are the -10log 10 of the relative rank of the investigated SNP among all possible SNPs in the Sus scrofa reference genome, giving the predicted 90% least impactful SNPs a pCADD score between 0 and 10, the least 99% a score between 0 and 20 and so forth.
Pig populations have been under a long-term selection process performed by animal breeders to constantly improve their stock [41]. Therefore, such commercial breeds can be seen as a long-term controlled biological experiment. In general, genomic selection uses SNP chip variant panels to associate genomic regions with traits of interest. The variants of the panel are distributed across the genome and allows within-population genetic variation to be captured [50]. However, genomic selection uses the genome as a "black box", as the SNPs on the chip are mostly not causal, but genetically linked (by LD) to the actual causal variants and genes [32]. Therefore, the efficacy of genomic selection can be substantially improved by adding new genetic markers comprising the actual causal variation [29], providing insight into the exact molecular drivers involved in the phenotype.
The objective of this study is to bridge the genotype-phenotype gap in pig populations by pinpointing causal variants that are selected by genomic selection. More specifically, we demonstrate that pCADD scores can be used to identify causal variants underlying important phenotypes. Being able to identify causal variants will have major implications for genomic selection, and we show that CADD scores are promising in identifying causal variants in more neutral phenotypes as well. This study provides insights into the molecular biology and pathways affecting important phenotypes in pigs, that can also be transferred to human phenotypes.

Results
Here we present an approach to identify causal variants underlying important phenotypes in pigs. Our approach starts with a large scale GWAS study to identify loci associated with important phenotypes. Subsequently, we integrate the population whole genome sequence, the impact scores (pCADD), and further functional genomic information to fine map and report known and novel causal variants.
2.1. Genome wide association studies in four elite pig populations reveal many QTLs affecting production, reproduction, and health We analysed comprehensive genotype and phenotype data in four purebred pig populations: two boar breeds (Duroc and Synthetic), and two sow breeds (Landrace and Large White). In pigs, purebred populations are the units in which selection is applied, while the final production animals are derived from three-way crosses. First, F1 crossbred sows are created by mating two sow breeds selected for high reproductivity and mothering ability, which are subsequently crossed with a boar breed especially selected for meat production traits. The examined traits can be grouped in three classes: (1) traits focussing on production traits, including backfat, intramuscular fat, growth rate and feed efficiency; (2) reproduction traits, mainly focussing on litter size, number of liveborn, survival, and mothering abilities; and (3) health and welfare traits including disease resistance, osteochondrosis, congenital defects, and other conformation traits. A total of 129,336 animals with 552,000 imputed SNPs were subjected to a GWAS analysis for 83 traits. The analysis revealed 271 QTL regions with a genome-wide association significance threshold of -log10(P) > 6.0, and significant associations were observed for the majority of examined traits. The 'lead' SNP that showed the strongest association signal was used as a starting point for further analysis.

pCADD evaluates all possible substitutions from the Sscrofa11.1 pig reference genome
In our approach the first step entails identification of the SNP of highest significance in a GWAS peak. Subsequently, all variants that are in high LD (r 2 > 0.7) were extracted. The variants were extracted from a total of 428 animals (Duroc: 101, Synthetic: 71, Landrace: 167, Large White: 89), sequenced to an average depth of 11.82. Next, all high-LD variants were annotated for their pCADD scores, to prioritize them for likely impact on phenotype. The sequence variants were also annotated for variant effect type, using the Ensembl Variant Effect Predictor (VEP, release 98) [49]. The distribution of the pCADD scores for a set of variants depends on their functional class, and non-coding variants have on average lower scores compared to coding variants. The quantiles and further class statistics for the pCADD scores are presented in Supplementary Table 1. pCADD provides an independent impact score and both, deleterious and functional SNPs, will be enriched for high pCADD scores, because they have impact (either negative or positive). The assumption is that if a variant has impact on a trait (either regulatory or coding), it likely falls within a rather evolutionary conserved region, leading to generally higher pCADD scores. In addition, three liver histone modification datasets were used (for modifications H3K27Ac and H3K4me3) to mark variation overlapping with regulatory sequences, including likely active promoter and enhancer elements in pig liver tissue [78].

Phenotype and pathway information provides further evidence of gene causality
Functional annotations, including pathways and gene-ontology information for the examined pig genes associated with the top-ranked variants, were extracted from the Uniprot database [74]. Moreover, we extracted associated phenotypes from orthologous genes from the Ensembl database for human (Homo sapiens), mouse (Mus musculus), and rat (Rattus norvegicus). The phenotypes are mainly based on (disease) association studies in humans, and gene knockouts in mice and rats [83]. A complete overview of the pipeline is presented in Fig. 1.

Gene expression information allows identification of possible expression quantitative trait loci
The combination of genotype and gene expression data provides an additional layer of evidence to find causal variants, as differences in expression of genes can be associated with a specific variant (expression quantitative trait loci; eQTL). In this study we use 59 RNA-sequenced samples [75] from Landrace (n = 34) and Duroc (n = 25) to test for differential expression between the genotype classes (homozygous reference, heterozygous, homozygous alternative) to associate the expression of genes with the genotypes. The sequenced RNA samples were derived from testis. Further details about the sequenced samples and alignment depth are provided in Supplementary Table 2. The combination of epigenomic marks (liver) and gene-expression data (testis) can, in addition to the pCADD scores, facilitate the discovery of functional variants.

Accelerated discovery of potential causal variants from GWAS results
To demonstrate the power of our approach we first analysed several QTL regions (per breed) with known causal variants reported in literature. This list includes 1) a missense mutation in MC4R affecting production traits [39], 2) a variant in the promoter of the VRTN gene affecting number of teats [76], and 3) a missense mutation in PRKAG3 affecting meat quality [51]. Despite the fact that hundreds of variants were found to be in LD at each of the GWAS peaks, the method returned the causal variant as top ranked for both the MC4R missense mutation  Table 3).
The mutation in the PRKAG3 gene identified by Milan et al. [51] does not segregate in our sequenced animals. However, we identified another missense variant (15:g.120865869C > T) in the PRKAG3 gene that is a strong candidate variant for affecting meat quality in both boar breeds (Fig. 2), as described by [73]. The causal missense variant is highlighted in green, and the lead SNP in the GWAS in blue in Fig. 2b. The variant results in a substitution of glutamic acid for lysine (ENSSSCP00000030896:p.Glu47Lys) (Supplementary Table 4). PRKAG3 regulates several intracellular pathways, including glycogen storage [19]. The specific isoform (ENSSSCT00000036402.2) affected by the Glu47Lys missense mutation has a role in the metabolic plasticity of fast-glycolytic muscle and is primarily expressed in white skeletal muscle fibers [48]. Gain of function mutations in the PRKAG3 gene have been correlated with increased glycogen content in skeletal muscle in pig, negatively affecting meat quality [15]. The Lys47 variant likely causes a gain-of-function of the 5'-AMP-activated protein kinase subunit gamma-3 enzyme, resulting in increased glycogen content causing lower water holding capacity resulting in low meat quality.

Integrated analysis reveals several novel variants with pleiotropic effects on important phenotypes
We systematically analysed all significant QTL regions (− log10(P) > 6.0) to identify novel causal variants affecting important phenotypes. We highlight several of the most striking variants in the subsequent paragraphs, and a summary of likely causal loci is given in Table 1. In addition, we show several examples of highly significant QTL regions for which we were unable to assign a candidate causal variant at the end of the results.

Promoter variants in the HMGA1 and HMGA2 genes affect fat deposition and growth in pigs
A strong QTL on chromosome 7 affects backfat, intramuscular fat, growth, feed intake and loin depth in Duroc (Fig. 3a). This QTL is among the most significant region affecting production traits (together with the MC4R QTL on chromosome 1 and a strong QTL on chromosome 10). The lead SNP in the GWAS result is located at position 7:30,116,227 with a -log 10 (p) > 20 for backfat, feed consumption, and intramuscular fat ( Supplementary Fig. 4). The analysis returned 485 variants in high LD with the lead SNP (Fig. 3b). The two variants with the highest pCADD scores are upstream of the HMGA1 gene, 566 bp apart (Fig. 3b, Supplementary Table 5). Both mutations are in the promoter region of the HMGA1 gene. A promotor function at this position is supported by signals on the H3K4me3 and H3K27Ac histone marks ( Supplementary  Fig. 5). The A allele, segregating at 36% allele frequency, is associated with less backfat, faster growth, but also smaller loin depth and decreased intramuscular fat. The expression of the HMGA1 gene was investigated in twenty samples. In all samples genotype and gene expression, as normalized fragments per kilobase per million (FPKM), were both available within the three genotype classes GG, AG, and AA.
The A allele is strongly associated with increased expression of the gene. The expression level furthermore appears to be affected in an additive manner (P = 0.041, Supplementary Fig. 6). The correlation of additive increase of expression of the HMGA1 gene, and the same additive effect on backfat and growth, and negative on intramuscular fat, suggests a causal relation between gene expression and phenotype. In addition, we find two variants affecting the promoter region of the HMGA2 gene, associated with less backfat in the Synthetic breed (Table 1, Supplementary Table 6). Both HMGA1 and HMGA2, belong to the High Mobility Group A gene family, are well-known to affect growth and stature in pigs [14,36,40], although a causal relationship had not been reported thus far. Our results suggest that the causal variants for both genes are regulatory.

A novel missense mutation in SCG3 likely to affect backfat and growth rate
A strong QTL on chromosome 1 affects backfat, intramuscular fat, and drip loss in the Synthetic breed (Fig. 4a). Despite the presence of more significant QTL regions we focus on this QTL given the large extend of LD. Large LD blocks can hamper accurate fine mapping of causal variants. The lead SNP in the GWAS result is located at position 1:115,884,118. The analysis returned 874 variants in high LD with the lead SNP (Supplementary Table 7). The SNP with the highest pCADD score (1:g.120074006G > A, pCADD = 30.28), a single missense variant affecting the SCG3 gene, was identified as potentially causal (Fig. 4b).
The variant substitutes a threonine for a methionine at position 386 in the Secretogranin-III protein (ENSSSCP00000044507:p.Met386Thr). The Met386 allele is associated with increased intramuscular fat, more backfat and lower meat quality. Several variants altering the SCG3 protein have been associated with obesity in humans [69], supporting its likely causality for the fat-associated phenotypes in pigs.

Genes affecting meat quality involved in muscle glycogen storage
A substantial number of candidate causal variants affecting meat quality were identified. In the Synthetic breed, a boar line hence particularly selected for its meat quality traits, 26 loci were significantly associated with drip loss (− log 10 (p) > 6). Drip loss is trait that measures the water holding capacity of the meat (Fig. 5). The top ranked pCADDscored genes show a strong enrichment for pathways involved in glycogen synthesis and storage (Table 1). Increased levels of muscle glycogen are known to lead to increased drip loss, which is considered to negatively affect meat quality [64]. Examples discovered in the present study include regulatory variants affecting the MEF2C, SCG3, and GBE1 genes. MEF2C knockout mice accumulate glycogen in their muscles [2], while GBE1 codes for a glycogen branching enzyme associated with glycogen storage disease, if mutated [22]. Moreover, we identify two missense variants affecting the NEU3 (ENSSSCP00000034065:p. Pro419Ser) and MAP1A (ENSSSCP00000005070:p.Gly1904Ser) genes, both directly involved in the glycogen deposition [34,81]. Not only does our study demonstrate the central role of glycogen-based pathways in an important meat quality trait, it also highlights that a combination of regulatory and protein altering variants are involved.

Genes affecting growth and fat deposition traits are involved in energy metabolism and adipogenesis
A number of likely causal variants and genes affecting other important production traits were found (Table 1), although the underlying pathways initially appeared to be less obvious. The top-ranked genes were found to be enriched in energy reserve metabolic processes, glycogen metabolic process, regulation of lipid biosynthetic process, and homeostasis (Supplementary Table 8). More specifically, two regulatory variants in the SOD1 and PRKCE genes likely affect backfat. SOD1 is involved in glucose metabolism and prevents oxidative damage associated with obesity in humans [46], while mutations in PRKCE decrease the amount of body fat in humans [11]. Furthermore, we identified one regulatory variant in the CACUL1 gene likely affecting intramuscular fat. This gene inhibits adipogenesis via the peroxisome proliferatoractivated receptor γ (PPARγ) [38]. In addition, two missense variants likely affect intramuscular fat via the LNPEP (ENSSSCP00000051249:p. Leu334Ser) and ABCA12 (ENSSSCP00000058038:p.Gly1693Cys) genes. LNPEP attenuates diet-induced obesity in mice through increased energy expenditure, and decreases the amount of adipose tissue [55], while the ABCA12 gene plays an important role in lipid transport, affecting carcass fat content in pigs [58]. We further identified potential regulatory variants in the NR1H3, NR1H4, and PRCP genes, all likely affecting growth ( Table 1). Note that the 3'UTR variant with the highest pCADD score in the NR1H3 gene is also causing a missense variant in the partly overlapping MADD gene. NR1H3 and NR1H4 are paralogous genes both involved in lipid homeostasis [66,84], while reduced levels of PRCP expression promote obesity by regulating the α-melanocyte-stimulating hormone (α-MSH) that regulates feeding behaviour. Finally, we found a missense variant in the SLC46A1 gene associated with increased intramuscular fat (ENSSSCP00000020843:Gly131Arg) in pigs, known to affect glucose and fat levels in knockout mice [7]. The various pathways identified to be involved in the physiology of growth and metabolism demonstrate that indeed selection traits involved in growth, energy efficiency, and fat deposition are complex and consisting of many genes, which is congruent with the known highly quantitative nature of these traits. However, identifying these underlying pathways enable identifying at which genes and parts of gene networks these pathways intersect, which provides valuable insight into pleiotropy and their trade- Table 1 List of potential causal variants identified from the pipeline. Table shows the variants type, potential overlap with promoter or enhancer region (from liver, [78]), the change in amino acid (for missense mutations) and the pCADD score for variants affecting one or more important selection traits (BFE: backfat, IMF: intramuscular fat, TGR: growth rate, DRY: drip loss, NTE: number of teats). The causal variant for genes in bold has already been reported in literature. A minus sign stands for a negative effect of the alternative allele in the table (orange), a plus sign stands for the positive effect of the alternative allele on the indicated trait (green). NS indicates that the variant has no significant effect on the trait. Variant IDs are given in Supplementary table 16. [1,4,10,16,20,21,45,53,54,56,62,67,72,82] offs.

Balancing selection for causal variants in the breeding program
Interestingly, a number of the variants that are likely directly affecting phenotypes under selection in commercial breeding programs, exhibit pleiotropic effects. This particularly applies to genes HMGA1, SCG3, and MC4R ( Table 1). Variants that positively affect backfat often have negative consequences for growth, while variants that positively affect intramuscular fat often show detrimental effects on backfat. The observed pleiotropic effects cause the variants to be under balancing selection in the breeding program, preventing population fixation of individual variants underlying strong QTL regions. Without balancing selection, causal variants of this magnitude of impact would have been selected to fixation in modern breeding programs in a very low number of generations.

Examination of three highly significant across-breed QTL regions
We examined three other striking QTL regions (from Figs. 2-4) to identify potentially causal mutations. First, we examined the QTL region for drip loss on chromosome 3 (Supplementary Table S12). The lead SNP (3:16839270) is in high LD (r 2 = 0.92) with the splice variant known to affect the expression of the PRHG1 gene affecting meat quality [47]. However, the splice variant is not highly scored according to pCADD (score = 2.13) and is thereby not among the top variants. Next, we examined the QTL region on chromosome 10 affecting the production traits backfat, feed intake, and intramuscular fat (Fig. 3, Supplementary  Fig. 4). The top 5 candidate variants affect the ZNF367 (missense variant), CDC14b (intronic), and AAED1 (splice donor) genes (Supplementary Table S14). The ZNF367 increases creatine levels in knockout mice [7]. The CDC14b gene is involved in DNA damage repair and aging [12], while the AAED1 gene is involved in glycolysis [85]. Further analysis is required to fine map the causal variant at this locus. Next, we examined the QTL region for backfat on chromosome 5 in the Synthetic breed (Fig. 4), the top SNPs are intergenic or annotated within the intronic regions of the CCND2 gene (Supplementary Table S13). CCND2 is involved in the cell cycle and has been reported to affect fat deposition traits in pigs [42].

Discussion
The aim of this study was to identify causal variants under selection in pig breeding programs, and to identify the molecular pathways involved in the traits. Including the pCADD scores is particularly relevant because genomic variation underlying phenotypic variation mostly affects the non-coding part of the genome [59], and GWAS results often point to regions outside gene boundaries [5]. Furthermore, the extensive LD in regions under selection makes it hard to pinpoint any single variant since there may be several candidates that all may be significant in a GWAS. pCADD scores [31] allow the prioritization of any single nucleotide substitution variant in the genome based on the likelihood of being functional. This is a major step forward in livestock, as thus far only variation in the coding region could be scored. On top of the pCADD scores, we use epigenomics and gene expression data to annotate regulatory sequences and associate gene expression to the trait of interest. In human, many transcriptomic and epigenomic marks have already been incorporated in the CADD scores [61]. However, the pCADD scores are built on far less (epi)-genomics data, but with the accumulation of functional genomic data in pigs [27], these pCADD scores will further improve. One drawback of pCADD is that it is not able to score structural variants yet. Advances have been made for the human pCADD to include scoring of structural variants [24] and this feature might be added in Livestock populations generally have small effective population sizes (N e : , far less compared to e.g. human (N e ~ 10,000), leaving much longer blocks of variants in high LD. This high level of LD increases the power to detect QTL regions, even with relatively low SNP density. However, within large LD blocks, many variants will be associated, and a thorough variant prioritization should be performed to point to likely causal variants within the (often) large variant set. For example, the LD block for the number of teats in Landrace spans about 1.8 Mb, leaving many thousands of variants in linkage, which increases the level of noise and hampers the detection of the causal variant ( Supplementary Fig. 3). Nevertheless, in Large White and Duroc, which have smaller LD-blocks  (100-500 kb), the causal VRTN promoter SNP is among the top SNPs. In that sense, integrating the results from multiple breeds provides additional power to further narrow down the list of candidates, assuming that the same causal variant is segregating, but likely with a very different underlying haplotype structure. GWAS analysis are generally performed within breed. However, a multibreed GWAS (potentially including crossbred information) could be performed to disentangle the haplotype structures in different breeds, and to facilitate fine mapping of the causal variant. Hence, future GWAS analysis and finemapping strategies could benefit from a multi-breed approach. Another way to provide further evidence of causality would be to fit the SNP into the GWAS model and check whether the QTL peak disappears completely. This would however require imputation to sequence and will only benefit if the LD between the causal SNP and the lead SNP in the 660 K GWAS is significantly below 1.0. This example shows that integrating GWAS and pCADD scores can be very powerful to prioritize variants. There is, however, a trade-off: although the homogenous populations lend themselves well for finding associations by GWAS, the extensive linkage disequilibrium results in too many high-pCADD SNP candidates. But this does also highlight a second aspect that is often not considered: that strong selection in small, homogeneous populations may lead to strong hitchhiking effects. The type of analysis presented in this study provides a strong method for inferring potential hitchhiking and inbreeding effects, given the knowledge on the functional variant and the surrounding (possible deleterious) variants in high LD. Note that in pig breeding the production animals are crossbreds not affected by inbreeding that benefit from the heterosis effect. However, we predict that it will become very valuable to genotype causal variants in crossbred animals, especially because the LD between the selection markers and the causal variants might be substantially lower in crossbred animals. Hence, we believe that causal variants could significantly improve across-breed genomic prediction.
Although the development of genomic selection has revolutionized animal breeding, the lack of functional genomic information currently limits further development [26]. The framework and associated pCADD scores provided within this study will accelerate the discovery of functional variants, which can be directly implemented in genomic selection by adding the causal variants to the SNP chips used for genomics selection. Moreover, the results provide further knowledge of the biological pathways associated with important phenotypic variation in livestock. This is vitally important, since breeding goals are in practice often mutually exclusive. Understanding how at a fundamental biological level, pathways under selection are intersecting, can provide a better formulation of selection criteria.
Integrating GWAS based on ongoing commercial selection and functional appraisal of variations in the populations under selection provides a powerful framework to study the genetic architecture of the traits under selection. Comparing the pathways and genes found to be important in these traits in this manner, reveals a striking functional overlap in similar phenotypes in other mammals. For example, we report the GBE1 gene affecting meat quality in pigs by accumulating glycogen in the muscle, a gene associated with glycogen storage disease in human [3]. Moreover, several of the identified genes affecting growth and fat deposition traits in pigs are involved in energy metabolism, glucose homeostasis, and adipogenesis, often associated with metabolic disease in human (e.g. HMGA1, SCG3 genes). In human, however, environmental factors play a very large role in the formation of metabolic disease, while in pigs the animals are kept under relatively uniform conditions, which could make the pig an ideal model to study the effects of specific genic variants on these analogous phenotypes [57]. Pig breeding has led to extreme changes in animal production and efficiency, with little negative consequences on health [41]. This remarkable robustness of the animals, and the molecular mechanisms involved, may aid in understanding metabolic disease in human.
Ultimately, quantitative selection seeks to perturb the underlying pathways in commercial traits. Our study suggests that, despite the complexity of pathways and the high number of genes potentially involved in any one trait, there may be a small number of genes that are exceptionally suited as 'entry points' into those pathways. These genes have a large effect that are more likely to be under selection than other genes in the same pathway. Understanding these 'key' genes, and how they function together would further help to unravel the (molecular) consequences of selection.

Conclusion
This study integrates pig CADD scores and various sources of functional data to provide a framework to pinpoint causal variation associated with important phenotypes in pigs. We demonstrate our method by identifying novel causal mutations or substantially narrow down the list of potential causal candidates in various strong QTL regions, affecting both production and reproduction phenotypes. The new regulatory variants can be utilized directly in the breeding program to improve selection substantially, and to better understand the biology and molecular mechanisms underlying the selected traits. Finally, the pig populations under study provide an interesting framework to study common pathways and molecular mechanisms involved in analogous phenotypes between humans and pigs.

Ethics statement
Samples collected for DNA extraction were only used for routine diagnostic purpose of the breeding programs, and not specifically for the purpose of this project. Therefore, approval of an ethics committee was not mandatory. Sample collection and data recording were conducted strictly according to the Dutch law on animal protection and welfare (Gezondheids-en welzijnswet voor dieren).

Genotype data and breeds
The genomic dataset consists of 15,791 (Duroc), 28,684 (Synthetic), 36,956 (Large White), and 41,865 (Landrace) animals genotyped on the (Illumina) Geneseek custom 50 K SNP chip with 50,689 SNPs (50 K) (Lincoln, NE, USA) and imputed to the Axiom porcine 660 K array from Affymetrix (Affymetrix Inc., Santa Clara, CA, United States). The chromosomal positions were determined based on the Sscrofa11.1 reference assembly [79]. SNPs located on autosomal chromosomes were kept for further analysis. Next, we performed per-breed SNPs filtering using following requirements: each marker had a MAF greater than 0.01, a call rate greater than 0.90, and an animal call rate > 0.90. SNPs with a pvalue below 1 × 10 − 12 for the Hardy-Weinberg equilibrium exact test were also discarded. All pre-processing steps were performed using Plink v1.90b3 [60].

Phenotypes
A total of 1,360,453 animals with phenotypic records for at least one of the 83 evaluated traits were available for this study. These animals were either purebred (Duroc, Synthetic, Landrace and Large White) or crossbred originated from the crosses between these purebred populations. The phenotypic records were used in the estimation of breeding values for all evaluated traits. The estimated breeding value (EBV) of each animal was obtained from the routine genetic evaluation by Topigs Norsvin applying the single-step approach [13,52], which allows the simultaneous evaluation of genotyped and non-genotyped animals, using the software MiXBLUP [70].

Genome wide association study
A single SNP GWAS was performed with the software GCTA [80] by applying the following model: where EBV j is the EBV of the genotyped animal j, μ is the overall EBV mean of the genotyped animals, SNP i is the genotype of the SNP i coded as 0, 1 or 2 copies of one of the alleles, a j is the additive genetic effect and e ij the residual error. Association results were considered significant if -log10(p) > 6.0.

Population sequencing and mapping
Sequence data was available for 101 (Duroc), 71 (Synthetic), 167 (Landrace), and 89 (Large White) animals from paired-end 150 bp reads sequenced on Illumina HiSeq. The sequenced samples are frequently used boars, selected to capture as much as possible of the genetic variation present in the breeds. The sequence depth ranges from 6.6 to 22.2, with an average depth of 11.82 (Supplementary Table 10). FastQC was used to evaluate read quality [6]. BWA-MEM (version 0.7.15, [43]) was used to map the WGS data to the Sscrofa11.1 reference genome. SAMBLASTER was used to discard PCR duplicates [21], and samtools was used to merge, sort, and index BAM alignment files [44].

Promoter and enhancer elements from ChipSeq data
We retrieved three H3K27Ac, and three H3K4me3 libraries (ArrayExpress accession number: E-MTAB-2633) from liver tissue from three male pig samples described by [78] [78]. Data was aligned using BWA-mem [43] and visualized in JBrowse [68]. Coverage information on variant sites was obtained using PyVCF [9] and the PySAM 0.15.0 package.

Phenotypes and gene ontology
Phenotype information from genes orthologous to pigs in humans, mice, and rats were retrieved from the Ensembl database ( [37], release 97) using a custom bash script. Gene ontology and pathway information was obtained from the UniProt database [74].

RNA-sequencing and differential expression
We used 25 Duroc and 34 Landrace boars selected based on high and low sperm DNA fragmentation index, a measure of well packed doublestranded DNA vs single-stranded denatured DNA, which is an important indicator of boar fertility [75]. The boars were all born in the same period of time and a broad range of semen quality tests were conducted on ejaculates of these boars. Sequencing was done in two batches. Library preparation and sequencing strategy of the first batch can be found in [75]. The second batch was prepared using TruSeq mRNA stranded HT kit (Illumina) on a Sciclone NGSx liquid automation system (Perkin Elmer). A final library quality check was performed on a Fragment Analyser (Advanced Analytical Technologies, Inc) and by qPCR (Kapa Biosciences). Libraries were sequenced on an Illumina HiSeq 4000 according to manufacturer's instructions. Image analysis and base calling were performed using Illumina's RTA software v2.7.7. The resulting 100 basepair single-end reads were filtered for low base call quality using Illumina's default chastity criteria. We mapped the RNA-seq data to the Sscrofa11.1 reference genome using STAR [17] and called transcripts and normalized FPKM expression levels using Cufflinks and Cuffnorm [71]. We assigned the genotype class (homozygous reference, heterozygous, homozygous alternative) for each RNA-sequenced individual using the 660 K genotype of the lead SNP in the GWAS result. We tested for differential expression between three genotype classes using the oneway ANOVA test. The Welch t-test was used to evaluate the differences between two genotype classes. A p value <0.05 was considered significant.

Funding
This research was funded by the STW-Breed4Food Partnership, project number 14283: From sequence to phenotype: detecting deleterious variation by prediction of functionality. This study was financially supported by NWO-TTW and the Breed4Food partners Cobb Europe, CRV, Hendrix Genetics and Topigs Norsvin. In addition, this study was supported by the IMAGE project (Horizon 2020, No. 677353). Mirte Bosse was financially supported by NWO-VENI grant no. 016. Veni.181.050. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The use of the HPC cluster was made possible by CATAgroFood (Shared Research Facilities Wageningen UR). The data and some of the analyses used in this study was partly financed by the Research Council of Norway through the projects "Precision whole genome sequence to precision breeding" -NFR no. 255297/E50 and "Investigation of boar fertility by genetic characterization and detection of traits important in sperm production and quality" -NFR no. 207568/O99.

Author contributions
MD conceived the study. MD and CG planned the analysis framework, MD developed the analysis pipeline, CG provided scripts for the graphical output and to annotate candidate genes with phenotypes of orthologous genes in human, mouse and rat. MD took the lead in writing the manuscript and was supported by CG. ML conducted the GWAS and provided data. HJM, AG and MG contributed to the interpretation of the results. MR, MB and DdR helped to supervise the study. All authors provided critical feedback and helped shape the research, analysis and manuscript.

Declaration of Competing Interest
ML and AG are employees of Topigs Norsvin, a research institute closely related to one of the funders (Topigs Norsvin). All authors declare that the results are presented in full and as such present no conflict of interest. The other Breed4Food partners Cobb Europe, CRV, Hendrix Genetics, declare to have no competing interests for this study.