Dissecting closely linked association signals in combination with the mammalian phenotype database can identify candidate genes in dairy cattle

Background Genome-wide association studies (GWAS) have been successfully implemented in cattle research and breeding. However, moving from the associations to identify the causal variants and reveal underlying mechanisms have proven complicated. In dairy cattle populations, we face a challenge due to long-range linkage disequilibrium (LD) arising from close familial relationships in the studied individuals. Long range LD makes it difficult to distinguish if one or multiple quantitative trait loci (QTL) are segregating in a genomic region showing association with a phenotype. We had two objectives in this study: 1) to distinguish between multiple QTL segregating in a genomic region, and 2) use of external information to prioritize candidate genes for a QTL along with the candidate variants. Results We observed fixing the lead SNP as a covariate can help to distinguish additional close association signal(s). Thereafter, using the mammalian phenotype database, we successfully found candidate genes, in concordance with previous studies, demonstrating the power of this strategy. Secondly, we used variant annotation information to search for causative variants in our candidate genes. The variant information successfully identified known causal mutations and showed the potential to pinpoint the causative mutation(s) which are located in coding regions. Conclusions Our approach can distinguish multiple QTL segregating on the same chromosome in a single analysis without manual input. Moreover, utilizing information from the mammalian phenotype database and variant effect predictor as post-GWAS analysis could benefit in candidate genes and causative mutations finding in cattle. Our study not only identified additional candidate genes for milk traits, but also can serve as a routine method for GWAS in dairy cattle. Electronic supplementary material The online version of this article (10.1186/s12863-019-0717-0) contains supplementary material, which is available to authorized users.


Background
Over the last decade, the development of high density single nucleotide polymorphism (SNP) arrays and next-generation sequencing (NGS) technologies have made genome-wide marker sets available in many organisms [1,2]. Combining these with phenotypic records on many individuals, genome-wide associate studies (GWAS) present a powerful tool to undercover genetic variants associated with variation in the phenotype [3]. In human, numerous studies successfully identified causal variants for traits such as height [4], bodyweight [5] as well as several complex diseases [6]. However, in livestock, long range linkage disequilibrium typically results in imprecise determination of quantitative trait loci (QTL) locations and the associated genomic regions containing several positional candidate genes. In addition, two or more QTL located close to each other may be misidentified as one QTL. In such situations, additional analyses need to be performed to distinguish multiple QTL located close to each other.
To resolve these issues, we need additional information over and above association statistics. For traits with Mendelian inheritance, techniques such as homozygosity mapping and studies of recombinant haplotypes provide important clues due to the unambiguous association of at least some genotypes with phenotypic differences. For quantitative traits, no such close associations exist. However, genomic information of various types do allow relative prioritization among candidate variants. The challenges are which information to consider post-GWAS and how to combine them with GWAS statistics. Expression quantitative trait loci (eQTL) mapping can help; expression profiles as the dependent trait in a GWAS have identified causal genes in some studies [7]. Nevertheless, eQTL studies are time consuming and expensive. Therefore, alternative approaches to incorporate gene expression data into GWAS are needed. Other sources of additional information like variants' annotation [8] and evolutionary conservation scores [9] have been used. Unfortunately, these analyses need to be designed on a case-by-case basis [10]. Their implementation is challenging in livestock due to the sparsity of annotation data.
In this study, we used an approach to separate multiple closely linked QTL in dairy cattle by fixing the lead SNP as a covariate. This approach detects QTL chromosome by chromosome, and generates a list of lead SNPs for each QTL. The method is demonstrated by application to three milk yield traits in Nordic Holstein cattle. Many previously identified loci were also confirmed here. Furthermore, we used the mammalian phenotype database to help to find the candidate genes and Variant Effect Predictor (VEP) annotations to screen for possible causal mutations.

Results
We applied a GWAS analysis approach that automatically and iteratively accounts for the effects of QTL identified in previous iteration(s), a similar approach to conditional analysis implemented in GCTA [11]. The impact of pre-correction on type I error rate was assessed by analyzing simulated data with the effect of a quantitative trait nucleotide (QTN) added to the real phenotypic data (for details on simulation, see Method section). The candidate genes were picked as the closest genes to the lead SNP and listed in Tables 1, 2 and 3. The search for candidate genes started with the top SNP location. However, the whole genomic region showing strong associations with the trait was searched, as the top SNP may not be always located closest to the causal gene due to differences in: LD, imputation accuracy and minor allele frequency. Therefore, we included discussion on other relevant genes (based on association results, known gene function etc.) which could be candidate genes underlying the QTL.
Our approach of including associated SNPs as covariates in subsequent rounds of analyses did not increase the type I error rates. We simulated one SNP as a QTN and considered 10 other SNPs with different levels of LD (r 2 ) with the QTN in order to test whether our method introduces type I error into analysis when fixing lead SNPs detected in previous iterations as covariates [12]. We generated new phenotypes from the real phenotypic value plus the simulated QTN effects. The QTN's contribution to individuals' phenotypes was obtained by multiplying the genotype dosage of the QTN with the allele substitution effect which was drawn from a normal distribution with a mean 20% of the standard deviation (SD) of the phenotype and variance as 1% of the phenotypic variance. The simulation was repelicated 100 times. We detected the simulated QTN as the lead SNP in the first round of all 100 replicates. When the simulated QTN was included in the model as a covariate, we did not observed any of the 10 SNPs in LD with QTN to be significant (i.e., no false positives detected).

The GWAS of fat yield
Analyzing milk fat yield, our approach detected nine additional QTL over and above the QTL detected in the first round ( Fig. 1 and Table 1). In Table 1, the first SNP on each chromosome is the lead SNP from the first round of GWAS analysis, the rest are the additional SNP(s) detected on a chromosome. Sixteen SNPs on chromosome 14 have the same P-value in the first round, and these SNPs are in high LD with the two known causative polymorphisms in DGAT1 [13], BTA14: 1802265 (rs109234250) and BTA14: 1802266 (rs109326954) (Additional file 1: Figure S1). The variant effect predictor (VEP) [14] annotation showed these two variants in DGAT1 are missense mutations. The second strongest association signal was located on chromosome 5 with lead SNP, BTA5: 93948357 (rs209372883) located within the intron of MGST1. MGST1 was previously reported associated with the milk fat content [15]. On chromosome 26, our lead SNP pointed to COX15. In a human study, this gene was proposed involved in biosynthesis of heme A [16]. Even though this gene is a promising positional candidate gene, no biological information currently links this gene to milk fat yield. Another gene known to affect milk fat content is SCD1 [17] (Table 4).

The GWAS for protein yield
We ran the analysis on the milk protein yield (Fig. 2), and found 34 lead SNPs ( Table 2), 12 of which were detected in the second or third round. The strongest association signal for protein yield was on BTA14 with lead SNP BTA14:1835440 (rs208567981), located within BOP1. The annotation of BTA14:1835440 (rs208567981) is a missense mutation, and the SIFT annotation is tolerant. However, this signal is most likely due to the known mutation in DGAT1. The lead SNP (rs208567981) was in strong LD with SNPs located within DGAT1 and the -log 10 (P) value of these 19 SNPs within DGAT1 were larger than 47.99 (including two known causative variants in DGAT1, Additional file 1: Figure S2). This result shows that the causal mutation may not necessarily be the SNP in highest association. The second lead SNP of this analysis is BTA6: 88477501, which is located near the well-studied casein genes CSN1S1, CSN1S2, CSN3 and CSN2 [18]. We estimated the variance explained by QTL. The QTL (22 QTL) found only from the first round explained 12.52% of the DRP variance for protein yield and all QTL (34 QTL) explained 16.76% of the DRP variance ( Table 4).

The GWAS for milk yield
We applied our analysis to milk yield (Fig. 3). A total of 26 lead SNPs (Table 3) were detected, out of which six were detected in the second, third or fourth round. The most significant association signal was in the DGAT1 gene. The second most significant association signal was at BTA20:29996719 (rs43116343), which is close to

MRPS30.
A recent study showed MRPS30 to be associated with lactation persistence in Canadian Holstein cattle [19]. This lead SNP is also close to the growth hormone receptor, GHR [20]. The causative mutation of GHR is BTA20:31909478 (rs385640152), and is known to affect milk yield [20]. The third strongest lead SNP was BTA5:93953487 (rs210234664). This SNP is close to MGST1. A previous eQTL study showed MGST1 may affect milk composition [21]. With our approach, we detected BTA6: 38027010 (rs43702337) in the third round, located in ABCG2. ABCG2 was previously reported to affect milk yield in dairy cattle [22]. This SNP is a missense variant; its SIFT annotation is "deleterious" and has previously been proposed as a causative mutation [23]. We estimated the variance explained by QTL. The QTL (20 QTL) found from the first round explained  (Table 4).

Post-GWAS analysis using the mammalian phenotype database
The criteria for selecting positional candidate genes was the gene located closest to the lead SNP. For future identification and research on genes biologically associated with milk traits, we tried to find whether there are other genes which should be considered as potential candidate genes other than the candidate gene lists (Tables 1, 2 and 3). Considering the high LD structure of cattle population, the causal genes may be located within the genome region in LD with lead SNPs. One source of additional information that may help to prioritize genes, is to find the link between the gene and the possible function in the mammalian phenotype database related to milk and milk-organ related traits [24]. Therefore, we extracted genes which overlap with the LD region of the lead SNP and search them in the mammalian phenotype database [24]. We only paid attention to two kinds of phenotypes: "abnormal mammary gland development" or "abnormal milk composition". Ten genes from the GWAS hits were also annotated as related to these two types of phenotype. This annotation appears to have biological relevance, although the enrichment of these 10 genes in the mammalian phenotype database analyzed by Fishers' exact test was not significant. The results showed five genes were reported to be related with "abnormal milk composition" ( Table 5). Out of this list, CSN1S1, CSN2, CSN3 and DGAT1 were reported in dairy cattle and also identified in the present study. Furthermore, we identified six genes related to "abnormal in Eight additional SNPs on chromosome 14 had same highest P value. Note, b indicated this SNP was found on second round, c indicated this SNP was found on third round, d indicated this SNP was found on fourth round mammary gland development" (Table 6) in mammalian phenotype database. In this list DGAT1 showed abnormal phenotype in both kinds of phenotype description we searched. In addition to the well-studied genes (CSN1S1, CSN2, CSN3 and DGAT1), the remaining five genes are ELF5, CAT, STK3, CHUK, and WNT1. ELF5 is one of the candidate genes proposed by the closest genes to lead SNP (BTA15: 65891100) associated with the fat yield (Table 1). ELF5 was previously found related to mouse mammary development [25] and may also influence the milk content through milk protein synthesis in cattle [26]. CAT is also located close to the same lead SNP as ELF5. CAT is involved in several biological processes including GO term 'responds to fatty acid' [27]. CHUK, close to BTA26: 20547445, is associated with fat yield (Table 1). This gene is known as a key gene involved in mammary development in mice [28]. STK3 is the nearest gene to the second lead SNP (BTA14: 67981742) on the same chromosome associated with milk protein yield ( Table 2). This gene was found to play a pivot role in controlling cell proliferation [29] and tumor suppression [30] in human studies. WNT1 is the nearest gene to the second lead SNP of milk yield ( Table 3).

Annotation of SNPs in LD with lead SNPs
As shown before, the causative mutation maybe located in the neighboring region of the lead SNP. Therefore, we extracted all SNPs in LD with leading SNPs (r 2 > 0.2) and annotated them using VEP [14]. We extracted 27,612 SNPs and obtained 29,249 annotations (because some genes or transcripts overlap). The majority of these SNPs are intergenic variants or intron variants (Fig. 4a). Among the SNPs that changed the coding sequence of the protein, most of them were synonymous variants (Fig. 4b). Using this result, we checked if we could prioritize candidate mutations in the candidate genes. For example GHR, the well-known causative mutation for GHR is BTA20:31909478 (rs385640152, F279Y) [20]. The annotation for this SNP is a missense mutation and the SIFT score is 0.02 which is 'deleterious'. Further, we checked whether we can detect some candidate mutations in the new candidate genes. Four genes (CSN1S1, CSN2, CSN3 and DGAT1) were found related to abnormal milk composition and DGAT1 related to mammary gland development (Table 5 and Table 6) as reported previously. In addition to DGAT1, we found

Discussion
Although functional gene clustering is weaker in eukaryotes genomes than in prokaryotes genomes, functional grouping of the genes with same or similar function still exists [31]. Therefore, in GWAS analysis, we may fail to detect the nearby genes and may treat them as one significant signal. In our study, we used an analysis approach to detect multiple nearby QTL by iteratively fixing the lead SNP as covariate. However, such an approach can inflate type I error rate [12]. To avoid introducing additional type I errors, we placed a condition that the lead SNPs detected in the second and subsequent rounds must be found to be genome-wide significant in the first round (i.e., significant according to conventional GWAS criteria). In addition, we tested our approach on simulated data with a simulated QTN and multiple SNPs with various levels of LD with the QTN.
In 100 replicates, we found no additional SNP in LD with the QTN other than the simulated causative variants. By using this analysis, we were able to detect multiple QTL (as well as designating the lead SNP for each QTL) on a chromosome automatically. For example, we detected a known QTL on BTA6 (BTA6:38027010, rs43702337) in the third round and also another QTL at 46 Mb (in the second round). This SNP is located in the gene ABCG2 which was previously reported to affect milk yield in dairy cattle [22] and this lead SNP was the most  probable causative mutation [23]. Furthermore, our approach also showed the potential to distinguish closely linked QTL. For example the lead SNPs on chromosome 6 of protein content, we detected the first association signal at BTA6: 88477501 and the third association signal at BTA6: 88749792. Similar conditional analyses were also applied in human and other livestock studies [32][33][34].
Here, we analyzed one lead SNP at a time, as opposed to Bolormaa et al. [34] who included all lead SNPs simultaneously in the model. We also compared the genetic variants explained by the QTL found by first round and all the QTL found by our approach. The results showed the QTL found at second and third round did explain more phenotype variants (Table 4). Post GWAS, we face the challenge of identifying the candidate genes. The conventional method is to use the nearest gene, but this may miss the target as many-a-time the lead SNP may not be from the causal gene. This could be due to imputation inaccuracies, multiple QTL in the vicinity or random chance factor. Therefore, we need to use additional information to prioritize the candidate genes. In this study, we used the mammalian phenotype database to search for candidate genes from the genes located in association regions. The mammalian phenotype is based on mouse mutation lines. As a test, we extracted all genes located within LD of the lead SNPs for all three milk yield traits and searched for related phenotype terms. Here, we searched for two phenotype terms 'abnormal mammary gland development' and 'abnormal milk composition'. We successfully identified some well-known genes affecting milk related traits in cattle as well as new candidate genes (Table  5 and Table 6). For the term 'abnormal milk composition' , we identified five genes. Four of them were reported previously in different studies [35,36], and only DGAT1 is the nearest gene to the lead SNP on chromosome 14. Another term we searched is 'abnormal in mammary gland development' and found six genes. ELF5, STK3 and WNT1 are the nearest genes to the lead SNPs. However, differences between mice and cattle may introduce some false positives. In all, using this strategy we not only found some well-studied genes missing from the nearest genes method (pick the gene which is nearest to lead SNP as candidate genes), but also identified new candidate genes which may be helpful in finding causal factors.
We also face another challenge of identifying the causative variant once the causal gene is identified as levels of linkage disequilibrium in cattle are high [37]. In many cases the causative variant is not the lead SNP [38] but another SNP hidden within the LD of the lead SNP. In human studies, there are different strategies to prioritize variants [10]. In this study, information from Ensembl [14] was used to prioritize potential causative variants. In our case, the DGAT1 and ABCG2 can be detected in our lead SNP list, and the causative mutation of both can be detected in VEP annotation as missense variants. GHR was found nearby the location of lead SNPs. For ABCG2 and GHR, the SIFT score show these mutations as 'deleterious'. For DGAT1, even though the SIFT showed these two mutations are tolerated the amino acid of the protein is changed. Therefore, the impact of moderate and high reported by VEP can be considered as possible causative mutations, while SIFT score can be used to provide additional support.
In summary, our analysis approach can distinguish nearby association signals of multiple QTL. In our study, we found candidate genes reported by previous studies. Followed by searching genes within the LD region of the lead SNPs, we can find high confidence candidate genes. Lastly, using VEP can help us to find putative causative mutations within candidate genes and provides a good source for further functional validation. However, our approach will not be able to pinpoint causal variants Table 5 Genes related to "abnormal milk composition" phenotype in the mammalian phenotype database [24]

Conclusion
In this study, we designed an approach for detecting closely linked multiple association signals and performed the analysis in Nordic Holstein cattle for milk, fat and protein yields. The results showed we not only detected most of the well-known genes affecting these three milk yield traits but also detected additional candidate genes.
Post-GWAS, we used information from the mammalian phenotype database and variant effect predictor to confirm known genes and causative mutations. In the meanwhile, we detected additional genes which might be contributing to variation in milk traits in Nordic Holstein cattle. Therefore, we concluded our approach can be used routinely for GWAS studies in dairy cattle.

Phenotype and genotype data
No animal experiments were performed in this study, and therefore, approval from the ethics committee was not required. Phenotypic records for Nordic Holstein cattle are kept in a centralized database (Nordic Cattle Genetic Evaluation, NAV. http://www.nordicebv.info/). Breeding values for milk, fat and protein yield (MY, FY and PY) are based on production figures expressed in kilograms taken from routine milk records and then combined into an index for each trait. For details on genetic evaluation for milk yield traits in Nordic countries see (http://www.nordicebv.info/ production). The breeding values used for association analysis were de-regressed proof breeding values [39,40] from the routine genetic evaluation by NAV and were available for 5043 progeny tested Holstein bulls.
The association study was carried out by using imputed WGS data, as previously described by Iso-Touru et al. [41] and Wu et al. [42]. A total of 4921 bulls were genotyped with the Illumina BovineSNP50 BeadChip (54 k) ver. 1 or 2 (Illumina, San Diego, CA, USA). The 54 k genotypes were imputed to WGS variants by a 2-step approach [43]. First, all animals were imputed to the high-density (HD) level by using a multibreed reference of 3383 animals (1222 Holsteins, 1326 Nordic Red Dairy Cattle, and 835 Danish Jerseys), which had been genotyped with the Illumina BovineHD BeadChip. Subsequently, these imputed HD genotypes were imputed to the WGS level by using a multibreed reference of 1228 animals from Run4 of the 1000 Bull Genomes Project Different variant calling pipelines were used for the 1000 Bull Genome Project data and the in-house Nordic data at Aarhus University. The whole genome sequence data at Aarhus University was analyzed as described by Brøndum et al. [44]; while the same for 1000 Bull Genome Project was described by Daetwyler et al. [1]. Detailed guidelines are available at http://www.1000bullgenomes. com. Data from both sources were available as VCF files. The data from the two sources were combined using Picards MergeVCFs (https://broadinstitute.github.io/pic ard/). As the 1000 Bull Genomes Project only shares data after variant calling, some markers were not called for all animals in the combined dataset. To avoid large gaps of missing markers in the dataset, only markers that were called in both the Nordic and the 1000 Bull Genomes Project datasets were kept. For positions containing both a SNP and an INDEL, the INDEL was discarded, as the imputation methods rely on unambiguous sequences of variants. Positions with disagreements between alleles for sequence and HD data were also deleted. Reference genotype probability data was run through BEAGLE [45] and all markers with an R 2 value (squared correlation between the true and imputed allele dosages) below 0.9 were removed from the original sequence data. This was done in order to remove poorly imputed markers that might have adverse effects on the imputation procedures.
Imputation from 54 k to HD genotypes to HD and imputation to the WGS level were undertaken with IM-PUTE2 v2.3.1 [46] and Minimac2 [47], respectively. The imputation to whole genome sequence was done in chunks of 5 Mb with the length of buffer region of 0.25 Mb on either side. A total of 22,751,039 biallelic variants were present in the imputed sequence data. After excluding SNP with a minor allele frequency below 1% or with large deviation from Hardy-Weinberg proportions (P < 1.0 − 6 ), 15,512,960 SNPs for fat yield, 15,551,720 SNPs for protein yield and 15,551,614 SNPs for milk yield on 29 autosomes in Nordic Holstein cattle were retained for association analyses. The average accuracy (r 2 -values from Minimac2) was 0.85 for across breed imputation. Information on the distribution of imputation accuracy as a function of minor allele frequency has previously been published [42].

The methodology of multiple QTL detection
We developed an analysis approach to run the conditional GWAS analysis, similar to the GCTA-COJO approach in GCTA [11]. However, GCTA-COJO uses GWAS summary data while we have reanalyzed the data after fitting only the lead SNP(s) on a chromosome. Furthermore, we used imputed dosage data instead of number of copies of the reference allele. This takes account of inaccuracies in genotype imputation. We first performed a single SNP GWAS analysis using GCTA [11] for each chromosome as the first round. Then we ranked the SNP based on their -log 10 P value in the GWAS. The SNP with the largest -log 10 P value, the lead SNP, within each chromosome was identified. An experiment-wise 0.05 type I error rate after Bonferroni correction for 15,512,960~15,551,720 simultaneous tests corresponds to a threshold of -log 10 P ≈ 8.5. If thelog 10 P value of the lead SNP exceeded 8.5; we extracted the lead SNP's genotype dosage, fitted it as a covariate, and scanned the whole chromosome again as the second round. If the result of second round detected another SNP with a -log 10 P value exceeding 8.5 and this SNP also was significant in the first round (-log 10 P > 8.5), we extracted the allele dosage of this SNP and fixed it as another covariate and scanned the chromosome in a third round. This same procedure was iterated until no additional SNP remained significant. The lead SNP in each round were collected to build a lead SNP list. Moreover, in each round solo SNP, that is, SNP with no other significant SNP within a 1 Mb region were removed. A boundary for each QTL peak was defined as follows: for each QTL, we scanned the 1 Mb region up-and down-stream of each lead SNP, if SNP -log 10 P value decreased by more than 3 units compared to the value at the leading SNP and the region is larger than 0.25 Mb we set this SNP as a boundary, otherwise we set ±0. 25 Mb as the boundary. The list of candidate genes were generated from the closest annotated genome feature to the lead SNP list.
Testing the type I error rate using simulation data We used simulated phenotype data to test whether our approach to detect multiple QTL on a chromosome by incorporating previously identified QTL as covariates, inflates the type I error rates [12]. We selected a SNP randomly from the genome as a causative mutation (QTN) with a MAF (Minor Allele Frequency) between 0.05 and 0.10 and in Hardy Weinberg equilibrium. Ten additional SNP with different levels of LD (linkage disequilibrium, r 2 ) with the simulated QTN were selected. These 10 SNPs have different r 2 with the QTN as follows: one with 0.9-1, one with 0.8-0.9, one with 0.8-0.7, one with 0.7-0.6, one with 0.6-0.5, one with 0.5-0.4, one with 0.4-0.3, one with 0.3-0.2 and two with less than 0.2. Allele substitution effects at the QTL were sampled from a univariate normal distribution with mean of 20% of the standard deviation of phenotype and variance equal to 1% of the phenotypic variance. We repeated this simulation and applied our analysis 100 times. Lastly, we investigated how many times we found a SNP in LD with the simulated QTN after we fix the simulated causative mutation as a covariate i.e. false positive detection.

LD calculation and annotation
We calculated the pairwise r 2 between lead SNP and all other SNPs on the same chromosome using PLINK [48] and extracted all the SNPs which have r 2 > 0.2 with the lead SNP. All these SNPs were annotated by VEP (Variant Effect Predictor) [14]. To find the candidate genes, we extracted all the genes which overlap with LD regions of the lead SNP and searched these gene entries in the Mammalian Phenotype database [24]. We collected all the lead SNPs and calculated the pairwise r 2 with SNPs in the chromosome. The boundary was set to the last SNP that has r 2 > 0.2. Then we extracted all the genes overlapping these regions and searched them in the database. We found 417 genes located in the LD regions, of which 388 have gene symbols. These 388 genes were searched in the database and 375 have mutation lines with phenotype descriptions in the Mammalian Phenotype database. We refined results using two terms for phenotypes: 'abnormal in mammary gland development' and 'abnormal in milk production'.