Multi-strategy genome-wide association studies identify the DCAF16-NCAPG region as a susceptibility locus for average daily gain in cattle

Average daily gain (ADG) is the most economically important trait in beef cattle industry. Using genome-wide association study (GWAS) approaches, previous studies have identified several causal variants within the PLAG1, NCAPG and LCORL genes for ADG in cattle. Multi-strategy GWASs were implemented in this study to improve detection and to explore the causal genes and regions. In this study, we conducted GWASs based on the genotypes of 1,173 Simmental cattle. In the SNP-based GWAS, the most significant SNPs (rs109303784 and rs110058857, P = 1.78 × 10−7) were identified in the NCAPG intron on BTA6 and explained 4.01% of the phenotypic variance, and the independent and significant SNP (rs110406669, P = 5.18 × 10−6) explained 3.32% of the phenotypic variance. Similarly, in the haplotype-based GWAS, the most significant haplotype block, Hap-6-N1416 (P = 2.56 × 10−8), spanned 12.7 kb on BTA6 and explained 4.85% of the phenotypic variance. Also, in the gene-based GWAS, seven significant genes were obtained which included DCAF16 and NCAPG. Moreover, analysis of the transcript levels confirmed that transcripts abundance of NCAPG (P = 0.046) and DCAF16 (P = 0.046) were significantly correlated with the ADG trait. Overall, our results from the multi-strategy GWASs revealed the DCAF16-NCAPG region to be a susceptibility locus for ADG in cattle.

Phenotype Data. The resource population consisted of 1,173 Simmental cattle that were born between 2008 and 2013 in Ulagai, Inner Mongolia. After weaning, all calves were transferred to a fattening farm in Beijing and fattened in the same pens for 8-12 months. All cattle were fed with identical feed, which consisted of silage, brewer's grain, bean dregs, breadcrumbs, and maize. We measured each bull's body weight at the following five time points: birth, upon entering the fattening farm, 12 months of age, 18 months of age, and before slaughter. The growth curve analyses closely followed the linear regression during the fattening period (see Supplementary Fig. S1), and the slope of the regression line therefore represented the average daily gain (ADG) during the fattening period.
Genotype Data. The genotypes of the 1,173 beef cattle were obtained by Illumina BovineHD BeadChip, which included 774,660 SNPs. Quality control procedures were carried out using PLINK 1.7 software 31 to remove SNPs with a call rate less than 95%, a minor allele frequency (MAF) less than 0.05 and a significant deviation from the Hardy-Weinberg equilibrium (P < 10 −5 ); moreover, animals with more than 10% missing genotypes were removed from the dataset. Missing alleles were imputed using Beagle 4.1 software 32 to guarantee the accuracy and effectiveness of the statistics 33,34 .  Table 1. Average daily gain (ADG)-associated quantitative trait loci (QTL) in cattle. Note: 1 "GWAS" refers to the genome-wide association study, "Association" refers to the candidate gene association analysis, and "QTL" refers to the QTL mapping linkage association analysis. 2 QTL positions that were reported as associated with the ADG trait. 3 Candidate genes that were reported as associated with the ADG trait. 4 The M1 line of Beefbooster Inc. was developed from an Angus base and has been under selection for over 30 years. 5

SNP-based GWAS.
A standard MLM for GWAS was performed by extending the Henderson notation as follows: i where y represented a vector of ADG, μ represented the population mean, v represented a vector of fixed effects, β i denoted the effect of the ith SNP, u represented a vector of the polygenic effects and e represented the residual. W, X and Z represented the incidence matrices for v, β i and u. Z was the genetic additive matrix constructed by SNPs, termed as kinship. As described by Lopes 36 , we built kinship using 50,000 random SNPs across autosomes.
In this model, we considered sex, birth year, calving season and population stratification as fixed effects. The percent phenotypic variance that was explained by a single significant SNP was calculated as follows: where p i and q i represented the allele frequencies for the ith SNP, β i denoted the effect of the ith SNP, and σ p 2 represented the phenotypic variance. The R package heritability (https://cran.r-project.org/web/packages/heritability/index.html) was used to estimate the ADG-associated heritability and genetic variance.
Haplotype-based GWAS. Haplotype-based GWAS was performed using the method proposed by Gregersen VR et al. 10 . Haplotype blocks were established based on pairwise measures of the linkage disequilibrium (LD) 37 and implemented using the PLINK 1.7 software with a block window that was less than 100 kb. The haplotype block estimation option was --blocks --ld-window-kb 100. After the haplotype block partitioning, haplotypes for each sample were calculated using a standard expectation-maximum (EM) algorithm, and the program was conducted using the R package haplo.stats (URL: https://cran.r-project.org/web/packages/ haplo.stats/index.html). Haplotype association analyses were implemented in the R package lme4 (URL: https:// cran.r-project.org/web/packages/lme4/index.html) using the MLM equation as follows: where y represented a vector of ADG, μ represented the population mean, v represented a vector of fixed effects, β ij represented the effect of the ith haplotype in the jth block (which contained t haplotypes), u denoted polygenic effects for each individual, and e represented the residual. W, H i and Z were the incidence matrices for v, β ij , and u. A Chi-square hypothesis test with df = 1 was used to calculate the significance level of the haplotype block as follows: The percent phenotypic variance (V pj ) explained by the jth block was calculated using a two-step approach. Firstly, the effect of haplotypes at the jth block was estimated using least square (LS) method and all jth block haplotypes were clustered into two groups (G1 and G2) based on the estimated effects. Each sample was defined as 0, 1, and 2 (G1/G1, G1/G2 and G2/G2) according to the EM results. We then calculated V pj as follows: where β represented the regression coefficient of the phenotype on the indicator (0, 1 and 2), and p and q indicated the frequencies of G1 and G2, respectively.
Gene-based GWAS. We conducted a gene-based GWAS method using a principal component analysis (PCA) according to the method proposed by Kai Wang et al. 38 . First, principle components (PCs) were constructed based on an intragenic SNP indicator, and we selected the PCs based on a cumulative contributed proportion > 85%. Second, the estimate breeding value (EBV) was calculated based on genomic best linear prediction (GBLUP) with fixed effects (sex, birth year, calving season and population stratification) and random effects (polygenic effects). Third, the effectiveness of each PC and statistical hypothesis test was calculated. The general linear model was: Scientific RepoRts | 6:38073 | DOI: 10.1038/srep38073 where b i represented the regression coefficient of the phenotype on the PC, X represented the vector of the PC, and e represented the residual. The following Chi-square hypothesis testing (df = 1) formula was used: For each gene, we selected the minimum P-value for the PCs when the PC number exceeded two. The significant threshold was set based on the permutation testing to overcome false positive discovery 39 . Thus, 1,000 permutation cycles were performed (23,856,000 multiple tests), and the 240,000 th highest value represented the cut-off point for the 1% level of significance.

Gene expression level.
To validate whether the explored gene resulting from the three GWAS methods was associated with ADG trait, transcript abundance in longissimus dorsi muscle tissue was measured. We selected 28 steers randomly in 2014. Longissimus dorsi muscle samples were collected from steers at slaughter and stored in liquid nitrogen. Total RNA was isolated using the TRIzol Reagent total RNA extraction kit (Invitrogen, Carlsbad, CA, USA) and precipitated with ethanol.
Primers were designed using the Primer 5 software and were approximately 200 bp in length (Supplementary Table S1). Real-time PCR was performed to examine the expression level of selected genes using the SYBR ® Fast qPCR Mix (Takara Bio, Otsu, Japan) with the Applied Biosystems ® 7500 Real-Time PCR Systems (Applied Biosystems, Foster City, CA, USA). Expression values were normalized to GAPDH as the internal control. The mean fold change in expression of the target genes was calculated using the 2 −ΔΔCt method.
Correlation analyses were conducted using R version 3.2.2 (https://cran.r-project.org, 18/3/2016). Correlations were derived for all candidate genes expression and phenotypic data with 28 random steers from the same year. General linear model (GLM) was used and the fixed effects included calving season and population stratification effects.

Results and Discussion
Phenotype description and genetic parameters. The phenotypic distribution followed a Gaussian distribution with a mean of 0.98 kg/day, a maximum of 1.87 kg/day, a minimum of 0.54 kg/day, and a standard deviation (SD) of 0.16 kg/day. The heritability (h 2 ) of the average daily gain (ADG) was 0.48, with an additive genetic variance (Va) equal to 0.012.
Following the quality control and imputation, 1,141 samples with 669,742 SNPs remained. Cleaned SNPs were uniformly distributed over the whole bovine genome with a mean inter-marker space of 4.52 kb.

SNP-based GWAS results.
In this study, we used three strategies to perform a genome-wide association study (GWAS) for the ADG trait in beef cattle (Fig. 1). In the SNP-based association, we identified 40 distinct SNPs (Supplementary Table S2) that exceeded the suggested significance thresholds (P < 10 −6 ), 38 of which were located within BTA6 (Fig. 1a). Here, we identified the most significant SNPs, rs109303784 and rs110058857, on BTA6 with identical P-values of 1.78 × 10 −7 . The distance between the two significant SNPs was 680 bp, which were in complete linkage disequilibrium (r 2 = 1) and explained 4.01% of the phenotypic variance. Rs109303784 and rs110058857 were both located upstream of NCAPG and downstream of the DCAF16 gene according to the Ensembl genome database (http://www.ensembl.org/index.html). Figure 2 showed the regional − Log10 (P-value) of the significant SNPs that surround the DCAF16-NCAPG locus on BTA6. We also calculated the LD levels, with the two peak SNPs denoted by different colors. Notably, we found that rs110406669 (P = 5.18 × 10 −6 ) had a low LD with the two peak SNPs and independently explained 3.32% of the phenotypic variance. Moreover, two other prominent SNPs, rs109028700 (BTA5:43111315) and rs137683327 (BTA5:84944556), were located on BTA5 and explained 2.59% and 2.87% of the phenotypic variance, respectively.
In contrast to the SNP-based GWAS results, no prominent block was found on BTA5, but 5 blocks were identified on BTA3, 7, 12 and 19. However, no gene regions or coding domains coincided with these blocks. Notably, Hap-3-N3218 (P = 1.7 × 10 −7 ) on BTA3 contained 3 extragenic SNPs (rs109934393, rs43349539 and rs43348574) that explained 6.22% of phenotypic variance. These results indicated that unknown functional regions or regulatory elements may exist around this identified block. Gene-based GWAS results. A total of 24,616 genes were annotated in ENSEMBLE database. For the gene-based association, 23,856 genes with an average 34.7 SNPs per gene were analyzed. And other 760 genes were excluded, since they included less than three SNPs or not were located in autosomes (sex chromosome or mitochondria DNA). The 1,000 permutation-cycle results suggested a set P-value of 10 −3 with a FDR < 1%. Seven genes were identified for ADG in this study (Table 3). Specifically, DCAF16 and NCAPG were implicated by the SNP-and Haplotype-based association results. We also found two small nucleolar RNAs, SNORD50 and SNORD87, with identical functions in the modification process of other small nuclear RNAs (snRNAs). Additionally, two uncharacterized proteins-ENSBTAG00000038625 and ENSBTAG00000024272-were obtained. These results indicated that the gene-based method can identify functional genes or loci which are previously unverified and provide a possible structural basis for further gene functional validation studies.
the ADG trait. We focused on the intersection of candidate SNPs identified by the three GWASs methods with the highest ADG trait-associated accuracy, which included 28 significant SNPs located at 38.6-39.0 Mb on BTA6. Figure 2 showed a schematic diagram of the region, which contains four annotated genes-FAM184B, DCAF16, NCAPG, and LCORL-from the Ensembl genome database. DCAF16, which was near to the peak SNPs for SNP-based GWAS approach, was the most significant gene (P = 6.45 × 10 −5 ) for gene-based GWAS analysis. Similarly, the most significant block, Hap-6-N1416 (P = 2.56 × 10 −8 ), was also located downstream of DCAF16 (physical distance = 19,663 bp) according to the Ensembl database. DCAF16 may function as a substrate receptor for the CUL4-DDB1 E3 ubiquitin-protein ligase complex, which is involved in two pathways that promote protein modifications and ubiquitination. NCAPG, which was also identified by three GWAS methods simultaneously, encodes a subunit of the condensin complex, which is responsible for the condensation and stabilization of chromosomes during mitosis and meiosis. The associated pathways involved the cell cycle, mitosis and the mitotic prometaphase. Numerous studies 25,26,[40][41][42][43][44][45][46][47][48][49][50][51][52] have confirmed that NCAPG has strong effects on the body sizes and growth traits of human and domestic animals. According to the association analyses from Lindholm-Perry's results 40 , 47 SNPs within or near the gene boundaries of the three candidate genes (NCAPG, LCORL and LAP3) were genotyped. Figure 4 showed a comparison of these association study results with our SNP-based GWAS results. In contrast to our results, the most significant SNPs were located in the LCORL gene. However, most of the significant SNPs from these two analyses were located around the BTA6: 38.78 (Mb) region near the downstream region of DCAF16, suggesting that this region might be a more effective QTL for ADG trait in cattle.
Additionally, a missense mutation (c.1326 T > G, indicated in Fig. 4 by a purple triangle) was identified in exon 9 of NCAPG by several association 26,45 and linkage analyses 25,41 . The resulting amino acid change of Ile442 to Met442 in the encoded protein has been shown to be a candidate causative variation of the growth trait in cattle. Significant selection regions that affect the statures of European and African cattle cohorts were identified in NCAPG by multiple signal selection analyses 49 . GWAS analyses in horses 42,43,46,47 and cattle 48,51 indicated that the NCAPG-LCORL locus or closed regions were significantly associated with body size and growth traits.
Based on our results and previous reports 44,52 , we tested DCAF16, NCAPG, and LCORL expression in muscle tissues. Longissimus dorsi muscle samples from 28 steers with ADG phenotypes were collected. General linear model (GLM) results showed DCAF16 and NCAPG expression is significantly associated with ADG trait (Table 4) and correlations between ADG and genes expression were presented in (Supplementary Fig. S3). No significant difference was detected for the LCORL gene. Our results were concordant with the results presented by Perry et al. 44 that abundance of NCAPG was associated with ADG in the muscle tissue muscle from cows.
In the NCBI database, the NCAPG gene has one reference transcript (Genebank accession number: NM_001102376) and two predicted transcripts (XM_005207785 and XM_015471561), which were derived by a computational analysis using transcriptome data from 11 Hereford cattle. The differences between the three  Table 3. Seven significant ADG-associated genes based on the gene-based GWAS. transcripts occur in exon 1 ( Supplementary Fig. S4). Three transcripts primers were designed using the Primer 5 software (Supplementary Table S3). We demonstrated the existence of three transcripts in Simmental cattle using reverse transcription polymerase chain reaction (RT-PCR) ( Supplementary Fig. S2), and the PCR production sequences were consistent with those reported in the NCBI database. To address the significant association between each transcripts abundance and the ADG trait, we also tested the expression levels of three transcripts. GLM results showed XM_005207785 (P = 0.050) expression was significantly associated with ADG, while no significant correlation were found in NM_001102376 (P = 0.597) and XM_015471561 (P = 0.074) transcripts (Table 4). Overall, DCAF16 and NCAPG have been simultaneously explored by the three GWAS methods, and statistical analysis have proven that DCAF16 and one of NCAPG transcripts (XM_005207785) abundance were associated with ADG trait, indicating that the DCAF16-NCAPG region is a susceptibility locus for the ADG trait in cattle.
Furthermore, we noticed that the independent and significant SNP (rs110406669) from the SNP-based GWAS was located 5′ upstream with a distance of 30,695 bp to XM_005207785. Two peak SNPs were located in intron 1 of XM_005207785 and upstream with a distance of 6,970/7,650 bp to DCAF16. We then searched the transcription factor-binding (TF) site around candidate regions using the Tfsitescan software on the MIRAGE WWW server (http://www.ifti.org/cgi-bin/ifti/Tfsitescan.pl). The regions, which contained ± 5 Kb flanking sequences of the obtained significant SNPs (rs109303784, rs110058857 and rs110406669), were analyzed, and Table 5 showed the Tfsitescan results. The distances between the two most significant TF sites identified here-Nmp4-COL1A1-sit and AT2-VIRE-and the significant SNPs were 130 bp and 178 bp, respectively. It has been shown that Nmp4-COL1A1-sit influences cell structure and function during extracellular matrix remodeling in osteoblasts 53 . The protein product of AT2-VIRE, the AT2 receptor, is widely and abundantly expressed in fetal tissues and plays a pivotal role in cell differentiation and growth 31 . Moreover, similar TF site sequences were found upstream of the NCAPG gene in various species (Supplementary Table S4). Taken together, we proposed that Nmp4-COL1A1-sit, AT2-VIRE or other TF sites are probably involved in the regulation of DCAF16 or NCAPG transcript expression in association with the ADG trait.

Conclusion
In this study, we performed multi-strategy GWASs to investigate average daily gain (ADG) in the Simmental beef cattle. Forty significant SNPs in the SNP-based GWAS, 14 significant haplotype blocks in the haplotype-based GWAS, and 7 prominent genes in the gene-based GWAS were identified. Two genes, DCAF16 and NCAPG, were demonstrated to be associated with ADG by all three GWAS methods. Most importantly, the significant SNPs within the NCAPG-DCAF16 region were strongly associated with the ADG trait, with phenotypic variance of approximately 4%, suggesting the existence of causal variants in this region. Moreover, we have also shown that DCAF16 and NCAPG expression were significantly associated with ADG. Our findings provide insights into the understanding of the genetic mechanisms underlying ADG trait in cattle, and these results inform future NGS-GWAS analyses of causal variants for the ADG trait. Moreover, multi-strategy GWASs represents a powerful approach to the search and analysis of susceptibility loci-related traits.  Table 5. List of transcription factor-binding (TF) sites around the NCAPG-LCORL locus. Note: 1 "Position" indicates distance (in nucleotides) relative to the beginning of XM_005207785; negative indicates upstream of XM_015471561. 2 Experience Value. 3 Significance level is represented by an asterisk (*), where *indicates 10 −4 < P-Value < 10 −3 , and ***indicates P-Value < 10 −4 . 4 Distance between TF and significant SNPs.