Identification of SNPs Associated with Somatic Cell Score in Candidate Genes in Italian Holstein Friesian Bulls

Simple Summary Mastitis is a worldwide diffused disease usually treated with an excessive use of antibiotics. Therefore, antimicrobial resistance is an important issue to be addressed by scientists. One of the possible solutions to decrease the use of drugs is genetic selection of resistant animals, that is, individuals that can be more resistant to mastitis. In our survey we analyzed Single Nucleotide Polymorphisms (SNPs) in genes known to be involved in both infection resistance and immune system activity. We found a group of SNPs that can be associated to mastitis related phenotypes (namely SCS) and that can be used for selecting resistant animals. An efficient selection is able to improve both animal welfare and quality and safety of animal products Abstract Mastitis is an infectious disease affecting the mammary gland, leading to inflammatory reactions and to heavy economic losses due to milk production decrease. One possible way to tackle the antimicrobial resistance issue stemming from antimicrobial therapy is to select animals with a genetic resistance to this disease. Therefore, aim of this study was to analyze the genetic variability of the SNPs found in candidate genes related to mastitis resistance in Holstein Friesian bulls. Target regions were amplified, sequenced by Next-Generation Sequencing technology on the Illumina® MiSeq, and then analyzed to find correlation with mastitis related phenotypes in 95 Italian Holstein bulls chosen with the aid of a selective genotyping approach. On a total of 557 detected mutations, 61 showed different genotype distribution in the tails of the deregressed EBVs for SCS and 15 were identified as significantly associated with the phenotype using two different approaches. The significant SNPs were identified in intergenic or intronic regions of six genes, known to be key components in the immune system (namely CXCR1, DCK, NOD2, MBL2, MBL1 and M-SAA3.2). These SNPs could be considered as candidates for a future genetic selection for mastitis resistance, although further studies are required to assess their presence in other dairy cattle breeds and their possible negative correlation with other traits.


Introduction
Mastitis is an infectious disease that affects the mammary gland of cattle, leading to an inflammatory reaction and therefore to negative economic consequences due to a marked decrease in milk production [1]. This infection is usually caused by microorganisms penetrating the mammary gland via teat canal [2]: pathogens can be transmitted either between cows (e.g., Staphylococcus aureus) or picked up from the environment (e.g., Escherichia coli) [3]. Bovine mastitis is considered as one of the costliest diseases affecting dairy cattle worldwide, with antimicrobial therapy representing the major impact on sustained costs [4]. Furthermore, given the high mastitis frequency worldwide, the subsequent antibiotic use in dairy cows is under constant control due to its association with antimicrobial resistance increase [5].
One possible way to tackle the antimicrobial resistance issue is to select animal with a higher genetic resistance to this disease [6]. Starting from the 50's, along with its genetic relationship with additional infection-related phenotypes, the possible inheritance of genetic resistance to mastitis was studied through the years [7,8]. However, the first traits to be selected, like mammary gland characteristics and somatic cell count (SCC), revealed to have low to moderate heritability [9]. Therefore, new genetic approaches to find markers able to allow a faster and more accurate selection are requested, with two potential available candidates: genome scanning and single nucleotide polymorphisms (SNPs) in candidate genes [6]. A holistic approach such as genome scanning and/or a Genome Wide Association Study (GWAS) requires a large effect and/or a large number of animals to detect loci associated with traits of interest, while complex diseases are determined by many loci or genes with a small or almost negligible effect [10]. Therefore, the SNP approach in candidate genes involved in organism recognition, leukocyte recruitment, pathogen elimination and resolution seem to be more direct and reliable.
At first, we selected nine genes that are all involved in the immune response to mastitis infection according to literature. Pentraxin 3 (PTX3) gene maps on Bos taurus autosome 1 (BTA1) and its encoded 45 kDalton glycosylated protein is expressed by mononuclear phagocytes, dendritic and endothelial cells in response to primary inflammatory signals, due to complement activation and pathogen recognition [11]. Chemokine (C-X-C motif) receptors 1 and 2 (CXCR1 and CXCR2, respectively) are paralogous genes coding for major proinflammatory cytokine receptors [12]. They both map on BTA2 and are separated by a 23 Kb fragment. CXCR1 and CXCR2 have been considered as prospective genetic markers for mastitis resistance in dairy cows [13,14]. Deoxycytidine kinase (DCK) maps on BTA6 and has a functional role in drug resistance and sensitivity [15]. Toll like receptor 4 (TLR4) maps on BTA8 and is associated with the early innate immune response. Specifically, TLR4 is recognized as the key transmembrane receptor for the detection of gram-negative bacteria [16]. Nucleotide binding oligomerization domain containing 2 (NOD2) maps on BTA18 and is a key component in the innate immune system, inducing the activation of proinflammatory signaling pathways [17]. Mannose binding lectin 1 and 2 (MBL1 and MBL2, respectively) map on BTA28 and BTA26, respectively. MBLs are collagenous lectins involved in the innate immune response to various microbial pathogens and are potential candidate gene to mastitis resistance [18]. Lastly, mammary Serum amyloid A3.2 (M-SAA3.2) maps on BTA29 and belongs to a superfamily of apolipoproteins expressed in bovine mammary gland as a response to pathogens associated to mastitis [19]. Another candidate gene found in a genomic region close to DCK, namely the immunoglobulin J (joining) chain gene (JCHAIN), was then included in the re-sequencing to find new possible candidate SNPs: this gene plays an important role in the assembly of polymeric immunoglobulins (dimeric IgA and pentameric IgM) and in their selective transport across epithelial cell layers [20].
The aim of this research was to analyze the genetic variability of these genes in Italian Holstein bulls and to study the effects of their SNPs in relation to resistance to mastitis [21].

Animal Data
The deregressed Estimated Breeding Values for Somatic Cell Score (dEBVs) were obtained by the National Breeding Association (ANAFIJ) for a total of 15,562 Holstein bulls. The dEBVs were calculated as for the genomic national evaluation with the algorithm developed by ANAFIJ. We chose to analyze individuals within the two extreme tails of the dEBVs distribution (±1 SD) of bulls born between 2002 and 2012. Given the 11 years range of bulls' birth year, dEBVs were expressed as deviation from the mean dEBVs updated in April 2014. Only bulls with a reliability index > 90 and with availability of semen were retained, resulting in the two following subsets: 37 bulls with low dEBVs (group L, <95) and 58 bulls with high dEBVs (group H, >105). Group L dEBVs ranged from 85 to 94 (92.0 ± 2.3, mean ± SD), while group H was composed by bulls with dEBVs values ranging from 106 to 112 (107.9 ± 1.8, mean ± SD). Reliability of the two groups was (mean ± SD) 92.4 ± 0.2 and 92.6 ± 0.2 (L and H, respectively). The mean number of daughters for each bull was 310 ± 74 and 300 ± 57 (L and H, respectively). Semen doses of the selected bulls were obtained from specialized laboratories for semen cryo-conservation.

Genes Data and Re-Sequencing
The DNA of the 95 bulls was re-sequenced in order to detect all the SNPs present in the sequences of the investigated genes. Information were obtained from NCBI database [22], "Gene" section (available from https://www.ncbi.nlm.nih.gov/gene/).
The assembly of the UMD Bovine Genome 3.1.1 was taken as a reference for all the genes ( Genomic DNA was extracted from semen by using NucleoSpins Tissue kit (Macherey-Nagel, Düren, Germany). The primers were designed using the Design Studio web application by Illumina ® to sequence the entire genes and about 10,000 bp of the upstream regions to search for polymorphisms that could be responsible of the gene expression and be related with resistance to mastitis also in the 5 UTR regulatory regions. The maximum length of the amplicons for each gene was 450 bp, trying to maximize the coverage of the target region. The primers generated by the software were included in a TruSeq ® kit custom amplicon. All the obtained amplicons were sequenced by Next-Generation Sequencing technology on the Illumina ® MiSeq platform at the IGA technology Services (Udine, Italy), which also performed the output data processing the variant and genotype call and generated a Variant Call File (VCF) for each gene. Polymorphisms were filtered on the base of locus GQX (genotype quality assuming position, <10,000), GQ (genotype quality, <30.00), R8 (indel repeat length, >8), and MQ (mapping quality, <0.00). Other considered parameters were indel, site conflicts, and read depth. The genotype table was set up in R environment inferring all the identified allelic variants from the VCF files. Allelic frequencies were calculated and mutations with Minor Allele Frequency (MAF) < 0.05 were removed from the dataset.

SNP Analysis
An investigation in the National Center for Biotechnology Information (NCBI, https: //www.ncbi.nlm.nih.gov) was carried out on the SNPs with MAF ≥ 0.05 to verify if they were already present in available databases, to define their gene position and, if detected in exon regions, to evaluate their effect on protein translation. The Wilcoxon-Mann-Whitney (WMW) non-parametric test was used to check if there was a difference between the two groups in terms of genotype distribution for all the loci. Subsequently, the generalized heteroscedastic effects regression model (HEM) was used to estimate the contribution of each SNP to the differentiation of the individuals into the two tails of the distribution of the dEBVs [23]. All the SNPs were also simultaneously tested with a multiple gene approach (MG), where association analysis under an additive model was performed considering the phenotype as binary trait (high or low dEBVs) and using the GRAMMAR approach as implemented within the GenABEL package v.1.8 [24] for R [25]. Since the analyzed SNPs were not actually scattered all over the genome, but in 10 selected genes (considering the introduction of JCHAIN), polymorphisms with a correlation higher than 0.80 with any others were excluded. Also, the SNPs not in Hardy-Weinberg Equilibrium (HWE) were excluded. Moreover, four individuals having identity by state (IBS) > 0.95, as revealed by the genomic relationship matrix calculated using the entire dataset, were not included in the following analysis. Then, a polygenic analysis was conducted using a genomic kinship matrix based on SNP genotypes to account for relationship between individuals. Residuals from the polygenic analysis were then used as dependent, quantitative variables in single marker, linear regression analyses to test the significance of marker effects.
In order to understand the possible role of the significantly associated SNPs a first check was performed on NCBI database to see if the SNPs fell within regions from which an RNA transcription through RNAseq analysis was obtained so far. Then, short interspersed nuclear elements (SINE), long terminal repeat elements (LTR), and RNA repeats were analyzed using UCSC Genome Browser [26]. Finally, RNA Central was used to search for potential similarity with non-coding RNA [27], while miRbase was used to search for micro-RNA (miRNA) [28].

Results
The final re-sequenced regions are listed in Table 1. Only for TLR4, NOD2, MBL2, and MBL1 we obtained sequences for about 10,000 bp upstream the selected genes, whereas for M-SAA3.2 and DCK we obtained reads for about 5600 bp upstream. In the remaining genes, namely PTX3, CXCR1, CXCR2, and JCHAIN, we could re-sequence about 1000 bp of the upstream region. For PTX3 gene we could not obtain the last 1000 bp of the gene. This occurred mainly because some regions were highly repeated while for other regions the software failed in finding specific primers. For the same reasons we could not obtain the 100% coverage, but only a mean gene coverage of 78.6% of the selected regions, with PTX3 and MBL2 sequenced with 64% and CXCR1 and MBL1 with 88% of coverage respectively. Despite gene selection, other genes are included in the sequenced regions of the bovine genome assembly: PTX3 gene is included in a region within the ventricular zone expressed PH domain homolog 1 gene (VEPH1) and the collectin surfactant protein A (SP-A) gene (SFPTA1) is located downstream MBL1. Moreover, one significant SNP in M-SAA3.2 was mapped in an intergenic region that seems to be closer to SAA4 than to M-SAA3.2, about 14,000 bp far from the re-sequencing chosen region. Considering the high similarity of the SAA superfamily, further work is needed to verify if there were mapping errors or we obtained re-sequencing of non-specific products. Therefore, we considered the discovered SNPs as belonging to the selected genes and specified in the annotation column of Table 2 their position referring to other genes.   Supplementary Table S8). After pruning for correlation and HWE, 182 of these 557 polymorphisms were tested for associations with the MG approach.
WMW test revealed 61 SNPs with a significant different genotype distribution in the two tails of the dEBVs: 12 SNPs out of 20 in PTX3, 1 SNP out of the 91 on BTA2 in CXCR1, 12 SNPs (three in intron 2 of JCHAIN, three in intergenic regions of DCK, and six in DCK introns) out of 76 on BTA6, six SNPs out of 81 in TLR4, 18 SNPs out of 54 in MBL2, 3 SNPs out of 64 in MBL1, four SNPs out of 128 in M-SSA3.2. Both the HEM and the MG analysis showed nine SNPs significantly associated with the phenotype each. Since two SNPs obtained by HEM and one SNP obtained by MG approach had no significant differences in genotype distribution by WMW, the total number of SNPs included in Table 2 is 64. The SNPs position refers to the UMD Bovine Genome 3.1.1 assembly and, where available, the SNPs rs are reported together with the description of the type of variant, the level of significance of the three tests and the effect of the SNP for HEM and MG approaches. The HEM approach revealed significant SNPs in DCK (4), NOD2 (2), MBL2 (2), MBL1 (1), whereas the MG approach revealed significant SNPs in CXCR1 (1), DCK (3), NOD2 (3), and M-SAA3.2 (2). None of the SNPs in PTX3, CXCR2, JCHAIN and TLR4 resulted significantly associated with the SCS levels in the analyzed population. Nevertheless, one SNP in PTX3 mapped within 3 UTR and others two corresponded to the non-synonymous mutations responsible for the aminoacidic exchanges D 341 E and E 347 K, therefore they could lead to changes in the expressed protein and be matter of interest although they were not significantly associated with the SCS levels in the analyzed population.

CXCR1
The only SNP showing a different genotype distribution in L and H groups by WNW test, rs109694601, also resulted significantly associated with SCS levels by MG approach. This SNP, located within intron region in CXCR1, is mapped 24 pb upstream a Mammalianwide Interspersed Repeats (MIR) element of the MIR3 subfamily. MIRs represent an ancient family of tRNA-derived Short INterspersed Elements (SINEs) found in all mammalian genomes, whose core region may serve some general function, although they have ceased to amplify by retro-transposition. Despite their massive presence in mammalian genomes, their contribution to the transcriptome is still largely unexplored even in humans, and in particular, elements controlling their transcription have never been systematically studied [29], thus we cannot even hypothesize a role of this SNP yet.

DCK
None of the SNPs located in JCHAIN showed significant associations with the considered phenotype, whereas a total of five SNPs in DCK were found significantly associated with dEBVs levels with HEM (four SNPs) and MG (three SNPs) approach. It has to be noted that despite the pruning of the high correlated SNPs and the removal of four individuals because of high IBS by MG approach, which could lead to slightly different results with respect to HEM, the two SNPs not found on NCBI databases (6: 88,044,420 and 6: 88,069,428) resulted significant with both HEM and MG approach ( Table 2). The intergenic region comprising the SNP at position 88,044,420 and SNP rs1115177107 includes a transcribed RNA, as shown by NCBI database, and could therefore have some not yet described regulatory effects. In DCK intron 1, where two significant SNPs (rs43472176 and rs43472180) were described, two ncRNA-like sequences were also found. Moreover, in-between rs43472177 and rs43472180 we found a sequence showing high homology with BTA-mir-2452, a miRNA found expressed upon induced viral infection [30], indicating that it should have a role in the immune response.

NOD2
Four out of the six SNPs with significant genotype distribution in H and L groups, were also found significantly associated with dEBVs levels using MG (rs109352180, rs209159307, and rs110918103) and HEM (rs110918103, and rs210362219) approaches. As for DCK also for NOD2 there was at least one SNP in common between the two approaches (rs110918103).
The intergenic region comprising rs109352180 shows the presence of transcribed RNA and could therefore have regulatory effects. Moreover, rs109352180 flanks a region partially aligning with three human miRNAs and four bovine lncRNAs, three of which are precursors of miR-185, a miRNA that could influence the expression of different genes [31,32]. The region is also rich in SINEs, whereas rs210362219 is in a sequence aligning with 3 bovine lncRNA and one human miRNA (68% identity).

MBL2
None of the 18 SNPs that could be of interest on the basis of WMW test resulted significantly associated with SCS levels by MG approach, neither the three mapped within exon regions (rs209940244, rs210820536, rs463533307), all synonymous mutations, nor the three in proximity of 5 UTR (Table 2). With HEM approach two SNPs resulted significantly associated with the phenotype: rs110884426, mapped in the intergenic region and rs136687134, located about 1000 bp downstream the 3 UTR. Both SNPs are found in a transcribed RNA, as shown by NCBI database. Moreover, rs136687134 is included in a region of Long INterspersed Elements (LINEs), transposable elements that take a large proportion of eukaryotic genomes, once regarded as nonfunctional sequences, and now considered to play pivotal roles in gene regulation [33,34].

MBL1
None of the 3 SNPs that could be of interest on the basis of WMW test resulted significantly associated with SCS levels by MG approach, whereas rs211629255, the SNP that falls into intron 2 of SFPTA1, resulted significant by the HEM approach (Table 2).

M-SAA3.2
The WMW test showed that genotypes distribution in the two sample groups was significantly different for three SNPs, namely rs136687125, rs137746604, and rs210417381. Of the three SNPs, only rs210417381, with also rs42175273, was found to be significantly associated with the phenotype in the MG approach, ( Table 2). It has to be considered that the three SNPs located near the 5 UTR of M-SAA3.2 have a mean correlation of about 0.77 and were therefore all included in the MG analysis, but due to the high correlation among them it is difficult to define which SNPs has actually an effect or if they are all involved in the response to mastitis resistance. The 3 SNPs are included in an RNA transcribed region containing 4 ncRNAs and also a long terminal repeated (LTR) element that should belong to endogenous retrovirus KERVK family. There are evidences that LTR sequences derived from distantly related endogenous retroviruses (ERVs) act as regulatory sequences for many host genes in a wide range of cell types throughout mammalian evolution [35,36]. Re-activation of ERVs is often associated with inflammatory diseases, thus the region could actually be related with mastitis resistance/susceptibility.

Discussion
Starting from the knowledge of the correlation between SCC and mastitis [37], the aim of this study was to analyze the genetic variability of candidate genes and to investigate the association of the identified SNPs with SCS EBVs. As reported by Koeck et al. [38], SCS EBVs are a valuable alternative predictor for mastitis resistance. Candidate genes selected for this study were already known in literature to be related to immune response to mastitis infections [11][12][13][14][15][16][17][18][19]. To maximize the effects of the different SNPs, bulls to be sequenced were selected between the low and high tails of distribution for SCS. We chose to use the selective genotyping approach: only individuals from the high and low extremes of the trait distribution are selected for genotyping, reducing genotyping work and costs maintaining nearly equivalent efficiency to complete genotyping [39,40].
An overall total of 64 SNPs showed significantly different genotype distributions in the H and L groups or were identified as significantly associated with dEBVs for SCS. Among these SNPs, 15 resulted significantly associated with HM or MG approach and 3 of them with both approaches. The main difference between the two approaches used in this study lies in MG that takes into account the genomic relationship matrix and the correlation between SNPs. Indeed, when correlation is considered, several SNPs associations could actually belong to correlated SNPs that were excluded from the analysis rather than to the reported SNP. Moreover, with the MG approach, given the general small allele substitution effect when considering simultaneously SNPs in different genes, and using a correction for the genomic relationship, a low number of significant SNPs is expected. The remaining SNPs should be, therefore, the ones where the association is stronger. Of the 15 SNPs identified by both approaches, the most significant were located inside intronic regions within DCK (rs43472176 and one at position 6: 88,069,428 not previously available on NCBI SNPs databases) and NOD2 (rs110918103). The newly discovered SNP at position 6: 88,069,428, together with rs110918103, are two of the three SNPs resulted significantly associated with both HM and MG approach. No significantly associated SNPs were found in exon regions, while eight SNPs were located in the intergenic regions and could possibly be related to distal regulatory element. Six genes (CXCR1, DCK, NOD2, MBL2, MBL1 and M-SAA3.2) included SNPs that were identified as significantly associated by HEM and/or MG and that are known to be key components in the immune system. As reported by Siebert et al. [41], several studies identified different SNPs in gene associated with mastitis and, furthermore, with SCS. Similarly, DCK gene was suggested as candidate gene associated with mastitis [42]. On the other hand, the direct association between mastitis and MBL1, MBL2, NOD2 and M-SAA3.2 genes has not been fully explored so far. Nevertheless Wang et al. [43], using a GWAS approach found a significant SNP in SAA2 gene, indicating an important role of the superfamily of these apolipoproteins. These results suggest that, just like CXCR1 and DCK genes, also MBL1, MBL2, NOD2 and M-SAA3.2 genes could be eligible as candidate genes for genetic selection of mastitis resistant cows. Many studies demonstrated that SNPs associated with diseases are mainly located in non-coding regions, making it difficult to link them to specific biological pathways. A recent study, in which a genotyping-by-sequencing approach was used to find novel SNPs associated with milk traits, showed that the majority of identified SNPs were located within intergenic regions (69%), followed by intronic regions (25%), with only 3.46% of SNPs being coding variants [44]. Moreover, different studies found that conserved non-coding regions in introns and near genes show large allelic frequency shifts, similar in magnitude to missense variations, suggesting that they are critical for gene function regulation and evolution in many species [45,46]. Most of the significant SNPs detected in our investigation are located inside or in proximity to complete or partial lncRNA-like or miRNA-like sequences, or to repeated regions containing SINEs or LINEs, and it is now established that both intronic and LINE/SINEs repeats could lead to transcriptional regulation of the affected genes [47]. Thus, although the role of some of these elements, especially in the bovine species, has still to be verified, and further studies are needed to better understand the role of these SNPs, the results obtained here are a starting point. Contrary to the findings of Welderufael et al. [48], no SNPs were identified as significantly associated with dEBVs level within PTX3 gene in our population. Similarly, no significant SNPs were detected within candidate genes TLR4 [49].

Conclusions
In this study, next generation sequencing technology was used to discover new SNPs in candidate genes related to mastitis resistance and to identify those associated with dEBVs for SCS. We analyzed the genetic variability of several SNPs found in candidate genes and identified 15 that are associated with the phenotype. Two of them maps within DCK gene and were not previously available on NCBI database, whereas other SNPs strengthen the role of CXCR1, NOD2, MBL1, MBL2, and M-SAA3.2 as candidate genes. These results suggest that the possibility to use SNPs as markers for genetic selection of mastitis resistant cattle is plausible.