Stable bull fertility protein markers in seminal plasma

Bull fertility is an important trait in breeding as the semen of one bull can, potentially, be used to perform thousands of inseminations. The high number of inseminations needed to obtain reliable measures from Non-Return Rates to oestrus creates difficulties in assessing fertility accurately. Improving molecular knowledge of seminal properties may provide ways to facilitate selection of bulls with good semen quality. In this study, liquid chromatography mass spectrometry (LC-MS/MS) was used to analyze the protein content from the seminal plasma of 20 bulls with Non-Return Rates between 35 and 60%, sampled across three seasons. Overall, 1343 proteins were identified and proteins with consistent correlation to fertility across multiple seasons found. From these, nine protein groups had a significant Pearson correlation (p < 0.1) with fertility in all three seasons and 34 protein groups had a similar correlation in at least two seasons. Among notable proteins showing a high and consistent correlation across seasons were Osteopontin, a lipase (LIPA) and N-acetylglucosamine-1phospho-transferase subunit gamma. Three proteins were combined in a multiple linear regression to predict fertility (r = 0.81). These sets of proteins represent potential markers, which could be used by the breeding industry to phenotype bull fertility. Significance: The ability of bull spermatozoa to fertilize oocytes is crucial for breeding efficiency. However, the reliability of this trait from field measures is relatively low and the prediction of fertility given by conventional methods to evaluate sperm quality is currently not very accurate. In this work, we identify sets of proteins in bull seminal plasma from repeated samples collected at different times of the year that correlate to fertility in a consistent way. We combined these individual proteins to build a molecular signature predictive of fertility. This study provides an overview of proteins linked to fertility in seminal plasma, thereby increasing knowledge of the bull seminal plasma proteome. Protein signatures from the latter, potentially related to fertility, may be of use to predict fertility for individual bulls.


Introduction
Bull fertility is a crucial trait for breeding companies, with a considerable impact on the dairy herd economy.In cattle, bull fertility is usually determined from the results of artificial inseminations (AI), estimated from Non-Return Rates to oestrus (i.e. the proportion of cows not coming back in heat after insemination and therefore deemed to be pregnant).Due to numerous confounding factors and biases related mostly to the type of female inseminated and the effects of the environment, high numbers of AIs are necessary to obtain reliable estimates for bull fertility [1].This was previously a source of limitation for the breeding companies as reliable fertility estimates were obtained from accumulating data over several years, by which time the bull was no longer in use for AI.Thus, it was difficult to define efficient sperm handling or genetic selection strategies.In the context of genomic selection where cohorts of young bulls are used, the number of AIs per individual are reduced compared to previously.However, since semen from some bulls with a superior genetic potential can still be used for thousands of AI, it is crucial to maintain high fertility standards even for younger bulls.However, for these young individuals, only limited data from AIs are available.This results in fertility estimates of poor predictive value.Identifying key markers of fertility in semen may allow for better reproducibility in handling bulls and bull spermatozoa, and in some cases, improved fertility.They could also be used as phenotypic markers for fertility in genomic selection schemes to avoid possible unfavorable evolution of sire fertility.In the past, differences in fertility between males were often attributed to variation in sperm quality [2].However, today, the predictive value of conventional sperm quality traits is usually considered to be low [3].
Seminal plasma (SP) has been shown to have an assisting role for spermatozoa and fertility, improving sperm longevity and ability to move within the female tract [4].In cases where the difference in fertility could not be linked to sperm quality, differences in SP have been observed with effects on the spermatozoa and the genital tract of the female [5,6].Other effects that have been shown to be associated with seminal plasma proteins include sperm motility, resilience to freezing and survival in the genital tract in the sheep [7,8].Following the pioneering work of Kilian [9], some specific candidates such as Osteopontin have been associated with fertility [10].Nowadays, there is an increased interest in seminal plasma proteomics based on global approaches of protein patterns.For instance, in two recent studies, comprehensive profiling of the bull seminal plasma proteomics were established [11,12].However, analyses were limited to a single time point measurement, and thus did not assess the stability of the markers over time.In another study, proteomic patterns were used to identify the role of seminal plasma proteins added to spermatozoa by comparing epididymal spermatozoa to ejaculated spermatozoa but they were not specifically linked to fertility [13].
Here, we describe proteomic patterns in seminal plasma from bulls of different fertility, using label-free LC-MS/MS measurements carried out on semen samples collected in different seasons.We provide a general proteomic overview, increasing the molecular knowledge of the bull SP proteome.Identifying proteins that are stable in repeated samples obtained in different seasons and correlated to fertility could allow their use as biomarkers of fertility.

Sample collection and processing
Semen ejaculates (total n = 109) were obtained from 20 Holstein bulls at an AI center (Animal Breeders' Association of Estonia, Raplamaa, Estonia, 58 • 56 ′ 11.4 ′′ N 24 • 53 ′ 14.5 ′′ E).Ejaculates were collected at three time points: Winter (2016 / 2017), Spring (2017), and Summer (2017).Temperature humidity indices (THI) calculated for the summer indicated no risk for heat stress (temperatures and humidity indices for the collection dates are available in Supplementary Materials S1).Seventeen of these bulls were sampled in all three seasons.At each time point, two ejaculates were collected two weeks apart to allow the study of the stability of proteomic patterns within and through seasons to be assessed.Three samples were lost in the sample handling, leading to a single ejaculate for a certain bull and season.The distribution of samples across season, ejaculates, repeated injections and birth dates are described in the table Supplementary Materials S2 and summarized in Fig. 1.The fertility performance of these bulls (as estimated by nonreturn rates at 90 days post-AI, based on the outcome from the first insemination) ranged from 35% to 60% based on a total number of inseminations per bull between 42 and 9384 (median 643.5).These values are presented in the table Supplementary Materials S3.Samples from a separate set of Holstein bulls from the same AI center (n = 17) were collected in 2018 as a separate cohort for comparison.Estimated nonreturn rates for these bulls ranged between 34% and 51% based on 316 to 9244 inseminations per bull (median 989).
Bulls were prepared for semen collection by allowing them to follow each other around a circular race one by one for 30-40 min before collection, and making false mounts, to increase libido.Approximately 70% of bulls false mounted once, and 30% twice.The first ejaculates were collected for the experiment.Following collection, ejaculates were centrifuged at 1800g for 10 min to pellet the spermatozoa [2] and aliquots of the resulting SP samples (5 μl) were then deep-frozen (− 196 • C) and later transferred on dry ice to the laboratory for proteomic analyses.
Protein extraction was performed by mixing 1 μl of SP with 20 μl of SDS-PAGE sample buffer with DDT (58 mM Tris, pH 6.8, 5% glycerol, 1.7% SDS, 10 mM DDT) followed by a 10 min incubation at 95 • C. Subsequently, iodoacetamide was added to a final concentration of 55 mM and the samples were incubated for 30 min in the dark.The samples were then analyzed on SDS-PAGE until the front had migrated approximately 5 mm into the gel.The pieces were excised, washed and the protein was digested with trypsin in-gel.The tryptic peptides were cleaned with C18 columns (UltraMicro spin columns, Nest Group Inc., Fig. 1.Study overview.Samples were taken at three different time points from the same set of bulls.For each round, two ejaculates were taken.For the first season, double injections were performed on the mass spectrometer.Next, proteins were selected for having a consistent Pearson correlation across two or three of the seasons (p < 0.1).Finally, an optimal subset of these was used to produce a predictor.
Southborough, MA, USA) and resuspended in a solution of 5% formic acid in water.

Mass spectrometry data acquisition
LC-MS/MS data for the seasonal protein digests were analyzed on an Orbitrap Elite mass spectrometer (Thermo Fisher Scientific, Germany) interfaced with Easy-nLC liquid chromatography system (both Thermo Fisher Scientific, Waltham, MA, USA).Peptides were analyzed using NanoViper Pepmap pre-column (100 μm × 2 cm, particle size 5 μm, Thermo Fischer Scientific) and an in-house packed analytical column (75 μm × 30 cm, particle size 3 μm, Reprosil-Pur C18, Dr. Maisch) using a linear gradient from 5% to 35% B over 100 min followed by an increase to 100% B for 5 min, and 100% B for 15 min at a flow of 300 nl/ min.Solvent A was 0.2% formic acid in water and solvent B was 0.2% formic acid in 80% acetonitrile.
Precursor ion mass spectra were acquired at 120 K resolution and MS/MS analysis was performed in a data-dependent mode where the 5 most intense precursor ions were selected for fragmentation using CID at a collision energy of 35.Charge states 2 to 4 were selected for fragmentation, and dynamic exclusion was set to 60 s.
For the set of samples corresponding to the first season, replicate injections per sample was performed to assess the degree of technical variation from the mass spectrometer.Four samples from the first seasonal sampling were rerun together with the samples from the second season to assess how much technical variation came from the downstream processing of the samples.
Peptide digests from the follow-up bulls (n = 17) were analyzed by LC-MS/MS using an Easy-nano LC system (Thermo Fisher Scientific, Germany) coupled with a QExactive HF-X mass spectrometer (Thermo Fisher Scientific, Germany) operating in positive ion mode for data dependent acquisition (DDA).The analytical column was 15 cm long fused silica capillary (75 μm 16 cm Pico Tip Emitter, New Objective), packed in house with C18 material ReproSil-Pur 1.9 μm (Dr.Maisch GmbH, Germany).Peptides were separated by an 80 min gradient from 5% to 90% solvent B (80% ACN, 0.1% FA) at a constant flow rate of 250 nl/min.The Orbitrap acquired the full MS scan with automatic gain control (AGC) target value of 3 × 10 6 ions and a maximum fill time of 50 ms.The 20 most abundant peptide ions were selected from the MS for higher energy collision-induced dissociation (HCD) fragmentation (collision energy: 40 V).Fragmentation was performed at 15,000 FWHM resolution for a target of 1 × 10 5 and a maximum injection time of 20 ms using an isolation window of 1.2 m/z.The software Xcalibur v3.0 (Thermo Fisher Scientific, Germany) was used both to control the nLC system and the MS.Here, it was used to acquire and visualize the raw data.

Data processing and analysis
Raw MS data were converted to mzML, an open format for mass spectrum, using Proteowizard version 3.0.9220(3.0.11841 for the follow-up data; [14] with MS-Numpress compression [15], and MS1 peptide features were extracted using Dinosaur [16] version 1.1.4.. Data processing to a peptide abundance matrix was performed in the Proteios software environment ( [17], 2.20.0-devbuild 4643).X!Tandem ( [18], version 2017.2.1.4)and MS-GF+ ( [19], v2016.06.29) were used for peptide spectrum matching.Precursor mass error tolerance was set to 7 ppm and for X!Tandem fragment monoisotopic mass error was set to 0.4 Da (0.02 Da for the QExactive data), while for MS-GF+ the instrument setting was used to set the fragment tolerance.Fix carbamidomethylation of cysteine, and variable oxidation of methionine was set as modifications, and protein N-terminal acetylation was set as additional potential modification in MS-GF+.As a search database, proteins annotated as Bos taurus were retrieved from UniProt (2017-05-19) and concatenated with the cRAP database containing common contaminant proteins (version 1.0, 116 entries).Reverse entries were added to the database as decoy sequences, giving a total of 92,984 protein entries.The search results were subsequently combined in Proteios.The built-in algorithm in Proteios was used for matching MS1 peptide features to MS2 identifications (plugin version 0.5.1, peptide spectrum match FDR cutoff 0.001, secondary hit FDR cutoff 0.01, m/z tolerance 0.005, RT tolerance 0.1) and alignment and propagation of feature identifications between samples using default settings.The resulting peptide matrix was used for subsequent analysis.The datasets from different seasons were processed jointly to facilitate the analysis of the same bull individuals across all time points.Raw peptide abundances were log2transformed and normalized using cyclic Loess normalization in Nor-malyzerDE ( [20], version 1.5.4), which is a normalization technique adjusting for systematic differences in abundance between samples at different abundance levels [21].For a given season, the two measurements from the same bull were combined to provide the median value for each peptide, giving one value per bull and season.This dataset was used for season-specific studies.It was further reduced across batches, to give one value per bull, which was then used to calculate a combined per-bull view.The peptides were summarized into protein groups using an R implementation (https://github.com/ComputationalProteomics/ProteinRollup, version 1.0.0) of the DanteR ( [22], v1.1.6970)R rollup, retaining proteins supported by at least two peptides.Protein level FDR was estimated by calculating the number of decoy entries passing through the protein rollup and was found to be <0.01FDR (8 decoys and 1343 targets passing in the seasonal dataset, with 13 decoys and 1640 targets in the follow-up dataset).The mass spectrometry proteomics data and the protein abundance tables have been deposited to the ProteomeXchange Consortium via the PRIDE [23] partner repository with the dataset identifiers PXD021241 and PXD021245.
The dataset was further explored using OmicLoupe ( [24], v0.9.0) for a comprehensive exploration of outliers and general trends.One bull was identified as a consistent outlier and was excluded from all the analyses.Further, a single sample had a distorted profile, as identified by density plots, and was also excluded (Supplementary Fig. 1 in Supplementary Materials S4 where Supplementary Figs.1-5 are presented).The most abundant proteins were estimated within each sampling by retrieving the three peptides with the highest average intensity and calculating the average of these, as proposed by Silva [25].The 10 most abundant proteins were collected within each seasonal sampling.A gene ontology summary was generated using a Bioconductor based database for Bovine (10.18129/B9.bioc.org.Bt.eg.db, 3.10.0)and enriched using the ClusterProfiler ( [26], 3.14.3)R package using FDR < 0.1 (Benjamini-Hochberg corrected; [27]) to determine significantly enriched groups.UpSet plots were generated to show overlapping correlated proteins between seasons in using the UpSetR package [28].
Pearson correlations between protein abundance and fertility estimates were calculated in R (version 3.6.3)for proteins within each of the seasonal batches as well as the validation dataset using the cor.test function in the stats R package.Protein groups having a p-value <0.1 across all three samples and detected in both datasets were selected.Correlations were also weighted for the number of inseminations from which fertility rates were obtained, but results from non-weighted correlations are presented here, as the r 2 values resulting from the two approaches were almost identical.In addition, an extended list of protein groups with Pearson correlation p < 0.1 in two of the three seasons with a similar correlation pattern in the third was retrieved.A list of proteins with Pearson correlation p < 0.05 in at least one of the seasons was included to facilitate further explorations.Pearson correlations between protein abundances and age were calculated to see if age might have a confounding effect for certain proteins.To determine if the protein differences seen might be related to freezability differences in the spermatozoa, information obtained from a recent study investigating this phenomenon in a comprehensive proteomics dataset [12] was summarized for the top-listed proteins.
For the study including repeated times / seasonal effects, repeatedmeasures ANOVA was carried out using the aov function in the stats R package, with batch as a fixed effect and bull as a random effect, thus accounting for the individual variation in the model.The resulting pvalues were adjusted to FDR values using the Benjamini-Hochberg procedure [27].These values were filtered using FDR < 0.05 and a minimum log2 fold change between any two of the three groups.
Ways to build signatures combining multiple proteins or peptides were explored.Multiple linear regression models were computed using the regsubsets command from the leaps R package (version 3.1, https:// cran.r-project.org/web/packages/leaps/index.html).The calculations were based on the proteins with Pearson correlation p < 0.1 across all seasons, selecting the protein variant with the lowest median p-value in cases where multiple had the same annotation, resulting in a reduced list of 6 protein groups consistently correlated across seasons, as illustrated in Table 1.This reduced list was produced by filtering redundant proteins matching to the same annotation, selecting the protein with the lowest median p-value.Individual protein abundances (previously Loess-normalized and log2 transformed on sample-basis) were normalized further to mean 0 and standard deviation 1. Subsequently, all combinations of predictors were tested systematically.The model with the highest adjusted r 2 value was selected.Further, the same procedure was carried out in the extended list of protein candidates (as illustrated in Table 2), limiting the maximum number of independent proteins to three to avoid overfitting.Finally, we explored with the same procedures a combination of proteins from the extended list showing a similar correlation pattern in the additional set of bulls, resulting in a final list of 11 proteins.The full R analysis, including code for generating most of the figures present in this article, is available as R notebook documents and HTML-files in the Supplementary Materials S5-S8.

Seminal plasma proteome overview
An overview of the seminal plasma proteome was obtained from the analysis of proteomics data from the three seasons.Globally, 23,990 different peptides were found, representing 1343 protein groups that were each identified by two or more peptides (Fig. 1) at FDR < 0.01.At a quantitative level, the technical replicates run in the first of the three seasons showed high technical reproducibility in the mass spectrometry (Supplementary Fig. 2).Furthermore, in Principal Component Analysis (PCA), samples from the rerun of the first season with the second season grouped primarily with the first season, indicating that the major part of the variation between seasonal datasets is not due to the preparation process or measurement, but due to variation between biological samples (Supplementary Fig. 3).PCA from the full dataset revealed that season has the strongest main effect.A separation related to fertility measurement can be seen from further principal components (PC) with axis explaining less variation as PC7 and PC8 (Fig. 2).This indicates a systematic relationship between the fertility and the overall expression seen in the proteomic data but also shows that this effect is overshadowed by other effects such as season of sampling and other sources of variation related to sample biological heterogeneity.
Gene ontology distributions of identified proteins were generally similar to the gene ontology distribution of the predicted Bos taurus proteome, with certain differences which were further highlighted by a gene ontology enrichment analysis.The highest enriched cellular component term was as expected "extracellular region part".Further, the highest enriched biological process was "carbohydrate metabolic processes" and "hydrolase activity acting on ester bonds".The overall distribution of gene ontology terms compared to the predicted proteome and enrichments comparing proteins observed in the seminal plasma proteome to the general background are displayed in Fig. 3.
Proteins estimated among the top 10 most abundant in at least one sampling (14 in total) are illustrated in Supplementary Fig. 4 with the top abundant proteins being Seminal plasma protein BSP-30 K, Spermadhesin-1 and Seminal plasma protein PDC-109, as shown in the table Supplementary Materials S9.

Proteins variable over seasons
To detect the proteins that were most variable between seasons, a comparison of seasonal variation using repeated measures ANOVA was performed.The analysis yielded 403 proteins when features with FDR below 0.05 and a log2 fold difference of 1 between the two groups with the greatest difference were selected.These protein groups are presented in the table Supplementary Materials S10, and the 30 top hits are illustrated in Supplementary Fig. 5.

Correlation to fertility reveals protein markers stable over seasons and biological repeated samples
Nine protein groups showed consistent correlation trends (p < 0.1) across the three seasons (Fig. 4), out of which 3 had a significant negative correlation with fertility estimates with p < 0.05, namely Placenta-expressed transcript 1 protein (PLET1), Lipase (LIPA) and Nacetylglucosamine-1-phosphotransferase subunit gamma (GNPTG).All of these had similar correlation trends across all three seasons.The extension of this list, made by including proteins with correlation p < 0.1 in at least two seasons, resulted in 34 protein groups.Out of these, all but two protein groups displayed the same correlation direction across all three seasons.These proteins were not among the most abundant ones, and the ten most abundant proteins within each season showed either a low or no correlation with fertility in subsequent tests.Principal component plots using proteins with p-value <0.1 in at least one of the three seasonal samples show clustering in all three batches, as well as in the combined dataset.These trends are generally clear, except for bull 16.Correlation overlaps and PCAs are illustrated in Fig. 5.The overlapping correlations are illustrated as an UpSet plot (panel A) which List of protein groups identified as correlated with fertility (Pearson's correlation p < 0.1) across all three seasons.Proteins with p < 0.05 are indicated with *.For cases with multiple proteins matching the same annotation, the one used for building multiple linear regressions is indicated with (*).The AgeSig column indicates for how many seasons the protein was found correlated with age (Pearson, p < 0.1).The Freeze column indicates whether the protein was found to be linked to different freezability of sperm in a recent study by Gomes et al. [12] (NonSig = non significant, Missing = not detected) .List of protein groups identified as correlated with fertility (Pearson's correlation p < 0.1) across at least two out of three seasons.Proteins with p < 0.05 are indicated with *.For cases with multiple proteins matching the same annotation, the one used for building multiple linear regressions is indicated with (*).The AgeSig column indicates for how many seasons the protein was found correlated with age (Pearson, p < 0.1).The Freeze column indicates whether the protein was found to be linked to different freezability of sperm in a recent study by Gomes et al. [12] (NegRel = negative relation, NonSig = non significant, Missing = not detected).displays the overlaps among features, divided into groups with no overlap, overlapping in at least two conditions and the same correlation direction, and overlapping in at least two conditions with different correlation direction.It appears that many proteins are identified as correlated with fertility in only one season, which could be due to random associations related to sample variability.However, in cases where a protein group is found to be correlated with fertility in two or more seasons, we observe that their correlation with fertility tend to be similar.At the peptide level, 28 peptides were found with a consistent correlation (p < 0.1) across all three seasons (batches of analysis), matching to 19 different protein groups.Protein groups with consistent correlation to fertility in at least two seasons are illustrated in Table 1, Table 2 and Supplementary Materials S11.The proteins in these tables were also checked for potential correlation with age and annotated for potential relation to freezability, as found in the recent study by Gomes et al. [12].Among the top-listed proteins, one was found related to age differences in the shortlisted proteins and 11 in the extended list.Furthermore, seven proteins in the extended list have been shown to be    The best performing models' predictions illustrated both on the combined dataset, and its performance on individual seasons.The model demonstrates a consistent predictive ability across all seasons, while not fully capturing individual variation (most notably seen in bull 16).The correlation is calculated as Pearson's r. linked to differences in freezability.Peptides with consistent correlation to fertility across all seasons are shown in the table Supplementary Materials S12.105 protein groups had p < 0.05 in at least one season.These are presented in the table Supplementary Materials S13.

Single-and multiple-regression models for predicting fertility
A predictor for fertility was built based on both the list of six protein groups consistently correlated to fertility (p < 0.1 across seasons) and the extended list of 28 protein groups (p < 0.1 across at least two seasons), identifying the combination of proteins producing the predictor with the highest adjusted r 2 value with a maximum of three predictive variables.The adjusted r 2 penalizes the inclusion of many features in the model by only increasing when the added proteins improve the model more than by chance.Values are reported as unadjusted r 2 .The best performing combination of predictors based on six protein groups reached an r 2 value of 0.65 (r = 0.81) and was built from three proteinslipase (LIPA), a cartilage acidic protein 1 (CRTAC1), and an N-acetylglucosamine-1-phosphotransferase subunit gamma (GNPTG).Fertility predictions obtained from this model for the combined seasonal bull values are illustrated in Fig. 6.The best performing combination of predictors based on the extended list reached an r 2 value of 0.85 (r = 0.92), but had worse predictive performance when applied to individual seasons, and was thus deemed as less robust.
A follow-up set of seminal plasma samples from an independent cohort of bulls with mainly mid-range non-return rates (34-51%) was further analyzed, resulting in 23,423 peptides, and 1640 protein groups identified (FDR < 0.01).The linear regression model generated for the seasonal samples was applied to these samples but showed low predictive ability in this fertility range.However, the lipase (LIPA) used in the predictor showed a similar trend in both datasets, both on the protein group level and for all shared four underlying peptides identified in both datasets.Trends of all eight (four shared) peptides identified for LIPA are illustrated in Fig. 7.

Discussion
In the current study, the bull seminal plasma proteome was explored using a label-free single-shot LC-MS/MS approach.The overall profiling of gene ontology terms revealed a rich proteomic landscape covering a wide range of functions.Gene ontology enrichment against the general background revealed enrichment of extracellular proteins, which is to be expected when studying an extracellular substrate.Furthermore, as expected, there was an enrichment of proteins involved in biological processes related to fertilization and spermatozoa.As for molecular function, peptidase action was enriched, which has been discussed as potentially involved in seminal plasma physiology [29].The label-free approach allowed the comparison of many samples, and the proteome depth of 1343 proteins reached is similar to a recent study where 1445 proteins were detected [12], and slightly higher than previously reported [11,13], where 685 and 1159 proteins, respectively, were detected.This increased depth provides opportunities for exploration of proteins beyond the most highly abundant and readily measured.This is a key point that stands in contrast to prior studies with less powerful methods, for instance, those carried out using 2D gel electrophoresis coupled to MS, which provides a valuable view of protein patterns, but due to technical limitations, addresses a limited number of proteins [30,31].Inspection of the most highly abundant proteins shows similarity with previous literature.Soleilhavoup carried out a proteomic study of seminal plasma identifying over 700 proteins, where BSPs belonging to the Spermadhesin family were described as the predominant proteins in bovine SP, with LEG1/C6orf58 also being prominent Fig. 7. LIPA peptides.Illustration of individual peptide correlations to fertility in the lipase protein.Peptide abundances from the second set of bulls are illustrated in orange.All underlying peptides show a negative correlation to fertility in both the three seasonal datasets and in the validation dataset.[31,32].Another recent study found BSP-1 and Spermadhesin-1 to have the highest abundances [12].In our study, the seminal plasma protein BSP-30 K, Spermadhesin-1 and Seminal plasma protein PDC-109 were found to be the most abundant.
Our results confirm that the technical variation associated with repeated analyses by mass spectrometry is relatively small.This is in line with other proteomic studies, where protein extraction and digestion were identified as the major sources of variability [33].
In the present work, the NRR was calculated without adjusting for environmental factors that may influence fertility.The fertility rates from bulls with low numbers of AI are the most susceptible to the above limitation.However, when attempts were made to weight the correlations by the number of inseminations performed for each bull, there was only a very minor impact on the observed protein signatures correlated to fertility.Furthermore, attempts to account for age were performed by calculating the correlation of the top proteins to age.One of the shortlisted proteins (A5D7U1, Placenta-expressed transcript 1 protein) had a strong correlation with age, which should be considered when interpreting the data.Any possible inaccuracies in fertility estimates due to other factors would be likely to contribute by increasing the background noise, reducing the sensitivity of the study and leading to missed proteins rather than misleading findings.
To our knowledge, this is the first time that relationships between SP proteomic patterns and fertility were approached from multiple measurements performed from several ejaculates from the same bulls collected at different times of the year.Our results show that using only one time point may lead to misinterpretation as some of the candidates display opposite relationships with fertility estimates at different time points.Season has been reported to impact sperm quality [34], and freezability [35] and our observations suggest that some of the above differences may be related to variations in protein content of ejaculates/ seminal plasma between seasons.Although our study was not initially designed to focus on the effect of season, we could identify a series of proteins for which abundance had a relationship to this factor.These results show the importance of analyzing different ejaculates at different time points when aiming at identifying protein markers for fertility.

Proteins showing consistent relationships with fertility across samples
This study allowed the identification of several proteins or protein groups, where respective abundance was significantly correlated with fertility consistently across all samples.Some of the proteins have been reported previously to have a role in reproductive function and/or fertility.Osteopontin (https://www.uniprot.org/uniprot/P31096), is a glycophosphoprotein which was shown to be involved in many different functions [36].This protein takes part in a mineralized matrix by binding hydroxyapatite and is important for cell-matrix interactions.Another function of this protein which could be of importance for the interactions between spermatozoa and the female genital tract, is its immune function through acting as a cytokine enhancing the production of interferons, correlating with TNF α and being an attractant for macrophages and T cells [36].In addition, Osteopontin has been reported to play a positive role in sperm-egg binding and embryo development as reported in several studies [37][38][39][40].High concentrations of Osteopontin in SP were also positively related to the freezability of bull semen [41] and Osteopontin gene polymorphism was associated with semen production traits in the buffalo [42].Our results showing a positive correlation between several peptides of this protein and fertility may be related to the above-mentioned functions.These results are consistent with results previously obtained in bulls [10,43], horses [44] and camels [45].
In humans, EDIL3 is involved in the adhesion of endothelial cells through interaction with a receptor (https://www.uniprot.org/uniprot/O43854).This protein may also have an inhibitory role while regulating the formation of vascular-like structures.However, in cattle, no annotation has been reported so far (https://www.uniprot.org/uniprot/E1BPX2), and there is some uncertainty in the annotation due to missing conserved residues compared to other similar proteins.
Leahy et al. [46] mention that this membrane protein limits leukocyte recruitment during inflammation, and speculate that the addition of this protein to the ram sperm membrane at ejaculation [13] may help to modulate the female immune system and aid sperm survival in the ewe reproductive tract.So far, no relationship between EDIL3 abundance and fertility has been reported.Its negative correlation with fertility, found consistently across seasons in our study, and the above regulatory role on the immune response of the female genital tract deserves further investigation.
A lipase (LIPA, long name: Lysosomal acid lipase (LAL)/cholesteryl ester hydrolase) consistently showed a negative correlation to fertility in both our seasonal dataset and in the follow-up dataset.This protein is annotated in the human species (https://www.uniprot.org/uniprot/P38571) and is known to be involved in intracellular hydrolysis of cholesteryl esters and triglycerides internalized via endocytosis of lipoprotein particles.It is also reported to be important in mediating the effect of LDL (low-density lipoprotein) uptake on suppression of 'hydromethylglutaryl-CoA' and on the activation of endogenous cellular cholesteryl ester formation.In both cases, it is involved in pathways related to the control of cholesterol levels.In the human, LAL is encoded by the LIPA gene, and low activity of this enzyme causes lipid accumulation and reduction of free fatty acids and cholesterol in the cytosol.This reduction influences numerous downstream genes via transcription factors, resulting in higher expression of the low-density lipoprotein receptor, acetyl-coenzyme A acetyltransferase.This effect results in amplified lysosomal lipid accumulation, increased levels of serum very low-density lipoproteins, and decreased serum high-density lipoproteins, also influencing free cholesterol.
This phenomenon may be of importance for fertility as the male reproductive function is influenced by cholesterol homeostasis [47].Loss of sperm membrane cholesterol is involved in the process of capacitation [48].Membrane cholesterol content has been related positively to sperm freezability [49,50].Low cholesterol in the sperm tail has been associated with poor motility parameters as observed in samples collected during the summer when compared to samples collected during the winter [51], and cholesterol has been presented as a predictive tool for semen quality evaluation [52].However, despite LAL having been shown to have an essential role in lipid metabolism [53], very little information exists on the relationship between LIPA and male fertility.Other lipases such as the Hormone Sensitive Lipase (HSL/LIPE) has been favorably associated to sperm traits [54].To explain the negative relationships observed here with LIPA in light of the above information would need further investigation.It would be interesting to document further the precise impact of high LIPA on the distribution of cholesterol in different sperm compartments, especially in the membrane, to explain the negative relationship to fertility found here.

Proteins with trends for relationships
Among proteins found correlated to fertility in two seasons, attention could be given to Prostaglandin D synthase.This protein is an extracellular transport protein with a high binding affinity for specific cell receptors and small hydrophobic ligands [55].Its isoform lipocalin-type prostaglandin D synthase was previously thought to be related to fertility [56].Prostaglandin D synthase was reported to be associated with poor semen parameters such as decreased sperm count, motility, and normal morphology [57,58].It has been used as a biomarker for 'obstructive azoospermia' in men, a condition where spermatozoa are absent from the ejaculate despite normal spermatogenesis, and an assay diagnostic purposes has been developed [59].The results of our study, showing a strong and consistent trend across seasons for a negative correlation of this protein with fertility, fit well with the abovementioned studies and also with previous results obtained in camels [45].Still, Prostaglandin D synthase has also been reported to have a positive trend with fertility (as discussed by Saito et al. [60]), showing the complexity of the system.Further studies would be needed to explain the mechanisms behind these trends in different biological systems.
Two other proteins identified in the extended list correlated with fertility in two out of three seasons are GNPTG and CRTAC1.The negative correlation between GNPTG and fertility is consistent with the work of Liu et al. [61], showing a higher abundance of this protein in semen from poorly performing bulls.Furthermore, we found here that CRTAC1 had a positive correlation with fertility but more investigations about its role may be needed as different trends were previously found when studying its relationship with freezing resilience and liquid preservation [8,31].
Numerous studies have described the relationships between the most abundant proteins in SP and fertility, such as BSPs and Spermadhesins.The BSPs (1, 3,5) have been either negatively or positively correlated with fertility [4] or freezability [12], whereas for Spermadhesins (SPADH1 and SPADH2) which may have a protective role against oxidative stress, positive correlations have been reported with freezability [62] and fertility of frozen semen [63,64].
Beyond the proteins directly included in the shortlist, other proteins of interest showing trends for a correlation with fertility were ZAG, showing a negative correlation with fertility, as in previous studies [31], and Spermadhesin Z13, which in our study, showed a trend for a positive relationship with fertility, similar to previous observations [65].

Prediction of fertility from a combination of protein markers
Due to its importance for the breeding industry, the ability to predict bull fertility has long been sought using both conventional and molecular methods.One early attempt using proteomics was carried out by Killian [9], where it was found that bovine seminal plasma contained proteins associated with bull fertility.A regression model with a positive correlation with fertility was developed (r = 0.89), but its robustness when confronted to multiple samplings was not tested.In a recent study also using liquid chromatography-mass spectrometry, more than 1000 proteins were identified, yielding a predictor for fertility (Spearman's rho of 0.94 / Pearson's correlation of 0.83) [11].However, predictors based on proteomics in linear regression models could be subject to overfitting [66] as the number of proteins used in the models by Viana et al. was high (six proteins) when compared to the number of individuals from which the models were developed (10 bulls).Here, to limit overfitting to the dataset at hand, the protein candidates included in predictive models were chosen by selecting a limited number (in this case, a maximum of three) of those showing a consistent correlation to fertility across repeated samples collected in different seasons, and using adjusted r 2 as selective criteria.This lowers the r 2 value of the prediction brought by our model when compared to other studies.This effect was observed in the data presented here where the inclusion of more proteins led to a model with improved performance in the combined data, but with reduced performance when applied to individual seasons.The development of a predictive signature is difficult due to the many sources of variation influencing molecular data, stemming from both biological factors (season) and technical factors (sample preparation, sample storage, mass spectrometry processing).
To increase the robustness of the signature, we explored further their pertinence by using an independent set of bulls.A limiting factor in this approach was that the fertility range of this second set of bulls was considerably narrower than the first.This contributed to reducing the significance of relationships previously established for most of the proteins, which showed a consistent correlation with fertility in the first study.Despite this reduction, our correlations with fertility, for instance, the negative correlation found between fertility and the amount of lipase (LIPA), and a trend for a positive correlation with one of the Osteopontin protein groups, were confirmed by this additional set of bulls.Moreover, in the extended list of bulls, 11 out of 34 showed similar trends in the independent dataset, albeit not significant, and would need further verification.We also attempted to explore the behavior of our predictor proteins when applied to the proteins in the study by Viana et al.Here, two of the proteins were identified (Osteopontin and N-acetylglucosamine-1-phosphotransferase) but showed no significant trends in the dataset by Viana et al, and lack of quantitative data did not permit further explorations.Many factors may influence the results, such as various software and normalization strategies used in the different studies.Future work would ideally involve testing the signature in a larger set of bulls with a wide range of fertility.This is complicated by the low availability of bulls of low fertility.Due to the progression of genomic selection, only sufficient semen is collected from each for a small number of inseminations.Still, this type of validation would be essential to confirm the reliability of the identified markers.

Conclusions
We identified sets of proteins robustly correlated with fertility and established a model providing consistent results for this group of bulls across repeated samples collected at different seasons.It was explored in an independent set of bulls and a reduced number of candidates were found to show consistent results for the two sets of bulls.However, considering the relatively low number of bulls, this model and its associated protein candidates would need further validation, ideally from a set of bulls with a wider range of fertility.Still, we find consistent relationships between the abundance of most of the protein candidates and fertility, as well as other features of male reproductive function.Moreover, there were consistent correlations across samplings.Thus, these results pave the way for functional studies to further decipher the mechanisms underlying male fertility.

Fig. 2 .
Fig. 2. Principal component on season and fertility.Principal component illustration of samples from all three seasons.A) illustrates the first two principal components and is colored by the season they were sampled.B) is colored on fertility and illustrates principal components 7 and 8, which were selected based on a strong correlation to fertility.C) illustrates the amount of variation explained by the principal components and is colored on their respective correlation to fertility.

Fig. 3 .
Fig. 3. Gene ontology and enrichment.Gene Ontology (GO) overview based on second-highest level GO terms, categorized into biological process (BP), cellular compartment (CC), and molecular function (MF).Panel A illustrates the distribution of functions of proteins observed in the dataset compared to the distribution seen in the full predicted proteome.B-D displays top enriched terms in the different ontology categories when comparing observed proteins with the full annotation.A strongly enriched term indicates comparably more proteins related to this function is present in the observed data.The gene ratio indicates the fraction of proteins annotated with that GO term in the annotation that is also identified in the proteomics data.

Fig. 4 .
Fig. 4. Consistently correlated protein profiles.Protein groups found to be correlated to fertility across all three seasons (Pearson's correlation p < 0.1 in all cases).The multiple combinations of IDs mapping to the annotation Osteopontin could be due to protein isoforms.

Fig. 5 .
Fig. 5. Correlation pattern overlaps.Study of features found to be correlated to fertility in at least one season (p < 0.1).The A) UpSet plots illustrate how these features are distributed between seasons and further between positive and negative correlation to fertility.The top bars show the number found in each overlap, and the linked dots illustrate which season/ direction of correlation is involved.As can be seen, all except two proteins (94% of overlapping proteins) found to be correlated across multiple conditions had the same correlation direction.The B) panel shows the principal component analysis for the same set of protein groups (p < 0.1 correlation with fertility in at least one season), both for individual seasons and after combining all data for each bull.

Fig. 6 .
Fig. 6.Predictor results.The best performing models' predictions illustrated both on the combined dataset, and its performance on individual seasons.The model demonstrates a consistent predictive ability across all seasons, while not fully capturing individual variation (most notably seen in bull 16).The correlation is calculated as Pearson's r.

Table 1
Consistently correlated protein groups.

Table 2
Protein groups with correlation across two seasons.