Machine Learning Predictive Models for Evaluating Risk Factors Affecting Sperm Count: Predictions Based on Health Screening Indicators

In many countries, especially developed nations, the fertility rate and birth rate have continually declined. Taiwan’s fertility rate has paralleled this trend and reached its nadir in 2022. Therefore, the government uses many strategies to encourage more married couples to have children. However, couples marrying at an older age may have declining physical status, as well as hypertension and other metabolic syndrome symptoms, in addition to possibly being overweight, which have been the focus of the studies for their influences on male and female gamete quality. Many previous studies based on infertile people are not truly representative of the general population. This study proposed a framework using five machine learning (ML) predictive algorithms—random forest, stochastic gradient boosting, least absolute shrinkage and selection operator regression, ridge regression, and extreme gradient boosting—to identify the major risk factors affecting male sperm count based on a major health screening database in Taiwan. Unlike traditional multiple linear regression, ML algorithms do not need statistical assumptions and can capture non-linear relationships or complex interactions between dependent and independent variables to generate promising performance. We analyzed annual health screening data of 1375 males from 2010 to 2017, including data on health screening indicators, sourced from the MJ Group, a major health screening center in Taiwan. The symmetric mean absolute percentage error, relative absolute error, root relative squared error, and root mean squared error were used as performance evaluation metrics. Our results show that sleep time (ST), alpha-fetoprotein (AFP), body fat (BF), systolic blood pressure (SBP), and blood urea nitrogen (BUN) are the top five risk factors associated with sperm count. ST is a known risk factor influencing reproductive hormone balance, which can affect spermatogenesis and final sperm count. BF and SBP are risk factors associated with metabolic syndrome, another known risk factor of altered male reproductive hormone systems. However, AFP has not been the focus of previous studies on male fertility or semen quality. BUN, the index for kidney function, is also identified as a risk factor by our established ML model. Our results support previous findings that metabolic syndrome has negative impacts on sperm count and semen quality. Sleep duration also has an impact on sperm generation in the testes. AFP and BUN are two novel risk factors linked to sperm counts. These findings could help healthcare personnel and law makers create strategies for creating environments to increase the country’s fertility rate. This study should also be of value to follow-up research.

Abstract: In many countries, especially developed nations, the fertility rate and birth rate have continually declined. Taiwan's fertility rate has paralleled this trend and reached its nadir in 2022. Therefore, the government uses many strategies to encourage more married couples to have children. However, couples marrying at an older age may have declining physical status, as well as hypertension and other metabolic syndrome symptoms, in addition to possibly being overweight, which have been the focus of the studies for their influences on male and female gamete quality. Many previous studies based on infertile people are not truly representative of the general population. This study proposed a framework using five machine learning (ML) predictive algorithms-random forest, stochastic gradient boosting, least absolute shrinkage and selection operator regression, ridge regression, and extreme gradient boosting-to identify the major risk factors affecting male sperm count based on a major health screening database in Taiwan. Unlike traditional multiple linear regression, ML algorithms do not need statistical assumptions and can capture non-linear relationships or complex interactions between dependent and independent variables to generate promising performance. We analyzed annual health screening data of 1375 males from 2010 to 2017, including data on health screening indicators, sourced from the MJ Group, a major health screening center in Taiwan. The symmetric mean absolute percentage error, relative absolute error, root relative squared error, and root mean squared error were used as performance evaluation metrics. Our results show that sleep time (ST), alpha-fetoprotein (AFP), body fat (BF), systolic blood pressure (SBP), and blood urea nitrogen (BUN) are the top five risk factors associated with sperm count. ST is a known risk factor influencing reproductive hormone balance, which can affect spermatogenesis and final sperm count. BF and SBP are risk factors associated with metabolic syndrome, another known risk factor of altered male reproductive hormone systems. However, AFP has not been the focus of previous studies on male fertility or semen quality. BUN, the index for kidney function, is also identified as a risk factor by our established ML model. Our results support previous findings that metabolic syndrome has negative impacts on sperm count and semen quality. Sleep duration also has an impact on sperm generation in the testes. AFP and BUN are two novel risk factors linked to sperm counts. These findings could help healthcare personnel and law makers create strategies for creating environments to increase the country's fertility rate. This study should also be of value to follow-up research.

Introduction
Population aging is one of the by-products of a country's economic development. It increases the burden on the younger generation and diminishes the time available to raise the next generation. Fertility and birth rates have been continually declining in many countries. Taiwan's fertility rate reached its lowest point in 2022 at 1.08 children born per woman, which is lower than the 2.1 needed to maintain the population [1]. Therefore, it is crucial to ensure that married couples wanting to raise the next generation are able to conceive successfully. However, around 15-20% of couples are unable to conceive within one year of unprotected intercourse. Male factors contribute to 50% of all infertile cases. Although advances in assisted reproductive techniques (ART) help many couples to conceive successfully, the success rate of ARTs still depends on semen quality [2].
The decline in fertility has coincided with the falling trend in semen quality in recent years. Sperm count and sperm concentration, two determinants of semen quality, were found to be declining in a meta-analysis of 61 studies published between 1938 and 1990 comparing men with no history of infertility [3]. Since that finding, multiple studies have confirmed this worrying trend of decreasing sperm count and sperm density. In a more recent study, the proportion of men with normal total motile sperm count (>15 million) was found to have declined by about 10% over the past 16 years [4]. Although this trend was found within the subfertile male population, it implies that more couples need ARTs to help them to conceive.
Many risk factors, ranging from the patient's genetic background [5], maternal exposure [6], environmental pollutants [7], metabolic syndrome (MetS) [8], and obesity [9] to the patient's lifestyle [10], have been recognized to affect sperm count. Sperm count is further associated with sperm quality and could determine male fertility [11]. However, the extent of the influence of these factors on semen quality remains to be clearly determined due to the inability to design an experiment to account for all possible confounding factors. In addition, many previous study populations were recruited from infertility centers and their conclusions were not representative of the general population. Therefore, to gain more insights into the interplay between these factors and male fertility in the general population, we are the first study to analyze the annual health screening data, the MJ health-check-up-based population database (MJPD), from a major health screening center in Taiwan. The MJPD is widely used in the healthcare/medical informatics studies [12]. Patients with metabolic syndrome, hyperlipidemia, or different lifestyles were considered and used in this study to analyze the impacts of these risk factors on semen count.
Most of the existing studies usually utilized traditional multiple linear regression (MLR) to analyze the relationship between risk factors and sperm count [13][14][15]. MLR assumes that the dependent variable should be linearly correlated with independent variables and that collinearity should not occur between independent variables [16][17][18]. However, the use of MLR has limitations when the data may have non-linear relationships or complex interactions between variables [16]. Machine learning (ML) methods are data-driven algorithms and do not require statistical assumptions. They can capture non-linear relationships between variables or those with complex interactions [19][20][21][22]. As ML methods can handle collinearity more effectively than MLR and generate promising performances, they have been widely used for prediction issues in the field of healthcare/medical informatics, while MLR is used as a baseline for comparison [23][24][25][26]. However, only a few studies have utilized ML for sperm-count-related research [27][28][29].The five effective ML methods with different modeling mechanisms, namely, random forest (RF), stochastic gradient boosting (SGB), least absolute shrinkage and selection operator regression (Lasso), ridge regression (Ridge), and extreme gradient boosting (XGBoost), are used in this study since they have been successfully utilized in many healthcare or medical informatics studies to provide promising results [24,25,[30][31][32][33][34][35][36][37][38][39]. Thus, this study aims to construct a framework based on RF, SGB, Lasso, Ridge, and XGBoost prediction models to identify the major risk factors affecting male sperm count in order to provide more sperm-count-related research that utilizes ML in the field of reproductive biology.

Data Material
The process for identifying subjects in this study consisted of scrutinizing health screening indicators and questionnaire records of 71,108 members of the MJPD for the period 2005-2017. The study selected 30 health screening indicators and questionnaire variables relevant to the investigation. As there might have been multiple annual screening data for each member in the database, only the most recent annual record of the subject was analyzed. Subjects who lacked data on the main study variables were excluded, leaving 30,255 individuals who met the study eligibility criteria. We excluded 6 subjects who were older than 50 years and not evenly distributed in the study groups and 28,874 non-male subjects for whom sperm counts or motility tests were not performed in their annual health examination. We finally identified 1375 eligible male subjects, of whom 686 (49.89%) were married and 619 (45.02%) were unmarried, with an average age of 33.22 ± 4.36 years.
In Taiwan, many studies using the MJPD are listed on the website (http://www. mjhrf.org/main/page/resource/en/#resource07; accessed on 1 October 2022). The MJPD includes data collected from four MJ clinics that provide health screening to the center's members. All the datasets used were authorized by MJ Health Research Foundation (Approval No.: MJHRF-2016005A). The data application procedures are described at http://www.mjhrf.org/main/page/release1/en/#release01(accessed on 1 October 2022). The MJPD is accessible to academic researchers upon request. The protocol of this study was evaluated for ethical issues regarding the use of data in the database and was deemed acceptable by the Research Ethics Review Committee of Far Eastern Memorial Hospital (FEMH-IRB-107127-E, Protocol Version 1, 15 February 2022) and the MJ Health Research Foundation; it was approved by ClinicalTrials.gov (ID: NCT05225454). The study was conducted according to the guidelines of the Declaration of Helsinki, and all data were anonymized before analysis in accordance with the ethics requirements of the institutional review board. Figure 1 illustrates the sperm count distribution in different age groups in the sample, while Figure 2 shows the subject identification process for selecting the sample in this study. Table 1 provides the sample attributes of the subjects, including descriptive statistics of the independent and dependent variables. Figure 3 presents the correlation coefficients between 20 numerical independent variables and sperm count using Pearson correlation analysis. It can be seen from Figure 3 that a total of 3 risk factors have a positive linear correlation with the dependent variable, namely, UA, HDL-C, and AFP. A total of 16 risk factors have a negative linear correlation with the dependent variable, namely Age, BMI, BF, WC, WHR, SBP, DBP, Hb, FPG, SGOT, SGPT, BUN, e-GFR, TG, T-Cho, LDL-C, and C/H. Hb has no linear correlation with the dependent variable. Although all of the numerical independent variables do not have a strong linear correlation with the dependent variable, there may be non-linear relationships or complex interactions between variables. Therefore, the five ML predictive algorithms were used in this study as they can analyze data with non-linear relationships or complex interactions between variables [19][20][21][22].

Proposed Framework
In this study, a framework was constructed using the five ML prediction models for the identification of important risk factors (independent variables) affecting sperm count, integration, and deliberation. The proposed ML prediction model-based risk factor evaluation framework is shown in Figure 4.
In the proposed framework, the first step involved selecting subjects from the MJPD for the analysis. In the second step, candidate risk variables were chosen and target variables were defined. Twenty-nine risk factors were used as predictor (independent) variables and sperm count was the target (dependent) variable. In the third step, the sperm count of each subject was identified. After the data were organized, the fourth step involved construction of the prediction model for sperm count using the five ML techniques: RF; SGB; Lasso; Ridge; and XGBoost.
RF is a technique that integrates decision tree methods [40]. It randomly generates multiple different and unpruned decision trees, each of which determines the growth of the tree based on the Gini index, and integrates all the trees generated into a forest. It then averages or votes for the trees in the forest to produce a stable ensemble model, thereby reducing correlation between trees and generalization error. Eventually, a stable ensemble model is generated. SGB implements a combination of bagging and boosting [41,42] to generate numerous additive regression trees by multiple iterations. Each tree is trained according to the residuals of the previous iteration [42]. The final number of additive regression trees is determined by satisfying the maximum number of iterations or the convergence condition. Finally, the cumulative result of multiple trees is obtained by weighted summation to determine the final stable model.
Lasso is an extension of the conventional regression method and is based on the principle of using the least absolute shrinkage and selection operator (L1 regularization) to reduce the overfitting problem by forcing the coefficients that contribute less variance to the model to exactly zero, thereby obtaining a lower variance [43,44]. Ridge has the same basic concept as Lasso, with the main difference being that Ridge uses L2 regularization to reduce the coefficients in the model. Ridge adds an appropriate L2 penalty to the model to reduce all coefficients to non-zero values or values close to zero, and then minimizes the sum of squared errors to further control the trade-off between bias and variance to reduce overfitting [45].

Proposed Framework
In this study, a framework was constructed using the five ML prediction models for the identification of important risk factors (independent variables) affecting sperm count, integration, and deliberation. The proposed ML prediction model-based risk factor evaluation framework is shown in Figure 4. XGBoost is an optimized gradient-boosting decision tree method. The concept is to generate multiple decision tree models in a sequential manner, with each model generated to fit the residuals of the previous model and a regularization term used to control the complexity of each model, eventually combining all the decision trees generated to improve the accuracy of the prediction [46].
When constructing each ML model, the data were randomly divided into a training data set with 80% of the data and a test data set with 20% of the data. The training data set was used to perform hyperparameter tuning and validation of the model using a 10-fold cross-validation method. Then, the model with the best hyperparameter was selected as the final model, and information on the importance of the corresponding variable was obtained. Finally, the best model predictive performance of each ML method was evaluated with the test data set. To verify the accuracy of the models generated, the performance of each model was measured using four key evaluation metrics-symmetric mean absolute percentage error (SMAPE), relative absolute error (RAE), root relative squared error (RRSE), and root mean squared error (RMSE) ( Table 2).
Root mean squared error y i and y i represent predicted and actual values, respectively; n stands for the number of instances.
After constructing valid RF, SGB, Lasso, Ridge, and XGBoost predictive models, the fifth step involved obtaining the relative importance values generated by each method for each predictor variable/risk factor according to the converging ML model. The importance of the most and least important risk factors were 100 and 0, respectively.
In the sixth step, each ML method generated different importance values for each predictor variable since the different methods had individual characteristics. In order to integrate the advantages of these methods and obtain more stable results, the average importance value was used to integrate and compare the predictor variables that were more important overall in the set of importance rankings, thus, improving stability and completeness. In the seventh step, a final analysis was performed and the results discussed to obtain the final conclusion.

Results
We mainly targeted the younger health screening group for our study sample; therefore, the average age of the sample is relatively low (33.22 ± 4.36 years) and the descriptive statistics show that the study group consists of relatively young healthy and subhealth groups (Table 1). Although the study was a one-time semen analysis, through different ML algorithms, we were able to identify risk factors that may affect semen quality, which could contribute to the prevention of poor sperm quality in unmarried men. We used five ML techniques, RF, SGB, Lasso, Ridge, and XGBoost, to construct predictive models for sperm count. Each method was evaluated based on four performance indicators (SMAPE, RAE, RRSE, and RMSE); we found that the smaller the indicator, the better the predictive performance of the model. Table 3 provides the results of comparison of the predictive performance of the five models. Ridge shows the best performance for SMAPE (0.530) and RAE (0.964) and Lasso shows the best performance for RRSE (1.005) and RMSE (52.608).
Overall, although the predictive performance of the ML algorithms is slightly different, that of the five models is similar and excellent. The five ML methods use different concepts to obtain the variable importance of each risk factor. Therefore, we average the importance values generated by the five methods for the same risk factor and rank each risk factor in descending order of its average variable importance in order to integrate the variable importance information generated by the methods to obtain more robust results and to find the top 10 important risk factors for predicting sperm count.  Overall, although the predictive performance of the ML algorithms is slightly different, that of the five models is similar and excellent. The five ML methods use different concepts to obtain the variable importance of each risk factor. Therefore, we average the importance values generated by the five methods for the same risk factor and rank each risk factor in descending order of its average variable importance in order to integrate the variable importance information generated by the methods to obtain more robust results and to find the top 10 important risk factors for predicting sperm count.  To investigate the variables with greater clinical relevance, we focus on the top five important risk factors identified in this study, namely, ST, AFP, BF, SBP, and BUN.

Discussion
Both too-short and too-long sleep durations result in poor-quality semen [52]. Sleep disturbance is also associated with parameters indicating poor semen quality; men suffering from disturbed sleep show lower total sperm count, percentage of total and progressive motility, and percentage of morphologically normal spermatozoa compared to men enjoying high-quality sleep [53]. Sleep deprivation in rats increases stressful stimuli, To investigate the variables with greater clinical relevance, we focus on the top five important risk factors identified in this study, namely, ST, AFP, BF, SBP, and BUN.

Discussion
Both too-short and too-long sleep durations result in poor-quality semen [52]. Sleep disturbance is also associated with parameters indicating poor semen quality; men suffering from disturbed sleep show lower total sperm count, percentage of total and progressive motility, and percentage of morphologically normal spermatozoa compared to men enjoying high-quality sleep [53]. Sleep deprivation in rats increases stressful stimuli, which leads to the activation of the hypothalamus-pituitary-adrenal axis and causes elevated serum corticosteroid levels and decreased testosterone levels [54]. However, no difference in sperm count or sperm motility was found in this sleep-deprived animal model compared to the control groups. Therefore, whether sleep duration affects sperm quality through changing reproductive hormone levels or through different pathways affecting gene expression patterns related to spermatogenesis remains inconclusive.
Our study indicates that a shorter sleep duration has adverse effects on sperm count. It is possible that with a shorter sleep duration, reproductive hormone levels might be changed to a level that causes lower spermatogenesis. Further investigations into the link between sleep duration and sperm count are needed.
Alpha-fetoprotein is another risk factor identified by our established model. Few studies highlight this link between AFP and semen quality. In experiments with cryptorchid mice, AFP is specifically expressed in spermatocytes and secreted into the circulation [55]. Injection of AFP into the seminiferous tubules of normal mice could block spermiogenesis, the final step of spermatogenesis. A recent study found high serum AFP in male patients with aberrant sperm counts [56].
However, some of these studies were based on injecting AFP into the semen of animals, and the resulting concentration of AFP should be much higher than that found in healthy male patients. In our current study, we find a positive relationship between AFP and male sperm count. We suspect that there may be a U-shaped relationship between AFP and sperm count, meaning that both too low and too high levels have negative impacts on sperm count. However, it is still required for maintaining normal sperm count, and more studies are needed to illustrate its relationship with male sperm count.
BF, SBP, and other factors in our top 10 list of risk factors (BMI, C/H, T-Cho, and WHR) are related to metabolic syndrome, which has become a global epidemic. Metabolic syndrome has been linked to male infertility and poor semen quality [57], and many studies show that reproductive hormones are altered in males with the syndrome [58][59][60]. Our results support the view that more severe metabolic syndrome has an adverse effect on sperm count.
In the case of BUN, the fifth risk factor in our ranking, no investigations to date have been performed to find its direct link with male fertility or semen quality. However, chronic kidney disease (CKD) has been found to be associated with poor semen quality by affecting spermatogenesis and sperm motility [61]. The link between CKD and semen quality could be multifactorial. Most of these studies were based on the analysis of advanced CKD or patients under hemodialysis. However, in relatively healthy male patients, higher BUN levels seem to have a negative effect on sperm count. Therefore, the link between elevated BUN and sperm count in the healthy population or prior to the development of CKD requires further detailed study.
In summary, the established ML model successfully reproduces the findings of previous studies that sleep duration, BF, SBP, and BUN negatively affect sperm count. AFP is a lesser-known risk factor and more studies are needed to identify its relationship with male sperm count.

Limitations
This study was a cross-sectional study investigating the links between health examination data and sperm count of middle-aged males in Taiwan. The participants included 686 (49.89%) married and 619 (45.02%) unmarried males. Our study used five ML methods to analyze the risk factors affecting sperm count in healthy males. We listed these risk factors according to their importance in affecting sperm counts. Our study was based on a single analysis of semen; therefore, it does not truly reflect the participants' fertility, which needs multiple analyses of semen at different time points. With enough participants, a cross-sectional study could more comprehensively identify risk factors linked to sperm count changes. In addition, ML enables the analysis of nonlinear relationships and complex interactions between multiple predictor variables in this study. However, the top five risk factors, except AFP, all have a negative impact on male sperm count. AFP shows a positive influence on male sperm count; however, there may be a U-shape relationship between AFP and sperm count. It is necessary for maintaining sperm count; however, both too much or too little can have adverse effects on sperm production. To support this hypothesis, more sophisticated algorithms are needed to identify these U-shaped relationships with sperm count.

Conclusions
From Taiwan's health screening data of 1375 male patients, the established ML model predicts many risk factors affecting male semen qualities. Some of our predicted risk factors are consistent with previous results and thoroughly studied. Specially, ST is recognized in different algorithms and is the highest-ranking risk factor after sorting. After becoming a developed country, late marriage and low birth rate are important problems that need to be dealt with. Based on our studies and previous research, regular lifestyle and enough sleep duration are strongly suggested to improve semen quality and decrease the risk of male infertility indirectly.
The different algorithms in this study found sleep time to be the most important variable for predicting semen quality after joint ranking. Most residents of cities in developed countries, with a similar demographic and economic environment to that of Taiwan, tend to marry late and have fewer children. In view of the preliminary results of this study and its corroboration of findings of previous investigations, we suggest that the relevant government departments or health authorities in Taiwan should promote appropriate health information to the male population of reproductive age and advocate normal workloads and sufficient sleep and rest. This may help to avoid the risk of decreased sperm count or an indirect negative impact on male fertility. Data Availability Statement: All of the datasets collected from the MJ Health Research Foundation, the data need to apply and authorize the use, and the application procedures are accessed via this link. http://www.mjhrf.org/main/page/release1/en/#release01 (accessed on 1 October 2022).

Conflicts of Interest:
The authors declare no conflict of interest.