Construction and external validation of a 5-gene random forest model to diagnose non-obstructive azoospermia based on the single-cell RNA sequencing of testicular tissue

Non-obstructive azoospermia (NOA) is among the most severe factors for male infertility, but our understandings of the latent biological mechanisms remain insufficient. The single-cell RNA sequencing (scRNA-seq) data of 432 testicular cells isolated from the patient with NOA was analyzed, and the cell samples were grouped into 5 cell clusters. A sum of 455 cell markers was identified and then included in the protein-protein interaction network. The Top 5 most critical genes in the network, including CCT8, CDC6, PSMD1, RPS4X, RPL36A, were selected for the diagnosis model construction through the random forest (RF). The RF model was a strong classifier for NOA and obstructive azoospermia (OA), which was validated in the training cohort (n = 58, AUC = 1) and external validation cohort (n = 20, AUC = 0.9). We collected the seminal plasma samples and testicular biopsy samples from 20 OA and 20 NOA cases from the local hospital, and the gene expression was detected via Real-Time quantitative Polymerase Chain Reaction (RT-qPCR) and Immunohistochemistry. The RF model also exhibited high accuracy (AUC = 0.725) in the local cohort. In summary, a novel gene signature was developed and externally validated based on scRNA-seq analysis, providing some new biomarkers to uncover the underlying mechanisms and a promising clinical tool for diagnosis in NOA.


INTRODUCTION AGING
of OA, while intracytoplasmic sperm injection (ICSI) and testicular sperm extraction (TESE) are more recommended for NOA [6]. Hence, the differential diagnosis of OA and NOA is of great significance because it is directly related to the choice of treatment methods [7].
With the rise and advancement of gene sequencing and big-data analysis, the development of gene-based models for disease diagnosis has attracted increasing attention. These genetic models acted as valuable tools to guide clinical practice and provided potential clues for investigating pathogenesis [8,9]. Among the highthroughput sequencing methods, single-cell RNA sequencing analyzed the transcriptomics at single-cell resolution, assessing the cell heterogeneity and diversity with high efficacy [10]. The scRNA-seqbased models have been successfully established in various diseases, such as bladder cancer [11], pancreatic ductal adenocarcinoma [12], and skin cancer [13], showing the tremendous advantages of scRNA-seq to achieve greater understandings of disease initiation and progression. However, no genetic diagnostic model for NOA based on scRNAseq has been constructed.
The present study analyzed the scRNA-seq data of 432 testicular cells isolated from the patient with NOA and screened the marker genes among different cell clusters. Subsequently, a protein-protein interaction (PPI) network was established, and the hub genes of the network were identified. The random forest algorithm was utilized for diagnostic model construction for NOA, and two independent NOA datasets were downloaded from Gene Expression Omnibus (GEO) as the training and external validation cohorts, respectively. The collected samples from The Third Affiliated Hospital of Southern Medical University were also utilized for validation through Real-Time quantitative Polymerase Chain Reaction (RT-qPCR).

Data collection
The scRNA-seq matrix of 432 testicular cells from an NOA patient (GSE157421) was directly downloaded on GEO (https://www.ncbi.nlm.nih.gov/geo/). GSE9210, including 11 OA and 47 NOA samples, and GSE145467, including 10 OA and 10 NOA samples, were also obtained as the training and external validation datasets, respectively. GSE9210 and GSE145467 were both Agilent microarray experiments for human testicular tissues. The probe IDs were converted into gene symbols using R software (version 4.1.0).

Processing of scRNA-seq data
The Seurat package of R was used to normalize the scRNA-seq data and to perform the quality control [14]. The filtering criteria were set as follows: nFeature_RNA > 50 and percent.mt < 5, which meant the cells with detected gene numbers ≤ 50 and the proportion of mitochondria ≥ 5% were excluded from the present study. The Top 10 genes exhibiting the most variable among the cell samples were identified with the FindVariableFeatures function of Seurat. Subsequently, the cell samples clustering was conducted via principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) based on the Top 1500 most variable genes. The markers genes of various cell clusters were screened with |logarithmic fold change [logFC]| > 0.8 and adjusted P < 0.05. The cell types were annotated via the SingleR and celldex packages. The monocle package was adopted for pseudotime analysis, which re-verified the correctness of the cell type annotation.

Construction of the protein-protein interaction network
The cell markers extracted from the scRNA-seq analysis were then used to establish a protein-protein interaction (PPI) network to identify the possible hub genes associated with the pathogenesis of NOA through the STRING database (https://string-db.org/). The confidence score was set to 0.9 to ensure the reliability of the established network as possible. The cytoHubba plug-in of Cytoscape software (version 3.8.0) was used to measure the importance of the genes in the network via degree algorithm.

Functional enrichment
Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment were conducted to annotate the biological functions of the genes in the network through the clusterProfiler R package. The gene sets with P < 0.05 and Q < 0.05 were considered to be statistically significant.

Development of a random forest model
The Top 5 core genes in the network were chosen as the variables for model construction. The diagnostic model was developed through the randomForest R package. Ntree =500 and mtry=3 was set as the arguments for the random forest. Mtry is defined as the number of variables sampled per iteration and ntree refers to the number of decision trees contained in the random forest. The pROC package was utilized to draw the receiver operating curves (ROCs) and to evaluate the 95% confidence interval (CI) of areas under curves (AUCs) based on 2000 bootstrap sampling, which were applied to measure the random forest's predictive performance in the training and external validation datasets. The confusion matrices were visualized via R software. Mean Decrease Accuracy and Mean Decrease Gini were used to calculate the importance of the variables in the random forest model, which were positively associated with the importance. Mean Decrease Accuracy meant the degree of reduction in the accuracy of random forest prediction after changing the value of a variable into a random number, and Mean Decrease Gini meant the influence of each variable on the heterogeneity of observations at each node of the classification tree [15]. We also compared the AUCs of the single gene and the 5-gene RF model via Delong' test to check whether the AUCs have been significantly altered. The expression divergence of the hub genes in each cell cluster was visualized via a bubble plot and a scatter plot with the Seurat package.

Clinical sample collection
This study protocol was approved by the Medical Ethics Committee of The Third Affiliated Hospital of Southern Medical University, and written consent was obtained from all patients. The patients with OA and NOA between October 2019 and September 2021 were enrolled in this study, and the diagnosis of OA or NOA relied on the testicular biopsy. The biopsy samples were immediately fixed with 4% paraformaldehyde (ThermoFisher Scientific, China) overnight, embedded in paraffin, and sectioned 8-10 μm thick. Age, Johnsen's Score, follicle-stimulating hormone (FSH), luteinizing hormone (LH), and testosterone (T) of the cases were also collected.
All the study subjects were abstinent for 3-5 days before the semen collection. The semen samples were obtained by masturbation. The semen was liquified for 20-30 minutes at room temperature. The seminal plasma was collected by centrifuging the semen at 4° C at 10000 x g for 10 minutes, and the precipitate was discarded. The seminal plasma was stored at -80° C for further study.

RT-qPCR
The total RNA was extracted with the Trizolchloroform method (Trizol reagent, Invitrogen, USA) after keeping the seminal plasma gently thawed on ice. The cDNA was synthesized with PrimeScript RT Reagent Kit (Takara, China) and amplified by SYBR Premix ExTaq kit (Takara, China) following the manufacturer's recommendations. The qualification of the RNA expression was based on ABI 7600 system (Applied Biosystems, USA). GAPDH was chosen as the internal reference gene. The 2−ΔΔCt methods were utilized to calculate the gene expression value. All the PCR experiments were repeated three times. The primer sequence was synthesized by the TSINGKE company (Guangzhou, China), as shown in Table 1.

Immunohistochemistry
The slides were washed with xylene and added to the ethanol as follows: 100% ethanol for 4 minutes; 90% ethanol for 4 minutes; 80% ethanol for 4minutes; 70% ethanol for 4 minutes. The sections were repaired in antigen repair solution (ThermoFisher Scientific, China) for 10 minutes at 95° C. 5% bovine serum albumin (BSA) in phosphate buffered saline (PBS) was used to block the non-specific binding sites for 1 hour. During the immunohistochemical staining with RPS4X (1:100, Proteintech, China), the slides and the antibody were incubated for 2 hours at room temperature in a humidified chamber. After the slides were washed 3 times with PBS, the anti-rabbit secondary antibody (Proteintech, China) was added to the slides for 1 hour at room temperature. Images were acquired by standard microscopy (Nikon Eclipse 90i, Nikon, Japan). The gray-scale of the images were analyzed according to the integral optical density (IOD), which was calculated by Image-Pro Plus (version 6.0, Media Cybernetics, USA).

Statistical analysis
The statistical analyses were based on R software (version 4.0.3, R core team) and GraphPad Prism 8 (version 8.4.3, GraphPad, USA). All the data of this study was presented as mean ± standard deviation (SD). The two-tailed Student's t-test was performed for the variance detection for the RT-qPCR and immunohistochemistry data, and P < 0.05 was considered as statistically significant. Welch-corrected t-test was utilized to compare the difference of age, Johnsen's Score, FSH, TH, and T between OA and NOA cases. Delong's test was conducted to compare the AUCs of different ROCs by means of roc.test function of R pROC package. *P < 0.05, **P < 0.01, ***P < 0.001.

Identification of cell markers via scRNA-seq analysis
The workflow of the present study was shown in Figure 1. First, the scRNA-seq data of an NOA patient's testicular sample was analyzed, and a sum of 432 testicular cells was acquired. The quality control of the detected gene numbers, gene sequencing count, and the percent of mitochondrial genes was indicated in Figure 2A. The percent of mitochondrial genes was  www.aging-us.com AGING negatively associated with detected gene counts (Pearson r = -0.53, Figure 2B); meanwhile, the high positive correlation between sequencing depth and detected gene counts was found (Pearson r = 0.94, Figure 2C). The Top 10 genes, including HIST1H4C,  MT-RNR2, PRM1, COX1, TEX101, ND4, ND4L, and  PRM2, showing the most significant expression difference among all cell samples, were revealed in Figure 2D. Subsequently, PCA was conducted to preliminarily classify the cell samples ( Figure 2E), and the P-value distribution in each principal component (PC) was shown in Figure 2F. The Top 20 and Top 30 genes associated with PC1-4 were illustrated in the dot plot (Supplementary Figure 1A) and the heat map (Supplementary Figure 1B), respectively. With the t-SNE dimension-reduction algorithm, 432 testicular cells were divided into 5 different cell clusters ( Figure 2G). Cell cluster 1 was annotated as induced pluripotent stem (iPS) cells, and the remaining 4 cell clusters were all annotated as gametocytes ( Figure 2H). Ultimately, a total of 456 cell markers were identified with the limma package (Supplementary Table 1). The heat map displayed the expression level of Top 10 differentially expressed genes (DEGs) among the cell clusters ( Figure 2I).

Cell trajectory analysis
In scRNA-seq analysis, the correct cell type annotation has always been a difficult point. We conducted the pseudotime analysis to confirm whether the cell type annotation was right. As shown in Figure 3, iPS cells gradually differentiated into gametocytes over time, which was reasonable and logical. The cell trajectory analysis validated the annotation results.

PPI network construction and functional enrichment
Compared with the DEGs extracted from the tissue with different statuses, such as OA and NOA, the DEGs between different cell clusters, also known as cell markers, could reflect the pathogenesis with a higher resolution. Hence, the cell markers from the scRNA-seq analysis were then used to construct the PPI network.
With the confidence score > 0.9 filtering, 30 genes were included in the network, as displayed in Figure 4A. The Top 10 hub genes ranked by degree were shown in Figure 4B and Supplementary Table 2. KEGG ( Figure  4C) and GO ( Figure 4D) functional annotation indicated the genes in the PPI network were mostly enriched in ribosome, tight junction, DNA replication, and many other critical pathways involved in cell activities.

Establishment and validation of a random forest model
Here, we detected the diagnostic value of the hub genes of NOA from the scRNA-seq and PPI analyses. Compared with the single gene, a multi-gene combination would be more potent for prediction, which has been demonstrated in many previous studies [16]. With the rapid development of computer technology, machine learning is increasingly applied to disease diagnosis [17,18]. Hence, we implemented random forest, a widely used and powerful machine learning algorithm, to construct the diagnosis model [19,20]. GSE9210 was set as the training dataset, and AUC of the random forest model was 1.000 (95% CI = 1.000-1.000), as displayed in the ROC ( Figure 5A) and confusion matrix ( Figure 5B). The performance of the diagnostic model in the external validation was also favorable with the AUC = 0.900 (95% CI = 0.769-1.000). Figure 5C, 5D showed the ROC and confusion matrix of the established model in the external validation cohort.
The diagnostic value of the genes in the diagnostic model was also detected. First, the Mean Decrease Accuracy ( Figure 6A) and Mean Decrease Gini ( Figure  6B) of each gene were calculated, and RPS4X was found to serve as the most important variables in the random forest model. Besides, ROC analyses indicated the AUCs of RPS4X were 0.932 and 0.920 in the training and external validation datasets, respectively, suggesting RPS4X was a promising biomarker for NOA ( Figure 6C, 6D). In addition, Delong's test between the AUCs of the 5-gene RF model and the single gene indicated that RPS4X and RPL36A were important variables in the model, as displayed in Table 2. The expression level of RPS4X in human testicular samples from NOA and OA patients was also detected via immunohistochemical staining. It was found that RPS4X was significantly up-regulated in the testes of 20 NOA patients compared with that in 20 OA patients, as displayed in Figure 6E.
In addition, the expression level of the hub genes in the cell clusters was also compared. As indicated in Figure 7A, 7B, RPS4X was significantly up-regulated in cell cluster 1, which was annotated with iPS cells, implying RPS4X might exert their pathogenetic functions during the differentiation of iPS cells into gametocytes.    Table 3. The merge of multiple groups with different means and SDs from GSE9210 cohort into one group was conducted via an online tool (http://www.obg.cuhk.edu.hk/ResearchSupport/StatTool s/CombineMeansSDs_Pgm.php). We analyzed the mRNA expression level of CCT8, CDC6, PSMD1, RPL36A, and RPS4X, which comprised the RF model, in the seminal plasma of OA and NOA patients. It was found that CCT8 (P < 0.01, Figure 8A) and CDC6 (P < 0.05, Figure 8B) were significantly upregulated in the OA samples, while PSMD1 (P < 0.05, Figure 8C), RPL36A (P < 0.01, Figure 8D), and RPX4X (P < 0.05, Figure 8E) were obviously decreased in NOA patients' seminal plasma. Table 4 indicated the AUCs and corresponding 95% CI of each gene in GSE9210 cohort, GSE145467 cohort, and local cohort.
The RF model was also a promising classifier in the seminal plasma from the ROC analysis (AUC = 0.725, 95% CI = 0.589-0.861, Figure 8F). Figure 8G displayed the confusion matrix of the model in local cohort. The accuracy, sensitivity, specificity, positive predictive value, and negative predictive value of the RF model in each cohort were shown in Table 5.

DISCUSSION
NOA includes changes in spermatogenesis caused by various hypothalamic and pituitary diseases, as well as primary spermatogenesis failure caused by different etiologies, and usually has a poor prognosis [21]. Such patients have no obvious signs of obstruction in the ultrasound examination of the reproductive system, but the obvious feature is that the patient's testicular volume is often small and cannot produce sperm or produce very few sperm. Patients with OA can often be diagnosed with ultrasound of the reproductive system, but some azoospermia patients exhibited both spermatogenic dysfunction and reproductive tract obstruction, which is known as mixed azoospermia. Therefore, imaging methods such as the ultrasound are far from sufficient to confirm the diagnosis of NOA. Many biomarkers associated with NOA have been discovered, such as folliclestimulating hormone [22], serum inhibin B [23], and anti-Mullerian hormone [24], but more studies on the biomarkers are helpful for clinicians to achieve a more precise diagnosis.
In addition, from the perspective of pathogenesis, although many theories have been proposed to explain the pathogenesis of NOA, our understandings of the biological processes associated with NOA remains insufficient. Nowadays, the widespread applications of gene sequencing, especially scRNA-seq, have deepened the knowledge of NOA [25,26]. For instance, Wang et al. had disclosed the unique role of autophagy homeostasis in the spermatogenesis of NOA cases through scRNA-seq analysis [27]. Liu et al. utilized scRNA-seq to detect the genetic change of ACE2 in testicular cells of normal and NOA patients, uncovering the possible mechanisms of how SARS-CoV-2 affected testicular cells [28]. These studies strongly demonstrated the usefulness of scRNA-seq for NOA's mechanism detection. However, the scRNA-seq-based diagnostic model for NOA has not been reported.
Here, the scRNA-seq data of the testicular cells extracted from an NOA patient was analyzed. The cell markers, which were defined as the DEGs among different cell clusters, were used to construct a PPI network. GO and KEGG enrichment indicated the genes in the network were mainly involved in the cell cyclerelated pathways, such as DNA replication, translational initiation, and regulation of G2/M transition of mitotic cell cycle. Subsequently, the Top 5 hub genes in the PPI network were chosen for diagnostic model development. GSE9210 and GSE145467 were utilized to construct and externally validate the predictive model, respectively, and a sum of 78 cases was enrolled, including 57 NOA and 21 OA patients. We also collected the seminal plasma samples of 20 OA and 20 NOA patients from The Third Affiliated Hospital of Southern Medical University, and detected the diagnostic efficacy of the RF model via RT-qPCR. Another important highlight of the research was that the random forest algorithm, a dimension reduction machine learning technique, was adopted for predictive model construction. The ROC analyses in the training cohort (AUC = 1.000), external validation cohort (AUC = 0.900), and local cohort (AUC = 0.725) demonstrated the feasibility and effectiveness of the strategy.
Some novel biomarkers for NOA were also screened. Ribosomal Protein S4 X-Linked (RPS4X) was essential for the formation of cytoplasmic ribosomes and participated in the initiation and progression of multiple diseases [29][30][31]. RPS4X acted as the strongest predictor in the random forest model with the highest Mean Decrease Accuracy and Mean Decrease Gini. ROCs also indicated that RPS4X was a promising diagnosis biomarker for NOA both in the training (AUC = 0.932) and external validation (AUC = 0.920) cohorts. RPS4X was significantly up-regulated in the iPS cells of testicular tissue isolated from the patient with NOA. AGING All the evidence suggested RPS4X played an important role in NOA. However, the association between RPS4X and NOA has never been reported, and how RPS4X regulated spermatogenesis of NOA patients remains unclear. Ribosomal Protein L36a (RPL36A) was also played an important role in the exertion of ribosomal function. The association of RPL36A with infertility has been reported. Selvaraju et al. has found RPL36A was up-regulated in the highfertile bulls' sperm, but the roles of RPL36A in human infertility are still unknown [32]. Overall, our findings helped to identify novel biomarkers, providing the possible cut-in for further elucidation of the mechanisms in NOA.   The limitations of the present study should be acknowledged. First, the research is retrospective, and a large-scale, multi-center, and prospective clinical trait would be beneficial to confirm the usefulness in clinical practice. Second, several novel biomarkers were identified, but their biological functions in NOA are unknown, and a series of experimental exploration ought to be conducted.
In this paper, we presented a random forest diagnosis model to distinguish NOA from OA, which was based on scRNA-seq analysis and externally validated, providing novel insights into the underlying mechanisms of NOA.

AUTHOR CONTRIBUTIONS
CDL designed the whole study and provided financial support. RRZ developed the algorithm and drew the plots. XYL wrote the original draft and performed the RT-qPCR experiments. TLC collected the human testicular biopsy samples and conducted the immunohistochemical staining. QC collected the human semen samples. HT, CY, and WBG did help to editing and reviewing.

CONFLICTS OF INTEREST
The authors declared that they have no conflicts of interest.