Comprehensive analysis of individual variation in the urinary proteome revealed significant gender differences

Abbreviations: 2DLC-MS/MS, two dimensional liquid chromatography - tandem mass spectrometry; ACPP, prostatic acid phosphatase; ANXA1, annexin A1; APOD, apolipoprotein D; AUC, area under curve; AZGP1, zinc-alpha-2-glycoprotein; AZU1, AZU1 protein; BMI, body mass index; CI, confidence interval; CV, coefficient of variation; eGFR, estimated glomerular filtration rate; FABP5, fatty acid binding protein 5; FASP, filter-aided sample preparation; IGKC, Ig kappa chain C region; KNG1, kininogen-1; LC −MS/MS, liquid chromatography - tandem mass spectrometry; MRM, multiple reaction monitoring; MSMB, beta-microseminoprotein; PERMANOVA, permutational multivariate analysis of variance; PSA, prostate specific antigen; PTGDS, prostaglandin-H2 D-isomerase; QC, quality control; RI, reference interval; ROC, receiver operation curve; S100A9, protein S100-A9; SPP1, osteopontin; SPRR3, small proline-rich protein 3; TC, total cholesterol; TG, triglyceride; TIMP1, tissue inhibitor of metalloproteinase-1; TXN, thioredoxin; UMOD, uromodulin; VMO1, vitelline membrane outer layer protein 1 homolog


Introduction
alignments were automatically conducted between the runs of the same offline HPLC fractions based on nonlinear mapping of the extracted features. Peptide identifications were then transferred between the aligned features. The abundances of peptide ions based on the robust estimation of the mean log peak area ratio of all peptide ions in the target run were normalized against an automatically selected reference run. Only features of peptide ions with charge states from +2 to +5 were selected for quantification. Protein abundances were calculated by the sum of the normalized abundances of their unique peptides.

MRM experiments and data analysis
Data derived from a spectral library of the normal urine proteome generated by conventional 1D LC-MS/MS and 2D LC-MS/MS using HCD collision were imported into Skyline version 3.5 [33]. Skyline was employed to manually select the most intense peptide transitions. Up to five transitions per peptide were traced on a QTRAP 6500 mass spectrometer (AB Sciex). Peptides with potential modification sites (cysteine and methionine) and those with missed cleavage sites were excluded. A total of 1-3 peptides from one protein were selected for quantification. All samples were loaded onto a self-peak area of all selected transitions.
To observe the stability of the MS signal, the mixed peptide sample was used as a QC sample and analyzed 3 times every two days during the process of MS analysis.

Estimation of absolute protein concentrations and reference intervals
First, a protein MS abundance measurement similar to iBAQ [34] was calculated for each protein. The abundance was defined as (protein molecular weight) * (summed peptide peak areas) / (number of theoretical peptides that could be detected by mass spectrometry). As reported in an epidemiological study, the mean concentration of urinary albumin among healthy 40-59-year-old Chinese individuals was approximately 2.675 mg/L [35]. A conversion factor between the protein MS abundance measurement and its absolute concentration was then calculated by dividing 2.675 by the mean albumin abundance among samples within the same age range. Consequently, the concentration of each protein in each sample could be estimated by multiplying its MS abundance measurement by the conversion factor. For each protein, the 2.5% and 97.5% quantiles of concentration values were calculated to estimate the 95% reference interval of protein concentrations in the normal population.

Evaluation of the effects of potential influencing factors on urinary proteome
Hierarchical clustering was employed to evaluate and visualize the overall similarities between samples. This analysis was performed on log2 transformed protein abundance data using the Pearson correlation coefficient for similarity calculation and the Ward agglomeration method. Only proteins with the biological coefficients of variation (CVs) above the median were used for clustering. PERMANOVA by guest on April 28, 2019 http://www.mcponline.org/ Downloaded from (permutational multivariate analysis of variance) was employed to test whether there was a significant effect of each potential influencing factor on the urinary proteome.
All tests were performed with the adonis2 function in the vegan package of R using the same distance matrix as in the hierarchical clustering, and the number of permutations was set to 999.

Protein co-expression network analysis
Network analysis was performed to identify co-expressed urinary protein clusters in different individuals. First, the Spearman correlation coefficients between proteins were calculated, and tests of significance were conducted. P values were then corrected for multiple hypothesis testing using the method described by Hochberg [36]. A protein co-expression network was constructed based on the correlation analysis. In this network, each vertex denotes a protein; two vertices were linked by an edge if and only if the correlation between the two proteins was significant (adjusted p-value < 0.05), and the weight of this edge was determined by the absolute value of the correlation coefficient of the two proteins. Densely connected subnetworks were discovered by the fast greedy community detection algorithm in the iGraph package of R [37, 38]. Each subnetwork was required to contain at least 7 vertices and contain more edges than vertices. To identify hub proteins, Kleinberg's hub centrality scores considering edge weights were calculated for all vertices in a subnetwork [39]. For the convenience of visualization, the high dimension protein abundance information in each subnetwork was reduced into a one-dimensional array (an eigengene in co-expression network analysis) by principal component analysis. In detail, the raw protein abundances were by guest on April 28, 2019 http://www.mcponline.org/ Downloaded from log2 transformed and scaled to a mean of 0 and standard deviation of 1. Then, the eigengene for each subnetwork was calculated as its first principle component, which explained 49.65% to 75.59% of the total protein variance. Thus, the major information of protein abundance distribution in a subnetwork could be represented by the distribution of its eigengene.

Identification of differentially expressed proteins between male and female urine samples
To determine the gender-related proteins in urine, a t-test with the assumption of unequal variances was performed on the log transformed abundance value for each protein. Adjusted p values for multiple hypothesis tests were then calculated using the Benjamini-Hochberg method [40]. Proteins with an adjusted p value less than 0.05 and a fold change over 1.5 were considered to be significantly differentially expressed between the two groups.

Discriminate model
In the training set (Set I), the peak area of each candidate protein was log2 transformed and then standardized (mean = 0 and standard deviation = 1). The LASSO logistic regression model was performed with the glmnet package in R [41]. Ten-fold cross-validation was used to determine the lambda value for the LASSO model. The lambda value was the largest value with an error within 1 standard error of lambda that gives the minimum mean cross-validated error. The fitted "urinary gender model" included a panel of proteins that perfectly separated male and female samples. In the test set (Set II), the abundance of each protein was calculated as the weighted sum of by guest on April 28, 2019 http://www.mcponline.org/ Downloaded from peptide peak areas. The weight of a peptide was defined as its median peak area among all samples divided by the sum of median peak areas of the peptides belonging to the same protein. The protein abundances were log2 transformed and standardized in the same way as the training set. The performance of the "urinary gender model" was evaluated by the ROC curve with the pROC package in R [42].

Functional analysis
The subcellular localization of proteins was annotated by the IPA software (Ingenuity Systems, Mountain View, CA). The significance of enrichment for each subcellular localization term was then calculated by one-sided Fisher's exact test, using all urinary proteins identified in this study as the background. Overrepresentation analysis of protein functions and pathways was performed by IPA and Reactome [43], respectively.

Data availability
All raw MS data are stored at the iProx (www.iprox.org) repository (Project ID: IPX0001256000).

Protein identification and label-free quantification
Urine samples from 49 healthy donors aged 20-69 years in sample Set I were comprehensively analyzed by 2D LC-MS/MS. Donors were assigned to five categories based on age, and each category spanned ten-year interval. Thus, the whole sample set was classified into ten groups based on donor age and gender. A total of 3008 proteins were identified with a protein FDR of <1% and with at least two unique peptides. For A total of 1872 proteins were quantifiable (Supplemental Table 2b). A total of 1738 of these with quantitative data in more than half of the samples were selected for further analysis. Technical variation of this analysis was evaluated by calculating the CV of the protein abundances among the five QC replicates. The median and 75% quantile of technical CVs were 0.23 and 0.39, respectively. Considering that the whole mass spectrometry analysis process included 432 LC-MS/MS runs and lasted for 60 days, the technical reproducibly with the label-free quantification used in this analysis is acceptable.

Gender-related urinary proteins
A total of 35 and 76 proteins that were significantly increased in the male and female urine samples, respectively, with changes larger than 1.5-fold and p values <0.05 were identified (Supplemental Table 3 and Figure 5a). This gender-related protein list includes 6 proteins in Subnetwork No. 7 and 39 proteins in Subnetwork No. 2.
First, proteins that might originate from the reproductive system were analyzed. The molecular functions of the gender-related proteins were analyzed by IPA ( Figure 5c). An interesting phenomenon is that the proteins that were increased in the different gender groups showed an opposite effect on cell invasion (Supplemental Figure 2). In the group of proteins that were increased in males, 4 of the 5 proteins related to cell invasion were predicted to have an inhibitory effect, whereas in the group of proteins that were increased in females, 10 of the 13 proteins related to this function showed a potential activating effect. The two protein groups also display opposite effects on apoptosis, which was predicted to be inhibited by the proteins that were increased in males and slightly activated by the proteins that were increased in females.
Additionally, proteins with several molecular functions, such as the migration of cells, catabolism of protein, and cross-linking of peptides, were significantly enriched only in the group of proteins that were increased in females.
Pathway analysis was further performed by the online analysis tool of the REACTOME database (Figure 5d). The immune system is the most significantly enriched category, containing nearly half (36/76) of the proteins that were increased in females. The markedly enriched immunological pathways included neutrophil degranulation, which contained 24 and 6 proteins that were increased in females and males, respectively, and antimicrobial peptides, containing 9 proteins enriched in females. The differential proteins in urine might be a reflection of the gender-related immunological differences across the whole body. The second enriched pathway category is keratinization, which contains 15 proteins that were increased in females that are specifically involved in the formation of the cornified envelope. This observation might suggest more shed keratinocytes in women's urine than in men's urine.
Nine proteins with interesting functions were chosen for further validation ( Table   2). An independent set of urine samples from 85 donors (Set II) was employed for validation by the MRM approach. The result of the validation analysis is shown in illustrating the discriminant power is displayed in Figure 6c. The majority of the samples were correctly classified except for three outliers.

Reference intervals of urinary proteins
In addition to gender, many other factors might also influence the urinary proteome. The reference interval for each protein provides a baseline to discover abnormalities in urinary biomarker analysis. Therefore, absolute concentrations of urinary proteins were estimated based on the mean value of urinary albumin concentration in the Chinese population. The 95% reference interval for each protein was calculated (Supplemental Table 5). Table 3 lists the concentrations of the top 10 most abundant proteins in the normal urinary proteome. None of these proteins had significant differences in abundance in female and male urine. Most of these proteins are extracellular glycoproteins existing in plasma. Among them, uromodulin (UMOD) is the only one that is exclusively secreted by the kidney.

Discussion
The characterization of the normal urine proteome provides fundamental information for the diagnosis and monitoring of diseases. In this study, we found that in addition to age, gender is a crucial factor that contributes to individual variation. The However, none of the other proteins have been observed to be highly expressed in the kidney. The functional associations of proteins in this subnetwork remain to be discovered.
Finally, we estimated reference intervals for 1640 urinary proteins quantified in Set I among normal males and females providing a baseline for detecting abnormalities in urine. We previously used the same approach to determine the absolute concentration of 2571 urinary proteins in a pooled sample and showed a medium correlation with immunoassay results [3]. It is an interesting phenomenon that UMOD exhibited much broader intervals than the high abundant proteins that originated from the plasma. This suggests that the levels of renal secretion are variable among healthy donors in our study. We hope that the omic-scale reference interval would be helpful for urinary proteome clinical applications and related research.             AF1  AF2  AF3  AF4  AM5  AM6  AM7  AF8  AF9  AF10  BF1  BF2  BF3  BF4  BF5  BM6  BM7  BF8  BM9  BM10  CM1  CM2  CF3  CM4  CF5  CF6  CM7  CF8  CM9  CM10  DF1  DM2  DM3  DF4  DF5  DF6  DF7  DF8  DF9  DM10  EM1  EF2  EM3  EF4  EM5  EM6  EM7  EF8  EF9  mixA  mixB  mixC  mixD  mixE   0