Sewage Protein Information Mining: Discovery of Large Biomolecules as Biomarkers of Population and Industrial Activities

Wastewater-based epidemiology has been revealed as a powerful approach for surveying the health and lifestyle of a population. In this context, proteins have been proposed as potential biomarkers that complement the information provided by currently available methods. However, little is known about the range of molecular species and dynamics of proteins in wastewater and the information hidden in these protein profiles is still to be uncovered. In this study, we investigated the protein composition of wastewater from 10 municipalities in Catalonia with diverse populations and industrial activities at three different times of the year. The soluble fraction of this material was analyzed using liquid chromatography high-resolution tandem mass spectrometry using a shotgun proteomics approach. The complete proteomic profile, distribution among different organisms, and semiquantitative analysis of the main constituents are described. Excreta (urine and feces) from humans, and blood and other residues from livestock were identified as the two main protein sources. Our findings provide new insights into the characterization of wastewater proteomics that allow for the proposal of specific bioindicators for wastewater-based environmental monitoring. This includes human and animal population monitoring, most notably for rodent pest control (immunoglobulins (Igs) and amylases) and livestock processing industry monitoring (albumins).


Sample collection
Twenty-four-hour composite wastewater samples were collected at the inlets of 10 wastewater treatment plants (WWTPs) located in the Girona and Barcelona provinces in Catalonia (Supplementary Figure 1). An automatic water sampler was used at all sites. The samples were then transferred to the laboratory at 4 °C.
Three collection campaigns were conducted on the 14th of December 2020, and on the 19th of April and 26th of July 2021 (winter, spring, and summer campaigns, respectively). For the study of the particulate fraction, samples were collected on three different days at the entrance of WWTP Besòs and Vic in May 2022.
Data on water inflow measured on the day of collection were provided by the WWTP operators.

Sample preparation
Soluble Proteins. The collected samples were filtered immediately after arrival at the laboratory. For this, up to 100 mL of 24-h composite wastewater sample was centrifuged at 4000 × g (10 °C, 20 min), and the supernatant was filtered through 0.2μm filters (VWR, North American, USA). The filtered samples were lyophilized using a freeze-dryer (TELSTAR LyoAlfa 6, PA, USA).
For the analysis, lyophilized samples were reconstituted in 20 mL MilliQ water and concentrated using a 10 kDa cutoff device (Amicon®, NMWL 10 kDa), with a filter that was previously passivated to minimize protein adsorption. Passivation was performed by washing the filter with 2.4 mL of NaOH 0.1M that was eliminated by centrifugation at 4000 × g (13 °C, 10 min), followed by a second wash with 2.5 mL milliQ water and centrifugation. Finally, filters were immersed overnight in Tween-20 (5% in MilliQ water), extensively washed with MilliQ water, and centrifuged at 4000 × g (13 °C, 5 min). Samples were concentrated to approximately 400 μL, then evaporated to dryness using a SpeedVac. Proteins in the sample were cleaned and concentrated in the heads of sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE) gels (5% stacking and 12% resolving) at 50 V for 40-50 min. Bovine serum albumin (BSA) was used as a reference marker. After electrophoresis, the gels were stained with Coomassie Blue and scanned. The bands of concentrated proteins were excised and S3 digested with trypsin using an automatic device (DigestPro MS, Intavis), as previously described [25].
Thereafter, the pellets were washed with phospahate-buffered saline using ultracentrifugation under the same conditions as described before.
After washing, the pellets were lysed with beads as described by Casas et al. [26].
Approximately 100 μL of each sample (25% of the total) was concentrated using SDS-PAGE gels (5% stacking and 12% resolving) at 50 V for 40 -50 min. BSA was used as a reference marker. After electrophoresis, the gels were stained with Coomassie Blue and scanned. The bands with concentrated proteins were excised and digested with trypsin using an automatic device (DigestPro MS, Intavis), as previously described [25].

LC-HRMS/MS and database search
The LC-HRMS/MS system consisted of an Agilent 1200 Series Gradient HPLC (consisting of a capillary nanopump, binary pump, thermostatic micro injector, and micro-switch valve) coupled to an Orbitrap-Velos High-Resolution Mass Spectrometer (ThermoFisher) equipped with a nanoESI ion source.
For the analysis, the tryptic digests of the sample extracts were evaporated until dry and re-dissolved in 50 µL of 0.5% TFA 5% methanol with gentle agitation in a Thermomixer (5 min, at 22 °C, 900 rpm). Five microliters of this solution was injected into the HPLC system.
The Orbitrap-Velos was operated in positive ion mode with a spray voltage of 1.7 kV.
Spectrometric analysis was performed in data-dependent mode, acquiring a full scan followed by 10 MS/MS scans of the 10 most intense signals detected in the MS scan.
Full MS (range 400-1650) was acquired in the Orbitrap with a resolution of 60.000.
MS/MS spectra were obtained in a linear ion trap.

Data treatment and semiquantitative analysis
Overall descriptions of the soluble wastewater proteome were obtained from the protein identification output of a Protein Discoverer Multiconsensus analysis, including all protein identifications from the different sites and campaigns. For discussion purposes, only proteins assigned as master proteins, with at least two peptides pointing to them, were considered. Estimation of the relative abundance of proteins was based on normalized spectral counts (NSCs). NSCs correspond to the total peptide sequence matches (PSM) obtained using Protein Discoverer and normalized to the mass of the protein to consider that the number of tryptic peptides produced by a protein increases with its size, and thus also the total PSMs measured.
The comparison of the soluble and particulate proteomes of the material in this study and that of the material found in the polymeric probes [23] was performed as described above. They included all soluble extracts, two replicates of particulate samples from the Vic and Besòs WWTPs, and all the probe-derived sample analyses from the inlet of the WWTPs (site 1 of the three WWTP samples in our previous work). Owing to the high number of archives to process, multiconsensus protein identification was performed considering five data groups: soluble data combined by campaign, combined particulate data, and combined probe data.
For the semiquantitative determination of proteins such as amylases and albumins, we selected peptides with an unambiguous match to the protein (no other proteins in the Protein Group) and with at least two PSMs. Unselected peptides pointing to a protein in the unambiguous set were then recovered and added to that set. Protein areas were calculated as the sum of all selected peptides pointing to the protein and were normalized to the wastewater flow measured at the WWTP inlet when the sample was collected.
To ensure reproducibility and traceability, Protein Discoverer output data were processed and documented using Jupyter notebooks and Python.