An integrated metaproteomics workflow for studying host-microbe dynamics in bronchoalveolar lavage samples applied to cystic fibrosis disease

ABSTRACT Airway microbiota are known to contribute to lung diseases, such as cystic fibrosis (CF), but their contributions to pathogenesis are still unclear. To improve our understanding of host-microbe interactions, we have developed an integrated analytical and bioinformatic mass spectrometry (MS)-based metaproteomics workflow to analyze clinical bronchoalveolar lavage (BAL) samples from people with airway disease. Proteins from BAL cellular pellets were processed and pooled together in groups categorized by disease status (CF vs. non-CF) and bacterial diversity, based on previously performed small subunit rRNA sequencing data. Proteins from each pooled sample group were digested and subjected to liquid chromatography tandem mass spectrometry (MS/MS). MS/MS spectra were matched to human and bacterial peptide sequences leveraging a bioinformatic workflow using a metagenomics-guided protein sequence database and rigorous evaluation. Label-free quantification revealed differentially abundant human peptides from proteins with known roles in CF, like neutrophil elastase and collagenase, and proteins with lesser-known roles in CF, including apolipoproteins. Differentially abundant bacterial peptides were identified from known CF pathogens (e.g., Pseudomonas), as well as other taxa with potentially novel roles in CF. We used this host-microbe peptide panel for targeted parallel-reaction monitoring validation, demonstrating for the first time an MS-based assay effective for quantifying host-microbe protein dynamics within BAL cells from individual CF patients. Our integrated bioinformatic and analytical workflow combining discovery, verification, and validation should prove useful for diverse studies to characterize microbial contributors in airway diseases. Furthermore, we describe a promising preliminary panel of differentially abundant microbe and host peptide sequences for further study as potential markers of host-microbe relationships in CF disease pathogenesis. IMPORTANCE Identifying microbial pathogenic contributors and dysregulated human responses in airway disease, such as CF, is critical to understanding disease progression and developing more effective treatments. To this end, characterizing the proteins expressed from bacterial microbes and human host cells during disease progression can provide valuable new insights. We describe here a new method to confidently detect and monitor abundance changes of both microbe and host proteins from challenging BAL samples commonly collected from CF patients. Our method uses both state-of-the art mass spectrometry-based instrumentation to detect proteins present in these samples and customized bioinformatic software tools to analyze the data and characterize detected proteins and their association with CF. We demonstrate the use of this method to characterize microbe and host proteins from individual BAL samples, paving the way for a new approach to understand molecular contributors to CF and other diseases of the airway.


Understanding the role of microbiota in airway conditions such as CF
In order to characterize the microorganisms from samples such as BAL that might play a role in diseases such as CF, metagenomic approaches have been commonly used (13), including small subunit ribosomal RNA (SSU-rRNA) gene sequencing for classifying the bacterial taxa present in the sample (14)(15)(16).Although metagenomics provides highly valuable information on the bacterial composition of clinical samples (17)(18)(19), even the most advanced metagenome sequencing methods (20) only provide predic tions of the potential functional molecules (proteins) expressed by the microbiota (21).Mass spectrometry (MS)-based metaproteomics offers a means to detect and quantify the proteins expressed by complex microbiota, thereby revealing functional markers of translationally active microbes present in the samples while simultaneously connect ing their expression to specific bacterial taxa (22,23).As such, when coupled with metagenomic information to define the organisms (and expressed proteomes) present, metaproteomics offers a powerful, complementary approach to fully characterizing bacterial communities within complex clinical samples (24)(25)(26).MS-based proteomic analysis of clinical samples such as BAL also offers the possibility of characterizing the response of the human proteins in parallel to those expressed by the microbiota, providing unique information on potential mechanisms of microbe-host interactions (27,28).

Challenges of metaproteomics in BAL
Despite its value, MS-based metaproteomics faces challenges to its effectiveness when analyzing complex biological samples, particularly those derived from human clinical specimens such as BAL.A main challenge is the low biomass of microorganisms relative to the human host, which leads to challenges detecting microbial peptides compared to the more prominent human peptides when analyzing tryptic peptide mixtures by liquid chromatography-tandem mass spectrometry (LC-MS/MS) (29,30).Adding to this challenge, generation of peptide-spectrum matches (PSMs) to microbial sequences necessitates matching MS/MS spectra to extremely large databases comprising all proteomes of microbiota present in the sample, which increases potential for false positives and decreases sensitivity (31)(32)(33).Finally, assigning function and taxonomy to the identified peptide sequences can be a challenge due to conservation of protein sequences across taxa and lack of confident annotation of encoded proteins (34).Many of these challenges can be addressed using specialized metaproteomic bioinformatic tools (35)(36)(37)(38).The upshot of successful metaproteomic analysis is the generation of unique information on proteins expressed by translationally active bacteria, along with a profile of human host proteins, within clinically important samples.Not surprisingly, metaproteomics has shown value in studying the role of microorganisms in CF airways in recent years (3,39).

Applying an advanced MS-metaproteomics workflow to BAL samples to study microbe-host dynamics in CF
We describe a novel workflow utilizing metaproteomics in BAL CF samples and demonstrate its effectiveness in characterizing microbial and human proteins com pared between CF and disease control (DC) patients with non-CF airway conditions.This advanced workflow is composed of these steps: (i) deep quantitative MS-based metaproteomics analysis of pooled BAL protein samples classified by disease status and SSU-rRNA-derived bacterial diversity; (ii) customized metaproteomic bioinformatic analysis to identify, verify, and quantify peptides from microbial proteins across pooled samples, as well as identify differentially abundant human proteins corresponding to disease status; (iii) prioritization of microbial and human peptide candidates to develop a panel for targeted MS-based assays using parallel-reaction monitoring (PRM) in individual patient BAL samples; and (iv) PRM analysis to quantify and determine differential abundance of microbial and host peptides between CF and DC samples, generating a promising and unique microbe-host peptide panel for future investigation in larger patient cohorts.
We demonstrate the effectiveness of this workflow and, in doing so, highlight a number of novel aspects: (i) demonstrating for the first time an end-to-end analytical and bioinformatic workflow for the joint analysis of microbial and human peptides in challenging BAL samples; (ii) a first-of-its-kind demonstration of PRM analysis to validate differential abundance of microbe-host peptide targets in individual CF or DC BAL samples; and (iii) a promising preliminary panel of microbe-host peptides, derived from some bacteria and human proteins with known associations to CF, as well as a number of novel proteins with new possible roles in pathogenesis.We provide all necessary information for others to utilize this peptide panel in their work studying progression and/or treatment of CF in clinical BAL samples, as well as an in-depth description of the workflow to enable its application to other clinical studies of microbe-host relationships in airway disease.

Clinical BAL sample cohort
The clinical BAL sample cohort included 67 samples from individuals with CF and 115 from people without CF but having other non-CF respiratory diseases and serving as DC samples.All BAL samples were collected for clinical purposes, and additional BAL available after all clinically indicated tests had been run was saved for research purpo ses.BAL samples were stored at −80°C until shipment to the University of Minnesota for analysis.See Data S1 (https://usegalaxy.eu/u/galaxyp/h/cf-data-file-s1)for details on all relevant information about these samples, which were collected using institutional review board-approved protocols for the protection of human subjects.

SSU-rRNA data generation
The SSU-rRNA data acquired from the sample cohort were previously published (40) and utilized in this experiment.Briefly, an approximately 315-nucleotide sequence encom passing the V1/V2 region of the rRNA gene was amplified by PCR using indexed primers (40).Unique sequences were assigned taxonomic identification using SILVA Incremental Aligner (41,42).Operational taxonomic units (OTUs) were generated by adding all counts with the same taxonomic identity.Based on the OTUs identified, the Shannon Diversity Index, a measure of microbiome diversity (43,44), was calculated for each of the 116 samples with sufficient bacterial load and used as a criterion for selection of samples used in this experiment (see Fig. S1).

Preparation of BAL cell pellets from individual samples
BAL samples were thawed and spun at 4,000 × g for 20 min at 4°C.The supernatants were removed.The pellets were rinsed twice with cold PBS, centrifuging and removing the supernatant each time.Freshly made urea-based lysis buffer was added to the pellets, which were vortexed, and set on ice for 5 min.The pellets were probe sonicated for 7 s to break up DNA and subjected to 60 repetitions of 35,000 PSI for 20 s then 0 psi at 37°C.Protein amounts for each sample were quantified with Precision Red Advanced Protein Assay reagent (Cytoskeleton Inc., cat.# ADV02).

Selection of patient samples for pooling
With the goal of generating an initial, deep profile of detectable proteins from BAL samples, pooled patient samples were generated in order to increase the amount of starting protein available for in-depth analysis.Samples were selected for pooling by first grouping by clinical diagnosis (CF or DC), further sorting based on their Shannon diversity index values and assigned ranks of high or low overall relative biodiversity.Microbial diversity ranking information was applied in the generation of pooled samples to reflect possible protein profile differences in disease progression influenced by the microbiome present.Using a total of 64 individual samples, samples from each of the CF and DC groups were selected to represent both the highest and lowest Shannon index values for four unique and separate pooled groups: cystic fibrosis-high diversity (CF-HD, n = 16), cystic fibrosis-low diversity (CF-LD, n = 16), disease control-high diversity (DC-HD, n = 16), disease control-low diversity (DC-LD, n = 16).For the purposes of this study, categorizing the sample pools based on their Shannon index was done with the objective of obtaining a deep profile of proteins present in either the CF or DC patient groups, rather than drawing conclusions about differences corresponding to bacterial diversity between these groups.The sample selection process can be seen in Fig. S1.Data S1 (https://usegalaxy.eu/u/galaxyp/h/cf-data-file-s1)shows all clinical information related to the samples selected for pooling, and a summary of the details can be found in Table S1.

Pooling of samples
Each pooled sample group (CF-HD, CF-LD, DC-HD, and DC-LD) was composed of 2.5 μg of total protein from each of the 16 samples within the group mixed together.Each pooled mixture was brought to equal volume of urea lysis buffer and diluted by 4× volumes of H 2 O.The samples were digested with a 1:40 trypsin-to-sample mass ratio at 37°C overnight.

High pH offline reverse-phase liquid chromatography
Each pooled sample was reconstituted in 50 μL of 50 mM ammonium formate and injected separately for fractionation by high-pH Shimadzu offline reverse-phase liquid chromatography (RPLC) using a 90-min gradient, 200 μL/min, with a fraction collected every 2 min (45).Thirty-two fractions were collected for each pooled sample, and 15-mAU-equivalent aliquots were concatenated into 10 tubes to mix early, middle, and late timepoints of the gradient fractionation.The concatenated samples were cleaned with MCX stage tips and dried with a speed vacuum.

LC-MS/MS analysis
The concatenated fractions were reconstituted in 97.9:2.0:0.1,H 2 O:ACN:FA load solvent, and direct injections of ~400 ng of each fraction were analyzed by nanocapillary LC-MS with a Thermo Fisher Scientific (Waltham, MA) Dionex UltiMate 3000 RSLCnano system online with an Orbitrap Eclipse mass spectrometer (Thermo Fisher Scientific) with high-field asymmetric waveform ion mobility (FAIMS) separation.Gradient separation on a self-packed C18 column (Dr.Maisch GmbH ReproSil-PUR 1.9-μm 120-Å C18aq, 100-μm ID × 40-cm length) at 55°C with the following profile using 0.1% FA in H 2 O (A) and 0.1% FA in ACN (B): 5% B solvent from 0 to 2 min, 8% B at 2.5 min, 21% B at 135 min, 34% B at 180 min and 90% B at 182 min with a flow rate of 325 nL/min from 0 to 2 min and 315 nL/min from 2.5 to 180 min.The FAIMS nitrogen cooling gas setting was 5.0 L/min; the carrier gas was 4.6 L/min; and the inner and outer electrodes were set to 100°C.Compensation voltages (CVs) were scanned at −45, −60, and −75 for 1 s each with a data-dependent acquisition method.The following MS parameters were used: electrospray ionization (ESI) voltage +2.1 kV, ion transfer tube 275°C; no internal calibration; Orbitrap MS1 scan 120K resolution in profile mode from 400 to 1,400 mass-to-charge (m/z) with 50-ms injection time; 100% (4 × 10E5) automatic gain control (AGC); higher-energy collisional dissociation MS2 activation was triggered on precursors with two to six charges above 2.5E4 counts; monoisotopic peak determination was set to peptide; MS2 settings (all CVs) were 1.6-Da quadrupole isolation window, 30% fixed collision energy, Orbitrap detection with 30K resolution at 200 m/z, first mass fixed at 110 m/z, 54-ms maximum injection time, 100% (5 × 10E4) AGC, 45-s dynamic exclusion duration with ±10-ppm mass tolerance; and exclusion lists were shared among CVs.See details in "Data Availability" for accessing raw MS files within the ProteomeXchange PRIDE repository.

Custom protein sequence database generation
To create a customized protein sequence database, microbial genera determined by SSU-rRNA sequencing data were used, only considering the top 99% relatively abundant in each of the samples used to create the pooled samples.Using software tools available in the Galaxy for Proteomics (Galaxy-P) tool suite (46), the tabular files were used as input for importing corresponding proteome sequences and merging these together with the human UniProt proteome sequence (2021-12-10, 101,014 protein sequences) along with common contaminants (116 sequences).This created a sequence database for each of the pooled samples containing these numbers of distinct sequences: CF-HD, 18,474,828; CF-LD, 7,219,650; DC-HD, 26,029,550; and DC-LD, 16,067,437.The MS raw files were first searched against this large protein sequence database using MetaNovo (47) (Galaxy v.0.1.1).PSMs generated were used to extract the parent protein sequences from the original database and generate a reduced database containing 238,382 total human and bacterial protein sequences.Files used throughout the process to generate the reduced database can be found in this link: https://usegalaxy.eu/u/galaxyp/h/cf-discoveryreduceddatabasegeneration.

Verification of candidate microbial peptide sequences
Matches to microbial sequences across the three programs were combined together for further verification using the PepQuery tool (59), which further evaluates the quality of PSMs to putative microbial proteins by testing these against sequences in the refer ence human proteome, considering possibilities such as PTMs and single amino acid substitutions to reference sequences as potential better matches to MS/MS compared to the initial microbial sequence match.Those MS/MS spectra that do not show a better match to the reference sequences were also reevaluated against their original microbial sequence and assigned a score, P value, and given a "yes" or "no" for overall confidence based on these criteria.The usage of the PepQuery tool, including input and output files, can be viewed at https://usegalaxy.eu/u/galaxyp/h/cf-pepquery-valida tion.Those microbial peptide sequences passing this verification step and assigned a yes for confidence were quantified between the four pooled sample groups using the MaxQuant, IonQuant, and FlashLFQ (Galaxy v.1.0.3.1)programs for intensity-based label-free quantification to determine those showing differential abundance between DC and CF sample pools.Based on the consistency of fold changes among CF and DC samples and PepQuery verification results, we selected 87 microbial peptides for further interrogation.Microbial peptides and their parent proteins were further annotated for taxonomy and function (if known) using BLAST-P to confirm their mapping to bacterial proteins and the Unipept (60) (Galaxy v.4.5.1) and MetaTryp tools (61).Quantification and detailed characterization of microbial peptides can be found in Data S3 (https:// usegalaxy.eu/u/galaxyp/h/cf-data-file-s3).
For analysis of human proteins, PSMs and inferred proteins from the FragPipe pipeline were quantified by normalized spectral counting in order to determine proteins showing potential differential abundance between sample groups (Data S2, https:// usegalaxy.eu/u/galaxyp/h/cfdata-file-s2).Total spectral counts for each protein were divided by the total PSMs within the sample and multiplied by 10 6 to generate a quantitative value for each protein in each of the four pooled samples.Values from the high or low diversity pools were combined for the CF and DC pools, respectively.Fold changes for CF:DC were calculated for each protein; those proteins with a normalized quantitative value of at least 100 in either the CF or DC samples and a fold change of at least twofold in either direction were further considered for analysis.These proteins were inputted into the STRING-database (db) resource (62) to characterize enriched functional networks.Networks were visualized within STRING using the built-in network visualiza tion functions.From this analysis, proteins belonging to enriched functional networks, along with the 10 proteins with the highest fold changes measured between the CF and DC groups, were further considered for targeted validation in individual patient samples.

Selection of peptide targets
From the microbial peptides and human proteins quantified in the discovery work flow described above, peptide targets were selected for validation using targeted parallel-reaction monitoring (63,64).Quantified microbial peptides were manually curated, selecting those with strong signal-to-noise for MS1 and MS/MS signals, visually confirmed quality MS/MS matches to the verified peptide sequences, and fold-change differences between CF and DC pools of at least twofold.For human proteins, pepti des were manually selected from proteins spanning the enriched functional networks determined by STRING, based on MS/MS signal strength, with no trypsin miscleavage, and minimizing amino acid sequences that might become modified (oxidation of methionines and alkylation of cysteines).

Targeted analysis of peptide targets in individual BAL samples
BAL samples from five individual people with CF and five DC were selected for dem onstrating targeted validation of the selected microbial and human peptides.These individual samples were selected from the samples used to create the pooled samples in the discovery analysis.A summary of the individuals' clinical details can be found in Table S2 and extra information at https://usegalaxy.eu/u/galaxyp/h/cf-data-file-s1.A total of 87 microbial peptides and 106 human peptides, selected as described above, were initially used to create an inclusion list for an initial analysis of a pool of peptides created from combining the CF and DC individual samples (10 total pooled together).Using this inclusion list, we analyzed the pooled sample by LC-MS/MS in order to first confirm identification of these peptides within the sample pool and to establish retention times for the detected peptides.
For this inclusion list-based analysis, we performed analytical separation and detection on an UltiMate 3000 RSLCnano UHPLC system (Thermo Fisher Scientific) interfaced to an Orbitrap Fusion Tribrid mass spectrometer (Thermo Fisher Scientific, San Jose, CA).Dried peptide samples were reconstituted using a load solvent mixture of 97.9:2:0.1,H 2 O:AcN:FA.Peptide mixture (400 ng) in 4 µL was injected on the analyt ical platform equipped with a 10-µL injection loop.Chromatographic separation was performed using a self-packed C18 column (Dr.Maisch GmbH ReproSil-PUR 1.9-µm 120-Å C18aq, 100-µm ID × 45-cm length) maintained at 55°C for the duration of the experiment.The liquid chromatography (LC) solvents were 0.1% FA in H 2 O (A) and 0.1% FA in AcN (B) solutions.Chromatographic separation was performed using a linear gradient as follows: 5% B solvent from 0 to 2 min, 8% B at 2.5 min, 21% B at 30 min, 35% B at 45 min, and 90% B from 47 to 55 min followed by a return to starting conditions.The flow rate was operated at 400 nL/min for 0-2 min, 315 nL/min for 2.5-45.0min, and 400 nL/min for 47-55 min.A Nanospray Flex ion source (Thermo Fisher Scientific) was used with a source voltage of 2.1 kV and ion transfer tube temperature of 250°C.Results from this analysis were manually reviewed using Skyline (65), selecting only peptides with at least three co-eluting transition peaks with no interfering, non-aligned transition signals; a Skyline dot product of at least 0.2 was required as an additional metric.Ultimately, a total of 133 peptides from both microbial and human proteins were confirmed using the inclusion list analysis in pooled samples and passed on for PRM analysis in individual samples.
After confirming the detection and relative LC retention time of microbial and human peptides, targeted LC-MS/MS analyses were performed by PRM analysis on the same Orbitrap Fusion system.Retention time-based scheduling of MS/MS generation by PRM was accomplished using Orbitrap MS1 detection at a resolution of 120,000, AGC targeted setting of 4 × 10E5, and a maximum ion injection time of 50 ms.Scan ranges of 380-1,580 m/z were used for full-scan detection.A total of 133 peptide targets were simultaneously monitored via LC-PRM-MS/MS using a 3-min LC retention time schedul ing window in a single experiment.MS/MS spectra were acquired with quadrupole isolation of 1.6 m/z, Orbitrap detection at a resolution of 30K and an MS2 AGC of 5 × 10E4, and a 54-ms maximum injection time.The analysis of peptides utilized collision induced dissociation (CID) fragmentation at a constant collision energy of 30%.

Evaluation and curation of PRM results and statistical analysis of relative peptide abundance
Peptides were quantified in Skyline (65), and specificity of MS/MS matches to expec ted sequences was accomplished by matching PRM-acquired MS/MS to predicted peptide sequence spectral libraries using the Prosit tool (66).PRM results were man ually inspected for quality, taking into account dot product values assigned by Prosit within Skyline and only considering those results with at least three co-eluting peptide fragments detected with high-quality signal-to-noise in the reconstructed LC chromato grams at expected retention times and Prosit dot products of at least 0.5.Using standard functions available in Skyline, area under the curve values were calculated for detected fragments and normalized to the total ion current (TIC) corresponding to the sample to generate abundance values for each PRM-detected peptide.For analysis of relative abundance levels between DC and CF groups, statistical calculations using non-paramet ric Mann-Whitney U-tests and data visualization using box-plot graphs were performed within the GraphPad Prism (v.9.5.0) software.

Integrated analytical and bioinformatic workflow for host-microbe protein discovery, verification, and validation
This work is built on a foundational, customized workflow combining instrumentalbased analytical methods with bioinformatic tools developed to address challenges of metaproteomic analysis in clinical BAL samples.These challenges include (i) detection and quantification of microbe-expressed proteins, which are relatively low in abundance compared to the human host proteins in BAL; (ii) verification of the accuracy of putative PSMs identifying microbial peptides; and (iii) validation of peptides of highest interest in individual CF BAL samples, ultimately offering a quantitative assay for investigating host-microbe protein dynamics across larger patient cohorts.Our integrated workflow is composed of three main modules (Fig. 1): (i) discovery, (ii) verification and annotation, and (iii) validation and results reporting.

Discovery module
The relatively low abundance of bacterial proteins compared to host proteins generally found when working with clinical human samples (67,68) challenges their detection via MS-based methods.The clinical samples available for this work (Data S1; https:// usegalaxy.eu/u/galaxyp/h/cf-data-file-s1)had previously been analyzed by SSU-rRNA sequencing and categorized by bacterial diversity based on assigned Shannon diversity index values.These samples were either derived from CF patients or DC patients, who were diagnosed with non-CF diseases that impact respiratory health, such as interstitial lung disease, asthma/reactive airway disease, and various immunosuppressive disorders.In order to maximize the detection of proteins from BAL samples, four pooled samples were generated, combining equal amounts of proteins from 16 patient samples with either CF or DC diagnoses, and high diversity or low diversity of bacteria as determined by SSU-rRNA data.See Data S1 (https://usegalaxy.eu/u/galaxyp/h/cf-data-file-s1),for information on samples used for pooling.Our ultimate goal was to demonstrate the effectiveness of our workflow to determine differentially abundant peptides between CF and DC samples, therefore the separation of pooled samples based on bacterial diversity was only done to increase our depth of identification of peptides across CF and DC pools.Each separate pooled sample (CF-HD, CF-LD, DC-HD, DC-LD) was fractionated via semipreparative high pH RPLC, and these fractions were analyzed by LC-FAIMS-MS/MS, offering acquisition of high-quality peptide MS/MS spectra for sensitive generation of confident PSMs offered by ion filtering from FAIMS.
In order to identify proteins from microbes and host, as well as provide some initial quantitative information on peptide abundance differences between the CF and DC sample pools, the MS/MS data were analyzed using a suite of tools developed for metaproteomic analysis.The taxonomic composition from the SSU-rRNA data was used to generate an initial, large protein sequence database of potentially expressed proteins for each of the pooled samples (average of about 17M sequences, Table 1).The MetaNovo (47) software was used to initially match MS/MS to the corresponding database for each pooled sample.PSMs generated by MetaNovo from each pool were used to build a reduced, composite database composed of 238,382 bacterial protein sequences containing the identified peptides.
To maximize PSMs from bacterial sequences, we utilized several sequence databasesearching platforms (SearchGUI/PeptideShaker, MaxQuant, and Fragpipe), matching MS/MS data against the sequence database of the bacterial protein sequences appended to UniProt human protein sequences.These three algorithms produced complemen tary results (Table S4; Fig. S2) identifying 2,292 unique bacterial peptide sequences of high quality based on the database searching scoring assignments from the different algorithms and controlling for PSM false discovery rate (estimated at 1% using targetdecoy methods (69).For label-free quantification of microbial peptides, the FlashLFQ tool (58,70) was used for SeachGUI/PeptideShaker results; MaxLFQ (55) was used for MaxQuant results and IonQuant (51) within FragPipe, as seen in Data S3 (https:// usegalaxy.eu/u/galaxyp/h/cf-data-file-s3).For human proteins, the results from Fragpipe were used, along with normalized spectral counting to quantify proteins showing abundance differences between the CF and DC samples analyzed.

Verification and annotation module
This module focused on verifying identified bacterial peptides from the discovery module and prioritizing their value as candidates for further validation based on abundance comparisons between CF and DC sample pools.The PepQuery software (59) was used as a means to independently verify the most confident PSMs to bacterial sequences, using its well-described routine for testing the quality of putative PSMs against other possible alternatives, such as matches to human sequences, including those carrying post-translational modifications.Those peptides passing the stringent PepQuery verification were further annotated by taxonomy using the Unipept tool (71) and BLAST-P to confirm their matching to bacterial proteins.Finally, those bacterial peptides showing potential differential abundance based on intensity-based LFQ values compared between the pooled CF and DC samples were further considered, resulting in 87 bacterial peptides of highest interest for advancement to the third analysis module (Data S3, https://usegalaxy.eu/u/galaxyp/h/cf-data-file-s3).
The human proteins quantified via normalized spectral counting were also further annotated functionally, with an aim of prioritizing potential pathways or functional networks enriched in abundance between the pooled DC and CF samples.For this, only proteins identified with at least 100 normalized spectral counts in at least one of the pooled samples were considered, and only those with twofold or greater abun dance differences between combined CF and DC sample pools.Proteins passing these criteria were inputted into the STRING-db bioinformatics resource (62), which provided an analysis of enriched subnetworks of functionally or physically interacting proteins.The interacting proteins composing these subnetworks, along with the top 10 most differentially abundant proteins in CF or DC, which did not fall into these networks, were retained as candidates for further validation in the next module.The Galaxy-P history (https://usegalaxy.eu/u/galaxyp/h/cfdata-file-s2)provides information on these prioritized human proteins.

Validation and results reporting module
The verified and quantified bacterial and human peptides of highest interest were advanced on to a final analysis module, focused on developing PRM methods targeting these peptides.The goal of this analysis was primarily to demonstrate the effectiveness of PRM to validate abundance levels of bacterial and human peptides of interest in individual patient BAL samples.This work would also deliver a preliminary panel of host-microbe peptides, validated in a small number of individual samples, useful for future studies of CF.To demonstrate this technology validation approach, five individual CF samples and five individual DC samples were selected from the samples used for the discovery module, digesting these proteins with trypsin and preparing them for PRM analysis.A pool of peptides from these 10 samples was initially analyzed by LC-MS/MS using an inclusion list of m/z values of bacterial and human peptides to verify the detection of these peptides and establish LC retention times for developing a scheduled PRM analysis.These data provided the necessary information to develop a scheduled PRM method for analysis of the individual CF or DC samples.
The PRM results were analyzed via the Skyline software.We utilized the Prosit resource (66) for predicting MS/MS fragmentation patterns of our bacterial or human peptides of interest, using this as a spectral library to evaluate the quality of detection of peptides using Skyline (65).Skyline mapped PRM data to expected fragmentation patterns, assigning Prosit-calculated dot product confidence scores and offering a means for visualizing the quality of results in individual samples.The abundance of detected peptides was measured using area under the curve calculations and normalization to the TIC for each run through Skyline.The TIC values can be found in Table S5, and chromato grams for CF and DC samples are displayed in Fig. S3 and S4, respectively.Differential abundance between CF and DC patients of quantified peptides was determined via statistical analysis.Human and microbial peptides passing quality criteria thresholds from Skyline analysis, along with normalized abundance values and associated details, are viewable in Data S4 (https://usegalaxy.eu/u/galaxyp/h/cf-data-file-s4-1)and Data S5 (https://usegalaxy.eu/u/galaxyp/h/cf-data-file-s5-2),respectively.Detected peaks used for quantitation and associated numerical data for each detected peptide can be viewed using the information in our ProteomeXchange PRIDE repository (see "Data Availability"), which includes a comparison of detected microbial and human peptides and proteins in 10 individual samples analyzed with PRM (Table S6). Figure 2A shows a selection of differentially abundant peptides derived from human proteins measured between CF and DC samples, as well as functional networks corresponding to these proteins as determined by STRING-db (Fig. 2B). Figure 3 shows a selection of differentially abundant peptides derived from bacterial proteins measured between CF and DC samples.In both Fig. 2A and 3, the quantified PRM-detected peptides had passed our stringent manual quality curation and had maximum dot products of at least 0.5 in the patient sample group (CF or DC), where they were detected with increased abundance, consistent with other studies using PRM for peptide validation (72).Figure .S5 and S6 show human and microbial peptides, respectively, that either were not differentially abundant, did not pass quality assessment, or both.The bacterial and human peptides validated by PRM analysis provide a unique panel of rigorously characterized peptides.Table 2 provides a summary of all of the peptide targets shown in the plots in Fig. 2 and 3.The table shows the necessary information to develop targeted methods quantifying these peptides in MS-based assays seeking to understand host and microbe dynamics in clinical samples.In addition to those results shown in Fig. 2A and 3, we also quantified a number of microbial and human peptides with strong PRM signals but lower dot products.These are shown in Table S7.
The PRM-verified human proteins displaying differential abundance between CF and DC samples fell into several categories.A number of these were part of a network of neutrophil-associated proteins (Fig. 2B; Table 2).Some of these have been well-described proteins related to inflammatory response, known to be increased in abundance in CF (e.g., neutrophil elastase, ELANE; the S100 proteins, neutrophil collagenase, MMP8, and myeloperoxidase [MPO]) (73)(74)(75).Some others within this network, such as tyrosine protein kinase (FGR) (76) and serine protease 57 (PRSS57) (77) (see Fig. S5; Table S7), are less well described as markers of CF, although known to be inflammation regulators abundant in neutrophils.
The association of other differentially abundant proteins with CF has not been as well described.Among these, two proteins involved in cell motility and cilia regulation (WD35 and IFT80) showed increased abundances in CF patients, along with the kinase and phosphatase proteins PHKB and PPP1B.A subnetwork of three apolipoproteins (APOE, APOC3, and APOC2) showed decreased abundance in CF compared to DC patients.A number of other proteins were verified as differentially abundant but did not show known interactions with each other when analyzed via the STRING-db knowledge base.
In total, 11 PRM-verified peptides from microbial proteins showed significant differential abundance in CF samples as compared to DC samples (Fig. 3; Table 2).Nine of these microbial peptides mapped to a distinct protein accession number (Fig. 3A), while one protein accession contained two verified peptides mapping to their sequence (Fig. 3B).For taxonomic and functional characterization, we subjected the peptides to BLAST-P analysis, Unipept analysis, and Metatryp analysis.For taxonomy (see Table 2), one peptide was assigned at the species level: Streptococcus agalactiae (A0A2G3DJD5_STRAG); two peptides were assigned at the genus level: Pseudomonas (A0A2S8YSC0_9PSED) and Mycobacterium (A0A1A2YS21_9MYCO).Four peptides were assigned to higher taxonomic levels, while four peptides were uncharacterized at the taxonomic level (see Table 2).
Interestingly, one of the differentially abundant microbial proteins identified by two distinct peptides each (Fig. 3B; Table 2) had ambiguous taxonomic assignment.For these proteins, the PSMs were initially matched to microbial sequences in the FASTA database; however, subsequent taxonomic analysis indicated these sequences also can be found in eukaryotes.For example, peptides initially matched to A0A246E3Y5_9MICO in the sequence database search were assigned to human immunoglobulin after further taxonomic characterization (Table 2).
Functional analysis of the proteins containing the PRM-verified microbial peptide sequences (see Table 2) assigned distinct functions to seven of these proteins.Proteins for two of the detected peptide sequences could not be assigned any function (denoted as "uncharacterized" for the function column in Table 2).This highlights the lack of functional characterization for many proteins of microbial origin commonly encountered in microbiology research (78).

DISCUSSION
In this study, we sought to develop a modular, integrated analytical and bioinformatic workflow to study the dynamic microbe-host relationships in BAL samples, offering a  means for new discoveries into this increasingly important aspect of CF pathogenesis (79).Direct detection of the microorganisms via metaproteomics can provide unique insights into the identity of biologically active, co-infecting organisms, as well as their functional state that may be indicative of their interactions with the human host cells.However, in samples commonly collected in the clinic for studying airway disease, such as BAL, characterizing the relatively low abundance of microbial proteins, along with the more prominent human proteins, presents a challenge.Here, we have shown how our modular workflow (80) can overcome these challenges.The discovery module analyzed proteins extracted from pooled cellular BAL samples, grouped based on SSU-rRNA information for determining bacterial diversity within each patient sample.Our focus on the insoluble cellular pellet fraction of BAL helped overcome the challenge of dynamic range suppression due to high abundance proteins found in BAL fluid (45).The pooled samples were also collected from patients with either clinically diagnosed CF or other respiratory tract conditions which acted as unique DC samples and separated into pools based on bacterial diversity in order to more deeply detect peptides by LC-MS/MS analysis.The taxonomic data from the SSU-rRNA data also provided the construction of an aggregated protein sequence database for each sample pool, albeit very large containing millions of sequences, of potential bacteria-expressed proteins in the samples.Our deep analysis via high resolution MS and customized metaproteomic informatic tools of these four separate pooled samples provided a sensitive profile of the detectable bacterial proteins and human proteins, along with initial measures of their relative abundance levels between the different patient groups.
Notably, our analysis of pooled samples amplified protein signals, such that we were able to select proteins and peptides of interest that were reliably detected across the pooled samples, obviating the need for more complicated quantitative analyses such as imputation of missing values, that is more commonly observed in discovery studies analyzing single patient samples.We also utilized multiple sequence database searching algorithms, which generated complementary results (Fig. S2) and maximized the number of bacterial peptide candidates identified in the discovery module.For the human proteins, our relatively simple design analyzing pooled samples also lent itself to the use of the FragPipe tool suite to infer proteins and quantify using spectral counting.We note that other more sophisticated algorithms for quantifying bottom-up proteomics data, such as linear mixed models (81), could be used in this discovery step.We also utilized customized tools such as Unipept (60,71) and also BLASTP to annotate the taxonomy and functions of microbe-expressed peptide sequences.Other options for such annotation do exist (82) and could be used for this analysis.
The verification and annotation module provided the next essential steps to ensure confident matches of MS/MS spectra to bacterial peptide sequences.This analysis included a rigorous verification of PSMs to bacterial sequences using the PepQuery tool (59), as well as assignment of these sequences to bacterial taxa and biochemical function, if known.This module also measured relative abundance levels of identified bacterial peptides and human proteins, as a means to further prioritize those features of highest interest for final validation.
The validation and results reporting module utilized PRM analysis, which leveraged the information on detected peptides generated in the discovery phase of the work.The PRM-based detection, coupled with data analysis using the popular Skyline tool (65) and Prosit (66), offered a means to confirm the presence of bacterial and host peptides in individual patient samples and more confidently to quantify their abundance.Although we used a predicted spectral library, use of an empirically generated library from the samples of interest may further increase the quality of the peptide identification and scoring (83,84).It should be noted that our workflow is amenable to the use of spectral libraries generated using either approach, or even a combination of the two.
Our results demonstrate the effectiveness of our modular workflow for metaproteo mic characterization of clinical CF samples, with several notable accomplishments.
• First, we were able to confidently identify hundreds of potential bacterial peptides from within BAL cellular samples using our discovery module.The most confident of these peptides were further confirmed via the verification and annotation module, along with measures of their abundance between pooled samples using label-free quantification methods (51,55,58,70).• Notably, we demonstrated, for the first time, the ability of PRM analysis to detect and quantify bacterial and human peptides in individual patient samples.This is a significant finding, opening the way to high-throughput, quantitative assays using targeted MS-based methods to dynamics of microbial protein markers in larger patient cohorts.Moreover, the workflow also provided a deep characterization of human proteins and their relative abundance between sample pools, as well as detection and quantification within individual samples.• Although previous studies have described MS-based proteomic analysis of human (85,86) or microbial (87,88) proteins from clinically relevant CF samples, ours is the first to simultaneously characterize microbe and host proteins underlying disease pathogenesis.This should open up new possibilities for studying dynamic interactions between microbes and host that may contribute to pathogenesis.This may include combinatorial statistical methods using a composite panel of microbe and host peptide abundance values to better discriminate clinically distinct patient samples (89), or intentional focus on peptides from proteins derived from pathogenic markers from microbes along with host proteins with immuneresponse functions to understand dynamics of infection and host defense in disease progression.• Lastly, all of the bioinformatic tools supporting this workflow are publicly available, with many of the workflows used available, complete with analysis settings and parameters, via the Galaxy ecosystem (46,80).It is our hope that this workflow can be used by others seeking to study CF or other respiratory diseases.
The rigorously characterized panel of peptides from both bacterial and human proteins (Table 2) is also a key deliverable from our study.The information provided on these sequences, as well as demonstration of their detection by PRM in individual samples, should enable their use as markers for targeted MS studies in larger cohorts of samples.Given the focus of our work was to demonstrate the effectiveness of PRM in individual BAL specimens, this panel was technologically validated in a relatively small number of samples and should be seen as a preliminary panel needing further study.It is our hope that these promising peptides can serve as molecular markers for studying diverse questions related to CF, such as disease progression or response to potential therapies.
Utilizing our workflow, we were able to deliver results on human protein abundance differences in both the pooled and individual samples when comparing CF and DC patients.Acknowledging that these findings were validated in a relatively small number of individual patient samples, nevertheless, our results profiling these targets offer some interesting potential.Some of these peptides were derived from proteins with known associations in CF, with others having more novel associations to CF lung disease pathogenesis.A number of neutrophil-expressed proteins were observed with large relative abundance in CF samples, which is unsurprising based on the established enrichment of neutrophils in the inflamed CF airway (90).Accordingly, the neutrophil proteases elastase (ELANE) and collagenase (MMP8) showed highly increased abundance in CF vs. DC patients, consistent with well-established evidence of their prominence in the CF airway (91,92).PRSS57 (also known as neutrophil protease 4) was also detected with high abundance (albeit with a single peptide; see Fig. S5; Table S7) and enrich ment in CF samples in discovery experiments.PRSS57 is a more recently characterized neutrophil protease that recognizes arginines as cleavage sites (93).Our results provide the first evidence that this protease may also contribute to CF.
In addition, other markers of inflammation associated with CF showed expected differential abundance in our studies, including MPO (94) and multiple S100 proteins (95).The heterodimer of the S100A8/A9 proteins, calprotectin, is involved in host response to infection through activation of neutrophils (96).The S100A12 protein and peptides are thought to reduce colonization of Pseudomonas by inhibiting their growth and pathogenicity (97).Two other proteins with ties to inflammatory lung disease were CLCA1, a regulator of mucus production in inflammatory airway disease that may be a target for CF intervention (98), and pendrin, which interacts with cystic fibrosis trans membrane conductance regulator (CFTR) (99) and is a proposed therapeutic target of airway inflammation (100).
Our findings also point to a number of proteins with potential roles in CF that have not been studied in this disease context previously.The phosphatase PPP1CB (101) and the interacting kinase PHKB (102) have metabolic signaling roles.Although not linked with CF, regulation of phosphorylation of CFTR has been proposed as a mediator of therapeutic efficacy (103).Additionally, a network of apolipoproteins had lowered abundance in CF compared to DC samples.This novel finding may be connected to dyslipidemia that has been associated with CF (104) but will require further study.
In our current study, we detected differentially abundant peptides from a number of bacteria with known associations in CF, as well as others with potentially more novel roles.For example, a differentially abundant peptide came from Streptococcus agalactiae, which has been previously reported in respiratory secretions in slightly older patients (105) and is a "Group B" Streptococcus that has been found in sputum of CF patients (106).S. agalactiae has also been suggested as a biomarker for squamous cell lung carcinoma (107) but does not have a known association with CF.
The detection of a differentially abundant peptide from Pseudomonas (Fig. 3B; Table 2) was not a surprise since it is a prevalent pathogen in advanced CF disease (108) with increased abundance in CF adults (109).Interestingly, the detected peptide from the Pseudomonas general secretion pathway protein F, a component of the extracellular type II secretion system (T2SS), involved in the transport of virulence factors thought to be mobilized with a rotary mechanism within the pilus (110,111).It has been postulated (112) that the T2SS has potential as a target for inhibitory therapeutics of this patho gen.We also detected a mycobacterial peptide in our analysis.An antibiotic-resistant Mycobacterium abscessus has been reported as an opportunistic pathogen in some CF infections (113).A peptide from Actinobacteria was detected (Fig. 3B; Table 2), which is known to be overrepresented in the CF pulmonary microbiome (114) with a significant increase with CF as compared to DC samples (115).A differentially abundant peptide was detected from Ralstonia, whose prevalence in CF patients has been increasingly demonstrated in studies over the years (116)(117)(118).Among other peptides detected, specific taxonomy was difficult to ascertain, as these could only be assigned to higher taxonomic classes.
Although most bacterial proteins were detected with one high-confidence pep tide, we did identify and quantify one bacterial protein with increased abundance in CF with two peptides matching their sequences (Fig. 3B; Table 2).The protein (A0A246E3Y5_9MICO), although initially annotated as bacterial in origin from the discovery module sequence database, has a human IgG domain, thus making it difficult to assign to the bacterial kingdom.In fact, BLAST-P and Unipept analysis indicated that this peptide could be of human origin and with an immunoglobulin variable region.Given these discrepancies, we decided to describe this protein (along with A0A228ZSI6_ECOLX; see Fig. S6; Table S7) of ambiguous taxonomy.Nevertheless, it is important to note that these proteins are differentially expressed in CF, and further characterization will shed light on their origin.
Despite the achievements demonstrated by our results, both in terms of the novel methodological approach and a promising panel of human and microbial peptide markers, there are some limitations to note.For one, adding new peptides of interest to our panel would most likely require again following the discovery and verification steps of our workflow.Although effective, this would take time and dedicated effort.A potential solution would be to incorporate emerging methods for data independent acquisition (119,120), which collects a deep, quantitative digital archive of all detect able peptides within complex samples.Such a data archive could be mined for the presence of peptide candidates revealed from ongoing research studies and may offer a more efficient means for verifying the presence of proteins of interest in patient samples and developing targeted methods for their analysis.Additionally, although rigorously characterized in our discovery module, the differential abundance results from our peptide panel were validated in a small number of individual samples.Thus, although promising, these should be viewed as preliminary biomarker candidates for CF.Consequently, we could not make conclusions on differences by samples with high or low microbial diversity within our study or correlate with other clinical variables.A study assaying these peptides in a larger cohort of patient samples is necessary to make such conclusions and should make it possible to correlate differential abun dance measurements with clinical variables or other metagenomic information, such as microbial diversity.The information we provide on this peptide panel (Table 2) should make such studies readily achievable for anyone with appropriate clinical samples and access to contemporary MS-based proteomics technologies.
In conclusion, we have demonstrated a powerful and novel workflow for discovery, verification, and validation of host and microbe peptide markers in cells collected from BAL samples of CF patients.Our well-characterized panel of peptides should provide an important starting point for larger studies profiling dynamic changes to the proteome and metaproteome in clinical samples, as a response to various experimental variables (e.g., disease progression and therapeutic treatments).As such, the work presented here should contribute significantly to the ongoing progress toward alleviating the burdens of CF lung disease.T.J.G., P.D.J., and T.A.L.: Conceptualization; P.D.J., T.J.G., M.E.K., S.M., J.K.H., J.B.O., K.M., and L.H.: Design and methodology; M.E.K., K.M., L.H., P.D.J., S.M., K.D., J.E.J., and R.W.: Data acquisition and analysis; P.D.J.,T.J.G.,J.K.H.,T.A.L.,C.W., and M.E.K.: Data interpretation; T.J.G., P.D.J., M.E.K., S.M., J.K.H., and T.A.L.: Manuscript drafting and editing;T.J.G. and T.A.L.:

FIG 1
FIG 1 Overview of the analytical and bioinformatic workflow for discovery (steps 1-3), verification (step 4), and validation (steps 5 and 6) of host-microbe peptide targets from the clinical BAL cellular samples.

TABLE 2 a
Summary of host-microbe peptide panel verified and quantified using PRM analysis d Taxonomy was assigned using the (a) Unipept and (b) BLAST-P tools.b Functions were assigned using (c) GO, (d) EC number, and (e) InterPro annotations.c PRM chromatograms were manually verified to ensure acceptable signal-to-noise and intensity of detected transitions for each peptide, with special focus on those peptides shown with a maximum dot product (dotP) value above 0.5 as described in the text.d m/z, mass-to-charge; GspF, general secretion pathway protein F. Methods and Protocols mSystems July 2024 Volume 9 Issue 7 10.1128/msystems.00929-2314

FIG 2
FIG 2 Human peptides detected in individual cystic fibrosis lung disease (n = 5) or disease control (n = 5) BAL samples with a targeted PRM method using a Fusion Orbitrap mass spectrometer.Some peptides were analyzed in a second technical PRM replicate analysis which included a modified list of target peptides.(a) Differentially abundant human proteins each represented by two unique peptides quantified using abundances normalized to the TIC of the corresponding sample.Results analyzed using Mann-Whitney U-tests.ns = P > 0.05, **P ≤ 0.01, ***P ≤ 0.001, ****P ≤ 0.0001.(b) Enriched functional networks of proteins with differentially abundant peptides.Thicker lines indicate higher confidence in network connectivity determined by STRING-db.The color scale visualizes the relative abundances.ns, not significant.

FIG 3
FIG 3 Differentially abundant microbial peptides detected in individual cystic fibrosis lung disease (n = 5) or disease control (n = 5) BAL samples with a targeted PRM method using a Fusion Orbitrap mass spectrometer.Some peptides were analyzed in a second technical PRM replicate analysis which included a modified list of target peptides.Abundance values of (a) one or (b) two peptides were normalized to the TIC of the corresponding sample.Results were analyzed using Mann-Whitney U-tests.**P ≤ 0.01, ****P ≤ 0.0001.
The samples were cleaned with mixed-mode cation exchange (MCX) stage tips packed with styrenedivinylbenzene-reverse phase sulfonate (Empore, cat.# 2241) extraction disks.Using centrifugation at 500 × g after each step, the tips were conditioned and equilibrated with AcN and H 2 O with 0.2% formic acid (FA) (at pH 2-3), respectively, and the samples reconstituted with H 2 O with 0.2% FA were dispensed onto the tip assembly.The samples, now bound within the stage tips, were washed with 95.0/5.0/0.2H 2 O/AcN/FA, washed with 100% AcN, and eluted for collection using 40:55:5 AcN:H 2 O:NH 4 OH.Peptide amounts were measured with a Pierce Quantitative Colorimetric Peptide Assay kit (cat.# 23275).

TABLE 1
Summary of results generated by discovery and verification modules a a CF-HD, cystic fibrosis-high diversity; CF-LD, cystic fibrosis-low diversity; DC-HD, disease control-high diversity; DC-LD, disease control-low diversity.

TABLE 2
Summary of host-microbe peptide panel verified and quantified using PRM analysis d