PGFinder, a novel analysis pipeline for the consistent, reproducible, and high-resolution structural analysis of bacterial peptidoglycans

Many software solutions are available for proteomics and glycomics studies, but none are ideal for the structural analysis of peptidoglycan (PG), the essential and major component of bacterial cell envelopes. It icomprises glycan chains and peptide stems, both containing unusual amino acids and sugars. This has forced the field to rely on manual analysis approaches, which are time-consuming, labour-intensive, and prone to error. The lack of automated tools has hampered the ability to perform high-throughput analyses and prevented the adoption of a standard methodology. Here, we describe a novel tool called PGFinder for the analysis of PG structure and demonstrate that it represents a powerful tool to quantify PG fragments and discover novel structural features. Our analysis workflow, which relies on open-access tools, is a breakthrough towards a consistent and reproducible analysis of bacterial PGs. It represents a significant advance towards peptidoglycomics as a full-fledged discipline.


Introduction 45
The characterisation of bacterial cell walls started with the development of electron microscopy 46 techniques (Mudd, et al., 1941), and it has ever since been the focus of countless studies. The major 47 and essential component of the bacterial cell envelope is called peptidoglycan (PG). It confers cell 48 shape and resistance to osmotic stress and represents an unmatched target for antibiotics (Mainardi, et 49 al., 2008, Vollmer, et al., 2008. Some of the most widely used antibiotics to date (beta-lactams and 50 glycopeptides) inhibit the polymerisation of PG. 51 PG (murein; originally known as mucopeptide) is a giant, insoluble, bag-shaped molecule, and its 52 composition was characterised soon after its discovery (Cummins, et al., 1956, Rogers, et al., 1959, 53 Weidel, et al., 1964. It is composed of glycan chains containing alternating N-acetylglucosamine 54 (GlcNAc) and N-acetylmuramic acid (MurNAc) residues linked by β,1-4 bonds. The lactyl group of 55 MurNAc residues is substituted by pentapeptide stems which often has the L-Ala 1 -γ-D-Glu 2 -L-DAA 3 -56 D-Ala 4 -D-Ala 5 sequence, where DAA is a diamino acid such as meso-diaminopimelic (mDAP) acid or 57 L-lysine ( Figure 1a) (Vollmer, et al., 2008). In some species, a lateral chain (with variable 58 composition and length) can be found attached to the amino acid in position 3. Peptide stem 59 composition and polymerisation can vary amongst bacterial species (Schleifer, et al., 1972). Whilst 60 PG building blocks produced in the cytoplasm are always the same, the final structure undergoes 61 constant hydrolysis and modification, a process referred to as "remodelling". Both remodelling and 62 alternative polymerisation modes (Figure 1b) lead to a considerable variation in PG structure during 63 cell growth and division. PG structural plasticity plays a critical role for adaption to environmental 64 conditions during host-pathogen interaction (Boneca, et al., 2007, Juan, et al., 2018 or to survive 65 exposure to antibiotics (Mainardi, et al., 2008). 66 PG material is straightforward to purify, but the structural analysis of this molecule is challenging and 67 remains a time-consuming and labour-intensive process. The intact molecule must be broken down 68 into soluble fragments by enzymatic digestion with a glycosyl hydrolase (lysozyme), and individual 69 building blocks (disaccharide-peptides, also called muropeptides) are analysed to gain insight into the 70 structure of the intact molecule. A transformative step for the characterisation of disaccharide-peptides 71 Finally, we provide evidence that PGfinder can be used in conjunction with freely available MS data 99 deconvolution software, making PG analysis possible using entirely open access tools. We propose 100 that our approach represents a significant advance towards a consistent and reproducible analysis of 101 PG structure, allowing peptidoglycomics to take the crucial first leap to parity with other omics 102 disciplines. 103 Results 104

PGfinder, a dedicated script for bottom-up identification of PG fragments 105
No pipeline is currently available for the automated analysis of MS PG data. Therefore, we sought to 106 replicate a shotgun proteomics approach to create an analysis pipeline dedicated to PG analysis, 107 referred to as "peptidoglycomics" (Wheeler, et al., 2014). 108 To limit misidentifications due to mass coincidences, we established a search strategy relying on an 109 iterative process (Figure 2). A first search was carried out using a database made of reduced 110 disaccharide-peptides (monomers) and their theoretical monoisotopic masses. MS data were 111 deconvoluted using the Protein Metrics Byos® software to generate a list of observed monoisotopic 112 masses alongside other parameters including retention times and signal intensity. Individual theoretical 113 masses contained in the monomer database ( Figure 2, database 1) were compared with observed 114 masses in the experimental dataset. Any observed mass within 10 ppm tolerance was considered as a 115 match and the corresponding inferred structure and theoretical mass were then added to a list of 116 matched structures (Figure 2, library 1). As a second step, we used the list of matched monomers to 117 build another database in silico (Figure 2, database 2), corresponding to dimers and trimers and their 118 theoretical masses. Two types of polymerisation events are included in the original PGFinder version 119 depending on the type of crosslink either through peptide stems or glycan chains. Individual 120 theoretical masses from the in-silico database were compared to observed masses to generate a list of 121 matched dimers and trimers (Figure 2, library 2). As a third step, we combined the lists of matched 122 monomers and multimers to generate a final library of modified muropeptides (Figure 2, library 3). 123 The final library contained only modified muropeptides corresponding to matched monomers, dimers, 124 and trimers. The modifications accounted for include: the presence of anhydro groups, deacetylated 125 sugars, amidated amino acids and modifications resulting from N-acetylglucosaminidase or amidase 126 activities (loss of GlcNAc and lack of peptide stems, respectively). In-source decay products (loss of 127 GlcNAc) and Na + /K + salt adducts were also added to library 3. All three libraries corresponding to 128 observed monomers, dimers, trimers, and their modified variants were combined to search the MS 129 data for masses matching theoretical values within a 10 ppm mass accuracy window. This search 130 generated results processed by PGfinder to carry out a "clean up step". The intensities of in-source 131 decay products and salt adducts were combined with that from parent ions when found within close 132 retention time (a 0.5 min time window). The output of this final step is a matched table written to a csv 133 format file. It contained all the inferred structures identified within the specified mass and retention 134 time windows with an extracted-ion chromatogram (XIC) signal intensity for quantification. 135 136 Using PGfinder to investigate PG structure and identify low abundance muropeptides 137 The performance of the matching script was tested using the well-characterised PG from E. coli as a 138 proof of concept. UHPLC-MS/MS data were acquired for three independent PG samples (biological 139 replicates; Table 1-supplement 1). Following MS1 spectral deconvolution (a process calculating 140 masses from observed m/z values), observed masses were matched to theoretical muropeptides masses 141 according to the strategy described above (Figure 2). A first search was carried out using a minimal 142 mass library made of 10 simple PG fragments including three glycan chains (di-, tetra-and 143 hexasaccharides) and 7 monomers (( Table 1 supplement 2). The output of the automated search is a 144 csv file per dataset; all files corresponding to biological replicates were collated into one excel file 145 (Table 1-supplement 3). Each search output contained approximately 3,000 rows of masses and 146 corresponding parameters. Depending on the dataset analysed, 41-48% of the total ion intensity was 147 assigned to PG structures. As anticipated, inferred structures were frequently found with multiple 148 retention times, reflecting the existence of stereoisomers, with one species accounting for most of the 149 intensity. In some cases, observed masses matched with more than one inferred structure. The output 150 of the automated search was consolidated as described in Table 1 (Figure 3). Based on the abundance of 157 multimeric PG fragments, we report a crosslinking index of 15.69%, which is slightly lower than the 158 value previously reported of 23.1% (Glauner, 1988). 159 The automated and unbiased search revealed several muropeptides that were expected but never 160 reported to date. These included (i) PG fragments resulting from amidase activity (4.55%), found as 161 "denuded glycans" (disaccharides and tetrasaccharides) and modified variants or muropeptide stem 162 with an extra disaccharide residue; (ii) a low abundance (0.23%) PG fragments containing 163 deacetylated GlcNAc residues; (iii) PG fragments resulting from glucosaminidase activity (0.12%). 164 Deacetylated muropeptides were not expected since no E. coli PG deacetylase has been identified in 165 this organism to date. All the structures identified for the first time (glycan chains, monomers 166 containing deacetyl groups and muropeptides lacking a GlcNAc residue) were confirmed by MS/MS 167 analysis ( Table 1- We showed with E. coli data that PGfinder is suitable to characterise the high-resolution structure of 186 PGs using a "bottom-up" approach. However, this requires a careful analysis of the search output to 187 confirm the identity of muropeptides identified and discriminate between multiple structures that can 188 be assigned to a unique observed mass. A more basic application is the use PGfinder in organisms that 189 have already been studied in detail to either compare PG composition or quantify the abundance of 190 specific structures. This application accounts for most PG analyses described in the literature, respectively; Student's t-test) (Figure 4b). We next performed a Student's t-test using permutation-217 based FDR to identify statistically significant differences in the abundance of individual muropeptides 218 between the two strains. The P value was plotted on a volcano plot against the fold change in 219 abundance between the two samples ( Figure 4c). Two muropeptides were significantly less abundant 220  Table 2). The eight muropeptides that were not identified were absent from the list of 242 deconvoluted masses, indicating that the problem was not associated with the script, highlighting that 243 the deconvolution step is a source of variability. Interestingly, the observed masses calculated using 244 Byos® Feature Finder were closer to the theoretical value (6.5 ppm versus 10.7 ppm on average), 245 reflecting another source of variability associated with data deconvolution. Four muropeptides 246 previously identified containing the AEJAG pentapeptide stem were matched with distinct structures 247 due to a mass coincidence between K and AG. A careful analysis of MS/MS spectra suggested that 248 these muropeptides contained a K residue rather than the AG dipeptide (the y2 ion being 143 ppm 249 away from the expected mass), showing the added value of an unbiased search. This conclusion is 250 supported by the retention times of the corresponding muropeptides since the tetrapeptide AEJK elutes 251 before EAJA whilst the pentapeptide AEJAG elutes later in the chromatography (Bern, et al., 2017). It 252 is worth noting that our search also identified a large number of muropeptides that were not reported 253 previously ( Table 2- deconvolution software (ProteinMetrics Byos® or Agilent MassHunter, respectively). We sought to 261 identify a free alternative software for mass deconvolution to make our PG analysis pipeline accessible 262 to everyone. MaxQuant was selected as a tool of choice since it represents a widely used software 263 package to analyse high-resolution mass-spectrometric data for shotgun proteomics (Cox, et al., 2008).  Table 2-supplement 2), processed it using Maxquant for mass deconvolution ( Table 2-supplement  269 3) and analysed it using PGfinder for PG structure and composition identification. We were able to 270 identify all the expected muropeptides ( This study describes a workflow for the unbiased and automated analysis of bacterial PG using freely 275 available resources. We analysed high-resolution mass spectrometry datasets corresponding to PG 276 fragments and demonstrated that this approach is a powerful tool to identify muropeptides and carry 277 out comparative analyses based on the MS1 data. Another advantage is that the only information provided by the user is a restricted monomer database 296 rather than a comprehensive library. This avoids a time-consuming operation, prone to human error. 297 PGfinder uses XIC for quantification of PG fragments, providing high resolution, sensitivity, and 298 reproducibility, as indicated by the comparisons across biological replicates ( Table 1-supplement 4  299 and Figure 4a). It enables the accurate quantification of molecules with overlapping retention times 300 and those present in very low abundance, with a large dynamic range (typically six orders of 301 magnitude). Although a Matlab-based software package (Chromanalysis) has been described to 302 automate the detection and quantification of UV peaks through Gaussian fitting (Desmarais, et al., 303 2015), quantification using XIC is more straightforward. 304 Our E. coli PG analysis confirmed that PGFinder is a powerful tool that provides a much improved 305 qualitative and quantitative PG analysis (Kuhner, et al., 2014, More, et al., 2019). Combining an 306 unbiased search with highly sensitive detection of individual structures is important for two reasons. 307 Firstly, it opens the possibility to identify subtle modifications of PG structure, resulting from either a 308 transient or a localised enzymatic activity such as that taking place at the septum. Secondly, it will 309 permit the identification of previously undetected modifications that may provide new insights into 310 our understanding of PG composition and dynamics. For example, we showed that E. coli PG contains 311 a low abundance of de-acetylated sugars. This observation is puzzling because no canonical PG 312 deacetylase genes have been identified in this organism. Although the biological relevance of this 313 property remains to be established, we cannot exclude the possibility that PG deacetylation in E. coli 314 may contribute to PG homeostasis. Another striking outcome resulting from our automated search is 315 the identification of a slightly higher amount of muropeptides containing anhydromuramic acid 316 (4.55%) as compared to 2-3% (Glauner, et al., 1988, Liu, et al., 2020. To explain the discrepancy 317 between our work and data from the literature, it is tempting to assume that most of the muropeptides 318 containing anhydromuramic acid identified with PGFinder were simply not searched in previous 319 studies. It is worth pointing out that none of the papers describing PG analysis published to date has 320 reported the list of structures searched in the MS data analysed. 321 One of our objectives was to create an automated PG analysis tool accessible to the broadest audience 322 possible, including people with no prior experience with programming or coding languages. 323 Therefore, we shared PGfinder as a Jupyter Notebook allowing users to customise the search strategy 324 depending on both the question asked and the instrument accuracy. PGfinder is particularly suitable 325 for the characterisation of novel PGs with unknown composition or structural modifications and can 326 be modified by users to add novel functionalities. However, a current limitation of this workflow is 327 that it does not process MS/MS data. Therefore, the fragmentation spectra of individual monomers 328 must be checked using dedicated tools to validate that the inferred structures are correct. We are 329 currently working towards an integrated pipeline that includes MS/MS analysis to our PGfinder 330 pipeline. The ability to disable some PG modifications means that the complexity of the search can be 331 adjusted to focus on specific properties (e.g., the occurrence of acetylation/deacetylation, or 332 amidation) or specific muropeptides resulting from hydrolytic activities (e.g., unsubstituted MurNAc 333 residues resulting from amidase activity). For PG which have already been well characterised (E. coli, 334 P. aeruginosa or C. difficile), the search parameters are already established, allowing a very 335 straightforward analysis to be performed. Therefore, access to a custom, semi-quantitative sensitive 336 analysis is ideal for comparing PG dynamics or differences in PG structure between a reference strain 337 and isogenic mutants. Both reduced disaccharide peptides or lactyl-peptides (generated by beta-338 elimination) are identified using PGfinder. 339 We anticipate that an open access to PGfinder, in conjunction with freely available deconvolution 340 tools will allow researchers to carry out comparative MS1 analyses. The pipeline defined in this work 341 enables reproducible and consistent data analysis. This represents the first step towards a standardised 342 approach to PG analysis, opening the possibility to re-analyse datasets in repositories. The modular 343 structure of the open-source PGfinder code can be easily integrated into any specific workflow for the 344 automated processing of PG MS data. 345

Materials and Methods 346
Bacterial strains and culture conditions 347 E. coli BW25113 was grown at 37°C in LB under agitation (250rpm). C. difficile strains were cultured 348 in heart infusion supplemented with yeast extract, L-cysteine and glucose in an atmosphere of 10% H 2 , 349 10% CO 2 and 80% N 2 at 37 °C in a Coy chamber or Don Whitley A300 anaerobic workstation. 350 351 PG purification 352 PG was purified from exponential (E. coli) or late exponential (C. difficile) phase as described 353 previously (Eckert, et al., 2006, Glauner, 1988 freeze-dried, and resuspended in distilled water at a 354 concentration of 5mg/ml. 355 356 Preparation of soluble muropeptides 357 PG (1mg) was digested overnight with 25µg of mutanolysin at 37°C in 150µl of 20mM sodium 358 phosphate buffer (pH 5.5). Soluble disaccharide peptides were recovered in the supernatant following 359 centrifugation (20,000 x g for 20 min at 25°C). They were reduced with sodium borohydride or beta-360 eliminated as described previously (Arbeloa, et al., 2004, Eckert, et al., 2006. 361 The reduced muropeptides were desalted by reverse-phase high performance liquid chromatography 362 (rp-HPLC) on a C18 Hypersil Gold aQ column (3µm, 2.1x200mm; Thermo Fisher) at a flow rate of 363 0.4ml/min. After 1 minutes in Water-0.1% formic acid (v/v) (buffer A), muropeptides were eluted 364 with a 6-minute linear gradient to 95% acetonitrile-0.1% formic acid (v/v). Muropeptides were freeze-365 dried, resuspended in 100µl. An aliquot of the desalted samples was analysed by rp-HPLC on the 366 same column to measure the UV absorbance of the most abundant monomer (no isocratic step, 367 muropeptides were eluted with a 30 min linear gradient to 15% acetonitrile-0.1% formic acid (v/v)). 368 Samples were diluted to contain 150mAU/µl of the major monomer. Based on the dry weight of the 369 PG sample, we estimated that this corresponded to approximately 50µg of material.  The crosslinking index and glycan chain length were calculated as described previously (Glauner, 395 1988). Label free relative quantitation of muropeptides from triplicate C. difficile clinical isolates 396 (R20291 and M7404) was performed using Byos 3.11 and statistical analysis of the quantitative data 397 was performed using Perseus v. 1.6.10.53 (Tyanova, et al., 2016). Briefly, muropeptide intensities 398 were log2 transformed and normalised by subtraction of the median value. A two sample Student's t-399 test was performed with a Permutation-based FDR of 0.05 to determine statistically significant 400 quantitative differences between the strains. Comparisons between R20291 and M7404 muropeptide 401 distribution (mono, di, tri, tetra. pentamer) was evaluated for statistical significance using Graphpad 402 Prism (unpaired t-test). 403

Runtime environment 404
Code is available at https://github.com/Mesnage-Org/PGFinder. (https://github.com/Mesnage-405 Org/pgfinder/releases/tag/v0.01 will take you to the archived release used in this manuscript). We used 406 python 3 to write the MS1 package and demonstrate its functionality using demo scripts. PGfinder can 407 be run through an interactive Jupyter notebook hosted on mybinder for ease of use by those less 408 familiar with python code. A conda environment is provided to ensure reproducible execution. 409 Regression testing has been implemented to ensure changes to code do not cause changes to important 410 results. The Github contains an interactive version to run user's analysis and an end-to-end demo using 411 samples data provided with the script (Interactive mybinder). The sample data is a MaxQuant 412 deconvolution output from the E coli MS data analysed in the paper. The current version of the script 413 can handle both.txt (MaxQuant) or .ftrs (Byos®) deconvoluted data and offers the possibility for the 414 user to include several modifications in the search. The time window for the "clean up step" (in-source 415 decay and salt adducts) as well as ppm tolerance for matching can also be defined by the user; the 416 default values corresponding to these parameters used in this work are 0.5 minute and 10ppm. 417

588
The identification of muropeptides was carried out using 4 successive steps, indicated by 589 different colours (orange, green, blue, red, respectively). As a first step, observed masses in 590 the dataset are compared to a list of theoretical masses corresponding to monomers (database 591 1). Matched masses within the ppm tolerance set (10ppm for Orbitrap data) are used to build a 592 list of inferred monomeric structures and their corresponding theoretical masses (library 1). 593 This is then used to generate a list of theoretical multimers (dimers and trimers) and their 594 masses (database 2). A second matching round is carried out to build a list of inferred 595 multimers (library 2). At this stage, matched monomers and multimers are combined to 596 generate a list of modified muropeptides (library 3). Two libraries of matched theoretical 597 masses (monomers and dimers, trimers) and a third library (their modified counterparts) are 598 used to search the dataset. Muropeptide structures are inferred from a match within tolerance 599 between theoretical and observed masses. This data is then "cleaned up" by combining the 600 intensities of ions corresponding to in-source decay and salt adducts to those of parent ions. 601 The final matched MS data is then written to a .csv file.