Development of an inexpensive matrix-assisted laser desorption—time of flight mass spectrometry method for the identification of endophytes and rhizobacteria cultured from the microbiome associated with maize

Many endophytes and rhizobacteria associated with plants support the growth and health of their hosts. The vast majority of these potentially beneficial bacteria have yet to be characterized, in part because of the cost of identifying bacterial isolates. Matrix-assisted laser desorption-time of flight (MALDI-TOF) has enabled culturomic studies of host-associated microbiomes but analysis of mass spectra generated from plant-associated bacteria requires optimization. In this study, we aligned mass spectra generated from endophytes and rhizobacteria isolated from heritage and sweet varieties of Zea mays. Multiple iterations of alignment attempts identified a set of parameters that sorted 114 isolates into 60 coherent MALDI-TOF taxonomic units (MTUs). These MTUs corresponded to strains with practically identical (>99%) 16S rRNA gene sequences. Mass spectra were used to train a machine learning algorithm that classified 100% of the isolates into 60 MTUs. These MTUs provided >70% coverage of aerobic, heterotrophic bacteria readily cultured with nutrient rich media from the maize microbiome and allowed prediction of the total diversity recoverable with that particular cultivation method. Acidovorax sp., Pseudomonas sp. and Cellulosimicrobium sp. dominated the library generated from the rhizoplane. Relative to the sweet variety, the heritage variety c ontained a high number of MTUs. The ability to detect these differences in libraries, suggests a rapid and inexpensive method of describing the diversity of bacteria cultured from the endosphere and rhizosphere of maize.

222 Data). Raw mass spectra and Bruker system identifications are available as dataset 223 MSV000086274 in MassIVE (Deutsch et al. 2016). Representative isolates of MTUs that were not identified at the species level using the Biotyper 231 database and software (Biotyper score < 2), were identified by 16S rRNA gene sequencing. DNA 232 was extracted from isolates cultured overnight in tryptic soy broth (30 °C, 200 rpm) with the 233 Puregene kit, following the manufacturer's protocol (Qiagen, Carlsbad CA). A near intact 234 fragment of the 16S rRNA gene was amplified with ReadyMade™ Primers 16SrRNA For and 235 16srRNA Rev supplied by IDT (Coralville, IA) using DreamTaq Hot Start DNA polymerase, 236 following the manufacturer's protocol (ThermoFisher, Waltham, MA). PCR reactions were 237 preheated (94 °C, 2 m) and then cycled 32 times through the following steps: denaturing (94 °C, 238 30 s), annealing (56 °C, 30 s) and extension (72 °C, 90 s). The final extension step was extended 239 for 5 minutes and product size was confirmed by electrophoresis with a FlashGel™ DNA Kit 240 (Lonza, Basel, Switzerland). PCR products were then purified with the DNA Clean & 241 Concentrator kit, following the manufacturer's protocol (Zymo, Irvine, CA). Purified fragments 242 were sequenced with the above ReadyMade™ Primers, using Sanger technology, by Lone Star 243 Laboratory (Houston, TX).  The majority of bacteria isolated from maize did not match reference spectra in the Bruker 257 system. Of mass spectra generated from the 132 representative colony morphologies cultured 258 from the rhizoplane and endosphere of two varieties of maize (Table 1), ten failed the initial 259 quality control of the Bruker system and eight failed a SNR threshold we describe below. Of the 260 remaining 114 isolates, 36 were identified at the species level with probable confidence, as 261 defined by a classification score of > 2 with the Biotyper system. Mass spectra from these 114 262 isolates were clustered into 60 MTUs with the following pipeline. These MTUs agreed with 263 operational taxonomic units assigned by 16S rRNA gene sequencing.

264
265 Iteratively trying different parameters, like tolerance and SNR for aligning mass spectra and 266 detecting peaks respectively, identified a set that optimized the number of peaks shared between 267 spectra generated from BTS references spotted on separate targets. The number of peaks shared 268 between reference spectra reached an asymptote after a dozen iterations (Fig. S1). Extending the 269 number of iterations of the optimization loop to 2,000 identified parameters that corresponded to 270 the highest Jaccard similarity coefficients (0.952), in terms of peaks matched between spectra 271 generated from BTS. Pairwise comparisons of these BTS spectra shared 20/21 peaks and showed 272 cosine similarities that ranged from 0.963 -0.970. Jaccard similarity coefficients between these 273 reference spectra showed a modal relationship with the total number of peaks detected in this set 274 of samples (Fig. S2 top). The zenith of this relationship occurred between 389 -406 total peaks 275 detected from the 174 spectra that passed initial quality control within the Bruker system. The 276 parameters that yielded the most reproducible peaks were selected for identification of quality 277 spectra, as defined by the number of the SNR for the ten largest peaks and the number of peaks 278 detected. Quality control analysis, using a threshold of a median SNR of 15 for the 10 largest 279 peaks and detection of at least 13 peaks, pruned mass spectra generated from 8 of 122 isolates 280 from subsequent cluster analysis. The second optimization loop found, for pairs of isolates that 281 had practically identical 16S rRNA sequences (> 99 %), the average Jaccard coefficient for 282 comparisons reached a zenith at 498 peaks detected from 114 isolates (Fig S2 bottom).

283
284 Cosine similarities calculated with the parameters that showed the greatest discrimination 285 between comparisons within species than between species, suggested a hyperbolic relationship 286 with the Jaccard similarities ( Fig. 1). Most (6472/6670) of the pairwise comparisons showed 287 Jaccard similarity coefficients of less than 0.2. In that range, cosine similarities averaged (±SD) 288 0.27 ± 0.10. For the 192 pairwise comparisons that showed Jaccard similarities greater than or 289 equal to 0.2, cosine similarities averaged 0.81 ± 0.16. Jaccard similarities of 0.2 corresponded to 290 cosine similarities of about 0.69. This suggests that spectra that share very few peaks may show 291 relatively high cosine similarities, which may reflect the variability in peak heights. Accordingly, 292 we weighted cosine similarities by Jaccard coefficients (see Methods).

294
The percent identities of pairwise comparisons 16S rRNA genes fit a sigmoidal model with 295 weighted cosine similarities (Fig. 2). The fit had a modest correlation coefficient (r 2 = 0.445) and 296 analysis of variance suggested the trend was highly probable (P < 0.0001); however, most 297 (259/325) of the pairwise comparisons showed percent identities of 16S rRNA genes of less than 298 90%. The midpoint of this sigmoidal model corresponded to a cosine similarity of 0.66. Cosine 299 similarities of > 0.66 consistently corresponded to a percent identify of ribosomal sequences 300 ranging from 99 -100 % (Fig. 2). 301 302 Inspection of Figure 2 suggested that cosine similarity of 0.65 would distinguish species. Using 303 that threshold, clustering of a library of 114 isolates into MALDI-TOF taxonomic units (MTUs) 304 defined 60 MTUs. Of these, 25 clusters contained more than one isolate (replicated MTUs) and 305 35 were singletons. Extrapolation from rarefaction analysis predicted that a library of 228 306 isolates would contain 84 MTUs, with a confidence interval of 70 -98 (Fig. 3 top). This library 307 of isolates corresponds to 70 % coverage, with a confidence interval of 62 -77 % (Fig. 3  308 bottom). In other words, as defined by MTUs, the pooled library contained the majority of 309 bacterial species that could be readily isolated from the rhizoplane and endosphere of both maize 310 varieties we sampled. Further extrapolation predicted that a library of 228 isolates would provide 311 78 -95 % coverage. 312 313 Partial sequencing of 16S rRNA genes yielded sequences that were used to classify isolates that 314 were not matched to the commercial database into 18 phylotypes that corresponded to different 315 genera (Fig. 4). Phylotypes defined by rRNA gene sequencing generally appeared robust, as 316 assessed by bootstrap values > 90% (Fig. 4). The exception was separation of the branch 317 containing isolates that classified as Paenibacillus and Bacillus species, which bootstrapping 318 supported only 76% of iterations. Within this Bacillales order, and in terms of rRNA gene 319 sequences, isolate H1OZ89 showed high similarity to a phosphate solubilizing Paenibacillus 320 species previously isolated from rape (Ding et al. 2019). Also, within this order, isolate 321 H3OZ122 showed high similarity to a microplastics degrading Bacillus strain recently isolated 322 from the Yellow Sea (Wang et al. 2019), isolate S4OZ125 showed high similarity to Bacillus 323 aciditolerans, a recently described novel species isolated from a rice field (Ding et al. 2019) and 324 H3OZ107 showed similarity to a Staphylococcus hominis strain isolated from a blood sample 325 (Fig. 4). These isolates (H1OZ89, H3OZ122, S4OZ125 and H3OZ107) clustered into separate 326 MTUs, which suggests the similarity threshold defined MTUs corresponds to species that occupy 327 different niches.  Figure 4, the threshold for clustering appeared conservative. The 333 rRNA sequences generated from seven isolates showed > 99 % identity with species of 334 Acidovorax, and each other, but were separated into three MTUs (29, 35 and 54). MTU 35 335 corresponded to a branch in Figure 4 supported with 94 % bootstrapping value, which suggests 336 that cluster is coherent. On the other hand, MTUs 25 and 54 appear to have isolates of the same 337 species, split into two clusters. 338 339 Matching to reference spectra did not appear to depend on the quality of the spectra, as 340 classification scores showed no relationship with the SNR of spectra. The set of identified 341 species contained 35 isolates that classified as bacteria and one isolate that classified as a fungal 342 species (Trichosporon mucoides). Of the 35 bacterial isolates 6 were singletons. The 29 343 replicated isolates were classified into 11 species. These isolates generally corresponded to 344 specific MTUs. For example, the 6 isolates reliably identified as Pseudomonas corrugata all 345 clustered into MTU 20 (Fig. 5). Similarly, isolates identified by the Bruker system as 346 Pseudomonas mendocina, Enterobacter cloacae, Serratia marcescens, Bacillus firmus, 347 Cellulosimicrobium cellulans, Pseudomonas citronellolis, Rhizobium radiobacter and 348 Acidovorax facilis clustered within MTUs 34, 33, 3, 21, 8, 4, 25 and 29 respectively (Fig 5). 349 Bootstrap values, for iterations of cluster analysis, generally supported the congruence of these 350 MTUs with species identifications. Bootstrap values for these clusters exceeded 95 %, with the 351 exception of MTU 34 (Fig. 5). However, the dendrogram was not coherent with respect to the 352 phylogeny of the isolates. In particular, Pseudomonas species were dispersed throughout the tree 353 and appeared within a branch that contained Gram positive species (Fig. 5) and, for the three 354 isolates the system identified as P. putida, the threshold used to define MTUs appeared 355 conservative. These isolates were split into two MTUs (Fig. 5). The lone isolate in MTU 26 356 showed little similarity, in terms of cosine similarity, to any other isolate. 358 Machine learning resolved the apparent discrepancy for the isolates that clustered into different 359 MTUs but were identified as one species by the Bruker system. To train the machine learning 360 algorithm, MTUs were identified by the Bruker system and 16S sequencing. When the Bruker 361 and 16S identification differed, the 16S identification was used but this was rare. Identification 362 by the Bruker system and 16S sequencing agreed at the genera level, for Bruker scores greater 363 than or equal to 1.6; however, the two approaches showed some discordance at the species level. 364 For example, MTU 33 was reliably identified (Bruker score = 2.5) as Enterobacter cloacae but 365 these isolates clustered with Enterobacter ludwigii in Figure 4. MTU 34 was probably identified 366 (Bruker > 2) as Pseduomonas mendocina but these isolates clustered with P. alcaliphila in 367 Figure 4. MTU 35 were identified by the Bruker system as Acidovorax facilis with scores > 2.2 368 but a representative of that cluster showed high similarity, in terms of 16S sequence, to A. 369 wautersii (Fig. 4). This MTU was defined as A. wautersii. When species were split into two 370 MTUs, they were pooled. For example, the three isolates identified, at the probable level, as P. 371 putida by the Bruker system were defined as that species, even though they belonged to two 372 MTUs (Fig. 5). MTUs 54 and 29 were identified as one clade Acidovorax sp. based on Figure 4. 373 After discarding 34 singletons, this classification scheme provided at least genera level identities 374 for all but two of the remaining isolates. The machine learning algorithm classified 100% of the 375 80 replicated isolates, and the two BTS references, correctly as 22 MTUs. This accuracy of 376 classification is reflected in a Cohen's Kappa coefficient of 1, where this coefficient ranges from 377 -1 to 1. 378 379 A combination 16S sequencing and MALDI TOF methods identified distinct MTUs that were 380 relatively abundant to particular niches or plant varieties. Libraries generated from the rhizoplane 381 contained the highest number of replicated MTUs overall (Fig. 6). Of those 20 MTUs, 14 did not 382 appear in libraries generated from the endosphere. These rhizoplane-specific MTUs included 383 numerous representatives of two Acidovorax sp. and Flavobacterium sp. Niveispirillum sp., 384 Pseudoxanthomonas sp., Rhizobium radiobacter and Shinella sp. (Fig. 6). MTUs specific to the 385 endosphere included a yet-to-be identified cluster (MTU02), Serratia sp., Arthrobacter sp., and 386 Pseudomonas sp. (Fig. 6). MTUs identified as Pseudomonas sp., Cellulosimicrobium sp., 387 Bacillus sp., and Enterobacter sp. were found in both the rhizoplane and the endosphere (Fig. 6). 388 Libraries generated from the heritage variety contained more MTUs (11) unique to that variety 389 than libraries generated from the sweet variety (Fig. S3). Optimization of cluster analysis of mass spectra generated from a library of bacteria isolated 394 from maize microbiomes produced coherent MTUs. These MTUs agreed with species defined by 395 16S rRNA gene sequencing and by the Bruker system and with niches typically occupied by 396 species related to these MTUs. The library of readily culturable bacteria isolated from the 397 rhizoplane of the heritage variety contained the most strains that were specific to a particular 398 niche. That is strains that abound in libraries generated from that sample and were not common 399 in libraries generated from the endosphere of that variety or from rhizoplane and endosphere of 400 the sweet variety. The rhizoplane and endosphere have many species in common, but it is logical 401 that the rhizoplane community is more diverse given its direct contact with soil, which are 402 diverse systems (Howe et al. 2014). In fact, endophyte communities may be a subset of the more 403 extensive rhizosphere communities (Long et al. 2010).

405
The relatively high abundance of Acidovorax and Flavobacteria in libraries generated from the 406 rhizoplane (Fig. 6) is consistent with previous culture-dependent surveys of plant 407 microbiomes. Acidovorax is commonly found in both soil and plants (Long et al. 2010) and 408 Flavobacteria genera includes a terrestrial clade associated with plant roots plant roots (Kolton 409 et al. 2013). However, these genera are not consistently associated with maize roots, as assessed 410 by metagenomic analysis (Niu et al. 2017). We also isolated Rhizobium radiobacter 411 (formerly Agrobacterium tumefaciens). This soil bacteria interacts with plants and strains of this 412 species show potential for bioremediation applications (Deepika et al. 2016) and novel species of 413 this genera have been isolated from maize roots (Gao et al. 2017). Microbacterium has been 414 isolated specifically from maize kernels (Rijavec et al. 2007). This endophyte may have been in 415 the seed when it was sown.    The pooling of isolates collected at different stages of growth, and isolated on different media, 508 into one library limits the scope of this study and precludes testing of hypothesis about the 509 controls of the diversity and limits our ability to estimate the coverage these libraries provided. 510 Defining the diversity of maize microbiome, within different locations of the host and between 511 different varieties of maize, would require generation of much larger libraries then is feasible 512 with culture-dependent techniques. Metagenomic analysis is better suited testing hypotheses and 513 provides a more complete census of the microbial communities in nature (Handelsman 2004); 514 however, this dataset revealed trends consistent with metagenomic studies, ecological theory and 515 published observations. 516 517 In this study, we used the maize microbiome as model and demonstrate that MALDI-TOF 518 systems provide a relatively low-cost method of identifying bacteria isolated from plant 519 microbiomes. This proteomics approach is available to the general scientific community through 520 core facilities, like the Huck Institute a Pennsylvania State University, which processed the 521 samples herein. The costs of analysis by vendors is about a dollar per isolate for pre-spotted 522 targets. The ethanol treatment / formic acid extraction applied herein is robust. Undergraduates 523 in a teaching laboratory at the University of Houston -Clear Lake generated the protein extracts 524 presented herein and dozens of students in this laboratory routinely generate quality spectra the 525 first time they try this technique. The primary cost associated with this approach is purchasing of 526 the reusable targets (~ $500 each) and the scripts presented herein allow users to process the data 527 with packages freely available in R.

529
The availability of MALDI-TOF analysis, through core facilities and commercial vendors, as 530 well as the development of data analysis pipelines, like the one presented herein, should allow 531 investigators to test the coherence of MTUs in terms of phenotypic traits and genotypes. 532 However, to fully realize the potential of MALDI-TOF for identifying bacteria associated with 533 plants, we need to build public databases of both raw and processed mass spectra and the 534 metadata associated with the isolates. In other words, the community needs a resource analogous 535 to the short archive available for next and third generation sequencing data. MassIVE provides 536 such a resource (Deutsch et al. 2016) but only a handful of mass spectra of bacteria isolated from 537 the plant microbiome have been archived in that database to date. We echo the call for the 538 development and use of such a central depository of mass spectra generated from plant-539 associated microbes (Ahmad et al. 2012) and environmental isolates in general.