Structure and function of bacterial metaproteomes across biomes

Soil microbes, and the proteins they produce, are responsible for a myriad of soil processes which are integral to life on Earth, supporting soil fertility, nutrient fluxes, trace gas emissions, and plant production. However, how and why the composition of soil microbial proteins (the metaproteome) changes across wide gradients of vegetation, climatic and edaphic conditions remains largely undetermined. By applying high-resolution mass spectrometry to soil samples collected from four continents, we identified the most common proteins in soils, and investigated the primary environmental factors driving their distributions across climate and vegetation types. We found that soil proteins involved in carbohydrate metabolism, DNA repair, lipid metabolism, transcription regulation, tricarboxylic acid cycling, nitrogen (N) fixation and one-carbon metabolism dominate soils across a wide range of climates, vegetation types and edaphic conditions. Vegetation type and climate were important factors determining the community composition of the topsoil metaproteome. Moreover, we show that vegetation type, climate, and key edaphic proporties (mainly soil C fractions, pH and texture) influenced the proportion of important proteins involved in biogeochemical cycles and cellular processes. We also found that protein-based taxonomic information based on proteins has a greater resolution than 16S rRNA gene sequencing with regards to the ability to detect significant correlations with environmental variables. Together, our work identifies the dominant proteins produced by microbes living in a wide range of soils, and advances our understanding of how environmental changes can influence the structure and function of the topsoil metaproteome and the soil processes that they support.


Introduction
Topsoil proteins are the ultimate catalyzers of multitude of biological functions (Hettich et al., 2013;Starke et al., 2019) which allow microbial communities to drive fundamental ecosystem services, including soil fertility, climate change regulation, pollutants degradation and waste decomposition (Bardgett and van der Putten, 2014;Crowther et al., 2019;Delgado-Baquerizo et al., 2020).Previous studies have investigated the cross-biome drivers of taxonomic and functional diversity across contrasting vegetation and climatic types using DNA amplicon sequencing and metagenomics (Bahram et al., 2018;Delgado-Baquerizo et al., 2018;Fierer et al., 2012).However, DNA-based methods are limited by their reduced capacity to account for active microbial communities and processes, restricting the capacity of metagenomics to efficiently predict ecosystem functions (Carini et al., 2016).
Proteomics has been proposed as a suitable approach for assessing the functionality of soil microbial communities through identifying the real catalyzers of soil processes -the proteins-at the local scale (Bastida et al., 2016;Hultman et al., 2015;Liu et al., 2019;Starke et al., 2017).Yet, we know very little about how and why the structure and function of the topsoil metaproteome changes across contrasting vegetation types and climates which are globally distributed.Advancing our knowledge on the structure and function of the topsoil proteome is integral to potentially predict shifts in important microbial-driven process in a changing planet.Here, we conducted a survey of soils collected from 60 sites that span broad gradients in environmental context (edaphic conditions, climate, and vegetation types) (Fig. S1; Table S1) with the topsoil metaproteomes and the relative abundance of individual protein categories characterized using high resolution mass spectrometry.For instance, mean annual precipitation and temperature ranged from 81 to 2161 mm and − 2.8-21.0-•C, respectively.Soil pH and carbon content ranged from 3.7 to 9.1 and 0.09-37.81%,respectively.We focused on bacterial proteins because these organisms are one of the most abundant and diverse on Earth (Bardgett and van der Putten, 2014; Delgado- Baquerizo et al., 2018).
Recent evidence suggests that microbial activity is particularly responsive to changes in plant communities and associated shifts in soil organic carbon (C) availability (Bastida et al., 2016;Lladó et al., 2019;Žifčáková et al., 2017).Here, we test the hypothesis that the metaproteome of bacterial communities differ across vegetation and climate types.This hypothesis is based on the assumption that the same factors that are known to be important in structuring bacterial communities, including soil pH, and the quantity and quality of soil organic C (Fierer, 2017;Bahram et al., 2018), also vary across vegetation and climate types (Jobbágy and Jackson, 2000).Because of this, we hypothesized that vegetation structure and contrasting C fractions (e.g., light vs. mineral-associated carbon) should be important drivers of the soil metaproteome.As a whole, our work aims to identify the most common protein groups in soils and advance our knowledge of the major environmental drivers regulating the distribution of the soil metaproteome across broad gradients of vegetation, climatic, and edaphic conditions.

Soil sampling
Soil and vegetation data were collected between 2016 and 2017 from 60 locations distributed across climate and vegetation types (SI Appendix, Fig. S1).These locations include a wide range of globally distributed soil, vegetation (including forest, prairies and shrublands) and climate based on aridity index (mesic and drylands) types.Sampling was designed to obtain wide gradients of edaphic characteristics (Bastida et al., 2019;Delgado-Baquerizo et al., 2019) (Table S1).For example, precipitation and temperature ranged from 81 to 2160 mm per year and − 2.80-21.0• C, respectively.A standardized protocol was used for soil survey (Maestre et al., 2012).In each location, we surveyed a 50 m × 50 m plot.Three parallel transects of the same length, spaced 25 m apart were added.The cover of perennial vegetation was measured in each transect using the line-intercept method (Maestre et al., 2012).Plant cover ranged between 6.7 and 100%.One composite topsoil (five 0-10 cm soil cores) sample was collected under the dominant ecosystem features across our plots.Following field sampling, soils were sieved (<2 mm) and frozen at − 20 • C).Climatic information was extracted from WorldClim (Hijmans et al., 2005).

Soil chemical, physical and microbial analyses
For all soil samples, we measured electrical conductivity, pH, texture, soil organic C (soil C) content and available P (Olsen P) content.Soil properties were determined using standardized protocols (Maestre et al., 2012).Soil pH was measured in all the soil samples with a pH meter, in a 1: 2.5 mass: volume soil and water suspension.Soil texture (% of fine fractions: clay + silt) was determined according to Kettler et al. (2001).Soil organic C was determined colorimetrically after oxidation with K 2 Cr 2 O 7 and H 2 SO 4 at 150 • C for 30 min (Anderson and Ingram, 1993).Total N was obtained using a CN analyzer (LECO CHN628 Series, LECO Corporation, St Joseph, MI USA).The content of Olsen P was analyzed following the method of Olsen and Sommers (1983) (Olsen and Sommers, 1983).Soil C content ranged between 0.1 and 38%, pH between 3.8 and 9.1, and percentagae clay + silt varied between 0.3 and 62%, respectively.A soil sample representative of each site was fractionated to isolate soil organic C pools characterized by different mechanisms of protection from decomposition and stability.We used the density-based fractionation scheme developed by Golchin et al. (1994) (Golchin et al., 1994) with slight modifications (Plaza et al., 2019;Sohi et al., 2001) to isolate two different soil organic C fractions: a free organic C fraction (FR_OC), located between soil aggregates and accessible to decomposers; and a mineral-associated organic C fraction, protected from microbial decomposition by sorption to mineral surfaces (MA_OC).This light fraction (FR_OC) separated here by density has been suggested to be similar to the particulate organic matter (POM) fraction separated by size (Lavallee et al., 2020;Mikutta et al., 2019).The isolated fractions were oven-dried at 60 • C, weighed, ground with a ball mill and analyzed for organic C concentration.The organic C concentration was determined by dry combustion using an elemental analyzer; before the analysis, the MA_OC fraction was fumigated with HCl to remove carbonates (Harris et al., 2001).
The composition of soil bacterial communities was analyzed through amplicon sequencing using the Illumina MiSeq platform.Ten grams of frozen soil samples (from composite soil samples as explained above) were cooled using liquid nitrogen and ground using a mortar and pestle.Soil DNA was extracted using the Powersoil® DNA Isolation Kit (MoBio Laboratories, Carlsbad, CA, USA).A portion of the bacterial 16S rRNA gene was sequenced using the 515F/806R primer set (Lauber et al., 2009;Ramirez et al., 2014).Bioinformatic analyses were carried out with QIIME (Caporaso et al., 2010), USEARCH (Edgar, 2010) and UNOISE3 (Edgar, 2016).Phylotypes (i.e.ASVs) were identified at the 100% identity level.A detailed description of DNA approaches can be found in Delgado-Baquerizo et al. (2019) (Delgado-Baquerizo et al., 2019).

Protein extraction from soil and mass spectrometry analysis
Protein extraction was performed according to the method described elsewhere (Bastida et al., 2014;Chourey et al., 2010).The cell lysis and disruption of soil aggregates were performed by boiling at 100 • C for 10 min in sodium dodecyl sulfate (SDS) buffer.The proteins were separated by 12% SDS-PAGE and, after electrophoresis, the gels were stained using colloidal Coomassie brilliant blue.The gel area containing the protein mixture of each sample was sliced into one piece.The samples were further processed by in-gel reduction and alkylation of cysteine residues, in-gel tryptic cleavage, and elution as well as desalting of tryptic peptides (Bastida et al., 2016).The peptide lysates were reconstituted in 0.1% formic acid prior to LC-MS measurement.Separation of peptide lysates was performed using an 85-min, non-linear gradient from 3.2% to 80% acetonitrile, in 0.1% formic acid, on a C18 analytical column (Acclaim PepMap100, 75 μm inner diameter, 25 cm, C18, Thermo F. Bastida et al.Scientific) in a UHPLC system (Ultimate nanoRSLC 3000, Thermo Fisher Scientific, Idstein, Germany).Mass spectrometry was performed on a Q Exactive HF mass spectrometer (Thermo Fisher Scientific, Waltham, MA, USA) coupled with a TriVersa NanoMate (Advion, Ltd., Harlow, UK) source in LC chip coupling mode.The mass spectrometer full scans were measured in the Orbitrap mass analyzer within the mass range of 350-1550 m/z, at 60,000 resolution, using an automatic gain control target of 1 × 10 6 and a maximum fill time of 100 ms.An MS/MS isolation window for ions in the quadrupole was set to 1.4 m/z.The MS/MS scans were acquired using the higher energy dissociation mode at a normalized collision-induced energy of 28%, within a scan range of 200-2000 m/z and using a resolution of 15,000.The exclusion time to reject masses from repetitive MS/MS fragmentation was set to 30 s.The mass spectrometry proteomics data have been deposited to the Proteo-meXchange Consortium via the PRIDE (Perez-Riverol et al., 2018) partner repository with the dataset identifier PXD018448.
For data analysis, the raw files from the mass spectrometer were converted into peaklists in the mascot generic file format (MGF) using MSConverter.Protein database searches and subsequent data functional and phylogenetic analyses were carried out using the Meta-ProteomeAnalyzer (Heyer et al., 2019) (MPA, version 2.14) using the OMSSA and X!Tandem search engines.Search parameters were set to 10 ppm peptide ion tolerance and 0.1 Da fragment ion tolerance, carbamidomethyl as fixed and methionine oxidation as variable modification, using UniProtKB/SwissProt as protein database (roughly 556,000 proteins, 11/2017).The database searches were performed in target-decoy mode and a false discovery rate of 10% was used.This high rate was used considering that this study is focused in broad protein categories and the wide variety and heterogeneity of soils.Proteins were grouped if they were sharing at least one peptide across all samples.A comparison matrix was created listing non-redundant spectrum count per protein group per sample.The comparison matrix was exported and used for subsequent data analysis.From each soil sample, we obtained the relative abundance of different protein groups, as well as their taxonomic origin at phylum and order levels (Supporting Information, Dataset).

Statistical analyses
We used permutational analysis of variance (PERMANOVA) to investigate the significant differences in the relative abundance of protein groups across climate (dryland vs mesic) and vegetation (forests, prairies and shrublands) types.In these analyses, each plot is considered a statistical replicate.Put simply, in our study we are using Earth as a grid across which we are collecting data from different plots or sites (replicates) from different ecosystem types.Having more than one sample within each plot would have been considered pseudo-replication as our question was related to comparing the abundance of protein groups across different vegetation and climate types globally rather than comparing them across plots within a given ecosystem type.Further, gradient designs, as we have used here, are powerful tools for detecting patterns in ecological responses to continuous and interacting environmental drivers as they generally outperform replicated designs in terms of prediction success of responses (Kreyling et al., 2018).Spearman correlation and Robust Linear Models (RLM) analyses were carried out to investigate the relationship between the relative abundance of protein groups or microbial populations, and environmental variables.We used Spearman correlations, as they provide useful information on the direction and strength of the associations between two variables, do not require normality of data, and linearity is not a strict assumption of these analyses.Non-metric multidimensional scaling (NMDS) was used for study the structure of bacterial metaproteomes across vegetation and climate types.

Results and discussion
We quantified the proportion (relative abundance; %), based on their spectral abundances, of a total of 5286 proteins that were assigned to 193 different functional groups (dataset, Supporting Information).Among them, the most abundant proteins were those involved in protein biosynthesis, transcription, amino acid biosynthesis, stress response, transcription regulation, glycolysis, tricarboxylic acid cycle, ion transport, DNA repair, lipid metabolism, N fixation, antibiotic resistance, one-carbon (C1) metabolism, cell cycle and cell division, lipid biosynthesis, carbohydrate metabolism, fatty acid metabolism, and DNA replication (Fig. S2).Some continental and global metagenome soil surveys have also reported a high abundance of genes involved in protein metabolism, amino acid and carbohydrates transport and metabolism, replication, DNA metabolism, energy production, transcription, cell division, etc. (Fierer et al., 2012;Bahram et al., 2018).Our study reflects the dominance of these protein groups (Fig. S2), but also the high proportion of proteins involved in nitrogen-fixation, C1-metabolism and antibiotic resistance.These results may indicate that proteins involved in rhizosphere colonization, where C1 metabolism (Knief et al., 2012) and microbial competition (i.e.antibiotic resistance mechanisms; Hou and Kolodkin-Gal, 2020) are critical traits, are among the most important proteins in a wide variety of soils across biomes.
Our results indicate that metaproteomic analyses complement, and provide a unique perspective on soil bacterial communities compared with DNA-based analyses of taxonomic composition (proportion of amplicon sequence variants -ASVs-or phyla determined from 16S rRNA gene amplicon sequencing; relative abundance, %).In particular, Mantel tests provided evidence that there is no significant correlation between the matrix of distance (Bray-Curtis) based on the taxonomic composition of bacterial communities (proportion of ASVs) and that based on the community composition of the soil metaproteome (proportion of 193 groups of proteins).This absence of correlation was also confirmed when we repeated Mantel analyses based on the proportion of taxa with the most abundant bacteria phyla (Fig. 1 and Fig. S2).These analyses suggest that our results are consistent across contrasting phylogenies and unlikely to be biased toward the preferential annotation of proteins from any given dominant phylum (Fig. 1).This lack of correlation between 16S rRNA gene amplicon analyses of DNA and metaproteomics could derive from the large abundance of inactive DNA in soil (Carini et al., 2016) while some proteins can remain active through their stabilization in organic matter and clay (Burns et al., 2013), from different methodological biases or from different turnover times of the biomolecules (Starke et al., 2019).Another explanation is that protein abundances may simply be poorly correlated with taxon abundance, either because of the large variation in protein production or turnover rates across different taxa, or because similar proteins may be produced by very different taxa.Together, these findings suggest that metaproteomics provide complementary information to that of DNA-based marker gene analyses when evaluating the links between environmental factors and soil microbial communities.Our results are in agreement with earlier studies showing that DNA-based surveys likely miss considerable portions of active microbial populations when compared to RNA approaches (Baldrian et al., 2012), and with local studies which have highlighted distinct microbial community composition when comparing DNA-and protein-based approaches (Bastida et al., 2017;Thorn et al., 2019).
Overall, proteins produced by Proteobacteria, and Alphaproteobacteria in particular, dominated the metaproteome profiles in terms of their relative abundances, followed by Actinobacteria and Firmicutes (Fig. 2A).Other relatively abundant phyla represented in the topsoil metaproteome included Bacteroidetes, Tenericutes, Cyanobacteria, Acidobacteria, Chlorobi and Spirochaetes (Fig. 2A).These results provide a complementary picture of the dominant phyla in soils across the globe, compared with observations of bacterial community composition inferred from 16S rRNA gene amplicon sequencing.For example, in terms of protein abundance, the relative abundance of Proteobacteria (ranging between 60 and 70% of the identified proteins) was greater than that observed from sequencing-based taxonomic approaches in our F. Bastida et al. soils (30-40%, Fig. 2B), and in other large-scale genomic soil surveys (Bahram et al., 2018;Delgado-Baquerizo et al., 2018).Interestingly, the greater proportion of Proteobacteria proteins in comparison to the abundance of this group using 16S rRNA gene sequencing was also noticed in other metaproteome studies using different protein extraction methods and downstream bioinformatics pipelines (Thorn et al., 2019), thus indicating that the contribution of this phylum to the overall soil metaproteome is consistent among studies.This difference between genomic and proteomic approaches could be a product of the reference protein databases being biased towards well-characterized taxa (such as Proteobacteria) in contrast to other taxa, such as Acidobacteria, that are underepresented in the reference protein databases.Consequently, further investigations are needed to provide a more comprehensive picture of the taxonomy of bacterial communities based on soil metaproteomes, in particular for uncharacterized taxa.For this purpose, it would be important to improve the annotation of proteins that are likely originated from poorly characterized taxa.
We also found that protein-based taxonomic information (proportion of phyla/classes; relative abundance, %) based on proteins has a greater resolution than 16S rRNA gene sequencing with regards to the ability to detect significant correlations with environmental variables (Fig. 2A and  B).The number of significant correlations between the information obtained from the proportion of protein-derived from phyla (proportion of phylum from protein data) and environmental variables were higher than those found for taxonomic information from amplicon sequencing at a comparable phylum level (proportion of phylum from amplicon sequencing data) (Fig. 2).For example, we found that the proportion of Alphaproteobacteria proteins (including Rhizobiales, Fig. S3) correlated positively with plant cover, mean annual precipitation, soil C content and the light mineral-free C fraction (FR-OC; the fraction of organic matter not associated with minerals and thus more likely to be accessible to decomposers), while these correlations were not apparent from comparable analyses of alphaproteobacterial proportions as determined from 16S rRNA gene sequencing (Fig. 2; Fig. S4).Similarly, previous studies have revealed a high relative abundance of proteins from Rhizobiales in soils with greater organic matter availability (Bastida et al., 2015;Starke et al., 2016).Likewise, the relative abundance of cyanobacterial proteins correlated negatively with plant cover and accesible C fraction (FR-OC), while this correlation was not evident from the 16S rRNA gene analyses.Cyanobacteria have been shown to be proportionally important in soils with low carbon content (Kuske et al., 2002;Belnap and Lange, 2003).Moreover, in agreement with DNA-based studies highlighting an important role of texture on the composition of the soil microbial community (Kallenbach et al., 2016), our metaproteome study also captured some significant effects of soil physical parameters (i.e.clay plus silt content) on the proportion of microbial proteins from Bacteroidetes, Planctomycetes and Gammaproteobacteria (Fig. 2; Fig. S4).
We then conducted two-way PERMANOVA to investigate the role of climate (dryland and mesic) and vegetation (forest, prairies and shrublands) in driving the distribution of the metaproteome composition (proportion of 193 proteins) across biomes.We found that both vegetation (P = 0.002) and climate (P = 0.034) influenced the community composition of the soil metaproteome, and further identified an interaction between both factors (vegetation × climate interaction P = 0.016), suggesting that the effects of climate on soil metaproteomes might be vegetation-type dependent (Fig. S5).These results suggest that shifts in vegetation, for example land use, encroachment, or deforestation associated with desertification (Berdugo et al., 2020) and climate change might have noticeable impacts on the overall community composition of the topsoil bacterial metaproteome (Bastida et al., 2018).Some of the effects of vegetation on the topsoil metaproteome might be indirectly driven by soil edaphic conditions.For example, vegetation types could have important effects on the soil metaproteome by regulating the amount of inputs and the quality of the organic matter to the soil system (Jobbágy and Jackson, 2000;Quideau et al., 2001).Indeed, soil C content and the most accessible C fraction (FR_OC) were lower in shrublands than in prairies and forests (Table S1), and the composition of the soil metaproteome of shrublands differed from that found in prairies and forests, particularly those in mesic climates (Fig. S5).
We next sought to determine those specific groups of soil proteins that were significantly associated with contrasting climates or vegetation types (Fig. 3).For this purpose, we performed a two-way PERMA-NOVA, and selected those protein groups which were significantly influenced (P < 0.05) by either climate or vegetation type, but lacked significant interaction between both factors (P > 0.05) (Fig. 3; Table S2).By doing so, we are free to independently interpret the effects of climate and vegetation on the soil metaproteome while ruling out complex interactions between climate and vegetation types.We conducted further Spearman correlation analyses to identify the main climatic and soil properties associated with these protein groups (Fig. 4).Our results provide solid evidence that the proportion of important groups of proteins involved in metabolism and/or biosynthesis of carbohydrates, fatty acids, lipids, amino acids, and transcription regulation significantly differ across vegetation types and/or climate (Fig. 3).Many of these protein groups have been found in high relative abundances in continental-scale metagenomic surveys (Bahram et al., 2018;Fierer et al., 2012;Noronha et al., 2017).Further, our topsoil metaproteome survey allowed us to investigate important groups of proteins associated with particular biogeochemical processes.For example, the proportion of proteins involved in C1 metabolism (i.e.formaldehyde-activating enzyme, methanol dehydrogenase, methylamine metabolism) was influenced by vegetation, but not by climate type, and was more abundant in prairies than in forests or shrublands worldwide (Fig. 3).The metabolism of plant cell wall compounds, such lignin, can produce methanol (Galbally and Kristine, 2002), which might be metabolized as source of carbon and energy by some microbes producing these enzymes (Sy et al., 2005).Further, C1 metabolism has been suggested to be important during bacterial colonization of phyllosphere, plant rests and rhizosphere (Knief et al., 2012).Our results also indicate that the proportion of C1 metabolism proteins correlated significantly and positively with the fraction of organic C associated with minerals, which is more protected from microbial decomposition (Lavallee et al., 2020), but did not correlate with total soil C content (Fig. 4).
Furthermore, the proportion of proteins associated with carbohydrate metabolism, including xylose metabolism and glycolysis, was higher in drylands than in more mesic ecosystems (Fig. 3).The relative abundance of this protein group correlated positively with the fraction of mineral-associated C and in fine-textured soils with high pH, and negatively with soil C/N ratio.However, there was no correlation with soil C content (Fig. 4).These results suggest that the quality -rather than quantity-and microbial availability of C which is regulated by organomineral interactions (Lavallee et al., 2020;Six et al., 2002), may be important factors shaping the relative abundance of proteins involved in carbohydrate metabolism.Under such conditions of intermediate or low C availability, such those usually found in drylands (Maestre et al., 2012) (Table S1), a higher proportion of carbohydrate metabolising proteins might help microbes to access this source of mineral-protected C (MA-OC) (Fig. 4).In contrast, we found that the relative abundance of fatty acids and lipid metabolism proteins was greater in more mesic soils (Fig. 3).Further, our metaproteomic analyses also suggest that the microbial communities from more mesic soils with greater organic C accesibility tend to have higher relative abundance of proteins associated with the biosynthesis of polyhydroxybutyrate (PHB) (Figs.3-4), a C storage polymer which is used as an energetic reserve by some soil microbes (Wang et al., 2006).
Environmental context also had an important role in regulating the relative abundance of N-fixation proteins across contrasting soil conditions.Our study revealed that vegetation type is an important factor shaping the proportion of this fundamental group of proteins (Fig. 3).The proportion of proteins involved in N-fixation was higher in forest soils, followed by prairies, with shrubland soils having the lower relative abundance of N-fixation proteins.These results are in agreement with the greater estimated biological N fixation in forests than in grasslands (Yu and Zhuang, 2020).The relative abundance of this protein group correlated positively with high organic C accesibility (FR-OC fraction), soil C content, and aridity index, and negatively with soil pH (Fig. 4).A national-scale metagenome study in the United Kingdom highlighted that soils with low pH and high organic matter content supported microbial communities with more abundant genes for N fixation (Malik et al., 2017).Moreover, our metaproteome survey shows some additional functional adaptations of microbial communities to particular environmental conditions.For example, the relative abundance of proteins involved in iron storage was higher in drylands than in mesic soils, and particularly in shrublands (Fig. 3).We found negative correlations between the relative abundance of proteins involved in iron storage and aridity index, and with edaphic variables such C/N ratio and FR-OC (Fig. 4).The greater relative abundance of proteins involved in iron storage in drylands can be seen as a bacterial adaptation against the low iron availability associated to high soil pH in these environments (Moreno-Jiménez et al., 2019).
Soil metaproteomics is a promising approach to study microbial functionality.However, it is still in its relative infancy, and such analyses are associated with some important caveats that need to be considered.First, the absolute amount (mass) of extracted proteins was not quantified, and thus there may be differential extraction efficiencies across soils.Second, we acknowledge that higher protein identification rates would likely be obtained if we had paired the metaproteomic data with metagenomic information (Starke et al., 2019), and that general databases (as the one used here, UniProtKB) can be biased to dominant cultured and sequenced bacteria.Moreover, current metaproteome assessments do not properly identify soil extracellular proteins, including extracellular hydrolases which are crucial for element cycles and soil fertility.This can be due to the fact that current methods do not properly address the extraction and identification of extracellular protein likely due to their stabilization in organic matter and soil particles (Bastida et al., 2018;Starke et al., 2019).Conversely, the preferential extraction of intracellular and membrane proteins can favour the identification of active proteins but not relic ones which are released and stabilized in soil after cell death and lysis.Further, our metaproteomic approach might be selective for certain microbial groups, such Proteobacteria.This can bias the results towards functional traits of this phylum, including a high proportion of proteins involved in N-fixation by Rhizobiales and C1-metabolism.However, our results focus on broad protein categories, and despite these limitations, we were able to identifiy the most dominant proteins in soils with contrasting properties across different environmental contexts.Future work should, however, seek to improve this methodology in order to advance our understanding of both dominant and rare proteins in soils globally.
Together, our topsoil metaproteome study helps advancing our knowledge on the environmental drivers of the structure and function of the topsoil metaproteome, and provides critical insights on the adaptation of soil bacterial communities to their environment across climate and vegetation types.Importantly, we highlight the absence of correlation between the community composition based on DNA approach (16S rRNA gene sequencing) and protein-based analyses, highlighting that both approaches provide different information on the soil microbiome.Although metaproteomics approaches are still evolving to solve technical and methodological limitations, our study provides an important step toward building a predictive understanding on the environmental factors controlling the topsoil metaproteomes.Vegetation structure, climate, and key soil properties played fundamental roles in shaping the distributions of proteins involved in biogeochemical cycles.This information is integral to manage and support the functioning of sustainable terrestrial ecosystems in a world subject to important environmental changes.

Funding
Data deposition: the primary data used in this paper will be deposited in Figshare and provide as Supporting Information dataset.The mass spectrometry proteomics data have been deposited to the Proteo-meXchange Consortium via the PRIDE repository with the dataset identifier PXD018448.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Fig. 1 .
Fig. 1.Relationship between the topsoil microbiome and metaproteome across locations.Correlations between community composition (via 16S rRNA gene sequencing) and functional community composition (via metaproteomics).X axes represents community composition of bacteria (Bray-Curtis dissimilarity) and Y axes represents community composition of soil proteins (Bray-Curtis dissimilarity).

Fig. 2 .
Fig. 2. Taxonomic information based on the topsoil metaproteome.Relative abundances of microbial taxa through proteomics (A) and 16S rRNA gene sequencing (B) across climate and vegetation types.The right part of each panel includes Spearman correlations (P < 0.05) between microbial abundances and environmental variables.Only significant correlations are included (P < 0.05) and color scale represents the correlation coefficient.(For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

Fig. 3 .
Fig. 3. Key functional proteins across climates and vegetation types.Relative abundance (%; mean ± SE) of functional groups of proteins across climate (A) and vegetation (B) types.For each factor, only biological functions with significant differences (PERMANOVA, P < 0.05) and lack of interactions (vegetation x climate P > 0.05) are shown.

Fig. 4 .
Fig. 4. Environmental drivers of the topsoil metaproteome.Spearman correlations between functional groups of proteins and environmental variables.Only significant correlations are included (P < 0.05) and color scale represents the correlation coefficient (A).Relationships between environmental variables and the relative abundances of different protein groups (B).Circle color: green, forests; yellow, prairies; brown, shrublands; black, dryland; white, mesic.(For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)