Dataset for effects of the transition from dry forest to pasture on diversity and structure of bacterial communities in Northeastern Brazil

The data included in this article supplement the research article titled “Forest-to-pasture conversion modifies the soil bacterial community in Brazilian dry forest Caatinga (manuscript ID: STOTEN-D-21-19067R1)”. This data article included the analysis of 18 chemical variables in 36 composite samples (included 4 replicates) of soils from the Microregion of Garanhuns (Northeast Brazil) and also partial 16S rRNA gene sequences from genomic DNA extracted from 27 of these samples (included 3 best quality replicates) for paired-end sequencing (up to 2 × 300 bp) in Illumina MiSeq platform (NCBI - BioProject accession: PRJNA753707). Soils were collected in August 2018 in a tropical subhumid region from the Brazilian Caatinga, along with 27 composite samples from the aboveground part of pastures to determine nutritional quality based on leaf N content. The analysis of variance (ANOVA) and post-hoc tests of environmental data and the main alpha-diversity indices based on linear mixed models (LMM) were represented in the tables. In this case, the collection region (C1 – Brejão, C2 – Garanhuns, and C3 – São João) was the random-effect variable and adjacent habitats formed by a forest (FO) and two pastures (PA and PB succeeded by this forest) composed the fixed-effect variable (land cover), both nested within C. In addition, a table with similarity percentages breakdown (SIMPER) was also shown, a procedure to assess the average percent contribution of individual phyla and bacterial classes. The figures showed the details of the study location, sampling procedure, vegetation status through the Normalized Difference Vegetation Index (NDVI), in addition to the general abundance and composition of the main bacterial phyla.

platform (NCBI -BioProject accession: PRJNA753707). Soils were collected in August 2018 in a tropical subhumid region from the Brazilian Caatinga, along with 27 composite samples from the aboveground part of pastures to determine nutritional quality based on leaf N content. The analysis of variance (ANOVA) and post-hoc tests of environmental data and the main alpha-diversity indices based on linear mixed models (LMM) were represented in the tables. In this case, the collection region (C1 -Brejão, C2 -Garanhuns, and C3 -São João) was the random-effect variable and adjacent habitats formed by a forest (FO) and two pastures (PA and PB succeeded by this forest) composed the fixed-effect variable (land cover), both nested within C. In addition, a table with similarity percentages breakdown (SIMPER) was also shown, a procedure to assess the average percent contribution of individual phyla and bacterial classes. The figures showed the details of the study location, sampling procedure, vegetation status through the Normalized Difference Vegetation Index (NDVI), in addition to the general abundance and composition of the main bacterial phyla.

Value of the Data
• "The dataset provides relevant information about the main effects of conversion from native dry forest to pasture on chemical and biological variables of the soils, especially on enzyme activity and on the structure, composition, and diversity of bacterial communities in the first 10 cm of soil depth." • "Researchers interested in bioinformatics, soil fertility, environmental conservation, microbial ecology, and remote sensing aimed at pasture recovery and monitoring of successional forest areas will find this dataset valuable." • "The data can be used to study changes in bacterial community structure due to changes in land cover and land use in semi-arid regions. In addition, the data can be used in metagenomic predictions based on the 16S rRNA gene, ecological model building, and various Bioinformatics applications."

Data Description
The raw data deposited include the panchromatic image at 2 m resolution and the multispectral compositions of the study area at 8 m resolution. Both had the same frame, photographed by the Wide-Scan Camera (WPM) sensor of the CBERS-04A satellite with radiometric and geometric system corrections refined by the use of control points and a digital elevation model (level 4 processing). The imaged swath of this frame was 92 km, indicating a raw data rate of 1800.8 Mbps in the panchromatic image and 450.2 Mbps in the spectral images. The spectral bands provided were: PAN -Panchromatic (B0: 0.45-0.90 μm); B -Blue (B1: 0.45-0.52 μm); G -Green (B2: 0.52-0.59 μm); R -Red (B3: 0.63-0.69 μm); and NIR -Near Infrared (B4: 0.77-0.89 μm). These data were used to study and choose the collection areas ( Fig. 1 A), to calculate the Normalized Difference Vegetation Index (NDVI) for pastures and forests ( Fig. 1 B), to differentiate the productivity levels of the studied pastures ( Fig. 2 A), to detect the most influential variables on NDVI through linear models ( Fig. 2 B), in the statistical design and sampling procedures ( Fig. 3 ). Under these conditions, 36 composite soil samples were equally distributed (12 samples) among three habitats: forest (FO), less productive pastures (PA), and more productive pastures (PB), according to NDVI values; both nested in three distinct cities ( Fig. 3 ), constituting 3 habitats x 3 cities x 4 replicates.
The genetic sequences available were from 27 of the 36 total samples, representing the samples with higher concentration and quality of the purified genomic DNA after extraction in soil, evaluated in a NanoDrop ® 20 0 0 spectrophotometer (Thermo Fisher Scientific Inc., Waltham, MA, USA). These samples were properly identified in the file "Soil Chemistry and Enzymes.xlsx", available at the Mendley Data link. These sequences consist of 16S rRNA libraries amplified with the primers Bakt_341F and Bakt_805R [1] on the Illumina MiSeq platform (paired-end: 2 × 300 bp). These data were trimmed, filtered, the ends (reverse and forward) were paired-end and chimeric sequences were removed. The structure, composition, and relative abundance of the main bacterial phyla detected in the nine distinct environments were presented ( Fig. 4 ), as well  as the statistics of the assumptions ( Table 1. b) and of the significant differences in α-diversity indices between forest and the two grasslands ( Table 2. b), also according to LMM. In addition, ANOVA and post-hoc test were done for the relative abundance data of the phyla ( Table 3 ) and contribution of major phyla and classes to the dissimilarity between FO, PA, and PB was also calculated to weigh the participation of their respective components in these niches ( Table 4 ).

Experimental design
The study was conducted in August 2018 in a tropical subhumid region from the Pernambuco state, Northeastern Brazil. Soil bacterial communities and soil variables was assessed using a sampling design based on a linear mixed model (LMM), where the sampling geographic region (C1 -Brejão, C2 -Garanhuns, and C3 -São João) was the random-effect variable (secondary factor) and habitats formed by a forest (FO) and two pastures (PA and PB) composed the fixed-effect variable (land cover as the main factor), both nested within geographic region. The study of independent variables via LMM is considered a weighted approach for biological systems because it demonstrates the overall response of fixed effects (land cover) nested within the random effect (geographic regions), where the latter absorbs variation in the intercepts of the statistical model [2] . Four 2.5 ha quadrats (replicates) were randomly located at each of the nine sampling sites (3 cities x 3 habitats), totalling 36 composite soil samples for chemical and genetic analyses and 24 composite pasture aerial samples for foliar nitrogen determination.

Sample collection
Each of the 36 soil samples or the 27 pastures were composed of 10 subsamples randomly collected in each quadrant to ensure the principle of homogeneity. The pasture samples were cut 10 cm above the surface and the soil samples were collected from the 0 to 10 cm layer, added to plastic bags, and preserved on site in thermal boxes with ice. Then, the samples were taken to Microbiology and Enzymology Laboratory of the Federal University of Agreste Pernambuco (Garanhuns -PE, Brazil), where part of the soils were separated and preserved in ultra-freezer at -80 °C for further chemical and enzymatic analysis and genomic DNA extraction.

Analytical approaches
The physicochemical properties of the soils were determined according to the methodologies provided in the EMBRAPA manual [3] , verifying soil texture, pH in water (1:2.5 v:v), pH in CaCl 2 (1:2.5 v:v), Al, H + Al, P, Ca, K, Mg, and Na content. The methodologies for determining the total organic carbon (TOC), microbial biomass carbon (MBC) and for quantifying the activities of the enzymes β-glucosidase (Beta), acid phosphatase (Aci.P), alkaline phosphatase (Alk.P), and urease (Ure) in the soils have been detailed in the main research article related to this data article.

Map editing and NDVI calculation
The maps of the studied region were edited based on the panchromatic and multispectral images from the WPM sensor of the CBERS-04A satellite (L4) made available on the INPE website ( http://www.dgi.inpe.br ). The images were processed using QGIS 3.10.3 software ( http://www. qgis.org ), using the coordinate system SIRGAS 20 0 0 / UTM zone 24S (EPSG:4674). Merging of the RGB bands was done to assess vegetation cover and suitability of the areas for collection. Then,

Determination of foliar nitrogen in pasture
The leaf nitrogen was estimated by adapting the sulfur digestion method of Malavolta et al. [5] . The digest solution was prepared in a 10 0 0 mL beaker by adding the substances in the following order: 175 mL of distilled water, 3.6 g Na 2 SeO 3 , 21.39 g Na 2 SO 4 , 4.0 g CuSO 4 5H 2 O and finally 200 mL of concentrated H 2 SO 4 . The ground samples of plant material (sieved on 2 mm mesh) were weighed (100 mg) and digested in tubes with 7 mL of the digesting solution, raising the temperature of the digester block by 50 °C every 30 minutes until it reached 350 °C, remaining at this temperature until the solution became colorless or slightly greenish. Next, the Table 4 Contributions of the main phyla and classes of bacteria (%) to the dissimilarity (AD) between the three environments.

Sequences processing
A total of 1,997,557 raw sequence pairs (forward and reverse) read by Illumna MiSeq sequencing were analyzed using the 'DADA2' pipeline version 1.16 [6] in R version 3.6.3 [7] in conjunction with RStudio 1.4.1717 [8] . The FIGARO tools [9] were used to optimize the truncation length parameters by "filterAndTrim" R function (276 bases for forward reads and 209 bases for reverse reads). According to this tool, forward and reverse reads with higher than 4 and 3 expected errors (maxEE) were discarded, respectively. Next, the error rates of the sequences were calculated with the "learnErrors" function, a machine learning-based algorithm; the amplicon sequence variants (ASVs) were inferred using the "given" function; and the paired reads were merged by applying the outputs of the previous functions to the input of "mergePairs". Chimeric sequences were identified using the "removeBimeraDenovo" function and then taxonomic assignments were given the remaining sequences based on the Silva SSU 132 (modified) database [10] , using the "IdTaxa" algorithm from the 'DECIPHER' v 2.20 R library [11] , considered a method with classification performance that is better than the standard set by the naive Bayesian classifier [12] .

Statistical analysis
Statistical analyses were also done in R version 3.6.3 [7] in conjunction with RStudio 1.4.1717 [8] . The natural log (ln) transformation was used in the raw data to ensure that the data pertained to a normal distribution with constant variance, adding a small adjustment (0.001) on all observations to eliminate errors with the ln transformation before the analysis of variance and checking the assumptions of normality and heteroscedasticity. Variables expressed as percentages (y%) were transformed by the function sin −1 [ √ (y%/100)]180/ π . These transformations are recommended to control error rates in biological data, generating yielded acceptable residual analyses versus fit plots and show p-values similar to the originals data [13] . Analyses were conducted either by Linear Mixed-effects Models (LMM) fitted using the 'lmer' function from the 'statistics' R package [7] and the algorithms of the 'lme4' R package [14] . The ANALysis Of SIMilarity (ANOSIM) test was used to calculate the contribution of phyla and classes to dissimilarity in each habitat (forest and grassland) using Past 4.0 software [15] . Analysis of deviance was done using ANOVA type III Wald F tests with Kenward-Roger degree of freedom (df) for both fixed and random effects. All chemical and enzymatic analyses used 36 composite soil samples

Ethics Statements
There is no ethical issue for this study as no animals or patients were involved in data acquisition.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.