New approach for predicting nitrification and its fraction of N2O emissions in global terrestrial ecosystems

Nitrification is a major pathway of N2O production in aerobic soils. Measurements and model simulations of nitrification and associated N2O emission are challenging. Here we innovatively integrated data mining and machine learning to predict nitrification rate ( Rnit ) and the fraction of nitrification as N2O emissions ( fN2ONit ). Using our global database on Rnit and fN2ONit , we found that the machine-learning based stochastic gradient boosting (SGB) model outperformed three widely used process-based models in estimating Rnit and N2O emission from nitrification. We then applied the SGB technique for global prediction. The potential Rnit was driven by long-term mean annual temperature, soil C/N ratio and soil pH, whereas fN2ONit by mean annual precipitation, soil clay content, soil pH, soil total N. The global fN2ONit varied by over 200 times (0.006%–1.2%), which challenges the common practice of using a constant value in process-based models. This study provides insights into advancing process-based models for projecting N dynamics and greenhouse gas emissions using a machine learning approach.


Introduction
Nitrous oxide (N 2 O) is a potent greenhouse gas and can cause ozone depletion (Ravishankara et al 2009, Robertson andVitousek 2009). Nitrification, a microbial process that converts ammonium (NH 4 + ) into nitrite and subsequently nitrate, is a major pathway of N 2 O production, especially in aerobic soils. However, measurements of nitrification rate (R nit ) and how much nitrified N is emitted as N 2 O are challenging. Many ecological process-based models, such as APSIM (Keating et al 2003, Holzworth et al 2014, DNDC (Li et al 1992, 2000, Zhang et al 2002, WNMM (Li et al 2007), and DAYCENT (Parton et al 1998, Del Grosso et al 2000 have embedded modules to predict R nit and N 2 O from nitrification. These models adopt the equations generated from limited empirical data, and nitrification is calculated as a function of soil water content, soil pH, soil temperature and soil NH 4 + content. The models adopt a 'grey-box' to estimate N 2 O emission from nitrification by assuming the fraction of nitrification as N 2 O emissions (f N2O Nit ) is fixed, and further constraining it by soil water content or soil temperature. The use of a fixed fraction is known to be problematic in predicting N 2 O emission because the proportion of N 2 O emission from nitrification varies with soil and environmental conditions (Farquharson 2016). In addition, process-based models may not perform well when extrapolating from site-specific to a larger scale, mainly owing to their deficiency in capturing the key processes in response to driving factors, detailed parameterization, limited data availability and their limited capacity in handling complex interacting factors (Giltrap et al 2015, Leng and Hall 2020, Saha et al 2021. Because of the nonlinearity between input and output variables and intricate interactions among these input variables (Butterbach-Bahl et al 2013), robust prediction of R nit and associated N 2 O production is difficult due to uncertainty in existing process-based models and limited measured data to improve those models. To address this issue, we adopted machine learning [stochastic gradient boosting (SGB) modelling] with data technologies and highperformance computing to identify patterns from global datasets and predict R nit and f N2O Nit . Models based on machine learning, such as ANN (artificial neural networks), RF (random forest), SVR (support vector regression) and BRT (boosted regression tree) have been recently applied in predicting soil biogeochemical processes. Previous studies showed that machine learning models performed better than traditional models in predicting N 2 O and NO emission locally ( . Among the machine-learning based models, BRT (Friedman 2001, 2002, Elith et al 2008 relied on SGB, which is especially effective when quantifying nonlinear N transformations that are regulated by variables with complicated interactions, like R nit and associated N 2 O emissions. Furthermore, SGB is useful in handling relatively small datasets (n < 100) by adjusting the weights of training datasets and tuning learning rate based on the size of the datasets to reduce errors and overfitting (Shepherd et al 2003, Andonie 2010, Zhang and Ling 2018. As a variable selection process, SGB also excludes less important variables to achieve better generalization of the dependent variable (Xu et al 2014). From the functional perspective, process-based models need to define the function of each input variable. On the contrary, response functions and variable interactions could be handled in SGB models, because they do not pre-set response functions (e.g. linear or exponential) but assume a certain interaction between variables (Leng and Hall 2020). Thus, we hypothesized that these features of SGB models could improve prediction performances of R nit and associated N 2 O emissions at a large scale.
In this study, using a comprehensive database compiled from global literature, we first attempted to establish SGB models derived from this database to predict R nit and f N2O Nit , and to compare their performance with existing process-based models. The objectives of the study were: (a) to compare the performance of SGB and process-based models in predicting nitrification rate and associated N 2 O production (b) to extend spatially the simulation of R nit and f N2O Nit using the better performing model.

Database compilation
Extensive keyword searches of databases (Web of Science (ISI), SCOPUS, CAB Abstracts (ISI), Academic Search Complete (EBSCO) and Google Scholar) and the reference list of cited references were performed. The keywords used in the search were nitrification; N 2 O/nitrous oxide emission/pathways; agriculture; cropping; pastures; forest and their combinations.
Studies were included if they met the following criteria: (a) 15 N tracers or acetylene blockage technique was used to determine N 2 O emission produced during nitrification; (b) gross nitrification rates (R nit ) were provided; (c) paired nitrification rates and associated N 2 O emission were reported; (d) details on experimental location, design and conditions were given to enable cross-checking for duplicate publication; (e) instantaneous measurements were excluded due to high heterogeneity of outcomes. We identified 186 observations from 25 papers published prior to 2018 that satisfied the set criteria for this study. These observations were used to train the SGB models. All studies were conducted under laboratory incubation conditions probably owing to the technical limitations of simultaneous measurements of R nit and f N2O Nit under field conditions. Data were log transformed to reduce the impact of outliers and improve normality for further analysis.

Comparing the performance of process-based models and SGB models
The algorithms in APSIM (Holzworth et al 2014), Crop-DNDC (Zhang et al 2002) and WNMM (Li et al 2007) simulate R nit using the same input variables, viz. soil NH 4 + content, soil pH, soil water content and soil temperature. The responses of nitrification to soil pH, soil water content and soil temperature during nitrification vary between these three models (figure 1). Crop-DNDC covers a wide range of soil water conditions with an increased response up to 80% WFPS. In APSIM and WNMM models, the thresholds of soil water content for nitrification are higher than that in Crop-DNDC. Nitrification increases with soil temperature in WNMM and APSIM models but decreases in Crop-DNDC when soil temperature is over 35 • C. These models estimate N 2 O production from nitrification through multiplying the simulated nitrification rate by a fixed f N2O Nit , and/or combining with functions of WFPS and soil temperature (figure 1). Specifically, APSIM simply adopts a simple default f N2O Nit of 0.002. WNMM considers soil temperature and soil water content for the calculation but sets a threshold f N2O Nit of 0.002. Crop-DNDC includes soil WFPS and temperature and adopts a constant f N2O Nit of 0.0006. The compiled database (section 2.1) was used to calculate the nitrification rate and N 2 O from nitrification based on the equations (see supplementary information 1 (available online at stacks.iop.org/ERL/16/034053/mmedia)) of WNMM, Crop-DNDC and APSIM.
A SGB model is a tree-based model; each tree is built in a sequential error-correcting process to converge to an accurate model. The SGB model was proposed and modified based on the gradient boosting decision tree (Breiman 1996, Friedman 2001, 2002. SGB model includes random subsampling, which draws a subsample of training data randomly instead of boosting all the sample data at each iteration (see supplementary information 3). This modification reduces the computational complexity and improves the model learning speed. The SGB model can also handle both numerical and categorical variables as features (Friedman 2001, 2002, Wang et al 2017.
To compare with the three process-based models we also adopted the same input variables, i.e. soil NH 4 + content, soil pH, soil water content and soil temperature to build an SGB model, referred to as SGB1. Four parameters that can be configured in a SGB model are the learn rate (LR), subsample fraction (SF), number of trees (NT), and minimum terminal node (MINCHILD). Learn rate controls the contribution of each tree to the final model. Subsample fraction is involved in the learning process to prevent model from overfitting. The number of trees is based on the learn rate. Minimum terminal nodes are regulated by the size of dataset. To perform the regularization, we tested different combinations of LR (0.001, 0.005, 0.01), SF (0.3, 0.6, 0.8, 0.9), NT (1000, 2000, 3000) and MINCHILD (3, 6, 9) parameters, and found that the optimal configurations for SGB1 were 0.01, 0.8, 3000 and 3 for LC, SF, NT and MINCHILD, respectively (Friedman 2001, Yang et al 2016, Wang et al 2018 (figure S1). The accuracy of the model was evaluated by 10-fold cross-validation (CV), where the entire dataset was first used for learning purposes, and subsequently partitioned into ten bins. For each of the 10 folds, nine folds were used as a training set and the remaining fold as a test set. The test results from each fold were averaged to estimate the whole model performance. CV is particularly useful for small datasets when one cannot afford to reserve some data for testing (Kohavi 1995 Regression coefficients of determination (R 2 ), root mean square error (RMSE) and the Nash-Sutcliffe model efficiency (NSE) (Nash and Sutcliffe 1970) (see supplementary information 2), were used to measure the percentage of variation explained by the model and the model accuracy for SGB1 and the three process-based models.

Prediction of R nit and f N2O Nit under different soil properties and climate conditions
We attempted to extend the prediction of R nit and f N2O Nit to a larger scale by applying SGB technique to our global database. It is unrealistic to perform dynamic prediction of R nit and f N2O Nit owing to data limitation. Previous studies proved that soil properties (clay content, pH, etc) and climatic conditions (MAT and MAP) can affect microbial community biomass and composition in the soil (Avrahami et al 2003, Hu et al 2016, Liu et al 2017, which are the potential key drivers of the nitrification process and associated N 2 O emissions. We therefore performed R nit and f N2O Nit prediction using longterm edaphic and environmental conditions as input variables.
We followed a series of stepwise procedures (Thompson 1978, Moisen et al 2006 to eliminate redundant input variables and optimize the number of variables for this SGB model (referred to as SGB2). This improves the applicability of SGB2 without compromising much prediction capacity. The stepwise variable selection method is efficient in computation and has been widely used in machine learning variable selection (Guyon et al 2002, Rakotomamonjy 2003. Since SGB2 was built based on incubation studies conducted under in vitro conditions mostly with optimal soil temperature, soil moisture and N availability, the prediction reflects R nit and f N2O Nit under such conditions. We further included a set of constraints to regulate the prediction results by SGB2. We first set soil total N and soil organic carbon as trigger conditions, which are the prerequisites for nitrification. Second, desert areas with low precipitation show limited potential for nitrification (Noy-Meir 1974, Skujin, š 1981. In this regard, typical desert areas (mean annual precipitation (MAP) <250 mm) (Marshak 2010) have been excluded from the prediction. To adjust the deviation between incubation temperature and long-term temperature, we incorporated a function of mean annual temperature (MAT) against nitrification rate to reflect the in situ situations globally. The complete model is described as follows: Among these equations, M SGB is the original SGB2 prediction, γ is the constraint coefficient of TN, SOC, MAP (P) and MAT (T), F = {TN, SOC, P} are a set of total N, soil organic carbon and MAP. δ F is the constraint coefficient of F, δ α is the constraint coefficient involved in δ F , where δ α ∈ {δ SOC , δ TN , δ P }, and f (T) is the function to determine the impact of MAT on R nit . Apart from R 2 , RMSE and NSE, additional mean absolute error (MAE) was used to measure the accuracy of prediction for selecting the optimal SGB model when mapping global prediction of R nit and f N2O Nit .
To project R nit and f N2O Nit at a global scale using SGB, potential driving factors of soil and climate data were obtained from different sources and integrated into a spatial GIS database. The soil property database with a spatial resolution of 1 km 2 was obtained from the World Inventory of Soil Emission (WISE) database developed by the International Soil Reference and Information Centre (ISRIC) (www.isric.org) (Batjes 2015

Overview of the literature-based dataset
Study sites spanned from 122 • W to 152 • E and 43 • S to 65 • N. For most of the incubation experiments, the soil temperature was controlled at 20 • C-25 • C, with a range of 5 • C-45 • C, and average soil moisture of around 45% WFPS. The average R nit in the topsoil (0-20 cm) was 1.4 kg N ha −1 d −1 and varied widely. The average f N2O Nit was 0.46% (0.004%-9.19%) (table 1 and figure S3).  figure S4). As for the N 2 O emission during nitrification calculated using a fixed ratio and functions of soil water content and soil temperature, all three process-based models did not perform very well (table 2). Crop-DNDC overestimated N 2 O emission from nitrification by 1.08 kg N ha −1 d −1 . APSIM and WNMM underestimated N 2 O emission when it was <0.3 g N ha −1 d −1 but overestimated it when the N 2 O emission from nitrification was higher (figure S5).

Stepwise variable selection of optimal models
Nitrification rate showed a strong relationship with MAT and soil C/N ratio in the model A1 (table 3). The highest R 2 value of 0.8 was achieved when at least six variables were included (models A5-A8). When only MAT, soil C/N ratio and soil pH were included (model A2), the performance reached 0.76 for both R 2 and NSE (table 3), which we considered optimal for further upscale predictions when compromising between the R 2 value and the number of variables included.
When predicting f N2O Nit , the performance of model B3 was significantly increased with four input variables (MAP, soil clay content, soil pH and soil TN), yielding a relatively higher R 2 value (0.55) and NSE (0.55), and lower MAE (0.26) and RMSE (0.40) than other models (table 4). Therefore, B3 model (table 2, SGB2) was considered optimal for predicting f N2O Nit .

Global mapping of the response of R nit and f N2O Nit to soil properties and climate conditions
We extend the prediction of R nit spatially using SGB2 (model A2 in table 3) with input variables soil pH,  2). The highest R nit was observed in subtropical and equatorial areas.
Higher nitrification rates generally occur in soils with a lower C/N ratio and a neutral pH ( figure S6). f N2O Nit covered a broad range from 0.006% to 1.24% (figure 3) with an average of 0.13%. Lower f N2O Nit was noted in tropical and higher latitude regions in the northern hemisphere, with higher f N2O Nit across the subtropical and around equatorial areas. A large proportion of N 2 O from nitrification was found in soils with a lower clay content (figure S7). When soil is neutral pH, more N 2 O was released from nitrification compared to both acidic and alkaline soils (figure S7). f N2O Nit followed the pattern of MAP and soil total N (figure S7).

Better performance of SGB models
SGB1 better predicts nitrification and associated N 2 O emissions than the three widely used process-based models using the same input variables (soil NH 4 + content, soil pH, soil water content and soil temperature). There are several possible reasons. First, limited site-specific experimental data were used for deriving process-based model parameters and equations, where less responsive or redundant variables might have been included. On the contrary, the relationships between variables in SGB1 were more representative as their internal interactions were developed from a relatively comprehensive global database. Second, process-based models may have misused the variable responses for R nit and f N2O Nit in their equations. For example, the relatively narrow range of the response of soil water to nitrification limited the prediction capacity of the APSIM and WNMM models. If soil water content was beyond of the lower and upper limits set in the equations in these models, nitrification was assumed to stop. In particular we found that APSIM and WNMM estimated that no nitrification occurred below 30% WFPS. However, the measured data from our global database indicated that nitrification did occur under 30% WFPS (Maag and Vinther 1996).
Third, the equations currently used in processbased models cannot account for the interactions of input variables (Butterbach-Bahl et al 2013) without comprehensive calibration using more site-specific detailed data. In contrast, SGB1 examines all possible nonlinear relationships and interactions between input variables themselves and output variables (Ryo and Rillig 2017).

The use of SGB models in spatial prediction globally
The spatial patterns of predicted R nit and f N2O Nit exhibited a large variation owing to the diverse soil and environmental conditions and their interactions (figures 2 and 3). This result further shows that using a constant f N2O Nit to estimate the N 2 O emission from nitrification by existing process-based models is unsuitable (Chen et al 2008, Farquharson 2016.
The relationship between soil properties and R nit has been well studied (Dancer et al 1973, Nyborg et al 1988, Bengtsson et al 2003, Zebarth et al 2015. A recent global-scale study reported that soil C/N ratio, soil pH and MAT are the key drivers of R nit (Li et al 2020). High soil C/N ratio may stimulate N immobilization, decrease availability of NH 4 + for nitrification, and subsequently lower nitrification rate (Bengtsson et al 2003). Larger microbial populations and activities (including nitrifiers) are generally found in neutral rather than in high or low pH soils (Tabatabai et al 1992). MAT can regulate soil nitrification rate directly and indirectly by changing microbial community structure and abundance in the long term (Avrahami et al 2003, Hu et al 2016). Besides, MAT    The spatial distribution of f N2O Nit was consistent with land use distribution (figure S8), which reflects the edaphic and climatic variations across terrestrial systems. The estimated range of f N2O Nit in humid tropical areas by B3 (table 4) was comparable to the DNDC and DLEM models examined by Inatomi et al (2019), but much lower than VISIT (Inatomi et al 2010) model estimates. However, f N2O Nit in semiarid regions was predicted to be lower by DNDC, DELM and VISIT than the B3 model. The differences in model predictions can be explained by three reasons. First, Inatomi et al (2019) adopted common protocols and initial and boundary conditions when upscaling the simulation for all three process-based models. Second, for the VISIT model, the parameterization of f N2O Nit was only dependent on soil pH, which is insufficient to reflect the complicated relationship between f N2O Nit and soil and environmental factors. Third, soil moisture and temperature in semiarid regions were below the threshold set for nitrification to occur in process-based models, thereby underestimating nitrification and associated N 2 O emissions, whereas our A2 and B3 models cover a wider range of input variables.
In our study, subtropical areas and equatorial areas have higher R nit and f N2O Nit than other areas. Higher MAT, neutral soil pH and lower soil C/N ratio could increase the microbial populations in soils and nitrifier efficiency, thus increasing nitrification rate (Gilmour 1984, Tabatabai et al 1992. The highest R nit and f N2O Nit were observed in areas with intense human activities, i.e. cropland and grasslands (Ambus 1998). These could be attributed to long-term anthropogenic N input, increasing soil N availability for nitrification and associated N 2 O production (Tabatabai et al 1992). On the other hand, optimal MAP provided an ideal oxygen level and moderate moisture condition for nitrifiers (Yu and Zhuang 2019). Areas with extensive forest and savannas have the lowest R nit and f N2O Nit worldwide. This finding indicates that in temperate and tropical forest areas, nitrification is not a major source of N 2 O emissions, which is in agreement with previous studies (Stehfest and Bouwman 2006, Werner et al 2007, Cheng et al 2012, Zhuang et al 2012.
Responses and changes of R nit and f N2O Nit are highly related to long-term environmental, edaphic factors and land use. The fraction of nitrification as N 2 O emissions is clearly not a constant; instead it should be adjusted according to edaphic and environmental conditions when used in processbased models or global climate models in projecting N 2 O emissions. We are aware of the limited datasets used to develop the SGB2 models for global prediction and did not intend to accurately quantify R nit and f N2O Nit at any timepoint. Our objective was to demonstrate that SGB2 models can be used to map the global spatial patterns of potential R nit and f N2O Nit based on a few long-term soil and climate variables that are easily accessible from world databases. These potential R nit and f N2O Nit values obtained under optimal soil temperature, soil moisture and sufficient N availability, could be used as benchmark for the calibration of process-based models.
Our findings provide important implications for the prediction of the N losses in process-based models. The SGB model is able to derive new parameters that can be incorporated into process-based models for different N loss pathways under various edaphic and environmental conditions. Moreover, SGB could also be directly embedded in a processbased model or replace the existing unsatisfactory modules of a process-based model to capture the complex, dynamic, and nonlinear processes more efficiently and effectively. By integrating with other modules of process-based models, SGB models can be used to develop decision support tools for sustainable N management.

Limitation
Limitations remained in this study. First, we conducted a comprehensive global literature search, but found a low number of observations. This indicates limited geographical coverage and the difficulties of simultaneously measuring nitrification and associated N 2 O emissions both in situ and in the laboratory. Second, while SGB models are suitable for handling small databases, the inclusion of more data in training machine learning models would likely improve their performance. Third, although we included constraint coefficients in SGB models to address the issues associated with artificial experimental conditions, this potential data bias can be reduced when more in situ data become available.

Conclusion
This study is the first attempt to use machine learning to develop SGB models to predict R nit and f N2O Nit in terrestrial ecosystems. Compared to three widely used process-based models, the SGB models were more accurate in predicting nitrification rate and associated N 2 O emissions. The SGB models can be extended to a global level with only a few input variables without compromising accuracy. In particular, nitrification rate was predicted by soil pH, MAT and soil C/N ratio whereas the fraction of nitrification as N 2 O emissions was predicted using soil pH, MAP, soil total N and clay content. Large spatial variation in nitrification and its fraction as N 2 O emissions was mainly driven by long-term environmental and edaphic factors. The fraction of nitrification as N 2 O emissions is clearly not constant; instead it should be adjusted according to edaphic and environmental conditions when used in process-based models or global climate models in projecting N 2 O emissions.

Data availability statement
The data that support the findings of this study are available upon reasonable request from the authors.