Maize Yield and Nitrate Loss Prediction with Machine Learning Algorithms

Pre-season prediction of crop production outcomes such as grain yields and N losses can provide insights to stakeholders when making decisions. Simulation models can assist in scenario planning, but their use is limited because of data requirements and long run times. Thus, there is a need for more computationally expedient approaches to scale up predictions. We evaluated the potential of five machine learning (ML) algorithms as meta-models for a cropping systems simulator (APSIM) to inform future decision-support tool development. We asked: 1) How well do ML meta-models predict maize yield and N losses using pre-season information? 2) How many data are needed to train ML algorithms to achieve acceptable predictions?; 3) Which input data variables are most important for accurate prediction?; and 4) Do ensembles of ML meta-models improve prediction? The simulated dataset included more than 3 million genotype, environment and management scenarios. Random forests most accurately predicted maize yield and N loss at planting time, with a RRMSE of 14% and 55%, respectively. ML meta-models reasonably reproduced simulated maize yields but not N loss. They also differed in their sensitivities to the size of the training dataset. Across all ML models, yield prediction error decreased by 10-40% as the training dataset increased from 0.5 to 1.8 million data points, whereas N loss prediction error showed no consistent pattern. ML models also differed in their sensitivities to input variables. Averaged across all ML models, weather conditions, soil properties, management information and initial conditions were roughly equally important when predicting yields. Modest prediction improvements resulted from ML ensembles. These results can help accelerate progress in coupling simulation models and ML toward developing dynamic decision support tools for pre-season management.


Introduction
Technological approaches to forecast weather and management impacts to crop yields and environmental quality are becoming more prevalent. They provide stakeholders with crucial information to support decision-making regarding the profitability and sustainability of crop production (Basso and Liu, 2018;Ansarifar and Wang, 2019). Seasonal crop forecasting often integrates data from multiple sources, such as in-situ sensors, remote-sensed imagery and crop simulation models (González Sánchez et al., 2014;Basso and Liu, 2018;Moeinizade et al., 2019;Togliatti et al., 2017) to, for example, detect potential water and nutrient limitations during the growing season and make supplemental irrigation or fertilizer recommendations. Yet, most of the management decisions (e.g. cultivar selection, fertilizer rates) are often made months before crops are even planted. In such cases, crop simulation models are better suited to assist with scenario planning, since they predict plant-soil processes by using soil characteristics, cultivar traits, management and weather information as input in mathematical equations that synthesize knowledge of crop ecophysiology and soil biophysics (Hoogenboom et al., 2004).
Although simulation modeling can achieve reasonable prediction accuracy, its application in actual farms is limited because of the substantial amount of expertise and data required for rigorous calibrations (Drummond et al., 2003;Puntel et al., 2019). Even with a well-calibrated simulation model, deployment for exploring potential management options under a range of possible weather conditions (i.e., scenario analysis) is often impractical due to long runtimes and data storage constraints. Critically, simulations have to be rerun to incorporate new information as it becomes available, or to extrapolate beyond the set of conditions which were originally simulated.
Meta-models are statistical models trained on more computationally expensive models (e.g., a crop simulator). By replacing a more detailed simulation model, a meta-model can provide faster execution, reduced storage needs, and added flexibility to extrapolate across temporal and spatial scales than a more detailed simulation model (Simpson et al., 2001). Meta-modeling has been widely implemented for extrapolating hydrological (Nolan et al., 2018;Fienen et al., 2015) and biogeochemical (Villa-Vialaneix et al., 2012;Britz and Leip, 2009) simulations across regional scales, as well as to streamline sensitivity analyses for parameterization of crop simulation models (Gladish et al., 2019;Stanfill et al., 2015;Pianosi et al., 2016). To our knowledge, meta-modeling approaches for decision-support and forecasting applications in crop production have not been previously evaluated.
The central question when developing a meta-model is how well the behavior of the simulation model is reproduced by the selected method. Reasonable predictions (e.g. < 20% error) made in real time can be more valuable for quick screening across large geographic regions or high-dimensional factorial spaces than running a more precise but slower simulator (Fienen et al, 2015). Answering this question requires both examining which predictive methods are best suited to emulate the crop model and determining the requirements of the size and type of the training data. Although the literature is rich on comparisons among the skill of various ML algorithms in predicting agricultural outcomes (e.g., González Sánchez et al., 2014;Morellos et al., 2016;Landau et al., 2000;Sheehy et al., 2006;Qin et al., 2018), these are largely based on empirical data. Sensitivity analysis studies have looked at the performance of several approaches to emulate simulated results (Gladish et al., 2019), but their focus is on approximating the distribution of continuous crop model parameters. Scenario analysis for crop forecasting often uses a combination of both categorical (e.g. crop cultivar, tillage and fertilizer application mode) and continuous (e.g. dates, input amounts, initial conditions) input variables, thus evaluation of ML metamodels with these types of datasets is needed.
In this article, we investigate the potential use of ML algorithms as meta-models for developing more computationally expedient and dynamic decision-support systems for crop production. Our goal is to develop a robust, fast and dynamic forecasting system that can provide pre-season (e.g., in April) predictions when information is most needed by farmers, targeting both production (maize yield) and environmental quality (nitrogen (N) loss) outcomes. Typically, forecasting systems based on crop models, remote-sensed imagery or surveys provide predictions months after planting (Basso and Liu, 2018;Togliatti et al., 2017). Given the uncertainty in weather, we examined the extent to which the maize yield and N loss target variables can be predicted with ML meta-models trained on pre-season weather information (October to April), initial conditions and management choices. We made use of a large scenario analysis dataset (n >3 million) generated by a well calibrated simulation model (APSIM) to: 1) Evaluate the performance of five different ML algorithms as meta-models; 2) Determine data requirements for achieving acceptable prediction accuracy; 3) Rank the importance of different data-types on yield and N loss prediction; and 4) Investigate whether ML ensembles offer better prediction than single algorithms.

Materials and Methods
We first calibrated the APSIM (Agricultural Production Systems SiMulator; Holzworth et al., 2014) cropping systems model using experimental data from seven locations in the US Midwest with observations of maize yield and N loss in drainage over 5-7 years and few management treatments. Second, we used APSIM to simulate maize yields and N loss responses to a factorial grid of management practices and initial conditions, resulting in more than 3 million simulated data points. Third, we used 88% of the simulated data to develop and train ML algorithms and 12% of the simulated data to evaluate ML algorithms prediction accuracies.
2.1. Experimental locations and data used to train and test APSIM Figure 1 illustrates the locations of the experimental data used to train and test APSIM. Experimental data for the KELLEY and NASHUA locations have been described in detail in previous modeling studies (Martinez-Feria et al. 2018, Dietzel et al. 2016, while data from the remaining locations (DPAC, HICKS.B, GLIMORE, SERF and STJOHNS) were extracted from the Sustainable Corn CAP Research Database (Abendroth et al, 2017). Soil information for each site was obtained from the SSURGO database (Soil Survey Staff n.d.). Soils in these sites are artificially drained using subsurface drain tubes. Daily weather  for all sites was retrieved from Daymet (Thornton et al 2018).

Development of the simulated dataset
We used the calibrated version of the model and performed a factorial simulation experiment to develop a simulated dataset for ML development and analysis. The factorial combinations (see Table 1) aimed to characterize the influence of initial conditions, crop management, soil management, and N fertilizer management. Within a factor we included three levels. For instance, for N rate we considered N rate at Maximum Return to Nitrogen (Sawyer et al., 2006) per location, 30% less and 30% more nitrogen (see Table 1). The crop, soil and N management factors were designed to represent levels of practice implementation. The combination of 2 crops, 26 total factor levels, 7 sites, and 34 years , resulted in more than 3 million simulated scenarios of yield and N loss data. Every model run corresponded to an instance of a full factorial design, so that all possible scenarios were simulated. The simulation process included a re-set on October 20 th (average maize harvest time) to decouple the effect of weather-year from the initial conditions. All simulations started on October 20 th and ended on October 19 th of the next year. Cumulative annual N loss refers to that period.
The developed database included the target variables (yield and N loss) and additional features (explanatory variables) such as data on soil, weather, and management (see supplementary table S2 for a list of variables). To make the ML meta-models more generalizable, the year and location features were replaced by soil characteristics such as texture, soil organic carbon and plant-available water holding capacity and weather data. Weather variables such as temperature and precipitation were integrated for the period October 20 th to April 10 th (fallow period). The fallow period was divided into five equal periods for each location and weather year to increase the weather-related features to a total of 35 for ML analysis. No growing season weather data was included in the database because the summer weather is an unknown factor at planting time.

Machine learning meta-models development and testing
Five machine learning meta-models were evaluated to predict simulated maize yield and N loss, including three types of linear regression regularizations (Ridge, LASSO, and Elastic Net), as well as two tree-based methods (random forests and Extreme Gradient Boosting (XGBoost)). Multiple linear regression was also employed as a baseline to compare performance among techniques. Description for each meta-model technique is provided in the supplementary materials.
Because preliminary analysis indicated that linear regression and regularization meta-models resulted in predictions with high variance (over-fitting) when including many explanatory weather features, we trained these meta-models using weather summaries for the whole fallow period instead of the fiveperiod summaries. About 88% of the data available were used for training ML meta-models. Fitting and testing procedures were performed using the R statistical software (version 3.5.0; R core team, 2018). We used the glmnet package (Friedman et al., 2010) to fit LASSO, Ridge and Elastic Net regression, ranger to fit random forests (Wright and Ziegler, 2017) and xgboost for Extreme Gradient Boosting (Chen and Guestrin, 2016).

Statistical indices for meta-model performance evaluation
Three measures were used to evaluate the ML meta-model performances. First, the root mean square error (RMSE) that is a measure of difference between predicted and observed values. Second, the Relative Root Mean Square Error (RRMSE) or Normalized RMSE that allows for a direct comparison between different meta-models and meta-model output variables that have different units. Third, the coefficient of determination (R 2 ) that is defined as the proportion of the variance in the response variable that is explained by independent variables. All the equations can be viewed in Archontoulis and Miguez (2014).

Machine learning meta-model validation
We divided the dataset to training (1983-2012) with 2.7 million data and test sets (2013-2016) with 0.4 million data on yield and N loss. An innovative time-wise 5-fold cross validation was conducted on the training set to tune the hyperparameters of random forests, XGBoost, ridge regression, LASSO regression, and elastic net (see supplementary Figure S3). The difference between this type of cross validation with the random k-fold cross validation is that the validation folds have been created manually, i.e. in the first fold, the data for years 1983-1988 shape the validation set, and we roll this validation set over to the second six years for the second fold, and so forth for other folds to cover all the range of the training set.
After tuning hyperparameters (ML specific parameter that are calibrated during the learning process) of each of the ML algorithms, a final meta-model for each of them was trained on the whole training set. The hyperparameters that were tuned were: "mtry" for random forests, "nrounds", "eta", and "gamma" for XGBoost, "lambda" for Ridge and Lasso regression, and finally "lambda" and "alpha" for Elastic Net. Note that multiple linear regression does not use hyperparameters. The hold-out test set (2013-2016; 0.4 million data points) was used to compute the prediction error for the test set which is an estimate of the true error.

Data size requirements and feature importance ranking
To evaluate how sensitive the developed ML meta-models are to the size of the training data, we partitioned the original training set (1983-2012) into subsets containing 5, 10, 15, 20 and 25 continuous weather-years, with 2012 as the last year included in each subset. Each 5 weather-years represents roughly 0.4 million data points. The subsets were used to train the different ML meta-models, which in turn were tested using the hold-out test set (2013-2016).
The importance of data input features (e.g., weather, initial soil N) were quantified for each ML metamodel separately. For random forests, variable importance was determined by ranking the input features based on the variance of the responses (Wright and Ziegler, 2017), while in XGBoost we use the fractional contribution of each feature to the model based on the total gain of this feature's splits (Chen and Guestrin, 2016). For multiple linear regressions, the features importance is ranked by computing the absolute value of the t-statistic for each variable. For regularization meta-models the absolute value of the coefficients corresponding to each input feature is used to rank them (Friedman et al., 2010).

How well do ML meta-models predict yield and N loss?
Across the entire dataset, maize yield varied from 0 to 15 Mg ha -1 , with an average of 9,845 kg ha -1 . Cumulative N loss ranged from 0 to 220 kg N ha -1 , with an average value of 14.8 kg N ha -1 . Across all locations, random forests outperformed other ML meta-models in terms of maize yield and N loss prediction in the testing dataset (Table 2). Random forests predicted N loss with a higher R 2 than N loss (0.77 versus 0.45, respectively), although RRMSE was lower for yield than N loss (13.7% versus 54.8%, respectively). This is not surprising given that N loss has much higher variation than yields. XGBoost was the second best-performing ML meta-model (Table 2). Analyzing ML performance within each location revealed interesting patterns (Figure 2b). The ML predicted yields and N losses better in some locations than others. In terms of yield prediction, RRMSE ranged from 7 to 22% among locations, while N loss prediction RRMSE ranged from 30 to > 100% ( Figure  2a). The highest RRMSE of yield prediction was obtained at one location (GILMORE) and it was consistent for all ML meta-models. In contrast, the highest RRMSE of N loss prediction was at two locations (SERF and HICKS.B), but not consistently among all ML meta-models.

How many data do we need to train ML meta-models?
Different ML meta-models showed distinct sensitivities to the size of the training data ( Figure 3). In terms of yield prediction, XGBoost was the most sensitive ML meta-model and random forests was the least sensitive ML meta-model to the size of the training dataset. For yield prediction, the maximum RRMSE decreased (thus lower error) when size of the training dataset increased from 0.4 to about 1.6 million data (Figure 3). Beyond that point, yield prediction did not benefit much by increasing the size of the training dataset.
In contrast to yield prediction, we did not observe a consistent relationship between ML meta-models RRMSE and size of the data for the N loss prediction (Figure 3). Four ML meta-models benefited by increasing the size of the training datasets (regression models) and two were negatively impacted (random forests and XGBoost; Figure 3).

Figure 3: The effect of changing study duration on maize yield (a) and nitrate loss (b) predictions (some lines are overlapping)
3.3. Which input data variables are most important for ML prediction?
Different ML meta-models showed different sensitivities to input data variables ( Figure 4). Considering maize yield predictions, random forests and XGBoost identified weather features, (i.e., temperature and rain) as the most important factors, followed by soil properties (i.e., soil organic carbon and plant available water), and finally by management (i.e., planting date). The other four ML meta-models had a 5-fold lower sensitivity to weather features than random forests and XGBoost in terms of yield prediction. LASSO regression and Elastic Net acted similarly in identifying the importance of features.
Averaging the sensitivities of all ML meta-models indicated that weather, soil, management and initial conditions input data were all roughly equally important (Figure 4). Random forests, the best N loss predictor (Table 2), indicated weather as the most significant input feature explaining 44% of the total variance and initial conditions as the second most important input feature explaining 37% of the total variance ( Figure 4). XGBoost, which is the second best N loss predictor identified initial conditions, namely initial soil N and water table, more important than others. Ridge regression found initial conditions followed by management decisions and soil properties the most important input features. LASSO, Elastic Net and linear regression similarly detected management and initial conditions as the first two most important features (Figure 4b).

Do ensembles of ML meta-models improve prediction?
Combining different meta-models with different weights resulted in ensembles that modestly improved maize yield and nitrate loss prediction than the best single performing meta-model ( Table 2). The ML yield ensemble meta-model considered only the random forests and XGBoost ML meta-models with weights of 60.2% and 39.8%, respectively. In the benchmark situation, in which all six ML meta-models received equal weights, prediction became worse for both yield and N loss (Table 2) compared to the single best performing meta-model.

Discussion
In this study, we quantified the role of ML algorithms as meta-models for predicting outcomes of high importance to stakeholders such as yield and N loss as early as planting time. We compared ML metamodels for their accuracy (Table 1; Figure 2), quantified data size requirements ( Figure 3) and importance of data input variables ( Figure 4) to inform experimentalists on future data collection protocols and guide modelers on which ML meta-models to choose for predictions. In view of increasing data availability in agriculture and the maturity of analytics from descriptive to prescriptive (National Academy of Sciences, 2018), more ML applications are currently taking place. For example, Lawes et al. (2019) used ML and APSIM modeling to predict optimum N rates for wheat, Puntel et al. (2019) and Qin et al. (2018) used ML and experimental data to predict optimum N rates to maize, while others are exploring coupling ML and simulation models to develop faster and more flexible tools for impact regional assessments (Fienen et al., 2015) and simulation model parameterization (Gladish et al., 2019).
The ML algorithms predicted end-of season yield with a RRMSE of 13-14%, which is comparable to the fit of the simulation model to the field data ( Figure S3) or even better considering that only information up to planting time was considered in this study (Martinez-Feria et al., 2018;Puntel et al., 2019). On the other hand, the RRMSE of cumulative annual N loss (harvest to harvest) was about 4 times higher than for yield and much greater than the error of the simulation model itself, indicating that annual N loss cannot be reliably predicted with information up to planting time. A different approach and most likely more in-season information is needed to reduce the RRMSE in N loss prediction. A common practice to guide management decisions is to use the yield average of previous years. For example, this information is used to estimate yield goals for N recommendations (Morris et al., 2018). By using this simple-mean approach, the RMSE associated with yield prediction for the unknown years in our dataset was 1,899 kg ha -1 . Use of ML decreased this error by 23 to 29% (Table 2), showing its potential added value. Random forests was the best performing ML meta-model for yield and N loss prediction among the ML meta-models evaluated (Table 1). Random forests is often regarded as a very flexible algorithm, often outperforming other ML techniques across a number of classification and regression problems (Vincenzi et al., 2011;Fukuda, et al., 2013;Jeong et al., 2016;Mutanga et al., 2012). Improvements in prediction with random forests were much greater in N loss than for yield (Table 2), when compared to a multiple linear regression. This could be an indication that N loss has a much greater degree of non-linearity than yields. Despite the modest difference in yield RRMSE, random forests did require less data for training than other ML meta-models ( Figure 3). Using more than 10 years of weather data did not change much the accuracy yield prediction. This data requirement is about half compared to all other meta-models.
In addition, different ML algorithms appear to have different sensitivities to the predictor features ( Figure 4). This may suggest that depending on data input availability for the studied environment, different ML meta-models can be selected. For example, if initial conditions are not of interest, then random forests appears to be the best choice for yield prediction. An interesting result from this analysis is that initial condition becomes very important when ML is to predict N loss and less important when ML is to predict yield. It is important to note that deriving variable importance metrics from ML procedures can produce biased results, depending on the type and the degree of collinearity among the predictors. The all of above needs to be considered designing protocols for data collection, as well as for future meta-model development.
In this study, we faced two major challenges while developing and training the ML meta-models. First, the high variance of the predictions and overfitting due to the dissimilar behavior of different locations in different years meant that the test dataset could not be explained well by the trained ML metamodels (this is called a mismatch problem). To overcome this challenge, we took the following actions: 1) weather and soil features were added to the dataset to make the meta-models more generalizable, 2) continuous versions of some of the input features were added to the data set, 3) the fallow season weather information was divided to five equal periods to provide more information for more complex meta-models, and 4) four years (2012-2015; 0.4 million of data) with different weather patterns were chosen as the holdout test set. The second major difficulty was fine-tuning the hyperparameters of the ML meta-models, given that the behavior of target variables in the different years and different locations were quite dissimilar. We solved this issue with a cross-validation approach in which we divided the training set to validation and train sets by using a manually adjusted cross validation with respect to years (supplementary Figure S3). With this approach ML meta-models captured the variations of the predictions of all years when those years had been chosen in the validation set.
Interestingly, while both random forests and XGBoost outperformed other meta-models across all locations (Table 2), they did not perform well in some locations ( Figure 2). We attributed this behavior to the existence of heavy outliers (extreme values) in these locations, given the higher propensity of random forests and XGBoost to learn from outliers (Reunanen, 2003). This was one of the reasons why we also examined the possibility of using meta-models. In crop and other modeling applications, model ensembles have been shown to be better predictors of yields than any single simulation model (Kimbal et al., 2019;Asseng et al., 2013;Wallach et al., 2018;Shahhosseini et al., 2019aShahhosseini et al., , 2019b. In this study, however, we did not find the same pattern for ML meta-models. In fact, equally weighted ensemble meta-models provided worst results than single models (Table 2). When different weighting factors were optimized for different ML meta-model combinations, the fit improved but only slightly compared to the best performing ML meta-model, which in our case was random forests (RRMSE of 13.5% vs. 13.7%). While ensembles can bring merits of multiple algorithms together, they need the predictions made by single ML meta-models as inputs and also creating ensembles requires more expertise and work. For future work, we recommend evaluating more diverse machine learning algorithms that explain different parts of the variation in the response variable. In doing this, it is expected that the constructed ensembles can show an even better performance. Additionally, optimizing weights of ensembles based on a validation set instead of the test set is another suggestion for future work.

Conclusion
We compared five ML algorithms as meta-models to predict simulated maize yield and N loss, trained with available information at planting time. Based on our results, we concluded that simulated yields can be more reasonably emulated than annual cumulative N loss. While the greater performance and lower data requirements of random forests make this the preferred technique among the meta-models evaluated in this analysis, as shown here datasets with heavy outliers might result in high error. Therefore, we recommend that various ML approaches be evaluated for specific datasets when developing meta-models. We did not find evidence that optimized ML ensembles can substantially outperform the single ML meta-model. This study demonstrates the potential role of meta-models towards developing dynamic recommendation systems for pre-season management decisions.