1 Introduction

Plant phenology is a major aspect of plant response to environment and a major determinant of plant response to climate change. This includes phenology of natural vegetation, which has been shown to be affected by warming (Piao et al. 2019; Menzel et al. 2020; Stuble et al. 2021) as well as phenology of cultivated crops (Rezaei et al. 2018). For the latter, phenology must be taken into account for crop management (Sisheber et al. 2022), choice of cultivar or cultivar characteristics adapted to a particular region (Zhang et al. 2022), and for evaluating the impact of climate change on crop production (Rezaei et al. 2018). It is thus important to be able to predict phenology as a function of environment, and in particular as a function of climate.

A number of mechanistic crop models have been developed, which include simulation of phenology. Such models are regularly used to evaluate management options (McNunn et al. 2019) or the effect of climate change on crops, including wheat (Asseng et al. 2013), rice, (Li et al. 2015), maize (Bassu et al. 2014) and soybean (Fodor et al. 2017). Such models are particularly important for taking into account an increasing diversity of combinations of weather events (Webber et al. 2020).

Mechanistic models in general, and models used to simulate crop phenology in particular, are based on our understanding of the processes and their inter-linkages that drive the evolution of the system. This conceptual understanding usually builds on detailed experiments that study specific aspects of the system (e.g., Brisson et al. 2003 for the crop model STICS). The set of model equations, which is the mathematical expression of our understanding, is referred to as “model structure” (Tao et al. 2018).

In addition to model structure, simulation requires values for all the model parameters. In essentially all uses of crop models, the model is first calibrated using observed data that is related to the target population for which predictions are required, for example, observations for the specific variety of interest and/or for the particular set of growing environments of interest. Calibration normally only concerns a fairly small subset of the parameters in a crop model, but is essentially always necessary because mechanistic models are only approximations, without universally valid parameter values (Fath and Jorgensen 2011; Wallach 2011).

There are therefore two main tracks to improvement of crop phenology model predictions. The first is through improvement of model structure through improved understanding of the underlying processes, and the second is through improvement of model calibration, and that is the topic here. The specific context of interest is the use of a phenology model to predict crop phenology for new environments, given a sample of data from similar environments. The calibration problem is how best to use the sample of data in order to minimize prediction error. We do not consider the question of the most appropriate data sets for calibration. Rather, we assume that there is preexisting data, for example, from variety trials, and one is limited to those data. Note that while better equations and better calibration are two separate paths to improving model predictions, they are complementary; if one wants to compare how well different equations predict phenology, one must first calibrate them.

Calibration of crop models is usually patterned on statistical methods used to estimate parameters in regression. However, the application of statistical methods to crop models is not straightforward. Major difficulties include the fact i. that system models often have multiple output variables that can be compared to observed results (e.g., dates of heading and dates of flowering for crop phenology models), ii. that errors for different variables in the same environment are often correlated, and iii. that there are usually many parameters, often more than the number of data points available. While the details differ, these problems apply to essentially all system models. No doubt, as a result, there are no widely accepted standard methods for calibration of system models. It has been found, for example, that there is a wide diversity of calibration approaches between modeling teams furnished with identical data, even between modeling teams using the same model structure (Confalonieri et al. 2016; Wallach et al. 2021a, b). By modeling team we mean a group of people working together on or with a crop model.

Because of the importance of calibration and the lack of standard approaches for calibration, there have been many studies published that make recommendations as to how to calibrate crop models or system models in other fields. One type of study is model-specific and identifies the most important parameters to estimate for a particular model (Ahuja and Ma 2011). Other studies have focused on the methodology of identifying the most important parameters through sensitivity analysis (Khorashadi Zadeh et al. 2022), on the choice between frequentist and Bayesian paradigms (Gao et al. 2021), on the form of the objective function, or on the numerical algorithm for searching for the best parameter values (Rafiei et al. 2022). A recent study has shown that different modeling teams make different choices for all the steps of the calibration procedure (Wallach et al. 2021c). That study showed that the modeling community is far from having a consensus on how to calibrate phenology models and that progress is required for multiple aspects of the calibration procedure.

The purpose of this study was to define and test an original, detailed, comprehensive procedure for crop phenology model calibration that could be applied to a wide range of models. We refer to this new procedure as a “protocol,” to emphasize that it contains detailed instructions for calibration. It builds on the recommendations in Wallach et al. (2021c) but goes beyond those more general recommendations, most importantly in proposing an original approach for choosing the parameters to estimate, which is arguably the most important calibration decision. A second major innovation of the protocol is the definition of documentation tables for each step of the calibration procedure, which can be used both for communication within a modeling team and to inform users of the calibrated model. We tested the protocol in a large multi-model ensemble study. Each modeling team first calibrated their model using their “usual” calibration procedure, and then using the protocol proposed here. This is thus a comparison, for each team, of usual versus protocol calibration. The results showed that the protocol reduced the variability between modeling teams compared to usual calibration approaches and, most importantly, that it significantly reduced prediction error compared to usual calibration approaches.

2 Materials and methods

2.1 Data sets

Three data sets for wheat phenology were used here, where a data set is data from fields representative of some specific target population. The target population for the first data set was wheat fields in the major wheat-growing regions of France sown with winter wheat variety Apache and using usual management. The data were from cultivar trials. They were separated into a calibration subset with data from 14 environments (6 different sites, sowing years 2010, 2011, 2014-2016 but not every year was represented for every site) and an evaluation subset with data from 8 environments (5 different sites, sowing years 2012 and 2013). The target population for the second data set was identical to that for the first, but with winter wheat variety Bermude in place of Apache. For both data sets, the calibration and evaluation subsets had neither site nor year in common, so the evaluation is a rigorous test of how well a modeling team can simulate phenology for out-of-sample environments. The observed data were days from sowing to the beginning of stem elongation (BBCH30 on the BBCH scale, Meier 1997) and to the middle of heading (BBCH 55).

The target population for the third data set was wheat fields in the major wheat-growing regions of Australia, with usual management and sown with spring wheat variety Janz. The data were from a multi-location multi-year multiple sowing date trial in Australia (Lawes et al. 2016; Wallach et al. 2021b). The calibration subset had data from four sites in 2010 and 2011, with three sowing dates per site (overall 24 environments). The evaluation subset had data from six sites in 2012, with three sowing dates per site (overall, 18 environments). Once again, the calibration and evaluation subsets had neither site nor year in common. In the original trials, the BBCH development stage was observed once weekly in each environment. Based on those data, a graph of BBCH stage versus day was produced, and interpolation was used to obtain the day for each integer BBCH stage from the earliest to the latest recorded in each environment. Those dates were provided to the modeling teams.

2.2 Modeling teams and model structures

Nineteen modeling teams, using 16 different model structures (Table 1), participated in this study, which was carried out within the Agricultural Model Intercomparison and Improvement Project (AgMIP; www.agmip.org). The modeling teams are identified only by a code (“M1,” “M2,” etc.) without indicating which model structure they used, since it would be misleading to give the impression that the results are determined solely by model structure. The participating teams represent an “ensemble of opportunity,” that is, an open call for participants was put out, and all teams that volunteered were accepted. An indication of the resulting variability in phenology prediction is given by the variability in the choice of parameters to calibrate using “usual” calibration (see Supplementary Table S4 in Wallach et al 2021c). Most teams estimated some parameters that represent degree days to various stages, though the number of degree day parameters varied between teams. A few models have parameters that represent development rates rather than degree days to each stage. Only a few teams estimated parameters related to time from sowing to emergence. About half of the teams estimated one or more parameters related to vernalization and to photoperiod sensitivity. A few teams estimated parameters related to the temperature response function, for example, minimum temperature or optimum temperature for development, or related to tillering or leaf appearance rate. Finally, a few teams estimated parameters related to the effect of stress on the development rate.

Table 1 List of model structures used by participating modeling teams.

2.3 Goodness-of-fit and evaluation of predictions

Goodness-of-fit refers to how well a calibrated model fits the data used for calibration. Prediction accuracy refers to how well a calibrated model simulates for environments different than those in the calibration data set. Since the evaluation environments here are for sites and years not represented in the calibration data, the test of simulated values against the evaluation data truly reflects how well a model can predict for new environments.

For both goodness-of-fit and out-of-sample prediction, our basic evaluation metric is the sum of squared errors (SSE) and the related quantities mean squared error (MSE) and root mean squared error (RMSE), where

$$\begin{array}{l}SSE=\sum\limits_{i}\sum\limits_{j}{({y}_{ij}-{\widehat{y}}_{ij})}^{2}\\ MSE=SSE/n\\ RMSE=\sqrt{MSE}\end{array}$$
(1)

The sum for SSE is over days to BBCH30 and BBCH55 for the French data sets, days to BBCH30, BBCH65, and BBCH90 for the Australian data sets and over environments. Here, \({y}_{ij}\) is the observed value of variable i for environment j, \({\widehat{y}}_{ij}\) is the corresponding simulated value and n is the number of terms in the sum. We also look at the decomposition of MSE as the sum of three terms, namely squared bias (bias2), a term related to the difference in standard deviations of the observed and simulated values (SDSD), and a term related to the correlation of observed and simulated values (LCS) (Kobayashi 2004).

In addition, we compare the simulated results in this study with two simple benchmark models. The first (the “naive” model) is simply the average number of days to each stage in the calibration data of each data set. This is used as the prediction model for all environments of that data set. The often-used Nash Sutcliffe modeling efficiency is one minus the ratio of MSE of a model to MSE of the naive model. The naive model ignores all variability between environments, so it is a very low bar as a benchmark. We therefore also use a more sophisticated benchmark, the “onlyT” model, as in Wallach et al. (2021a). This benchmark model assumes that the sum of degree days above a threshold of 0°C from sowing to each stage is fixed for spring wheat. For winter wheat, a simple vernalization model is used to determine the start of development. Vernalization is 0 if daily mean air temperature is below −4°C, increases linearly to 1 at 3°C, remains at 1 to 10°C, decreases linearly to 0 at 17°C and is 0 above 17 °C. When the sum of daily vernalization reaches 50, vernalization is complete (van Bussel et al. 2015; Wallach et al. 2021a). Then the fixed number of degree days applied after vernalization is completed. Both benchmark models are quite easily parameterized based on calibration data, and then easily applied to new environments.

2.4 Simulation exercise

The participants received input data (daily weather at the field, soil characteristics, management details, and, where possible, initial conditions) for all environments of every data set. Also, the observed data from the calibration environments were provided to all participants. The participants were asked to use those data to calibrate their models using the calibration protocol described in detail below and then to simulate and report days after sowing to stages BBCH10 (days to emergence), BBCH30, and BBCH55 for the French calibration and evaluation environments, and to stages BBCH10, BBCH30, BBCH65, and BBCH90 for the Australian calibration and evaluation environments. Days to emergence was included to have an example of a variable for which there were no calibration data. The BBCH stages 30 and 55 requested for the French environments represent stages that are used for fertilizer decisions in France. The BBCH stages 30, 65, and 90 requested for the Australian environments represent major transitions that are explicitly simulated by many models.

All teams calibrated their model with the same data as here using their usual calibration approach, either in previous studies (Wallach et al. 2021a, b) or specifically for this study. It is the results of the usual calibration method that are compared here to the results of using the proposed protocol. At no time were the evaluation data shown to participants, neither in previous studies nor in the present study.

The protocol does not impose a specific software solution. However, several participants used trial and error in their usual approach and requested help in finding and implementing an automated search algorithm, since that is required for the protocol. To answer this need, the CroptimizR R package (Buis et al. 2021) was modified to do the protocol calculations, and many of the participants used this software.

In addition to the individual models, we report on two ensemble models, created by taking the mean (the e-mean model) or the median (the e-median model) of the simulated values. These ensemble models were calculated both for the usual and protocol calibration results.

2.5 AICc and BIC

The protocol prescribes a model selection criterion to decide which parameters to estimate. The corrected Akaike Information Criterion (AICc) and the Bayesian Information Criterion (BIC) are two different criteria that are often used for model selection (Chakrabarti and Ghosh 2011). Both are based on model error, with a penalization term that increases with the number of estimated parameters. Assuming that model errors are normally distributed, the criteria are:

$$\begin{array}{c}{AIC}_{c}=n\;\mathrm{ln}\left(MSE\right)+2p+\frac{2p(p+1)}{n-p-1}\\ BIC=n\;\mathrm{ln}\left(MSE\right)+p\;\mathrm{ln}(n)\end{array}$$
(2)

where n is the number of data points and p is the number of calibrated parameters. These criteria are only used for comparing models calibrated using the same data.

There have been comparisons between these criteria, but there does not seem to be one that systematically performs better than the other, for choosing the model that predicts best (Kuha 2004). In applying the protocol here, participants were asked to perform the calculations twice, once using the AICc criterion and once using the BIC criterion to choose the parameters to estimate. In almost all cases, the two criteria led to exactly the same choice of parameters. In the few cases where the criteria led to different choices, the final models had very similar RMSE for the evaluation data, with a very slight advantage to BIC (Supplementary tables S24-S25). Therefore, all results shown here are based on the BIC criterion.

3 Results and discussion

3.1 Description of protocol

The protocol is based on the recommendations in Wallach et al. (2021c), and follows the same list of steps (Fig. 1), but has important additions, in particular for the choice of parameters to estimate and the documentation to be produced.

Fig. 1
figure 1

Schematic diagram of steps in proposed calibration protocol.

  • Step 1. Describe environments, choose default parameter values

It is important to describe the environments represented in the data, and of the target population, in particular temperatures and day lengths. Information for the data sets used here can be found in Wallach et al. (2021a, 2021b).

Since most parameters retain their default values, the choice of default values for those parameters that affect phenology is important. For phenology, one would want to have reasonable approximations to the cycle length for the cultivar in question, to photoperiod dependence, and to vernalization requirements. This information and more are usually available from the cultivar developer. The documentation for step 1 (see example in Table 2) specifies the characteristics of the cultivar being modeled and those of the cultivar used to provide default parameter values.

Table 2 Example of protocol documentation for Step 1, “Choose default parameter values”. The first row shows cultivar characteristics of the observed cultivar. The second row shows characteristics of the cultivar that provides the default parameter values. This example is for the French data set, variety Apache, modeling team M21.
  • Step 2. Identify correspondence between observed and simulated variables

In the simplest case, there is a simulated variable that corresponds directly to each observed development stage. The documentation for step 2 is a table with one row for each observed variable, showing the corresponding simulated variable if any (see example in Table 3).

Table 3 Example of documentation for protocol Step 2, “Identify correspondence between observed and simulated variables”. The table has one row for each measured variable, showing the corresponding simulated variable if any. This example is for the French data sets and modeling team M21.
  • Step 3. Define the objective function

The objective function of the protocol is squared error summed over development stages and environments, which is the objective function of ordinary least squares (OLS) regression and is often used in crop model calibration. A major choice of the protocol is to include in the objective function all the observed development stages that have a simulated equivalent, including stages that are not of primary interest. A first reason is that often the same calibrated model will be used for several different objectives, so measured variables that are not of central interest in the current study may be important in future studies. Furthermore, using more variables makes the model a better representation of multiple aspects of the system dynamics, which is likely to improve all simulations. The choice in the protocol is to use OLS and to avoid estimating additional statistical parameters representing variance and covariance of errors. However, one should check residual errors to evaluate the extent of heteroscedasticity or correlation of errors. Since the objective function is the sum of squared errors over the variables from step 2, no new decisions are required here and no additional documentation is required.

  • Step 4. Choose which parameters to estimate

This is arguably the most difficult and the most important decision of the calibration approach. Here we propose a novel approach which combines expert knowledge with a statistical model selection criterion. This approach distinguishes two categories of parameters to estimate: the nearly additive, obligatory parameters (those that will definitely be estimated) and the candidate parameters (those that will be tested, and only changed from the default value if the improvement in the fit to the calibration data is sufficiently large).

  • Step 4a. Identify the obligatory parameters

The obligatory parameters are parameters that are nearly additive, i.e., such that changing the parameter has a similar effect for all environments for some variable in the objective function. Usually, a parameter that represents degree days to a measured stage is a good choice as an obligatory parameter for time to that stage. Estimating a truly additive parameter, which adds the same constant amount to days to a stage for all environments, will exactly eliminate bias for that stage. That is, the mean of simulated values will exactly equal the mean of observed values (see Supplementary Eqs. 1-2). Estimating an almost additive parameter will nearly eliminate bias. Once bias is nearly eliminated, one may already have a fairly reasonable fit to the data. Each almost additive parameter must affect a different variable or combination of variables. There cannot be more almost additive parameters than the number of variables in the objective function. Otherwise, the parameters would be very poorly estimated, or non-estimable. The protocol does allow fewer almost additive parameters than observed variables. In that case, bias is only nearly eliminated on average over several variables, and not for each variable. The choice of obligatory parameters is up to the modeling team, based on knowledge of the model. However, the protocol gives fairly detailed recommendations, namely that they should be nearly additive, that degree days to stages are usually a good choice, and that the number of obligatory parameters cannot exceed the number of different variables. For each obligatory parameter, one must provide the default value and what one considers a reasonable range for that parameter (for an example of choice of bounds see Tao et al. 2018). The documentation for step 4a is a table with one row for each obligatory parameter (see example in Table 4).

Table 4 Example of documentation for protocol Step 4a, “Identify the obligatory parameters.” These are parameters that are almost additive, i.e., that have nearly the same effect for all environments. There is one row for each obligatory parameter. The number of obligatory parameters cannot exceed the number of observed variables which have a simulated equivalent, and each obligatory parameter must be nearly additive for a different variable or combination of variables. This example is for the French data set, variety Apache for modeling team M21.
  • Step 4b. Identify candidate parameters

The role of the candidate parameters is to reduce the variability between environments that remains after estimation of the obligatory parameters. It is the role of the modeler to identify the candidate parameters and to order them by amount of variability likely to be explained. In the calculation step (step 5), each candidate parameter is tested, and only those that lead to a reduction in the BIC criterion are retained for estimation. Otherwise, the parameter is kept at its default value. The documentation for step 4b is a table with one row for each candidate parameter (see example in Table 5).

Table 5 Example of documentation for protocol Step 4b, “Identify candidate parameters.” These are parameters that seem likely to explain a substantial part of the variability between environments that remains after fitting the obligatory parameters. There is one row for each candidate, which should be in the order of presumed importance. This example is for the French data set, variety Apache for modeling team M21.
  • Step 5. Calculation of the optimal parameter values

The protocol prescribes the use of a simplex algorithm for searching for the optimal parameter values. The Nelder-Mead simplex method (Nelder and Mead 1965) is a robust, derivative-free method, which is appropriate for crop models which may have multiple discontinuities. The results of the simplex are sensitive to starting values (Press et al. 2007), so the protocol calls for multiple starting points. In the first calculation step, the obligatory parameters are estimated and the BIC value is calculated. This is the initial list of parameters to estimate. Then each candidate parameter is tested in turn. If estimating the new candidate together with the previous list of parameters to estimate leads to a reduction in BIC, the candidate is added to the list of parameters to estimate. If not, the candidate returns to its default value and will not be estimated. The documentation for step 5 is a table with one row for each step in the calculation (see example in Table 6).

Table 6 Example of documentation for Step 5 “Calculation of the optimal parameter values”. The first line shows the optimization results for the obligatory parameters, and the resulting sum of squared errors and BIC criterion. Each subsequent line corresponds to a candidate parameter. If estimating the candidate together with the previously selected parameters leads to a decrease in BIC compared to the smallest value so far, the candidate is added to the list of parameters to estimate. If not, the candidate returns to its default value and is not considered further. In this example, the first candidate parameter (jvc) is accepted. All the subsequent candidate parameters increase BIC and are therefore rejected. The model finally chosen (minimum BIC) has three estimated parameters. This example is for the French data set, variety Apache, modeling team M21.

The first calculation step, searching for the optimal values for the obligatory parameters, only involves a relatively small number of parameters, and furthermore, these parameters are chosen to be nearly additive. It is expected then that this step should not present serious numerical difficulties. Subsequently, only one new candidate is added at a time. The previously chosen parameters are also estimated, but it is expected that their previous best values should be good starting values, so that once again there should not be serious numerical difficulties. Overall, the protocol takes advantage of the particular structure of crop phenology models, in particular the possibility of identifying the most important, almost additive parameters, in order to separate the search for optimal parameter values into relatively easy steps.

  • Step 6. Examine goodness-of-fit

Many diagnostics are possible and useful. We emphasize particularly a graph of simulated versus observed values, calculated MSE and its decomposition for each variable (see example in Table 7) and comparison with the two benchmark models (see example in Table 8). It is expected that squared bias should be small after calibration, and this is the case in the example of Table 7.

Table 7 First example of documentation for protocol Step 6, “Examine goodness-of-fit.” In this table, there is one row for each observed variable with a simulated equivalent, showing mean squared error (MSE) and its decomposition into three terms. Of particular interest is the bias2 contribution, which should be small if there is an almost obligatory parameter corresponding to this variable. This example is for the French data set, variety Apache, modeling team M21.
Table 8 Second example of documentation for protocol Step 6, “Examine goodness-of-fit”. In this table, there is one row for each observed variable with a simulated equivalent, showing root mean squared error (RMSE) for the calibrated model and for two benchmark models. The “naïve” benchmark assumes that all environments have the same number of days to the given development stage, equal to the average of the observed days to that stage. The “onlyT” benchmark assumes a constant number of degree days to the stage in question, with a simple vernalization calculation in the case of winter wheat. This example is for the French data set, variety Apache, modeling team M21.

3.2 Comparison of protocol and usual calibration

None of the modeling teams, in their usual procedure, used the same procedure as the protocol for choosing parameters to estimate. In most cases, for usual calibration, the choice of parameters was based solely on expert opinion. A few teams used expert opinion but tested a few alternative choices to see which gave the best fit to the data. Finally, some teams based the choice on sensitivity analysis (Wallach et al. 2021c). Using usual or protocol calibration led to important differences in the calibrated models. For example, the number of estimated parameters in the final model was different between protocol and usual calibration (Supplementary Figure S1, Table S1).

The differences between simulated values after usual and protocol calibration were small for BBCH10, for which there were no calibration data. For the other stages, the simulated values differed appreciably (Supplementary Figure S2, Table S2).

3.3 Goodness of fit and evaluation of prediction error for usual and protocol calibration

Nineteen modeling teams participated, and were all able to implement the protocol, showing that the protocol, though detailed, is nonetheless sufficiently flexible to be applicable to a wide range of models. The protocol was tested in comparison with the “usual” calibration procedure for each modeling team, which was possible because each modeling team had previously calibrated their model using the same data as here. To our knowledge, this is the first example of such a stringent test for a new calibration procedure. It provides a realistic test of whether the proposed protocol really improves calibration.

Figures 2 and 3 show RMSE using usual and protocol calibration for the French and Australian data sets respectively. Table 9 shows RMSE values for each data set, averaged over modeling teams, for usual and protocol calibration for the calibration and evaluation data (results by modeling team are in Supplementary Tables S3-S8). The protocol reduces RMSE by 10–22% compared to the usual calibration method. The p-values for a one-sided paired t-test of the hypothesis that RMSE is larger for usual calibration than for protocol calibration are also shown. On average over stages other than BBCH10, all three data sets have significantly larger RMSE values with usual calibration than with protocol calibration for the calibration data (p<0.05). For the evaluation data, the reduction in RMSE is highly significant (p<0.01) for the two French data sets, but less significant (p=0.15) for the Australian data set. Table 9 also shows the proportion of modeling teams where RMSE is larger for the usual calibration than for the protocol calibration. Looking at the averages over stages and then averaging over data sets, 75% of models have lower RMSE for protocol calibration than for usual calibration for the calibration data (Supplementary Tables S3, S5, S7). For the evaluation data, 60% of models have lower RMSE for protocol calibration than for usual calibration (Supplementary Tables S4, S6, S8). Presumably, a major reason that protocol calibration reduces RMSE compared to usual calibration is that the protocol uses an improved method of choosing the parameters to estimate, which combines expert knowledge and a statistical model selection criterion. Using a model selection criterion has the advantage that it avoids overfitting and in general will avoid estimation of parameters whose estimators are highly correlated.

Fig. 2
figure 2

Root mean squared error (RMSE) following protocol calibration versus RMSE following usual calibration, for each modeling group. Points below the diagonal indicate smaller RMSE for protocol calibration than for usual calibration. a French calibration data, variety Apache. b French evaluation data, variety Apache. c French calibration data, variety Bermude. d French evaluation data, variety Bermude.

Fig. 3
figure 3

Root mean squared error (RMSE) following protocol calibration versus RMSE following usual calibration, for each modeling group. Points below the diagonal indicate smaller RMSE for protocol calibration than for usual calibration. a Australian calibration data. b Australian evaluation data.

Table 9 Comparison of errors for usual and protocol calibration. The table shows root mean squared error (RMSE) averaged over modeling teams for each stage and for the average over stages, separately for the calibration and evaluation data. For each calibration and evaluation data, the first column is RMSE for simulations using the usual calibration method, the second column is RMSE using protocol calibration, and the third column is the p-value of a one-sided paired t-test that tests whether RMSE for usual calibration (RMSEu) is larger than for protocol calibration (RMSEp). Below the p-value is the fraction of modeling teams for which RMSE for usual calibration is larger than for protocol calibration.

Almost all modeling teams did better than the two benchmark models for all stages, both for usual calibration and protocol calibration, with slightly better results for protocol calibration (Supplementary Tables S3-S8). Since the protocol specifically aims to reduce bias, one would expect squared bias to be a smaller fraction of MSE for protocol calibration than for usual calibration, and this is the case, both for the calibration data and the evaluation data (Supplementary Tables S9-S23).

3.4 Between-model variability

The variability between simulated results for different modeling teams is shown in Table 10. The standard deviation is similar for usual and protocol calibration for BBCH10, for which there are no data for calibration, but is systematically smaller for protocol calibration for the other stages. Considering the average over stages other than BBCH10 and taking the average over data sets, protocol calibration reduced the standard deviation of simulated values by 31% for the calibration data and by 22% for the evaluation data (see Table 10).

Table 10 Variability of simulated values. This table shows the standard deviations of values simulated by the different modeling teams (days), for each simulated development stage and for the average over stages other than BBCH10. BBCH10 is not considered, since there were no observed values for BBCH10. Separate standard deviations are given for the calibration and evaluation data, and simulation using usual calibration or protocol calibration.

3.5 Comparison of usual and protocol calibration for ensemble of models

The choice of usual or protocol calibration has little effect on the predictive accuracy of the ensemble models e-mean and e-median. Averaged over development stages and over data sets, for the evaluation data, RMSE for e-median is respectively 5.7 and 5.8 days for usual and protocol calibration. The values for RMSE of e-mean are 6.1 and 6.2 days for usual and protocol calibration, respectively (Supplementary Tables S4, S6, S8).

Recently, many crop model studies have been based on ensembles of models (Jägermeyr et al. 2021). Many studies have found that the ensemble mean and median are good predictors, sometimes better than even the best individual model (Martre et al. 2015; Wallach et al. 2018; Farina et al. 2021). It has thus, become quite common to base projections of climate change impact on crop production on the ensemble median (e.g., Asseng et al. 2019). The e-mean and e-median results here, compared to the individual modeling teams, are in keeping with previous results. The e-median model is among the better predictors though not the very best, and is somewhat better than e-mean. However, this does not imply that individual model results are unimportant. First, there will continue to be studies based on a single model or on a very small ensemble, and for those studies, it is important to improve individual models. Also, even for larger ensembles, it is important to reduce variability between models because this reduces variability in the ensemble results.

4 Conclusions

We propose an original, detailed protocol for calibration of crop phenology models and showed that it can be applied by a wide range of wheat models and for data sets with different structures. The application here is to wheat phenology models. However, the same protocol could undoubtedly be used for phenology models of other crops.

This protocol could also be the basis for a calibration protocol for general crop models using multiple kinds of data. The procedure proposed here for the choice of parameters to estimate could be applied in the more general case. However, there would be the additional problem of dealing with multiple types of data, for example, phenology, biomass, yield, etc.

The protocol was tested with data sets representing a diversity of conditions. Comparison with usual calibration practices showed that on average over modeling teams, the protocol led to a better fit to the calibration data and to a better fit to out-of-sample data. The error of the mean or median of simulations was nearly identical with usual or protocol calibration, but the protocol substantially reduced between-model variability compared to usual calibration, which reduces the uncertainty of the mean or median. Thus, application of the protocol would be advantageous not just for individual modeling studies, but also for studies based on ensembles of models.