Same soil, different climate: Crop model intercomparison on translocated lysimeters

Crop model intercomparison studies have mostly focused on the assessment of predictive capabilities for crop development using weather and basic soil data from the same location. Still challenging is the model performance when considering complex interrelations between soil and crop dynamics under a changing climate. The objective of this study was to test the agronomic crop and environmental flux‐related performance of a set of crop models. The aim was to predict weighing lysimeter‐based crop (i.e., agronomic) and water‐related flux or state data (i.e., environmental) obtained for the same soil monoliths that were taken from their original environment and translocated to regions with different climatic conditions, after model calibration at the original site. Eleven models were deployed in the study. The lysimeter data (2014–2018) were from the Dedelow (Dd), Bad Lauchstädt (BL), and Selhausen (Se) sites of the TERENO (TERrestrial ENvironmental Observatories) SOILCan network. Soil monoliths from Dd were transferred to the drier and warmer BL site and the wetter and warmer Se site, which allowed a comparison of similar soil and crop under varying climatic conditions. The model parameters were calibrated using an identical set of crop‐ and soil‐related data from Dd. Environmental fluxes and crop growth of Dd soil were predicted for conditions at BL and Se sites using the calibrated models. The comparison of predicted and measured data of Dd lysimeters at BL and Se revealed differences among models. At site BL, the crop models predicted agronomic and environmental components similarly well. Model performance values indicate that the environmental components at site Se were better predicted than agronomic ones. The multi‐model mean was for most observations the better predictor compared with those of individual models. For Se site conditions, crop models failed to predict site‐specific crop development indicating that climatic conditions (i.e., heat stress) were outside the range of variation in the data sets considered for model calibration. For improving predictive ability of crop models (i.e., productivity and fluxes), more attention should be paid to soil‐related data (i.e., water fluxes and system states) when simulating soil–crop–climate interrelations in changing climatic conditions.


INTRODUCTION
Process-based crop growth models are an essential and invaluable tool to assess the impact of climate change on future crop yields Challinor et al., 2018). The crop models used in these climate change impact assessments studies generally assume an ecosystem response to changing climatic conditions. For instance, the response to rising temperature and CO 2 in plant productivity can be described with similar functional relationships and parameterization as under current climatic conditions (Ainsworth & Rogers, 2007;Asseng et al., 2015;Lenka et al., 2020). In addition, Boote et al. (2010) pointed out that in the past, developers of crop models have often assumed that these functional relationships are correct, and that insufficient effort has been made to improve these relationships or to test them in light of the latest scientific findings. Future crop growth assessments based on process-based crop model intercomparison studies have mostly been carried out using model parameters calibrated for soils under original climatic conditions in their original regions (Mulla et al., 2020;Ruane et al., 2017;F. Tao et al., 2020;Webber, Hoffmann, & Rezaei, 2018). All these studies were based on soils developed at a site under a specific past climate and management, assuming that the soil-crop-atmosphere system is adapted to the atmospheric and lower boundary conditions. The problem with this approach is that modelers calibrate and validate their specific model in a range of environments (i.e., soil and climate conditions) and use the parameterized models with given structures to predict crop development under environments not included in the calibration, as pointed out by Wallach, Palosuo, Thorburn, Gourdain, et al. (2021) and . Changes in climatic conditions (e.g., induced by future climate change) will affect not only crop growth but also the soils (e.g., Robinson et al., 2016Robinson et al., , 2019. Thus, the ability of individual crop models to predict effects of changing climatic conditions on soil ecosystems remains untested for the range beyond the measured sitespecific variability of environmental conditions. Furthermore, the analyses of crop model simulations mostly focus on the physiological processes such as growth in biomass, yield, and phenological development during the vegetation period (e.g., Martre et al., 2015;Wallach, Palosuo, Thorburn, Gourdain, et al., 2021;Yin et al., 2017). However, for the simultaneous prediction of agronomic (crop production) and environmental (boundary and internal fluxes and states) variables under changing climatic conditions, the soil-plant-atmosphere system must be considered as a whole Vadose Zone Journal (i.e., adequate attention should be paid to crops and soil water fluxes and states). This includes continuous simulations, including crop-free periods and lead times (e.g., Groh, Diamantopoulos, et al., 2020). Analyzing water and element fluxes, as well as soil water storage, becomes increasingly important when trying to predict crop development under changing climatic and environmental conditions because of the feedbacks of those fluxes and states with the crop development.
A common strategy to develop site-specific soil use and management strategies and predict agricultural productivity in the same soil under changing climatic conditions is to carry out crop model simulations for climate scenarios, generated based on global atmospheric circulation models (e.g., Abd-Elmabod et al., 2020;Chisanga et al., 2020;Zydelis et al., 2021). Typically, crop models have first been calibrated and validated based on multiannual observations for a specific crop (or crop rotation) at various sites (e.g., Kollas et al., 2015;Nendel et al., 2011) to represent a wide range of environmental conditions in these studies. In a second step, the calibrated models have been applied to predict crop productivity under future climatic conditions (e.g., Schleussner et al., 2018). However, the ideal situation regarding data quality for calibration and validation  has been realized only in a few studies (Archontoulis et al., 2020). Even more frequently, the models are often calibrated at a single site only, for example, due to other experimental constraints. Crop model intercomparison studies as performed by the Agricultural Model Intercomparison and Improvement Project (AgMIP) (Ruane et al., 2017) or by the Modelling Agriculture with Climate Change for Food Security (MAC-SUR) (Asseng et al., 2013;Bassu et al., 2014) among other initiatives, explained differences in predictions between the crop models at the same site by a differential response to increased temperature and atmospheric CO 2 concentrations in leaf area development due to different model formulations (e.g., F. Tao et al., 2020) or by the limited representation of differences in soil properties at heterogeneous landscapes (Groh, Diamantopoulos, et al., 2020). Well balanced, highquality datasets as demanded by Kersebaum et al. (2015) for model calibration and validation are rare, and in various cases, datasets have been used in which the crop has not entirely been free of stress. For the model calibration procedure, this means that modelers calibrate plant-specific parameters, while they should have also accounted for soil-related parameters (e.g., Asseng et al., 2013;Bassu et al., 2014). The resulting parameter set then often delivers a successful simulation in the application case, but for the wrong reasons (Wallach, Palosuo, Thorburn, Hochman, Gourdain, et al., 2021) The validation of crop model predictions has furthermore been limited by the availability of data from climate change experiments, especially from long-term monitoring studies (Peng et al., 2020) that also include water and matter fluxes

Core Ideas
• We demonstrate the use of high precision weighable lysimeter for full model calibration and validation. • Lysimeter data from translocated soils represent effects of changing climatic conditions. • We compare calibration with blind forward simulations (fixed soil and calibrated crop parameter). • We compare individual crop model predictions with multi-model mean. • We test the predictive ability of crop models and multi-model mean.
in the soil. Although future climatic conditions may be mimicked (e.g., Deltedesco et al., 2019;Köhler et al., 2019), it is widely unclear how to consider future soil conditions. So far, the coevolution of soil-crop-ecosystems under changing climatic conditions as a longer term process cannot be described yet, but models were developed that could be used for predictions. However, this is crucial, as the effect of soils on climate can lead to large-scale climatic changes. One step to test and develop simulation models or model components could be to compare simulated crop developments and fluxes obtained with calibrated crop model with experimental data from sites with different climates. However, this approach does not allow to disentangle the effect of soil from that of climate because it neglects that soil development is site-specific, depending on climate, vegetation, and parent material, and soils may reach another state with changing boundary conditions (Maurer & Gerke, 2016;Robinson et al., 2016;Zeng et al., 2004). Testing the ability of crop models for different climatic conditions but for the same soil characteristics requires the translocation of soils into climate zones where they did not develop. Soil monoliths in cropped weighing lysimeters have been proposed for this purpose . Thereby, the original soil structure of the soil profile during extraction is preserved. Furthermore, water balance components such as precipitation and evapotranspiration including nonrainfall water (i.e., dew and fog) can be determined more accurately with high-precision weighable lysimeters compared with other typical methods and instruments (Alfieri et al., 2012;Gebler et al., 2015;Groh et al., 2019;Haselow et al., 2019). In general, precise data help to improve model parameterization (Groh, Stumpp, et al., 2018). These lysimeters can be either installed in climate chambers as larger-scale laboratory systems (e.g., Roy et al., 2021) or moved to other regions exhibiting different climatic conditions (e.g., Pütz et al., 2016) to observe and compare the soil-ecosystem response under different climatic conditions, which perhaps includes the transition period towards arriving at a new state. In the TERENO SOILCan lysimeter project (TERrestrial ENvironmental Observatories; https://www.tereno.net), the transfer of intact soil monoliths between the lysimeter stations followed a space-for-time substitution approach, and sites are representative of agricultural regions in Central Europe with the highest vulnerability to expected climate change effects (Zacharias et al., 2011). The transfer of soils along environmental gradients therefore mimics expected climate changes effects.
Recent field experiments on ecosystem responses to climate change focused mainly on plant-or crop-related processes (e.g., Drought-Net) (Hilton et al., 2019;Loik et al., 2019) and environmental aspects such as soil water content; water and element fluxes received far less attention. Agronomic variables such as grain yield and aboveground biomass are inherently connected with the water and plant nutrient status of the soil (e.g., Kersebaum, 2007). Therefore, in addition to variables controlling crop growth, the soil status controlling the environmental fluxes, as well as the knowledge about the lower boundary conditions (i.e., groundwater level), are needed for a holistic crop model calibration and validation (Groh, Diamantopoulos, et al., 2020;Groh et al., 2016). So far, lysimeter data from the same site (climate) but for different soil profiles have been used for crop model predictions (Groh, Diamantopoulos, et al., 2020). Applying these same data from the SOILCan lysimeter network, especially from those translocated to regions with different climatic conditions, bears the opportunity to test whether the structure of the models is adequate to simultaneously predict crop development and environmental fluxes for same soil under different climatic conditions.
The overall aim of this study was to compare the ability of an ensemble of crop models to predict the ecosystem's agronomic productivity (grain yield and above ground biomass), environmental fluxes (evapotranspiration, drainage), and states (soil water content) for lysimeters containing the same soils but exposed to different climates. We hypothesize, that all the calibrated crop models when using identical soil hydraulic properties will be able to similarly well reproduce the agronomic and environmental variables at the calibration site, but model performance will diverge when predicting crop growth, yield, and soil water flux for conditions outside the range of site-specific variations. The objectives were (a) to assess the effect of changed climatic conditions on key agronomic and environmental variables and (b) to test the predictive capability of crop models to simulate this effect. The model comparison will contribute to test a widely used procedure in climate change impact assessments studies (e.g., Falconnier et al., 2020;Franke et al., 2020;Rosenzweig et al., 2014;Zydelis et al., 2021), in which calibration is based on a set of climatic conditions and prediction will be made for climatic conditions outside the calibration range.

Site and soil descriptions
The Dedelow (Dd; Figure 1) site (53˚22′2.45′′ N, 13˚48′10.91′′ W, 50-60 m asl) is located in the region of the younger moraines in northeastern Germany, which is an intensively cultivated hummocky arable soil landscape. The Dd lysimeter station is part of the Northeast German Lowland Observatory of TERENO (Heinrich et al., 2018) and the German-wide lysimeter network SOILCan . A total of nine intact eroded Luvisol soil monoliths (c.f., Groh, Diamantopoulos, et al., 2020) were extracted along an approximately 20-m transect within the experimental field in Dd in 2010 ( Figure 2). Following the concept of the TERENO SOILCan lysimeter experiment, three of the soil monoliths have been transferred from Dd to (a) a drier and warmer (D-W) region in the central (Bad Lauchstädt, BL) and three have been transferred to (b) a wetter and warmer (W-W) region in the western part of Germany (Selhausen, Se). Crop management remained similar at all sites except for regionally adopted fertilizer application, plant growth regulators, and pesticides (see Supplemental Tables S3-S6). Three soil monoliths remained at the Dd station. The experimental field station in BL (51˚23′37′′ N, 11˚52′41′′ E, 113 m asl) is located in central Germany and the station in Se (50˚52′7′′ N, 6˚26′58′′ E, 104 m asl) is located in the western part of Germany ( Figure 1). The BL site is part of the Harz/Central German Lowland Observatory, whereas the Se site is part of the Eifel/Lower Rhine Valley Observatory of the TERENO network (Bogena et al., 2018;Wollschläger et al., 2016). At each site, the air temperature (T a ), global radiation (R g ), wind speed (WS), rainfall (R), relative humidity, and barometric pressure data were monitored.
The climatic conditions at BL and Se ( Figure 1) differed from those of the (soil-origin) site in Dd with respect to annual mean T a , mean annual R, grass reference evapotranspiration (ET 0 ), WS, and annual R g . For 2014-2018, the soil monoliths from Dd were subjected at Se to higher annual average T a (+1.6˚C), R (+58 mm yr −1 ), and R g (+65.4 kWh m −2 yr −1 ), but to a lower WS (−0.3 m s −1 ), and ET 0 (−102 mm yr −1 ). At BL, the annual averages were higher than in Dd for T a (+1.0˚C), ET 0 (+28 mm yr −1 ), and R g (+31.3 kWh m −2 yr −1 ), but lower for WS (−0.1 m s −1 ), and R (−77 mm yr −1 , see Table 1).
The D-W climatic conditions at BL are characterized by a climatic aridity index (AI = ET 0 /R) of 1.47, and the W-W climate by an AI value of 0.96, as compared with the AI value of 1.23 at Dd (Table 1). The water balance components obtained from three lysimeters at each site (Supplemental Table S1) show the basic hydrological differences between T A B L E 1 Site specific mean air temperature (T a , annual and seasonal), mean rainfall (R, annual and seasonal), potential annual evapotranspiration determined as grass reference evapotranspiration (ET 0 ; Allen et al., 2006), aridity index (AI) defined as the ratio between R and ET 0 , annual global radiation (R g ), and daily mean wind speed (WS) during the observation period (November 2014-October 2018 Table S2), the highest grain yield (GY) for all crops was observed for the Dd lysimeters at the original Dd site; in terms of grain equivalent units, it outperformed the overall crop productivity of the Se lysimeters by almost 100%.

Model comparison strategy
Eleven models participated in the study to predict environmental and agronomic variables; AgroC (AC) (Klosterhalfen et al., 2017) Daily values of actual evapotranspiration (ET a ), R, and net drainage water flux at the bottom boundary of the lysimeter at 1.5-m depth (NetQ) were determined from the high precision weighing lysimeters directly (c.f., Groh, Stumpp, et al., 2018), which have a surface area of 1 m 2 . Volumetric soil water content was measured with time domain reflectometry (CS610, Campbell Scientific) and soil water pressure heads with tensiometers (TS1, METER), with pairs of both types of sensors installed at 0.1-, 0.3-, and 0.5-m depths, and a TS1 sensor at 1.4-m depth. Observations on volumetric soil water content were aggregated to a mean soil water content (SWC), which represents the soil's main root zone (from 0-to 0.6-m soil depth).
The crop management and development data for each site included sowing and harvest dates, crop type, information on F I G U R E 2 Photo of (a) the Dedelow site before, and (b) after lysimeter extraction July 2010. (c) Typical soil profile with Ap-, Bt-, and C-horizons of the Luvisol. In the back of Photos a and b, the instrument for lysimeter extraction can be seen. In Photo a, an air-suspended truck is shown, which was used to transport the excavated soil monoliths to the experimental research field station in Bad Lauchstädt and Selhausen. Photo by Wilfried Hierold, Leibniz Centre for Agricultural Landscape Research (ZALF), Müncheberg the previous crop, seed density, emergence, tillage operations, crop residual management (Supplemental Table S3), the crop development in BBCH stages (Biologische Bundesanstalt, Bundessortenamt and Chemical industry), and applications of fertilizers (Supplemental Table S4), plant growth regulators, and pesticides (Supplemental Tables S5 and S6) were provided to set up the simulation. Daily values of T a (minimum, maximum, and mean), WS, relative humidity, barometric pressure, and R g were also provided for the simulation. ET 0 was calculated for hourly time steps according to the FAO56 Penman-Monteith method (Allen et al., 2006) and summed up to obtain daily values. We have to note that R input provide from lysimeter observations also contained inputs from nonrainfall events (i.e., dew and hoar frost; Groh, Slawitsch, et al., 2018). Pressure heads (hPa) measured in each lysimeter by tensiometers at 1.4-m soil depth or the depth to water table (m; converted from pressure head observations) were used to define the lower boundary of the corresponding soil profile. Soil physical and chemical characteristics were obtained from laboratory data derived from intact core samples and disturbed bulk soil samples (Supplemental Tables S7 and S8). These data were used to parameterize the soil hydraulic properties and chemical characteristics (e.g., total N content, C/N ratio) of the soil profiles and were provided to set up the simulation. More details on the determination on those soil characteristics and initial conditions for the simulation can be found in Groh, Diamantopoulos, et al. (2020).
The calibration of the crop model parameters for each crop was based solely on the data from the Dd site. The provided data to calibrate the models included end-of season agronomic variables GY and aboveground biomass (AgBio), and in-season observations of leaf area index (LAI) (for measurement methods for LAI see Groh, Diamantopoulos, et al. [2020]) over four cropping seasons. Additionally, the daily main environmental variables (ET a , NetQ, and SWC) from lysimeters at Dd, including the three Dd lysimeters and two additional lysimeters of the present study (eroded Luvisols; c.f., Groh, Diamantopoulos, et al., 2020), were provided for the calibration process.
For the predictive simulations of the Dd soil-ecosystem under a D-W and W-W climate, the measured soil physical and chemical characteristics of each individual lysimeter and the atmospheric conditions, as well as lower boundary information for each lysimeter (site), were used. During model calibration, each modeling group identified and adjusted a single set of plant related model parameters for each crop (winter wheat [Triticum aestivum L.], winter rye [Secale cereale L.], and oat [Avena sativa L.]) for the soils at the reference site in Dd. This approach was chosen because the crop parameters in models are plant specific but remain the same for all soil types. We followed recommendations from Wallach, Palosuo, Thorburn, Hochman, Gourdain, et al. (2021) to document the calibration process for each model (see Supplemental Tables S9-S18), including information on the choice of fitting parameters, the estimated parameter values (i.e., optimal and range), and the software or method used to obtain the parameter values.
The crop model parameters determined during calibration to describe crop development and soil water fluxes for the different boundary conditions at BL and Se, but similar crop management (see Supplemental Table S3) was used in a forward simulation to validate the predictive ability of the models. Also, for these sites, climate and crop management data as well as the soil hydraulic properties and chemical characteristics were provided. The "Data set Classification software" developed by Kersebaum et al. (2015) was used to quantitatively analyze and classify our dataset for calibration and validation of crop models. The estimated ranking showed a score of >200 points ("Platinum"), which classifies our calibration dataset as highly suitable for parametrizing, calibrating, and improving crop models.
The individual model and MM performance for the calibration and prediction was evaluated for the agronomic related inand end-season variables (LAI, AgBio, and GY) and the main in-season environmental variables (ET a , NetQ, and SWC). The normalized root mean square error (nRMSE) was used as the objective function for model calibration and as criterion for model evaluation as where SD is the standard deviation of GY, AgBio, LAI, ET a , NetQ, and SWC; N is the number of observations; and Sim and Obs are the simulated and observed values. Note that we have used SD instead of the mean value to take into account variables such as NetQ, which include negative (drainage) and positive values (upward directed water flux (2) where M is the number of individual available measured variables (i.e., 6, if all measured variables were considered).
The water use efficiency (WUE) (kg m −3 ) at the ecosystem level was described as and represents the measure of consumed water during the growing season (i.e., from sowing until harvest) of the corresponding crop (Katerji et al., 2008). As a second crucial agricultural index, the harvest index (HI) was calculated (Hay, 1995) as The HI index was used here to compare how crop models perform under different environmental conditions (e.g. water stress) (Fereres & González-Dugo, 2009) and crop management decisions (e.g., growth regulator) (Hütsch & Schubert, 2018).

Model calibration
The mean deviation of model results from data in terms of nRMSE after calibration ( Figure 3) indicates characteristic differences between the tested models for the boundary fluxes (ET a , NetQ), crop development and yield (LAI, GY, AgBio), and soil state variables (SWC). Except for LAI, the model performances ( Figure 3) were relatively similar for all models, including the MM. For the crop related variables, the nRMSE values ranged from 15 to 121% for GY, 13 to 72% for AgBio, and 44 to 192% for LAI. The lowest nRMSE values were obtained with the MO model for GY and LAI and with the SP model for AgBio (Figure 3). For the water balance related variables, the nRMSE values range from 48 to 87% for ET a , 84 to 234% for NetQ, and 33 to 102% for SWC, with lowest nRMSE values achieved always by the MM (Figure 3). The standard deviation in nRMSE values between all models after calibration was relatively large for NetQ (15%), mainly caused by few models with low predictive performance (HE, MO, DC, and GE), smallest for ET a (2%), and relatively similar for GY (10%), AgBio (7%), LAI (7%), and SWC (8%). Performance for models describing the soil water balance through a capacitance approach (i.e., in the non-Richards-based models HE, MO, and DC) showed large variability between single soil profiles reflected by higher standard deviations of nRMSE values for NetQ, SWC, and GY ( Figure 3). Most models showed a good fit for the variables GY, AgBio, and ET a (i.e., nRMSE values ≤50%). On the other hand, nRMSE for NetQ and LAI were relatively large (≥100%). The overall agreement between observations and simulation after calibration for the Dd soil at the Dd site was best for the MM (see blue asterisks in Figure 3), based on the multiobjective criterion nRMSE Total .

Model prediction ability
For the evaluation of the predictive model performance (Figure 4), the deviations between simulated and measured data were used to calculate the MnRMSE, which is the variable specific total mean nRMSE averaged over the entire crop rotation and averaged over the three replicate lysimeters at each site. The MnRMSE values from the calibration at the Dd site were compared with MnRMSE values from predictions for BL and Se site ( Figure 3). Similar MnRMSE values in calibration and prediction models can also predict crop growth and environmental fluxes under a different climate. In case of stronger deviations, this means that calibrated model parameters are not able to describe the observed processes as well. As expected, the MnRMSE values averaged for each model over the three soil profiles per site ( Figure 4) are mostly larger for the validation at BL and Se sites than those obtained at the calibration site Dd.
The MnRMSE values for the prediction of GY, ET a , and NetQ at BL (Dd soil at W-D climate) were relatively similar for most models except for GE, HE, and DC ( Figure 4). For the latter models, the performance even improved for NetQ. The MnRMSE values for the prediction of AgBio and SWC, however, deviated largely from the 1:1 line for most models at BL site ( Figure 4). For Se, the MnRMSE values for most predictions of GY were relatively large, whereas AgBio was captured well (as compared with BL) by most crop models except for DC, DY, and SP (Figure 4). Note that results for LAI were not included in the analysis here, because of a lack in LAI observations at Se and BL. For ET a , the MnRMSE values were relatively similar among all three sites, with slight shifts from the 1:1 line indicating overestimation mainly at Se (Figure 4). For NetQ, most models achieved relatively similar or even better (i.e., DC, GE, HE) quality measures at the validation site. For SWC, the MnRMSE values were relatively large at the validation sites except for models AC, HE, TH, HG, and DY, where the prediction quality was similar to that obtained at the Dd calibration site. predicted the GY for oat and winter wheat relatively well, but not the GY for winter rye. The mismatch for the MM for GY was relatively small with average MnRMSE values over all crops of 40% at Dd and 49% at BL. However, at Se predictions of GY were clearly overestimated by the MM and achieved a MnRMSE value of 205%. The AgBio could be described well by most models at the calibration site Dd (Supplemental Figure S1). One exception is perhaps the winter wheat in the first growing season 2014-2015, which was underestimated by AC, DC, and DY models and overestimated by TH. Predictions of AgBio at BL were less accurate than those at Se. Here, AgBio was underestimated for winter rye by all models except for SP and for the winter wheat crop mainly by AC, DC, and MO models. In contrast to GY, the prediction of AgBio at Se were mostly in agreement with the observations (Supplemental Figure  S1), with only DC underestimating and AC and DY overestimating AgBio. AgBio predictions deviated remarkably for winter rye by SP model and for winter wheat by TH model (Supplemental Figure S1). The MM achieved lowest MnRMSE values at Se site and the SP model at Dd site and the SU model at BL site (both belonging to the Expert-N model family) performed best in simulating AgBio with regard to other models. However, MnRMSE values for AgBio for MM were higher at the validation sites BL (66%) as compared with the reference site Dd (30%) and validation site Se (29%).

Crop growth and yield
Crop model predictions can be used to calculate agronomic measures such as the HI (Figure 6). For the reference site Dd, HI values were obtained with reasonable accuracy by most models except for DC, DY, GE, and TH (MnRMSE > 150%). Note that the DC model used a fixed GY/biomass ratio for all crops. The predicted HI values were overestimated by most models for the W-D site BL compared with the reference site Dd, except for DC and SU models. Largest deviations of HI at BL were observed (MnRMSE = 472%) by the MO model (Figure 6), overestimating the HI for winter rye and underestimating for winter wheat for one of the soils. At the Se validation site (Figure 6), most models failed to predict HI values. The model performance was generally low and MnRMSE values ranged from 144 to 544% because of the overestimation of GY. The predictions by all models failed to describe

Crop water use and soil ecosystem fluxes
The majority of the models underestimated ET a (crop wise) for the growing seasons (Figure 7) at the three sites and achieved relatively similar MnRMSE values (e.g., for MM, 49% at Dd, 53% at BL, and 56% at Se). The ET a for oat in 2018 was only captured by DY. The models AC, MO, GE, and HG more or less systematically underestimated ET a during the whole crop rotation, whereas other models (DC, CE, SP, SU, TH, and the MM) for certain crops under and overpredicted measured ET a (Figure 7).
In contrast with ET a , the actual bare soil evaporation (E a ) during periods without vegetation (Supplemental Figure S2) is controlled by weather and the soil hydraulic properties, Vadose Zone Journal F I G U R E 6 Comparison of simulated vs. measured harvest index (HI; i.e., ratio between grain yield and total biomass) for lysimeters at the reference site Dedelow (Dd, green symbols; i.e., model calibration) with those transferred to the locations Bad Lauchstädt (BL, red symbols) and Selhausen (Se, blue symbols) for winter wheat (squares), winter rye (circles), and oat ( but not by plant properties. The E a was here assigned to a specific cropping period by cumulating daily E a values of the calendric periods before planting and after harvest of the crop. Simulated E a values agreed with observations fully for the HG model and MM approach, partly for DC, DY, HE, MO, TH, and AC (Supplemental Figure S2). The AC, HE, and TH models underestimated E a , whereas the Expert-N models (CE, GE, SP, and SU) overestimated E a , especially before and after oat cropping at both validation sites. These deviations of predicted E a from observation were largest for relatively long bare soil periods before and after the oat crop (August 2017 until March 2018).
The predicted WUE did not match well for most crop models at the reference site Dd, and model performance values even increase considerably for the validation sites ( Figure 8). For all crops, WUE was overestimated by AC and MO and underestimated by the DC model. For oat, CE and SP overestimated WUE, and for winter wheat, the DY model showed relatively large scatter (Figure 8). Predictions for the validation sites underestimated WUE for winter rye (AC, F I G U R E 7 Comparison of simulated vs. measured cumulative actual evapotranspiration (i.e., ET a ) for lysimeters at the reference site Dedelow (Dd; green symbols) and lysimeters transferred to Bad Lauchstädt (BL; red symbols) and Selhausen (Se; blue symbols) for winter wheat (squares), winter rye (circles), and oat ( DC, CE, SU, and HE), and overestimated WUE for winter wheat (AC, GE), winter rye (MO), and oat (AC CE, GE, and SP). The WUE was overestimated similarly to DC's GY for almost all crops. A similar overestimation of WUE for oat at Dd by CE and SP models was predicted at BL. Except for DC, the crop models mostly overestimated the WUE at Se (Figure 8), because WUE depends on GY, which was overpredicted by most models at this site. The MM was in better agreement with observations, in particular for values at Dd and BL, somehow reflecting a compensation of overand underestimations of WUE by the different crop models.
The predictions of the cumulative net drainage flux out of the lysimeters (NetQ) agreed with observations at the reference site Dd for AC, DY, CE, SP, SU, MO, TH, and HG (Supplemental Figure S3). For the validation site BL, simulated NetQ values fully agreed with observations for DC, DY, CE, SP, and HE model (Supplemental Figure S3). Note that NetQ simulated by the HE model was always zero because capillary rise was restricted for cases of deeper groundwater levels. For the second validation site Se, with a W-W climate, the same crop models either overestimated (AC, HG, TH, MO) or underestimated (CE, GE, SP, SU) NetQ. The mean seasonal SWC in the upper 0.6 m agreed with observations when looking at the values for the reference site Dd (Supplemental Figure S4), with a MnRMSE value across all models of 63%. The match for predicted SWC among the models at the validation site BL was less (MnRMSE of 97%) because DC, CE, SP, SU, and the MO models tended to underestimate the SWCs. Note, that the scatter between the SWC values of the three soil profiles (e.g., during the oat cropping season) at BL was related to a limited availability of SWC data. The tendency of the models to underestimate SWC was even more pronounced at the second validation site Se (MnRMSE value of 109%), indicating that these values were either not well represented in the measurements used for model calibration and that relevant processes describing the SWC dynamics outside the calibration range could not be parameterized during calibration. Both, the Richards-(AC, DY, TH) and the non-Richards-based (HE) soil water models seem to comparably describe the SWC dynamics at the Se site.

Combined assessment of ecosystem productivity and fluxes
The overall simulation performance of the models was separated in a crop-related part that consisted of the arithmetic mean nRMSE values describing productivity (GY, AgBio, LAI) and in an environmental part describing soil water fluxes and soil water content (ET a , NetQ, SWC). The model performance for the agronomic variables at reference site Dd (nRMSE range: 13-121%) and the site BL (nRMSE range: 12-140%) was for most models in a relatively similar range of nRMSE values (Figure 9). For the validation site Se, nRMSE values were larger (range nRMSE: 16-530%), especially for individual crop models and the MM (Figure 9). Here, individual models such as MO for reference site Dd, DY and SU for the validation sites BL and Se, outperformed the MM for the agronomic variables. Again, the agronomic nRMSE was only calculated for GY and AgBio for the validation sites, as no LAI observations were available. The variables ET a , NetQ, and SWC had a similar model performance in terms of the mean environmental nRMSE at the reference site as for the validation sites. For the environmental nRMSE, the MM outperformed the results of the individual models at both the reference and the validation sites. However, considering both agronomic and environmental nRMSE values, the MM performed better than any single model, except for the validation site Se. Here, the model SU showed the overall best performance and outperformed the MM, as the simulation results of other models for the agronomic part were partially very poor (nRMSE > 150%).

Model calibration
The use of a larger and more encompassing dataset containing in-and end-season agronomic variables as well as environmental data for model calibration improved the performance of crop models in general for the same sites and years ( Figure 9). It seems that the calibration of phenology, which controls important variables like GY, AgBio, and LAI, does not contain enough information to predict all agronomic and environmental states and fluxes well, when comparing calibration to phenology only, as discussed by Groh, Diamantopoulos, et al. (2020). This was already shown for GY of wheat (Asseng et al., 2013), but not for a more comprehensive Here, it has to be noted that it was not intended in our study to compare the different embedded functions describing the effect of the environment on plant production and environmental fluxes, rather to compare amongst different models. A comprehensive comparison and evaluation of the different model embedded functions and ways to calibrate them harmonically among different crop models is still ongoing research and would be out of scope here. The improvement by calibration based on all available data was particularly visible for the end-season-related agronomic variables GY ( Figure 4) and AgBio (Supplemental Figure S1). This can be explained by the fact that mainly crop-related parameters were calibrated within this study. The GY of the winter wheat between both cropping periods differed substantially (Supplemental Table  S2), which was captured only by some of the crop models (i.e., CE, SU, and MO). A recent study using two long-term experiments  suggested that the choice of cultivar can, apart from other management decisions (e.g., crop growth regulator) (Hütsch & Schubert, 2018) or weather conditions, substantially influence GY of the crop winter wheat (Aula et al., 2020). As only a single set of plant parameters was estimated for the two different wheat cultivars grown in the two cropping periods, our study confirmed that the choice of cultivar as well as the use of crop growth regulators is important to reproduce GY of both winter wheat cultivars. The predicted GY of the oat for the reference site Dd in 2018 agreed well with observations ( Figure 4), which showed that the model parameterization obtained by calibration was able to cope with the effect of water stress on GY for these soils. This is also confirmed by the improvement of the average nRMSE for oat by 182% as compared with simulations where solely phenological stages were used for calibration (Groh, Diamantopoulos, et al., 2020). However, other important parameters affecting the seasonal LAI development (Figure 3, especially maximum LAI, date of flowering) were not captured well by most of the Expert-N family models (e.g., SP, SU, and GE), resulting in low model performance values for LAI. The model performance for the environmental states and fluxes improved only slightly using the full dataset for calibration as compared with calibrating on phenology only (c.f., Groh, Diamantopoulos, et al., 2020). Only a significant improvement was achieved for NetQ by the DY and DC model. The variability of the prediction from each individual model, expressed as scatter of the nRMSE, highlights that the non-Richards-based models tend to vary much more for the variables NetQ, SWC, and GY compared with the Richards-based models ( Table 2). Note that one can expect that the agronomic variables would be less dispersed, as essential plant parameters were adjusted by calibration. All of the environmental variables were less affected because the hydraulic parameters were fixed and only parameters for the root water uptake (e.g., AC) or the actual crop coefficient factor (K c ) for ET 0 (e.g., AC) were estimated. The larger variability for NetQ and SWC might be related to the different implementation of the given soil hydraulic parameters, as the Richards equation-based models used the soil hydraulic parameters (Mualem-van Genuchten parameters), whereas the non-Richards equation-based models used the water contents at saturation, field capacity and wilting point instead. Additionally, the type of lower boundary conditions affects the prediction of NetQ. This agrees well with previous investigations, which demonstrate that describing variably saturated flow with the Richards equation improves predictions of the SWC dynamics compared with a buckettype approach (Longo et al., 2021;MacBean et al., 2020;Soldevilla-Martinez et al., 2014).
Most of the models greatly underestimated ET a , probably due to the way root water uptake, partitioning of ET 0 into transpiration and evaporation, or soil hydraulic properties were parameterized, or due to the way the K c factors, which scale ET 0 to a crop-type-specific ET a , were set or constrained. We can contest that the spread among the crop models in K c values was large. Some models used a constant value around one over the entire period for all crops (e.g., DC, TH, and all Expert-N related models), whereby in other models individual K c values for each crop was set (AC, DY, MO, and HE,). As mentioned, the K c factor was calibrated in for the AC model to obtain a better match with ET a , whereby the fitted mean cropping period K c factor was 1.09, 1.53, and 1.48 for oat, rye, and wheat, respectively. Hereby, the fitted K c factors for rye and wheat are at the maximum end or even exceed physically consistent or literature values (c.f., Allen et al., 1998;Pereira et al., 2021).
The mean nRMSE values of the MM were lower than for any other single crop models. The specific averaged nRMSE for the soils at Dd (Figure 9) confirms that the MM better predicts the overall combined assessment of agronomic productivity, as already reported by Martre et al. (2015) and Wallach, Martre, et al. (2018), and for environmental water fluxes and states as reported by Groh, Diamantopoulos, et al. (2020). This was also shown for MM and the variable NetQ, which was previously not included in crop model intercomparison studies due to the lack of data. Many recent studies, which compared the model performance of the MM with single crop models, have found that MM predicts single (Wallach, Palosuo, Thorburn, Gourdain, et al., 2021) or multiple target variables (Bassu et al., 2014;Martre, Wallach, et al., 2015) successfully well. The reason for the superiority of MM lies in the fact that different models have different errors, which are consequently averaged over different models when forming an average (Wallach, Martre, et al., 2018).
One may also consider that the observations used in this study differed in terms of the scale from other crop model intercomparison studies that used frequently data from larger scale field experiments (Wallor et al., 2018). The discrepancy between the lysimeter scale (1 m 2 ) and the field scale (>100 m 2 ) may have affected results. However, any such effect would influence all the results of models in the same way and would thus not affect the model comparison. It should be noted here that the recent study by Groh, Stumpp, et al. (2018) confirmed that lysimeters can be representative of a larger area with respect to environmental variables (e.g., NetQ, SWC, and tracers), but similar studies for agronomic T A B L E 2 Average standard deviation of normalized RMSE (nRMSE) values for each variable of the model calibration for all, non-Richards-based (number of models = 3), and Richards-based models (number of models = 8) Note. GY, grain yield; AgBio, aboveground biomass; LAI, leaf area index; ET a , actual evapotranspiration; NetQ, net drainage water flux at the bottom boundary of the lysimeter at 1.5-m depth; SWC, soil water content.

Avg. standard deviation of nRMSE
variables obtained from high precision weighable lysimeter are still lacking.

Modeling agroecosystems
Holistic agroecological modeling requires consideration of both, agronomic aspects (e.g., GY, LAI, AgBio) and environmental water fluxes and states (e.g., ET a , NetQ, SWC) simultaneously. This investigation is one of the rare studies that include both parts for calibration and validation. A previous study has shown that soil properties, such as rooting zone thickness, varying texture, and soil hydraulic properties need to be adequately parametrized in crop models (Groh, Diamantopoulos, et al., 2020). This highlights that the soil part including vertical soil heterogeneity (horizonation) should be considered when simulating crop ecosystems. As data for the detailed calibration are often missing, the results from this study provide some estimates for the expected prediction error. However, calibrating only plant-related parameters in crop models does not necessarily improve the simulation of soil water flux-related key variables. The existence of compensating effect during parameter calibration in crop models (Sima et al., 2020) leading to nonuniqueness, as defined already for hydrological models in the early 1990s (Beven, 1993) is still a challenging aspect in the calibration approach. The problem that models are often right for the wrong reason (Kirchner, 2006), when only a single criterion is considered in model calibration (Houska et al., 2021), is added here as a challenge to calibration. When the focus is on increasing food production (Hamidov et al., 2018), soil processes are often compromised although soil functions have important feedbacks on ecosystem productivity, nutrient cycling, energy, and water (Deckmyn et al., 2020;Zwetsloot et al., 2020). This emphasizes the need to better calibrate the soil part in crop models beyond a limited consideration of classical agronomic variables such as only GY, AgBio, LAI, HI, or plant nutrient uptake and instead including soil related observations with high temporal and if feasible spatial (depth) resolution .

Model prediction ability
A comparison of the predictive performance of agronomic variables of Dd soils between the original climate in Dd and a D-W climate in BL showed that under D-W conditions at the BL site (nRMSE = 88%), similar predictive performance was achieved as for calibration site Dd (nRMSE = 67%, see red lines in Figure 9). In contrast, predictions diverged strongly from the observations for the wet and warm region (W-W) at the Se site (nRMSE = 163%). The observed GY decreased substantially under W-W conditions compared with the calibration site at Dd or validation site at BL (D-W climate). Unfortunately, the crop models used in this study, including the MM, were not able to capture the effect of climate and management (e.g., different application of N fertilizer and crop growth regulator) on GY. The observations showed a higher water use efficiency (WUE) and reproductive output (i.e., HI) of plants growing under conditions with lower plant-available soil water, confirming previous results for several soils, including those from Dd translocated to BL and Se (Groh, Vanderborght, et al., 2020); in this study, the lower WUE at the ecosystem level (see Equation 3) was hereby mainly driven by higher evaporation under W-W climatic conditions, as differences in ET a were large during times where evaporation rather than transpiration is dominating ET a . This might explain the partially large underestimation of ET a or E a by some of the crop models, because most crop models were not able to simulate E a well. The E a occurs in two different stages. In Stage 1, E a is limited by the atmospheric evaporative demand. With further drying of the soil, when the surface soil water content is depleted, a change in the E a rate occurs and this rate drops below the potential evaporation rate (Stage 2 evaporation). Here, specific model parameters, weather conditions, intercept evaporation, and evaporation during wet soil conditions (Stage 1) and soil hydraulic parameters during dry soil conditions (Stage 2) are crucial. Thus, a better agreement of evaporation during bare soil conditions and vegetation period can be expected if in situ data are used to determine the soil hydraulic properties (e.g., Schneider et al., 2021). On the other hand, the ET a underestimation might be in addition partially related to oxygen stress under wetter climate conditions (i.e., W-W with high SWC) by water logging, as for the HE model transpiration is reduced when the air filled pores are temporarily below 8% according to Supit, Hoojer, and Diepen (1994). Other factors such as differences between stomata conductivity, resistance in stem, and other rhizosphere processes might explain the underestimation of ET a .
In any case, the question remains why crop models were not able to capture the reduction in GY under W-W climate. The simulation error for GY is likely caused by the soil status (SWC and nutrients) and altered thresholds under the new climatic regime. The climatic conditions at the site where the lysimeter have been transferred to were outside the range of historical observation and thus not accounted for in the calibration. This discrepancy between environmental conditions for calibration and validation can affect the functional relationships (e.g., R and GY) of the ecosystem as shown by Knapp, Carroll, et al. (2018) for grassland ecosystems. One further possible improvement would be a more comparable calibration of the crop models (e.g., choice of parameters, software to optimize parameter, etc.), but this was beyond the scope of this study. Several crop modeling studies have already used different ranges of environmental conditions for model calibration and evaluation (Asseng et al., , 2019Biernath et al., 2011;Hussain et al., 2018) to test models' ability to predict target variables outside the range of historical observations. Wallach, Palosuo, Thorburn, Gourdain, et al. (2021) referred to these types of studies as "extrapolation" studies, which is a common procedure in modeling crop growth under, for example, future climate conditions (Seidel et al., 2018). Thus, changing environmental conditions and differences in stress situations such as increasing temperature and heat stress Pszczółkowska et al., 2020;Rezaei et al., 2018;Webber, White, et al., 2018), temporal water limitations (Sadras, Villalobos, et al., 2016), or excess (e.g. water-logging Mäkinen, Kaseva, et al., 2018) during a specific crop development stages might have affected the crop development and consequently reduce GY of the specific crop under W-W climatic conditions. This explains why the calibrated models in Dd work well under similar conditions in BL, where the thresholds and their effects on the results were calibrated indirectly. In addition to the transpiration effect caused by higher air temperature, heat stress can serve as a possible explanation for lower GY, when a critical temperature is exceeded (Eyshi Rezaei et al., 2015). Such critical temperatures can be in particular harmful during anthesis, because the anthesis period will be shortened, which will lead to earlier flowering (L.-X. Tao et al., 2008). Additionally, higher temperatures near anthesis will reduce the grain number, because it affects pollen fertility, increases grain abortion, and reduces grains per spike due to accelerated crop development (Eyshi Rezaei et al., 2015).
Observations on GY and air temperature confirm for soils, which were transferred from Dd to BL and Se that the effect of heat stress days during flowering increased (Figure 10a). The average number of days with heat stress increase from 5 d at Dd to over 10 d at BL and Se. The average duration of heat stress during this period was in general longer in Se (2.4 d) than in BL (1.8 d). Hence, additional model parameters, or even model processes accounting for heat stress are needed. Secondly, plant parameters might not have been correctly estimated during calibration in the environment without or with less heat stress to be used to predict GY under W-W, because limited data range might lead to a miscalibration of the response curve or parameter sensitivity changes under different climatic conditions or climate change (Melsen & Guse, 2021). The GY simulation results support this assumption, as for the site BL, which has a similar climate to Dd, the influence of heat stress (days with max air temperature > 27˚C [Semenov & Shewry, 2011] for a period of 32 d after flowering) on the yield could be reproduced well but was significantly underestimated under a W-W climate at the Se site (see Figure 10a). Gilardelli et al. (2018) showed for the WOFOST (WOrld FOod STudies) model, that under unfavorable climate conditions for crop growth, predictions of GY reacts sensitively to certain crop parameters, such as those related to respiration ratio or specific leaf area. Observed GY and number of heat stress days at flowering confirms a significant decrease in GY across all sites (Figure 10a). In addition, the reduction of GY showed a high correlation (r = .87; p < .001; Figure 10b) with decreasing number of ears per m 2 . The observed 1,000-kernel weight remained similar at all sites, implying that the decrease in GY can be associated with the number of grains. This suggest that the higher temperatures and heat stress around anthesis negatively affected GY especially at Se, due to decreasing pollen fertility or abortion and sterile grains (Chaturvedi et al., 2021;Farooq et al., 2012;Shenoda et al., 2021).
The challenges discussed above indicate that the parameters fitted outside the prediction environment or processes not accounted for in the calibration environment might play a crucial role for agricultural projections, especially in the context of climate change and on the broader feedback that soils have on climate. F. Tao et al. (2018) already summarized that model structure and model parameters are an important source of uncertainty when predicting effects of climate change on future ecosystem services, which implies long-term F I G U R E 1 0 Relationship between (a) mean grain yield (observed and simulated) and the number of days with heat stress at each site for the period of 32 d after flowering. A threshold of 27˚C maximum daily air temperature was used to define days with heat stress (Semenov & Shewry, 2011). (b) Relationship between the observed grain yield and ear density. Mean values are indicated by symbols and standard deviation by error bars. Sites include Dedelow (Dd), Bad Lauchstädt (BL), and Selhausen (Se). Tmax is maximum temperature projections with altered climatic conditions outside the historical range where the models have been calibrated on. In consequence and consistent to our hypothesis, crop models are able to predict the site-specific temporal and spatial variables when calibrated on longer historical time series only when those "new" climatic conditions are within the range of the historically observed climate.

Modeling crop-soil-ecosystem response to changing boundary conditions
The study challenged the problem of improving predictions of agroecosystem productivity and environmental fluxes and states under changing climatic conditions for the most likely condition that changes of environmental conditions are outside the historical range of observations. This includes not only the absolute values but also the temporal distribution of rainfall events or temperature dynamics during the season. Such information seems essential for both the calibration and validation steps of models, but also for predicting ecosystem functioning under changing global climatic conditions. In addition to the space-for-time substitution approach, one possibility to deal with a limited site-specific range of observations for model calibration could be to use data from manipulative experiments to broaden the training dataset for model calibration. For example, Forstner et al. (2021) showed that only the combined use of observational and manipulative climate change experiments was able to identify how altering individual and multiple drivers (i.e., reduced R, elevated CO 2 , and surface temperatures, increase in ET 0 ) affect ecosystem functioning of grassland ecosystems on changing climate conditions. Still, many challenges remain in addition to those discussed above, such as consideration of changes in soil properties due to climatic changes and the effects of these changes on model parameters or hints on missing model structures, depending on changing climatic conditions (e.g., Robinson et al., 2016Robinson et al., , 2019.

CONCLUSIONS
The present study compared the performance of commonly used crop models for both agronomic crop-related and soil ecosystem-related datasets from weighing lysimeters with intact soil monoliths in situ with those translocated to a warmer and drier and a warmer and wetter region. The aim was to predict the crop development and ecosystem fluxes for lysimeters with the same soil that were transferred in regions with different climatic conditions after calibration at the original site. As expected, the degree of model complexity (i.e., non-Richard-vs. Richards-based models) in describing vadose zone processes largely explained variations in the model performance between individual crop models. The relatively similar crop model predictions at the validation site BL indicated that the climatic conditions of the calibration site Dd were in a similar range. The larger deviations of the model predictions compared with lysimeter data at the Se site suggested that either the effects of climatic conditions outside of the range of Dd site-specific variability could not be described by the calibrated crop models or that the underlying process representations have not been included in the models. The observed decrease in grain yield, not captured by most models, could possibly be related to an increase in heat stress. Thus, the presently tested crop models were not able to predict the effect of such effects of changes in climatic conditions represented in the space-for-time substitution approach. The results confirm the hypothesis that crop models are unable to predict site-specific temporal and spatial variability of agronomic crop and soil ecosystem variables and fluxes when climatic conditions are outside the range of the data time series used for calibration. The implications of this finding are important for future applications of crop models with respect to predictions of agronomic perspectives and interrelations with ecosystem response to changing climatic conditions. The study clearly demonstrated that a key for improving the predictive capability of crop models in the soil-vegetation-atmosphere continuum is the quantification of soil related data for calibration and model testing. This would improve predictions of above-and belowground ecosystem processes in crop models and enhance future risk assessment related to ecosystem functions. The results suggest the need to develop improved concepts for model calibration procedures that incorporate experimental data sets from scenarios with assumed climatic conditions.

A C K N O W L E D G M E N T S
We acknowledge the support of TERENO and SoilCan, which were funded by the Helmholtz Association (HGF) and the Federal Ministry of Education and Research (BMBF). Kurt-Christian Kersebaum acknowledges funding by the Ministry of Education, Youth and Sports of Czech Republic through SustES-Adaption strategies for sustainable ecosystem services and food security under adverse environmental conditions (CZ.02.1.01/0.0/0.0/16_019/000797). Evelyn Wallor was funded by the German Federal Ministry of Education and Research (BMBF) through the BonaRes project "I4S" (031B0513I). Jannis Groh acknowledges the German Research Foundation (DFG; Project no. 460817082) for funding. We thank Jörg Haase, Dr. Gernot Verch, Ingrid Onasch, Gudrun Buddrus, Ralf Gründling, Sylvia Schmögner, Werner Küpper, Ferdinand Engels, Rainer Harms, Philipp Meulendick, and Leander Fürst for data collection and maintenance of the experimental setup at the corresponding research station.