Evaluating five forest models using multi-decadal inventory data from mountain forests

Forest ecosystem models, being widespread science tools and used for forest management decision support are usually evaluated individually against field data sets, while model intercomparison and joint evaluation studies are rare. We tested five forest models according to a harmonized protocol against data from nine forest compartments in the Snĕžnik region, in Slovenia. The suite of models included standand landscape-scale, empiricaland process-based models used across Europe. The test dataset originated from inventory data covering 50 years (tree measurements 1963, 1983 and 2013) and included annual harvesting records at tree level. Uncertainties in data and forest conditions were considered by defining 12 scenarios varying initial regeneration, browsing pressure and harvest modalities. We evaluated the models` ability to initialize forest conditions accurately, whether management interventions could be implemented based on harvest records, and how well basal area and diameter structure could be predicted. Simulation results for basal area development showed good to satisfactory performance for all models, at which SAMSARA2, SIBYLA and PICUS showed the best agreement. Comparison of simulated and observed diameter distributions showed good performance of ForClim, PICUS, SAMSARA2 and SIBYLA. Model output variability was between 6% and 24%, indicating the relevance to consider uncertainties that can be attributed to specific sources. There was no clear hierarchy between more empirical or more processbased models regarding accuracy of stand development projections. The cohort-based landscape model LandClim showed the lowest stand-level accuracy and scenario sensitivity, but results nevertheless qualified it for complementary application at landscape scale. Within individual-based models, spatially explicit models seemed to be more suitable for heterogeneous mixed mountain forests. The findings demonstrated the usefulness of inventory datasets for model testing and intercomparison.


Introduction
Forest simulation models are powerful tools for testing and evaluating the mid-to long-term implications of different management strategies on future forest development and related ecosystem service provisioning (Söderbergh and Ledermann 2003;Schelhaas et al., 2014). Changing environmental conditions and intensifying disturbance regimes (e.g., Seidl et al., 2011;Temperli et al., 2013) have increased the complexity in forest resource planning and management, and consequently the role of model-based decision support has drawn a lot of attention recently (Muys et al., 2011;Linkevičius et al., 2019).
One response to this growing challenge has been an increased reliance on forest simulation models, which are potentially climate sensitive and allow for alternative and novel forest management strategies, to be evaluated. Concomitantly, there has also been an increase in testing the benefits, limits and credibility of forest models (Courbaud et al., 2015;Foster et al., 2017). Model evaluation studies are required to build trust and confidence in model outputs and are thus a prerequisite for any model application in practical decision support. In view of growing interest in complex forest structures, multi-species mixtures, the provisioning of various ecosystem services beyond timber production and the need to consider the effects of a changing climate, the demands being placed on forest models have grown considerably over the recent years. Silvicultural regimes that have been proposed to adapt forests to climate change often focus on small-scale silvicultural measures and the creation of heterogeneous stand conditions to foster forest resilience (Christensen 1997;Puettmann et al., 2009). Adequately representing these forest structures, and the underlying ecosystem processes that generate them, requires individual-based, climate-sensitive forest modeling approaches that allow the simulation of complex silvicultural tree selection and cutting patterns (Söderbergh and Ledermann 2003;Grimm et al., 2005;Mina et al., 2017).
From the perspective of a potential model user, it can be difficult to decide what the appropriate model for a specific task and location should be based on its original theoretical concept. The reason is that many models, being scientific tools under permanent development rather than ready products, are continuously refined, extended or hybridized with other models (Larocque et al., 2011). Therefore, to identify strengths and weaknesses of different models, intercomparison studies are recommended , where multiple models are tested within a harmonized framework against independent observational data. While model comparison studies are frequently published for carbon and water flux models (e.g., Ryan et al., 1996;Hanson et al., 2004;Jin et al., 2016;Thurner et al., 2017) multi-model evaluation studies for forest ecosystem models are rare. For instance, Badeck et al. (2001) tested six gap models against observed structure and species composition of a virgin forest. Other studies comparing models originating from the same region have been published by Härkönen et al. (2010), Huber et al. (2013) and McCullagh et al. (2017). Recently, Bugmann et al. (2019) compared the behavior of mortality algorithms implemented in several forest models. However, studies comparing forest models that originate from different countries and different ecological and management contexts against long-term observational data sets from managed forests are rare (but see e.g., Mäkelä et al., 2000;Lindner et al., 2005).
Consistent data sets from managed multi-species forests that extend over several decades are rare, particularly when tree-level data are required. Usually, data from silvicultural experiments are utilized for this purpose (Mäkelä et al., 2000;Yaussy 2000;Lindner et al., 2005;Seidl et al., 2005). When long-term observational data are used for a model evaluation study, the issue of information quality arises (Gadow 2000). Historical data from decades ago may be subject to uncertainty with regard to accuracy of measurements, and there may be gaps with regard to tree species-specific information, and unknown calipering thresholds. Moreover, usually no information about tree positions, forest structure or spatial species mixture types is available. The timing of harvests as well as the composition of harvested volume (species, dead and alive trees) may also not be known exactly. Given the relevance of legacies for future forest development, erroneous initial forest conditions may propagate over time and increase uncertainty, which limits the power of model evaluation studies.
In the FP7 project ARANGE (http://www.arange-project.eu), several forest models, originally developed for different European forest types and representing different conceptual modeling approaches, were employed to explore management alternatives for mountain forests in major European mountain ranges . In addition, multi-decade forest inventory data from the Dinaric Mountains in Slovenia were available within the project consortium. This setting provided the opportunity to compare five established models in a model intercomparison study.
Specific questions of the study were: (1) How well can forest models be initialized with historical inventory data? (2) How well can forest models implement historical management schemes derived from harvest records? (3) How well do observed stand trajectories and model simulations match with regard to volume, basal area and diameter structure?

Study area
The observational time series data comes from an area near the Snĕžnik Mountain (1796 m a.s.l.), in the northern part of the Dinaric Mountains, Slovenia, Europe. The Snĕžnik area is a karst limestone plateau, transformed in the last glacial period. The soils are mainly chromic Cambisols and rendzic Leptosols. The climate in the northern Dinaric Mountains has Mediterranean influences, with warm summer temperatures (long-term mean from July to August is 18.3 • C at 800 m a. s.l., and 14.9 • C at 1300 m a.s.l.) and low winter temperatures (mean January temperatures − 0.6 • C to − 4.1 • C). Mean annual temperature ranges from 6.8 • C at 800 m to 3.1 • C at 1300 m, with annual precipitation between 1670 mm and 1930 mm, respectively. Mean summer precipitation (May to September) ranges from 650 to 740 mm. The upper timberline is located at approximately 1550 m.
Mountainous silver fir (Abies alba Mill.) -European beech (Fagus sylvatica L.) -Norway spruce (Picea abies Karst.) forests are the prevailing natural forest type, with frequent occurrence of sycamore maple (Acer pseudoplatanus L.) and wych elm (Ulmus glabra Huds.), while smallleaved lime (Tilia cordata Mill.), rowan (Sorbus aucuparia L.), common whitebeam (Sorbus aria (L.) Crantz), yew (Taxus baccata L.) and some other species can also be found sporadically. The first major regular utilization of these forests started in the second half of the 19th century when silver fir was promoted, while in line with the economic principles of that time, beech was weeded out and used for charcoal and potash production and wood distillation (Perko 2002). At the beginning of the 20th century, an uneven-aged single stem selection system (i.e., plenter system) was introduced (Schollmayer 1906). Due to a noticeable decrease in fir vitality and its insufficient regeneration and recruitment (Klopcic et al., 2010), a combination of single stem and small-scale irregular shelterwood system was introduced in the 1960s. Afterwards it was adapted to a more flexible, site and stand specific continuous cover system labeled "free style silviculture", combining elements of single stem selection, irregular shelterwood and shelterwood approaches, which has been applied since then (Mlinsek 1968;Boncina 2011).
Within the Sneznik area a set of nine compartments with a total area of 60.0 ha was chosen for the model evaluation study. The sites are located apart from each other on elevations between 800 m to 1300 m a. s.l..

Forest inventory dataset
Along with the introduction of uneven-aged forest management at the beginning of the 20th century, a permanent division of forests into compartments was established and since 1912 eight forest inventories have been conducted. Before 1973 inventories were implemented by fully callipering the compartments. In 1973 and 1983, inventories were executed as full callipering of a sample of compartments, while in 1993 permanent sample plots were established and used since then. The fullcallipering data were available from inventories in 1963 and a follow-up measurement in 1973 (compartment 17A) or 1983 (all other compartments). Due to low sampling densities in the 1993 and 2003 inventories, this data could not be utilized to calculate reliable values for individual compartments. Thus, for the current study, an angle-count sampling inventory (Bitterlich 1952) was conducted in 2013 on a 50 × 50 m grid to gather data compatible with the historical surveys in 1963, 1973 and 1983. Thus, compartment polygons did not change throughout the observation period starting in 1963. From all inventories, stem numbers per hectare were available per tree species in 5 cm DBH (diameter at breast height)-classes, starting at 10 cm DBH. Only live trees were recorded in the inventories. For details about the nine study compartments see Table SM1.
Depending on the site index, two sets of height functions relating tree height to DBH for individual species were assigned to the compartments to calculate initial tree height in 1963 (Table SM2).
Starting in 1963, a historical register of annual harvests per compartment was available, documenting the harvests in 5 cm DBHclasses. Table SM1 presents selected information about the harvesting activities in the forest compartments.

Climate data
The forest models require daily or monthly climate data to drive the simulations (see Table SM3). For each of the nine forest compartments a daily time series of climate data covering the period 1963-2013 was prepared based on the nearest grid cell (Lat. 45.625, Long. 14.375) of the E-OBS data set (van den Besselaar et al. 2011). The MT-CLIM routines (Running et al., 1987;Thornton and Running 1999) were used to adjust the E-OBS climate record for elevation, slope and aspect of the nine sites and to estimate incoming global radiation of the daylight period and vapor pressure deficit (see Thornton et al., 2000).

Forest models
The five models were the gap model ForClim (Bugmann 1996), the landscape model LandClim (Schumacher et al., 2004), the hybrid 3D patch model PICUS Seidl et al., 2005) and the spatially explicit empirical models SAMSARA2 (Courbaud et al., 2015) and SIBYLA (Fabrika 2005). The models are briefly introduced and their key features summarized in Table SM3. For detailed descriptions, we refer to the original sources.

ForClim
ForClim is a climate-sensitive forest succession (gap) model that has been developed to simulate forest dynamics over a wide range of environmental conditions (Bugmann 1994). The model simulates establishment, growth and mortality of individual trees on small independent patches, using a minimum of ecological assumptions to capture the influence of climate and ecological processes on forest dynamics (Bugmann 1996;Didion et al., 2009b). ForClim is structured into four sub-models: weather, water, plant, and management. The PLANT sub-model is the core of ForClim, where establishment and growth of tree cohorts (i.e., trees of the same species and age) are simulated based on light availability, soil nutrients, browsing intensity and bioclimatic indices calculated within the sub-models WEATHER and WATER. Tree mortality is modeled as a combination of constant "background" mortality and a stress-induced component. The MANAGEMENT sub-model enables the simulation of a wide range of silvicultural treatments such as clearcutting, shelterwood, thinning, planting, and others. In this study, we applied ForClim version 3.0 (Rasche et al., 2011), complemented by an empirical harvesting algorithm for simulating removals of an exact number of stems for every tree species by diameter class (single stem removal; see description in Mina et al. (2017)). ForClim is currently parameterized for 31 European tree species and has been tested for the representation of natural forest dynamics of temperate forests of the Northern Hemisphere (e.g., Didion et al., 2009a).

LandClim
LandClim is a process-based forest landscape model (Schumacher et al., 2004;Schumacher et al., 2006) designed to simulate forest dynamics and disturbances at large spatial scales (10 3 to 10 6 ha) over long periods of time (hundreds to thousands of years). In LandClim, landscapes are represented as a 25 × 25 m grid with specific topographic and climatic input variables for each cell. Within each cell, a simplified forest gap model (Bugmann 2001) simulates establishment, growth, competition and mortality of trees on an annual time step. Similar to ForClim, trees are simulated using a cohort approach (i.e., a computational simplification where one representative individual is simulated for all trees of the same species and age within a cell (Bugmann 1996)). Tree growth is simulated using a logistic growth equation, where species-specific maximum growth rate and size are reduced by light availability, degree-day sum and a drought index (Schumacher et al., 2004). Establishment and mortality are stochastic processes. Each year, the potential for tree establishment is determined as a function of environmental filters (i.e., available light at the forest floor, minimum winter temperature, growing degree-day sum, drought index, and browsing). Mortality probability is determined as a combination of stress, density-dependent and intrinsic mortality. LandClim can simulate management in 10-year intervals on defined management areas by selecting and removing a percentage of trees fulfilling specified DBH constraints. The model has been tested and adapted to the European Alps (Briner et al., 2013;Elkin et al., 2013;Temperli et al., 2013), North American Rocky Mountains (Schumacher et al., 2006;Schwörer et al., 2016), and Mediterranean forests (Henne et al., 2015). For this study, the individual compartments were simulated in LANDCLIM as independent entities without landscape level interactions among them to produce results comparable to the stand-level models.

PICUS
The forest model PICUS version 1.5 Seidl et al., 2005;Irauschek et al., 2017a), henceforth referred to as PICUS, is a hybrid of classical gap model components and process-based stand-level NPP algorithms (Landsberg and Waring 1997). The spatial core structure of PICUS is an array of 10 × 10 m patches with vertical crown cells of 5 m in height. Interactions between patches are considered via a three-dimensional light model and spatially explicit seed dispersal. Stand-level NPP is estimated with a model of light use efficiency (Landsberg and Waring 1997), which depends on intercepted radiation, temperature, precipitation, vapor pressure deficit, soil water and nutrient supply. Distribution of assimilates to individual trees is based on the relative competitive success of the individual trees. Tree mortality depends on age and stress conditions. Natural tree regeneration considers seed production and distribution, germination and establishment. Up to the height of 130 cm seedlings are simulated with a height class approach. Beyond that threshold, they are considered as individuals in the tree population (Irauschek et al., 2017b). PICUS includes a flexible management module enabling the implementation of silvicultural treatments at tree level depending on tree attributes and patch location. The model includes 17 parameterized tree species and has been validated (Seidl et al., 2005;Didion et al., 2009a) and applied in numerous studies all over Europe (Lexer et al., 2002;Maroschek et al., 2015;Pardos et al., 2015;Zlatanov et al., 2017).

SAMSARA2
SAMSARA2 is an individual-based and spatially explicit model to simulate regeneration, growth and mortality of individual trees in mixed and uneven-aged mountain forest stands (Courbaud et al., 2015). The model builds on the theory that light interception by tree crowns is a key driver in uneven-aged stands, because they present a strong vertical heterogeneity favoring asymmetric competition between trees and between the canopy and seedlings. In SAMSARA2, competition for light within a stand is calculated based on light ray interception by tree crowns. SAMSARA2 has been calibrated empirically for silver fir and Norway spruce stands within the montane elevation belt of the Alps in France. Ecological factors other than light, such as climate and site conditions are not directly taken into account. Fertility of the site is taken into account indirectly through the value of the different demographic parameters. According to Courbaud et al. (2015) the model should be recalibrated if applied under different site conditions. Annual diameter increment of individual trees depends on their size and the amount of light intercepted by their crown during the growing season. Natural mortality depends on tree diameter and a competition index defined as the basal area of larger trees within a radius of 15 m. Seeds are produced by adult trees and germination, growth and survival of seedlings depend on the light reaching the ground calculated in the center of 25 m 2 cells. Trees participate in light interception above DBH of 7.5 cm. Specific management algorithms allow the simulation of detailed silvicultural strategies, varying both the characteristics of harvested trees and their spatial arrangement within a stand (Lafond et al., 2012;Lafond et al., 2014).

SIBYLA
The SIBYLA model (Fabrika 2005) is based on the SILVA simulator (Pretzsch et al., 2002). It is an empirical, distance-dependent ecological niche-based model that simulates the growth of individual trees. The expected height increment is estimated from the potential height increment of the tree and a multiplier, which characterizes the effects of competition, soil and climatic conditions (see Pretzsch and Kahn 1998). SIBYLA was parameterized and validated using forest inventory data from Germany, Switzerland and Slovakia. The model is parameterized for five main European forest tree species -Norway spruce, silver fir, Scots pine, European beech and oak (Quercus sp.); other species can be simulated on the basis of their ecological and morphological similarity with the aforementioned species and using calibration functions. The model consists of sub-models for mortality, competition, growth, regeneration and thinning and a stand structure generator. Growth responses to environmental drivers (growing degree-days, annual temperature amplitude ( • C), mean air temperature ( • C) and precipitation in the growing season (mm), De Martonne (1925) index of aridity, soil moisture and site nutrient status) were formalized according to Kahn (1994). The model can simulate several cutting and thinning techniques typically applied in Central Europe (Fabrika and Ď urský 2005).

Simulation scenarios 2.5.1. Small tree initialization
For trees smaller than 10 cm DBH no information was available. However, most forest models consider trees of smaller sizes (in LandClim from DBH 5 cm, ForClim and SIBYLA from height 1.3 m and in PICUS from 10 cm seedling height). To allow a harmonized initialization of those models, two scenarios were defined for small tree initialization: (i) no small trees (A1, Table 1); (ii) small trees initialized using the threshold of 130 cm seedling height (A2). For the latter, the regeneration sub-model of SIBYLA (Fabrika 2005) was employed to estimate the initial number of trees in 1963 for diameter classes not covered by the historical inventory data (DBH 0-5 cm and 5-10 cm; see details of the process in the Supplementary material) for use in all models.

Browsing by ungulates
Over the observation period , there was a considerable influence on regeneration by ungulate browsing (Klopcic et al., 2010). Because consistent browsing data was not available for each measurement period, browsing intensities per species over the entire simulation period were estimated based on detailed browsing inventories carried out between 1992 and 2000, expert knowledge and historical deer census data. To consider the uncertainties in browsing parameters, a total of three browsing scenarios were defined (B1, B2 and B3; see Table 1). Further details are given in Table SM4.

Harvesting
The historical register of harvests separately accounted for sanitary and regular logging. This means that harvests included, at least partly, dead or "near dead" trees (i.e. sanitary fellings). According to internal reports from the Snĕžnik region, there is the tendency that in compartments with longer extraction distances less sanitary harvests are carried out due to economic reasons. This poses several problems for the specification of harvested trees in the simulations. To consider the potential effects of different modes of selecting trees for extraction, two harvesting scenarios were implemented: (i) only living trees harvested (C1); (ii) in a sanitary management scenario trees that had died in the simulations in the two years preceding a planned harvest were preferably selected for harvest (C2).

Analysis approach
We compared observed basal area in 1963 with the initial state of compartments in the model simulations (i.e., initial state), 1983 (except for compartment 17A, that was first remeasured in 1973; for simplicity we refer to the first remeasurement date in the results as "1983 ′′ only) and 2013 with simulated basal area using the root-mean-square error (RMSE) and Nash-Sutcliffe Efficiency (NSE).

RMSE
The root-mean-square error represents the square root of the quadratic mean of the differences between predicted (simulated) and observed basal areas for the 9 forest compartments (i) (Eq. (1)). The RMSE is always non-negative, where a value of 0 indicates a perfect fit to the data. RMSE can be related to the standard deviation of the measured values to provide a standardized index where values less than half the standard deviation may be considered low (Singh et al., 2004).
As an additional bias metric, we show the mean error (ME) (Eq. (2)).
The Nash-Sutcliffe Efficiency is a normalized statistic that determines the relative magnitude of the residual variance compared to the variance in the measured data (Nash and Sutcliffe 1970) (Eq. (3)). NSE takes on values from -∞ to 1. Values close to 1 correspond to a perfect match of data and simulations. NSE of 0 indicates that the model predictions are as accur99ate as the mean of the observed data and values below 0 indicate that the observed mean is a better predictor than the model. The denominator in Eq. (3) determines to some extent the calculated NSE values. Because in the current study forest compartments were managed with an uneven-aged continuous cover regime, the variation of the observed basal area values was small and consequently to achieve high NSE values difficult. The R-package "hydroGOF" (Zambrano-Bigiarini 2017) was applied to calculate NSE.
To compare observed and simulated DBH distributions, we used the Diameter Distribution Error (DDE), described as "total variation distance index" in Levin et al. (2009) (Eq. (4)), where j are 14 diameter classes of 5 cm width (starting at 10 cm), n is the stem number per ha per diameter class, and N is the total sum of trees per ha (see also Saad et al., 2015). The sum of the differences of the relative frequency in the diameter classes is multiplied by ½ to scale the error between 0 and 100%, with optimum DDE at 0%. DDE was chosen because it is a distribution-free measure that calculates the diameter distribution error independent of the total number of trees. The diameter classes were the same as those in the inventory records.
For each compartment we calculated tree volume for the three measurement years, for all trees that had died during the simulations as well as for the harvested trees in a standardized approach by using local height curves (Table SM2) and individual tree stem volume functions (Pollanschütz 1974) to compare periodic volume growth and mortality, while avoiding possible biases due to biomass and volume estimation methods used by individual forest models only (Thurnher et al., 2013).
The total periodic production of timber (Total Growth) is calculated with (Eq. (5)). For a specific period (p), timber stock (T Stock) corresponds to the standing tree volume, Harvest is the standing tree volume of trees harvested in period (p) and Mortality is the standing tree volume of trees that had died during period (p).

Simulation protocol
The modeling groups jointly defined the scenario settings (cf. Table 1) and received the same data consisting of (i) tree data to initialize simulation stands representing the forest compartments in 1963, (ii) site and soil attributes, (iii) daily weather data for the period 1963-2013, and (iv) historical harvest data for the period 1963-2013. Each group performed the simulation runs (see Table 1) and delivered the model output data in a harmonized format to the lead author, who aggregated and analyzed the data. Tables SM1 and SM2 show site attributes available to the modeling groups and aggregated information on tree and harvesting data.

Forest model initialization performance
Total basal area of trees with DBH above 10 cm in 1963, as initialized in PICUS, SAMSARA2 and SIBYLA, matched the observations very well (Fig. 1a). RMSE was 0.3 m 2 ha − 1 for PICUS, 0.2 m 2 ha − 1 for SAMSARA2, and 0.7 m 2 ha − 1 for SIBYLA (Table SM8). Initial basal area in LandClim showed the highest absolute deviations (RMSE 6.9 m 2 ha − 1 ), with lower basal area in all compartments compared with the inventory in 1963.
A comparison of initial DBH distributions in the models and the inventory in 1963 revealed a similar picture (Fig. 1b). The diameter distribution errors (DDE) for SAMSARA2 were below 1% and for PICUS and ForClim below 2%, with one department as exception for ForClim with DDE over 4%. DDE values for SIBYLA were between 1.5% and 3.5% indicating good agreement with the 1963 inventory. LandClim had moderate DDE values from 4 to 9%. Inspection of LandClim diameter distributions revealed that most of the mismatch in diameter structure occurred in the high DBH classes.

Simulation of compartment development
Starting with the initial compartment characteristics (see 3.1 above), the models simulated forest development according to the scenarios in Table 1. In Fig. 2 the simulated basal area development for each model is shown per compartment and compared to the inventory records. Most models slightly underestimated mean compartment basal area after 50 simulation years in 2013: ForClim at 27.0 m 3 ha − 1 , SAMSARA2 26.3 m 3 ha − 1 , PICUS 26.1 m 3 ha − 1 and SIBYLA at 25.5 m 3 ha − 1 versus the mean inventory value of 28.4 m 3 ha − 1 . On average LandClim slightly overestimated basal area at the end of the simulation period in 2013 (29.3 m 3 ha − 1 ). The mean error (ME) over all scenarios and compartments for the models was − 3.0 m 3 ha − 1 (ForClim), − 3.2 m 3 ha − 1 (PICUS), − 1.6 m 3 ha − 1 (SAMSARA2), − 2.5 m 3 ha − 1 (SIBYLA) and +3.2 m 3 ha − 1 for LandClim. More details on simulated basal area are shown in Fig. 3.
Overall, SAMSARA2, SIBYLA and PICUS showed good model performance (NSE > 0.5; mean RMSE < 5 m 2 ha − 1 ) with PICUS being the only model performing well in both observation periods. LandClim had negative NSE values in both periods, meaning that the mean of the observations was a better predictor than the model. ForClim had the worst of all observed NSE values in 2013 (ranging from -1.5 to − 1.8), but NSE was highly influenced by poor performance in compartment 17A. Interestingly, this was the only compartment dominated by beech, while in the other compartments silver fir was dominating. Also other models (SIBYLA and SAMSARA2) showed the largest deviations from the observed values in compartment 17A for the observations in 2013.
It is interesting to look at the sensitivity of the models to the scenario assumptions. Overall, the variation in model output, as triggered by scenario A (initialization) and B (browsing), was smallest for LandClim (2% of the mean result) and largest for SAMSARA2 (8%); relative range in output for ForClim, PICUS and SYBILA was approx. 4 to 6% (Fig. 4). Adding scenarios C (harvesting mode) increased the variation to 24% (SAMSARA2), 11% (PICUS) and 6% (SYBILA) of the mean output (see also Figure SM1 and Table SM8).
Comparing simulated and observed diameter distributions at different time points is a rigorous test of the ability of the models to project forest structure over the 50-year period. Both in 1983 and 2013, LandClim showed the largest mean DDE value (23.6%, 69.8%) and ForClim the lowest (13.7%, 29.8%), with SYBILA, SAMSARA2 and PICUS having similar DDE values as ForClim. For all models DDE increased from 1983 to 2013 (Fig. 5).
What was the scenario that produced the best outcome with each of the five models? To answer this question the mean NSE and RMSE related to basal area over the entire simulation period were calculated for each simulation scenario and for each model the scenario with the maximum NSE was selected as "best" scenario" (Table SM6). In a similar procedure, mean DDE statistics were calculated for the entire simulation period and the scenario with the lowest DDE per model was selected (Table SM6). Regarding the simulated diameter distributions, scenario A2, which included small trees in the initialization, resulted in more accurate simulated diameter distributions for all models. The scenario assumptions that produced the best overall match with regard to basal area differed among the five models. While for ForClim and LandClim "no browsing" (B1) combined with "sanitary management" (C2) produced better results, for PICUS, SAMSARA2 and SIBYLA the browsing pressure as observed in 2013 (B2) in combination with sanitary management (C2) resulted in more accurate results. For all models the two error statistics NSE and RMSE consistently ranked the same scenario at the top.

Realization of harvests
According to the harvest records, the compartments were treated regularly in an interval of 10 to 20 years. Moreover, in some compartments sanitation logging occurred almost every year (e.g., several operations in compartment 40C between 1985 and 1990 or in compartment 2C in the years 1980 to 1995) (see Fig. 2). When comparing the diameter structure of simulated harvests with the records, all models except LandClim performed well (not shown) with SYBILA showing the best match of simulated and observed diameter distribution of harvested trees. For all models, deviations occurred mainly in the smaller diameter classes below 20 cm DBH.

Total volume production
Utilizing the harvest records, the periodic harvest volumes (Harvest)   can be calculated for the historical period, while tree Mortality which has not been extracted and thus is not included in Harvest is not known. Consequently, no observed Total Growth could be calculated for the nine study compartments. However, for comparison we could retrieve the local total mean annual volume growth from forest management plans (about 9 m 3 ha − 1 yr − 1 ). This general estimate can be compared with Total Growth according to Eq. (4) for each model simulation. From Fig. 6 it is evident that there are differences among the models. Compared with the local estimate, ForClim, SAMSARA2 and SIBYLA simulated somewhat lower Total Growth (5.6, 7.0, 7.1 m 3 ha − 1 yr − 1 ), LandClim is clearly overestimating (15.7 m 3 ha − 1 yr − 1 ), and PICUS is closest to the reference value of 9 m 3 ha − 1 yr − 1 (8.4 m 3 ha -1 yr − 1 ). SAMSARA2 and SIBYLA reproduced the recorded harvest volumes accurately. PICUS and For-Clim show only minor underestimates, whereas LandClim clearly overestimated Harvests. Large differences between the models are visible regarding Mortality. Here, ForClim, SAMSARA2 and SIBYLA clearly simulated much lower tree mortality (ForClim 0.9 m 3 ha − 1 yr − 1 , SAMSARA2 1.3 m 3 ha − 1 yr − 1 and SIBLYA 2.2 m 3 ha − 1 yr − 1 ) than Land-Clim (5.1 m 3 ha − 1 yr − 1 ) and PICUS (3.9 m 3 ha − 1 yr − 1 ). The effect of the sanitary management scheme, causing a shift of volume from Mortality to Harvest, is small. Details for both inventory periods are shown in Table  SM7.

Assessment approach
One major challenge for increasing the acceptance of simulation tools for decision support is the objective validation of models, to build trust in the validity of model output (Muys et al., 2011). Multi-model evaluation exercises offer a great opportunity to compare detailed model outputs and point out differences, strengths and deficiencies. Recently studies have started to quantify parametric uncertainty (Hartig et al., 2012), and some results exist on the contributions of model sub-processes to overall parametric uncertainty (Augustynczik et al., 2017), but the quantification of structural uncertainty is less advanced.
A prerequisite for multi-model evaluation studies is that the participating model groups are provided with identical site, stand and climate information and follow a harmonized protocol that avoids specific fitting of model settings to local conditions. In a comparative analysis of 15 forest models with regard to model sensitivity to different tree mortality algorithms, Bugmann et al. (2019) refrained from having all participating models run under identical conditions. However, this approach adds another source of uncertainty and makes it even harder to track those model sub-processes that contribute most to the variation in model output and to respond to the question of why a specific model does better or worse in a specific situation.
The historical long-term compartment-based dataset, in combination with management records as used in this study, offered great potential to test models in species-rich forest ecosystems in realistic management contexts. Inventories carried out by forest enterprises usually do not fulfill the criteria for long term model evaluation exercises. Here the focus usually lies on cost efficiency and less on continuous long-term datasets. Methodologies usually switched from full calipering of trees over wide forest areas to statistically more advanced and less cost intensive sampling methods starting from the 1950s onwards. As a result, continuous datasets consisting of tree data with sufficient   sampling densities over a long time period are very scarce. The tree and harvest data from the presented case study cover a remarkably long period of 50 years. They represent entire compartments and are thus better representatives of real forests compared to small plots of largescale forest inventories (see Huber et al., 2013). However, to fully understand forest development in the nine compartments, information on tree mortality, stand structural information and tree regeneration would be essential. In an attempt to frame sources of uncertainty in the evaluation data, 12 scenarios were defined to specify explicit assumptions on the initial state of regeneration, browsing pressure by ungulates over the monitoring period and the inclusion of tree mortality in harvesting decisions (Table 1). What could be considered a weakness of the evaluation data was turned into a strength of the assessment approach because it added rigor to the comparison of model output and observations. It also facilitated analysis of the sensitivity of the models to assumptions that have to be made quite often, when data and model input from decades ago are used.

Model initialization
A first crucial step in practical model applications is to initialize stand structures from available information. To our knowledge, the ability of models to create realistic initial tree populations has not been considered before. All models involved in this study include special routines for initialization to avoid stand structures that are not compatible with internal processes dealing with competition for resources and space and may therefore result in undesired behavior of simulated tree populations, especially in the first simulation years (see Supplementary material). Generally, these routines are most important for complex forest stands with high tree density, vertical tree layers, diverse species mixtures, many size classes and patchy spatial structure. In the model SIBYLA, for example, the models internal consistency of the virtual tree population is evaluated by estimating the suitability of the tree positions according to the nearest and second nearest neighbor trees using a probabilistic approach (Pretzsch 1993). For rejected trees, new coordinates are generated until microstructural requirements are fulfilled. The PICUS model operates in a similar way within its 10 × 10 m resolution to avoid the initialization of non-viable tree neighborhoods. The gap model ForClim uses an algorithm to optimize the leaf area of virtual trees to avoid excessive shading. SAMSARA2 avoids initialization problems by using a uniform distribution for estimating coordinates, which may visually result in very regular stand structures. In LandClim, compartments are initialized spatially explicit at the level of the simulation cells (25 by 25 m). The effort that has been put into generating realistic initial forest states shows that, although so far not much attention has been paid to this issue, (i) creating realistic tree population structures may have crucial implications for simulated short-to midterm population behavior, and (ii) indicates the relevance of structural information and tree coordinates (see Table SM3). For instance, PICUS, SAMSARA2 and SIBYLA could have directly utilized tree coordinates for initializing forest stands. All models except LandClim were able to generate realistic forests regarding basal area density and diameter structure (compare Fig. 1 and Table SM8 for an overview). The landscape model LandClim, which had not been developed to simulate fine-grained tree populations at population level, initialized forests with substantially lower basal area compared to observations and also did not match well the initial diameter structure of the validation data set.

Forest simulation results
In simulating managed mixed mountain forests over five decades, beyond growth, regeneration and mortality of trees, also harvests and the response of tree populations to these human disturbances need to be mimicked realistically. The results showed that spatially explicit individual tree-based models performed better in simulating basal area development than the non-spatial gap model ForClim, and, not surprisingly, the landscape model LandClim. Overall, the best performing models in our study yielded RMSE values for basal area over a simulation period of 50 years, which are about 10 to 12% of the compartment basal area. This simulation result is comparable to the range of basal area mean error values from large-scale forest inventories (e.g., Naesset 2007). The mean error (ME) values over all scenarios indicated that all models on average underestimated the observed basal area values by about 10% of the mean observed values, except the LandClim model which yielded a slightly larger positive bias.
An even greater challenge for forest models is to capture the development of the size structure of the tree population. With the exception of LandClim, all models performed reasonably well. Furthermore, the study did not confirm two general beliefs: First, and interestingly, empirical models (SYBILA, SAMSARA2) did not perform significantly better regarding the accurate prediction of DBH structure than the process-based models ForClim and PICUS (compare Guisan and Zimmermann 2000;Fontes et al., 2011). A similar finding for the PICUS model had already been reported in Seidl et al. (2005). Second, none of the models had been developed for Dinaric mountain forests. As the model evaluation protocol avoided the calibration of the tested models to local tree growth data, the expectations towards accurate model results in simulating 50 years of tree growth were low. However, the stand-level models performed reasonable (compare Fig. 7). The spatially explicit models PICUS, SAMSARA2 and SYBILA performed overall better than the non-spatial gap model ForClim and showed very concordant results. On the other hand, ForClim showed the best performance regarding the diameter distributions.
The current study clearly showed that a model should perform well in a series of aspects beyond tree growth that drive forest development. Over the 50-year simulation period, the interacting processes tree growth, regeneration, mortality and tree selection and harvesting determine the standing stock as well as the DBH distribution at a specific point in time. Hülsmann et al. (2018) point out that mortality subroutines in forest models are typically rather fundamental and lack empirical basis. In a recent study Bugmann et al. (2019) concluded that the sensitivity of forest models to changes in mortality algorithms is Fig. 7. Relative model performance scaled to the respective maximum and minimum values. Maximum BA accuracy and DDE accuracy in origin (0/0). BA accuracy = sum of relative RMSE for initialization and Nash-Sutcliffe Efficiency (NSE) for simulation results (pooled 1983 and 2013). DDE accuracy = sum of relative diameter distribution error (DDE) for initialization and simulation results (mean 1983 and 2013). The circle diameters correspond to the relative variation resulting from scenario assumptions (results 2013, scenarios A (Initialization of small trees) and B (Browsing by ungulates). Data in Table SM8. larger than the sensitivity to climate change signals. Unfortunately, the testing data used for our study did not include tree mortality. The extent to which harvests included a share of natural tree mortality was also not known. However, the scenarios C1 (harvests include only living trees) and C2 (sanitary management, harvests also include dead trees) allowed to estimate the effect of the underlying assumptions. Our results showed that a seemingly simple process such as harvesting trees might produce quite different results among the models.
All models except LandClim could handle harvest prescriptions on an annual basis. However, harvest algorithms within the models differ in the way they handle tree removals in specific DBH-classes. Some algorithms search for specific trees (size and species) in the simulated forest. As differences between observed and simulated DBH-structures increase with increasing simulation length, more flexible algorithms (e.g., trees may be taken from neighboring DBH-classes) may be beneficial, especially for very detailed harvest operations such as single-tree or sanitary management records (Mina et al., 2017). Our results indicated that harvest algorithms incorporating multiple criteria, such as basal area, tree diameter and diameter distributions, can improve overall modeling accuracy (Lafond et al., 2012). The importance of consistently considering natural tree mortality in harvest algorithms will even increase with increasing relevance of disturbances in future climates (Seidl and Rammer 2017).
Framing the uncertainty in the evaluation data (initialization of small trees, browsing by ungulates, tree selection for harvesting) by means of 12 scenarios proved to be an adequate approach to (i) consider data uncertainty, and (ii) test the sensitivity of the models towards the scenario assumptions (see Figure SM1 and Tables SM5 and SM6). It was interesting to note that the browsing scenarios (B1, B2, B3) had a strong influence on the predictions, and were predominantly affecting diameter structures, whereas basal area was less affected. SAMSARA2 seemed to be particularly sensitive to the harvesting assumptions. What is a useful threshold for model sensitivity? This question cannot be answered easily. As a general rule, the sensitivity of a model to a driving gradient must allow to reproduce observable data along the gradient (e.g., Lexer and Hönninger 2004). Interestingly, the average variation, caused by the full set of scenarios (as simulated by SAMSARA2, PICUS and SIBYLA), is approximately in the same order of magnitude as the error in basal area or volume estimates of large-scale forest inventories.
LandClim, the only landscape model included in the comparison set, delivered satisfactory estimates of basal area development per compartment, though it operates on coarser resolutions regarding initialization, management and growth simulation, and was designed to simulate landscape level disturbances and processes that were not the focus of this study. The advantage of landscape models is that they can be used to complement stand-level models to explore forest development at multiple spatial scales from stand to landscape scale, which is gaining importance with growing confidence that disturbance regimes will intensify under a warming climate. Interacting disturbance agents operate at multiple spatial scales, which is beyond the reach of standlevel models (e.g., Elkin et al., 2013;Hlásny et al., 2019).

Conclusions
Several conclusions can be drawn from this study: First, a strong effort should be made by research institutes and forest owners, to access and store historical inventory datasets and harmonize contemporary inventories with historical data to be able to capitalize on the benefits of a long consistent time series. In particular, detailed harvest records are crucial to understand time series data of stocks, which are the usual monitoring focus. Knowing harvested trees by tree species and by status (live, dead) would be a huge leap forward for further development and testing of forest ecosystem models.
Second, there is no clear hierarchy between more empirical or more process-based models regarding the accuracy of stand development projections.
Third, individual-based models are more precise for stand-level predictions than the tested landscape model LandClim, but the latter performed reasonably well at the stand scale, qualifying it for a complementary application at the landscape scale. This serves as a great example for a finding from a model intercomparison study, where the ultimate goal is not that every contestant shows excellent performance regarding standard outputs. More important is a critical comparison between models to differentiate why certain models are better under specific conditions observed in the case study for specific outputs. Following a detailed description of underlying model assumptions and objective evaluation of outputs, recommendations can then be given regarding the specific applicability of models and their limitations. However, model performance demonstrated in a specific ecological and management setting should not be generalized carelessly to diverging conditions. Repeated successful model evaluation experiments in different settings will build trust in model performance. Fourth, in model evaluation studies a comprehensive set of model output variables should be assessed. There is a fast-growing demand by decision makers for reliable prediction of stand structural features beyond volume or basal area, because for forest management planning ecosystem service provisioning is based on indicators on species composition and structure of the tree population (e.g., Maroschek et al., 2015). The current study has shown that there are forest models available that qualify for forest management decision support.
Fifth, as a very relevant mission joint model intercomparison and evaluation studies compare currently available ecosystem models by challenging them with applications in novel case study settings and foster scientific collaboration among the modeling groups. This prepares the ground for multi-model applications in decision support, a prerequisite to quantify and explain uncertainty from model assumptions.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.