A standard protocol for describing the evaluation of ecological models

Numerical models of ecological systems are increasingly used to address complex environmental and resource management questions. One challenge for scientists, managers


Introduction
Scientists, managers, and stakeholders increasingly rely on numerical models of ecological systems.One challenge is to appraise the efficiency of these models to tackle complex environmental questions.Providing clear evaluations of model performance is one way to address this challenge.Models can be constructed, analysed, and used by different actors, from scientists to policymakers, and these actors have different understandings and expectations from models.Assessing how good a model is at addressing specific problems is difficult when ecological modellers use a variety of model types, have different modelling cultures and practices, and use different vocabularies.This can hinder communication, transparency, reproducibility, and the general development of good practices within the modelling community.It is therefore essential to provide tools to support a collective understanding of what can be expected from a model and how a model is to be evaluated (Cartwright et al., 2016;Eker et al., 2018;Heymans et al., 2020).
Transparency and reproducibility are at the core of the scientific method.However, the complexity of the tools used to observe and model ecological systems challenges reproducibility and transparency (Powers and Hampton 2019).The ongoing so-called reproducibility or replicability crisis reflects this difficulty.The crisis has primarily been identified in the fields of psychology (Pashler and Wagenmakers 2012), clinical studies (Begley and Ioannidis John 2015) and economics (Camerer et al., 2016) and is much less discussed in ecological research (but see Ives 2018;Nichols et al., 2019Nichols et al., , 2021)).It may not be possible to strictly replicate ecological observations, but transparency in workflow and data analyses can facilitate reproducibility.It should be possible to reproduce ecological model simulations, given that the relevant information is provided for that purpose.In addition to replicating a model and the associated simulations, it is equally important to be able to understand, assess and replicate how model performance was evaluated.This step is critical, given that almost every new method published claims to outperform existing ones, but are seldom re-evaluated (Boulesteix et al., 2020).Providing relevant and comprehensive information is a first step towards replicability, which needs to be complemented by appropriate communication and quality standards.How is the information communicated?Is it accessible?Is it unambiguous?Is it sufficient?A standardised protocol for reporting model evaluation procedures would address these questions and contribute to increased transparency and reproducibility of ecological models.
There have been considerable collective efforts in recent decades to develop standardized modelling practices, from model building to evaluation of model performances.A major advancement has been the development of standardised protocols such as the ODD (Overview, Design concepts, and Details, Grimm et al., 2006).The ODD protocol was originally developed to respond to the lack of a standard protocol for describing individual based models (IBMs).The protocol was reviewed and updated twice since its original publication (Grimm et al., 2010(Grimm et al., , 2020b) ) and it is now commonly used by ecological modellers, beyond the original IBM community, to describe their models in reports and publications.The ODD protocol has been inspirational to groups of modellers with diverse focus, such as on model optimisation (ODDO, Mahévas 2019), data-mapping (ODD+2D, Laatabi et al., 2018), and inclusion of human decisions (ODD+D, Müller et al., 2013).In each case these groups have borrowed from the original ODD protocol idea and extended it for their specific purpose, thereby contributing to the harmonisation and communication of modelling practices.
A major step in the development and application of ecological models is the evaluation phase.There exists a large body of literature on how to perform model evaluation for various classes of models (e.g.Stow et al., 2009;Allen and Somerfield 2009;Bennett et al., 2013;Conn et al., 2018;Hipsey et al., 2020), but much less work has been done to standardise the reporting of model evaluations.The TRACE (TRAnsparent and Comprehensive Ecological modelling) documentation (Grimm et al., 2014) is a notable exception which provides a framework for documenting the modelling process, including several aspects of model evaluation.Standardised protocols for reporting model evaluation can constitute useful tools for modellers and end-users to easily understand and compare evaluation procedures and appreciate the performance of models in relation to specific objectives.Making such tools available is therefore anticipated to benefit the scientific community and model end-users.
The issue of model validation and evaluation in environmental science has been the subject of extensive research and debate.Oreskes (1998) argued that quantitative models cannot be validated but only evaluated.In Oreskes' view, evaluation is described as "an assessment in which both positive and negative results are possible, and where the grounds on which a model is declared good enough are clearly articulated".This assessment implies an examination of model outputs against pre-specified performance criteria.In the literature, the term model validation has remained pervasive (Eker et al., 2019) although often overlapping with the concept of evaluation as originally presented by Oreskes.In their 10-step procedure for developing and evaluating environmental models, Jakeman et al. (2006) introduced a stepwise approach in which every stage is open to critical review and revision, in consort with end-users.The evaluation step is left to the end and is concerned with the model being fit for purpose, although the criteria for achieving this goal are not fully developed by these authors.More recently, Parker (2020) explores the meaning of a model being adequate for purpose for different classes of models, whether pedagogical, explanatory or predictive.In the works of Jakeman and Parker, model evaluation is primarily achieved by measuring the performance of a model against pre-specified objectives, thereby following the original argument of Oreskes.This excludes the idea of a general validity of a model and favours the principle of an evaluation of a model for a specific objective (or a set of objectives).This mirrors George Box's notorious statement that "all models are wrong; the practical question is how wrong do they have to be to not be useful" (Box and Draper 1987), where useful implies use and therefore purpose.This is also in line with Augusiak's review of the literature on model evaluation and validation which concludes that despite little agreement on terms and underlying notions in the literature, it has repeatedly been pointed out that the evaluation of a model should depend on its purpose (Augusiak et al., 2014).
Evaluating that an ecological model is fit for purpose implies that the same model can (and should) be evaluated each time it is used for a new purpose.This is a rather trivial implication of the fit for purpose evaluation, however examples of re-evaluation of the performance of complex ecological models are scarce.Complex ecological models require extensive development efforts, and these materialise in the first publication of the model, together with a global evaluation (or validation) of the model (see e.g., Radach and Moll 2006;Link et al., 2010;Travers-Trolet et al., 2014;Pedersen et al., 2021).A fit-for-purpose approach would require that this first model evaluation be revised and reported for each new application of the model.One challenge in doing so is that the task of reporting model evaluation, which is already substantial when the model is first published, may seem daunting if it is to be repeated for every new model application.This can possibly be eased by reporting primarily on aspects of the model evaluation that are specific to each new application.An additional help can be provided by following a template in which a set of questions can guide the modeller through the reporting process.
By taking inspiration from the success and utility of the ODD protocol and the following extensions, we here present a complementary protocol for the reporting of ecological model evaluation procedures: the OPE (Objectives, Patterns, Evaluation) protocol.We discuss the rationale for the different elements of this protocol and provide a list of questions that can guide modellers to report in OPE format.We summarise the protocol (Table 1) and provide an easy-to-use Word template to support documenting model evaluations.(Supplementary material S1).Finally, we test the protocol on six case studies taken from a collection of marine ecosystem models with which the authors are familiar.These case studies are presented in detail in the supplementary material (S2).These modelling applications pre-existed the OPE protocol.The OPE has therefore not been used to guide the model evaluations presented here, but only to report how these evaluations were performed.

Elements of the OPE protocol
The elements of the OPE protocol are divided into three sections: Objectives, Patterns and Evaluation.Each section is then divided in subsections which contain one to six questions.

Context and motivations
In our experience, many ecological models are not developed with the sole purpose of answering a single, well circumscribed question.Rather, complex ecological models take time to develop, are often built to address multiple, and sometimes diffuse, purposes and are gradually applied to a range of questions (Fulton et al., 2011;Planque and Mullon 2020).For example, dynamic global vegetation models (DGVM) which B. Planque et al. were originally conceived to assess ecosystem-level responses to atmospheric-CO 2 concentration (Prentice et al., 2000) have later been applied to deliver projections based on future climate scenarios (Sitch et al., 2008) and are being continuously developed to address new questions, such as the impacts of different management practices on terrestrial ecosystems (Prentice et al., 2007).In a similar fashion, the ecosystem model Atlantis developed for the northwest Atlantic shelf (Link et al., 2010) was first used to explore the combined effects of climate and fishing (Nye et al., 2013), then to address the impact of ocean acidification (Fay et al., 2017) and more recently to quantify combined effects of acidification, fishing and marine protection (Olsen et al., 2018).Models of natural systems are inevitably embedded with multiple sources of uncertainty, and modellers make decisions during model construction (e.g., on which processes to include or simplify) which will affect the final outcome (Babel et al., 2019).There is a risk that assumptions which are reasonable for one particular model application are inadequate for another (Parker 2020;Saltelli et al., 2020).It is therefore essential that model suitability and performance are assessed and described for each application, and a crucial first step is describing the purpose of the specific application.In other words, to evaluate that a model is fit for purpose one must first specify the purpose.In this contribution, we refer to model as the generic description of the modelling tool (e.g.Ecopath with Ecosim, Polovina 1984aPolovina , 1984b;;Christensen and Walters 2004), we use the terms goal, purpose and objective in an interchangeable manner to express the motivation driving the study and we refer to a model application when the model is applied towards a pre-defined objective or set of objectives.Central in the OPE framework is our conception that it is sensical to evaluate the same model against different patterns or data when applied for different purposes.
Describing the main objectives of the study and how modelling will contribute to reach these objectives is perhaps the most crucial step in evaluating model performance and suitability, and it should be a key reference point throughout the evaluation process.Without a clear understanding of the purpose, it becomes difficult to communicate credibility and generate trust in the modelling work.Furthermore, it may be sensible to evaluate the same model against quite different patterns or data when applied for different purposes.Defining the aims and objectives of the model application early in the research process can save time, for instance with the realisation that objectives may depend on key processes for which the model of choice lacks functionality.
The aims and objectives of a model application should be stated in simple, clear language.We suggest using active sentences (e.g., construct, produce, test, document) and avoid vague wordings (e.g., explore, study, investigate).Beware that ambiguity in the description of the purpose of a model often leads to multiple (subjective) interpretations of whether an outcome was successful or not (Parker 2020).This hinders a reliable evaluation process.The following questions guide the reporting of objectives: 1 What are the objectives of the model application? 2 Why is the model suitable to address the objectives? 3 What would count as successful in achieving these objectives?

Specific model setup
Ideally, the ecological model has already been fully described following a standardized protocol such as the ODD.It is possible that the original description is adequate for a new application of the model, but specific applications may also require adjustments of the model structure, parameters, or assumptions.Assumptions are particularly important to report when the model is used to perform predictions at other points in time and space, which requires that the model has some degree of transferability (Wenger and Olden 2012;Yates et al., 2018).This is the case when the objective of the model is to produce forecasts or to predict ecosystem properties in one region based on a model developed in another.It is wise to explicitly state what lies behind the often-implicit assumption of ceteris paribus (everything else being equal).For example, are trophic interactions assumed to follow the same rules in different regions?Are spatial distributions or environmental conditions assumed to be unchanged in the future?When models are used for conditional forecasting, one should also report assumptions about expected changes that can affect the system studied.For example, how are possible future changes in water temperature, fishing effort, accidental oil spill or increase in noise due to shipping represented in the model?A model can be revised to better reproduce the ecological components or processes that are relevant to a new application.It is also possible that revised model structure, estimates of input parameters or new data on the forcing conditions of the model become available.All these updates should be reported in this section which describes any changes or additions which have been made since the original model description.
4. Are there any deviations from the original model description?a In the model assumptions, b in the model structure (e.g., addition of submodels, variables, components, modifications of spatial or temporal scales), c in the model details (e.g., changes in parameter values, functional relationships), d in the model forcing (e.g., initial conditions, boundary conditions, forcing time series and maps).

Selected patterns
A pattern may be defined as a characteristic and clearly identifiable structure in nature, or in data extracted from nature (e.g., population cycles, animal space use, species diversity etc.), that can be attributed to a generative process (Levin 1992;Grimm et al., 1996).Thus defined, a pattern is key to ecological understanding and prediction.Ecological patterns emerge from multiple ecological processes, which operate at multiple spatial and temporal scales and levels of organization (individual, population, community, and ecosystem).Understanding the causal mechanisms responsible for pattern formation is a primary goal of ecology (Levin 1992).
Modelling complex adaptive systems (see Levin 1992), such as marine ecosystems, is challenging, but pattern-orientated modelling (POM) may facilitate the task (Grimm et al., 1996(Grimm et al., , 2005;;Grimm and Railsback 2012).POM "starts with identifying a set of patterns observed at multiple scales and levels that characterize a system with respect to the particular problem being modelled" (Grimm and Railsback 2012).In other words, the selection of patterns to be used in model evaluation, depends on the objective(s) or hypothesis of the study.
Relevant ecological patterns may be related to numbers, biomass, production, or consumption of relevant ecological entities, to dynamic behaviour at equilibrium, or to character of state transitions in perturbation studies or in systems undergoing change (e.g.Beisner et al., 2003).Other examples are spatial patterns such as spatial synchrony or travelling waves (e.g.Sherratt and Smith 2008).More complex emerging patterns (e.g., spatial structure described by a variogram, degree of spatial overlap between species) may also be candidate targets for model evaluation.The selection of specific patterns is motivated by the objectives of the modelling application and is generally driven by the hypotheses that can explain the emergence of these patterns.As pointed by Cury et al. (2008), it might be relatively easy to reproduce a single ecological pattern with all kinds of alternative models, but simultaneously reproducing an entire set of patterns is much more demanding and requires that the model is structurally realistic.Rather than tying a model to a specific pattern, via heavy calibration, it can be more useful consider several weak patterns at the same time -because then the risk that we force the model to look right, but for the wrong reasons, is reduced.This is particularly true in the case of complex ecosystem models which include many processes and parameters that can be adjusted to tune the model to few selected outputs.While some patterns B. Planque et al. may be used to inform the model construction (e.g., some empirical relationships between ecological variables), other are emergent properties of the model.Model evaluation based on these emergent patterns may be of greater interest since models that succeed in getting emergent patterns right may also have greater potential for transferability to other time, place or systems (Radchuk et al., 2019).It is therefore critical to report on the selection of patterns and on the justification for this selection.
5. Which ecological patterns are used for the model evaluation?
a temporal patterns such as cycles, regime shift or trends, measures of temporal variability, and autocorrelation.b spatial patterns such as spatial synchrony, travelling waves, patchiness, and autocorrelation.c structural and functional patterns, such as taxonomic diversity, biomass ratios, integrated production, diet fractions, and trait distributions.d Other relevant patterns 6.Why are these patterns important/essential to address the objectives?
In the following part of the OPE one must describe the data used for evaluation purposes, which can include both data from the model output and data which are independent of the model.Information on data used for model building should be provided in the model description (typically, an ODD protocol) and data used for optimization should be reported in the optimization description (e.g. in an ODDO protocol, Mahévas 2019).

Independent data
Independent datathat is data that exists independently of the model being builtare often derived from field observations, and procedures for collecting and processing these observations should briefly be summarized in this part of the OPE.Relevant information includes i) whether the data originate from a dedicated field survey, an open database, or another model, ii) the spatial/ temporal/ taxonomic/ etc. extent and resolution of the data, iii) data representativeness, and iv) accuracy, precision, bias, or uncertainty.Data representativeness is the degree to which data can be used to represent the ecological patterns that are relevant for the objective of the study.For example, daily, weekly, or monthly time-series will have different representativeness if the ecological pattern of interest is related to phenology.Similarly, the representativeness of data collected at a single sampling station is also expected to vary with the spatial scale of the ecological question of concern, being more representative for small scale modelling studies centred around the sampling station than for larger scale investigations.Deriving ecological patterns (Section 2.2.1) from observations can involve extensive data processing, and this should be reported here.When the same type of data can be used for model optimisation and evaluation (as in cross-validation) this should be reported in this section.In some cases, although the data is collected independently of the model being built, the model and data may not be completely independent from each other (for example, knowledge from historical data used to build the model, or input data in an Ecopath model is also expressed as an output of the model) and this should be reported.The following questions guide the collection of information about the independent data used to evaluate the model, given selected pattern(s).
7.Where do the independent data originate from?(e.g.field survey, open database, another model, …) 8. What are the extent and resolution of the independent data?(spatially, temporally, taxonomically, …) 9. How representative of the ecological process are the independent data?10.Are there estimates of independent data accuracy, precision, bias, or uncertainty?
11. How are the independent data processed to represent the selected patterns?Are assumptions made to derive these patterns from the data?

Model outputs
Often, only parts of the model outputs are used in a specific application and the aim of this section is to describe which outputs have been used and evaluated.In some cases, the data may be post-processed (e.g., aggregation of results by guild, geographical region, or integration in time).The purpose of post-processing can be to generate indicators of the relevant patterns (ex.species spatial distribution, biomass ratios, index of seasonality, see Section 2.2.1) or to generate model outputs that are comparable with independent data (Section 2.2.2).The post processing step can require new assumptions (e.g., assume that conversion rates such as C:Chla are constant in time/space/taxa).The aim of this section is to describe the selection of model outputs, the post-processing operations, and to report on quality, quantity, representativeness, uncertainties, or potential bias in the model outputs.
12. Which model outputs are used for the evaluation?13.Have the outputs been post-processed, and how? 14.Are there estimates of model outputs accuracy, precision, bias, or uncertainty?15.Are additional assumptions made when deriving patterns from model outputs?

Evaluation methodology
We refer here to the evaluation method applied in the context of a specific application of a model to address stated objectives (Section 2.1.1).Model verification (sensu Gräbner 2018) -the act of testing whether the model does what it is supposed to do, i.e., that it is technically functional -should precede any application of the model and is not considered here.A first model evaluation step is often to conduct sanity checks.These are rapid explorations of the model outputs which ensure that, even though the model is technically functional, it is not behaving poorly.Sanity checks are often non-quantitative and based on domain knowledge rather than on quantitative comparisons of observations vs. model outputs.Though these are not often reported in model evaluation procedures, they inform about key conditions that the model must satisfy to be considered useful.Examples of sanity checks can include an inspection that population sizes or biomasses are within plausible ranges, that seasonal patterns are plausible or that emerging spatial patterns are visually credible.These can be done via Fermi estimations, often referred to as back of the envelope calculations of plausible ranges.Sanity checks are often performed in an unformal way and the intention of this section is to clarify and document this step.In cases when no sanity checks are performed, this should be justified.
16. Are sanity checks conducted?If so, what is the method used?If not, explain why. a Which data and patterns are used for this? b Does this apply to patterns that are not otherwise evaluated for this model application?
The core of the evaluation process is the comparison of patterns emerging from model outputs against those obtained from independent observations.This first raises the issue of the comparability between independent observations and model outputs, i.e., whether model outputs and independent data are directly comparable and whether modelled patterns are directly comparable to observed patterns.For example, are modelled biomass integrated over a large continuous geographical domain comparable with biomass field observations from a limited number of sampling sites?The second issue is the methodology used to compare ecological patterns derived from observations to those B. Planque et al. derived from the model.There can be many methodological approaches, ranging from qualitative visual comparisons to fully quantitative estimates of the model performance at reproducing observed patterns (Allen and Somerfield 2009;Bennett et al., 2013).The latter can include univariate or multivariate approaches, and can be based on error-based measures, information theory measures, parametric tests, non-parametric tests, distance-based measures, and combined measures (Hora and Campos 2015).This stage of the evaluation is sometimes referred to as skill assessment.
The choice of methods and metrics used in model skills evaluation will depend on the relevant patterns.For example, when dealing with cycles, the degree of congruence between modelled and observed cycles amplitude and frequency should be reported.When modelling state transitions, agreement in the rate of change of a trend should be reported.With ecosystem models addressing ecological stability or temporal variability, the stability measure should be reported at multiple levels of organisation (e.g., species, functional group, community etc.).The quantitative criteria to evaluate the match between observed and simulated patterns must be reported.For example, if the mean of the simulations is within a certain range (e.g. 1 standard deviation) of the observed pattern, the model satisfactorily addresses the pattern (e.g.Kramer-Schadt et al., 2007).The selection (or lack of selection) of particular skill assessment methods can also be partially dictated by existing skills, available software or discipline culture and habits.Some evaluation methods may have been tried without success.In those cases, one should report on the attempted evaluation steps with some discussion on how and why these were deemed unsuccessful.
Each methodology usually comes with associated assumptions that need fulfilling for the method to be valid, and these should also be reported here.
The core issue at the end of the evaluation process is whether the model outputs can be considered satisfying for the purpose of answering the modelling objective, i.e., that the grounds on which a model is declared good enough are clearly articulated (Oreskes 1998).17.What is the methodology used to compare ecological patterns derived from independent data with patterns derived from the model?a What is the rationale for choosing this method?b How are observational and/or model output uncertainties handled?c Does the methodology rely on specific assumptions?d Were other methods experimented?If they didn't succeed, explain why.18.Is there a threshold level (in the match between observed and modelled patterns) that can separate acceptable from unacceptable models?19.How comparable are the patterns derived from the model and those derived from the independent data?By answering the above questions, researchers should also discuss if there are patterns that cannot be well evaluated with the chosen method.

Sensitivities
We distinguish between two types of sensitivities to be reported.First, model sensitivity which is the result of a sensitivity analysis (SA), usually performed on model structure and parameters.Second, evaluation method sensitivity, which refers to the sensitivity of the model evaluation to the choice of evaluation methodology and available observational data.
Sensitivity analysis scrutinizes how variations in model inputs influence variations in model outputs, a fundamental step in model evaluation and corroboration (EPA 2009).A sensitivity analysis informs about which input parameters the model is most sensitive to (and therefore which parameters should be obtained with greater precision and accuracy), and about the relative importance of processes in the model.A diverse array of SA approaches has been developed to help cope with the various needs dictated by differing model assumptions, computational complexity, and availability of relevant information (Saltelli et al., 2004;EPA 2009).Reviews and guidelines for best SA practice in the context of ecological and environmental modelling are an important aid to SA planning, implementation, and reporting (Saltelli et al., 2004;A. 2021;EPA 2009;Thiele et al., 2014;Pianosi et al., 2016).
Attributes of SA methods worth considering in reporting include: independence of model linearity and additivity assumptions, ability to address interaction effects amongst input factors, capacity to handle differences in scale and shape of input probability distribution functions, ability to deal with differences in input spatial and temporal dimensions, and capacity to evaluate the effect of an input while all other inputs are allowed to vary as well (Frey 2002;Saltelli et al., 2004).
In this section, one should consider the sensitivity of the model outputs that are relevant to the objective of the study i.e., the modelled patterns (Section 2.2.3).Priority should be given to reporting sensitivity analyses that were conducted specifically for the model application.Sensitivity analyses performed in earlier stages of model development can be reported if also relevant for the objective(s) of the study.While there is no perfect model to address a specific ecological question, there is no perfect method either to evaluate the performance of a model (Makridakis et al., 2020).Typically, the choice of the sensitivity analyses depends on the availability of observational data with which the model can be compared, on the computational requirements to perform certain types of model evaluation, on the availability of evaluation methodologies to the modellers, and on modellers skill sets.This section reports on the rationale and criteria for choosing a particular approach to evaluate the model performance.It stresses when the choices are dictated by the objectives of the study as opposed to computational constraints, lack of relevant information or other considerations.For example, models with complex architecture and high computational costs -two common features for ecosystem models (Steenbeek et al., 2021) -impose restrictions on the exploration of the parameter space.This in turns limits the scope for global SA and simultaneous exploration of known sources of uncertainty, which are two desirable features of SA.This section also reports on how sensitive the evaluation method is to the data used for evaluation (section 2.4).Could the model evaluation give significantly different results if supported by other/new/more precise data or if other skill assessment methods had been used?It is also the place where one can report failed attempts to evaluate the model or discuss possible future development in evaluation methodology.Alternative or complementary approaches to standard sensitivity analyses (e.g., robustness analysis, Thiele and Grimm 2015;Grimm and Berger 2016) can also be reported here.In summary, this section highlights the relevant attributes of the model evaluation, caveats, possible limitations, and possible developments, clarifying the performance of the model evaluation in relation to the objectives.

Table 1
The 25 questions of the OPE protocol, grouped into three headings: Objectives, Patterns and Evaluation.A brief comment accompanies each question to guide the reporting.A template form is provided in appendix S1, in which reporting can be directly entered.

OBJECTIVES CONTEXT AND MOTIVATIONS 1
What are the objectives of the model application?Describe here the motivation and context for using the model.What is the purpose of the study?Do not describe the model, or its general objectives but focus on study-specific objectives.Use active sentences (e.g., produce, test, quantify, reconstruct dynamics) and avoid vague wordings (e.g., explore, study, investigate, understand).
Why is the model suitable to address the objectives?Provide the main rationale for why this specific model approach is suited to address the objective(s) raised in question 1.For example, is the model representing a process that is central to addressing the objectives? 3 What would count as successful in achieving these objectives?Explain here which criteria are used to determine if the model can address the objective or not.For example, if the objective of the model is to quantify a variable/process, is success defined based on the uncertainty around these estimated quantities?

SPECIFIC MODEL SETUP 4
Are there any deviations from the original model description?If this is the first time the model is presented, a full ODD description should be provided (Grimm et al., 2006(Grimm et al., , 2010(Grimm et al., , 2020b)).If the model has already been presented elsewhere, only deviations from the original description should be provided here.Models are often adjusted to address a specific ecological question/objective.It is these adjustments that should be reported here.The term "ecological pattern" refers to Pattern-Oriented Modelling (POM, Grimm et al., 1996Grimm et al., , 2005;;Grimm and Railsback 2012).Relevant ecological patterns can be observed at various scales and characterize the ecological system with respect to the particular problem being modelled.This is a follow-up from question 8 to link data with key processes and patterns.For example, if a central process in the study is interannual variations in population numbers, and observational data of population numbers are available: do these data appropriately represent the annual abundance, or do they represent a snapshot in time or space?Do not report on uncertainty estimates here, this is addressed in question 10. 10 Are there estimates of independent data accuracy, precision, bias, or uncertainty?
Uncertainty estimates for the independent data should be reported here (uncertainty estimates for the model outputs are reported in question 14).11 How are the independent data processed to represent the selected pattern?Are assumptions made to derive these patterns from the data?
Independent datawhether observational or modelledmay provide a representation of the patterns of interest (question 5) only after further processing.For example, survey data may be spatially interpolated to derive spatial distribution patterns.Another example: biomasses from several taxonomic units may be grouped to derive patterns of interannual changes in biomass for particular functional groups.Report these post-processing steps here.

MODEL OUPUTS 12
Which model outputs are used for the evaluation?This is a list of model outputs that have been selected based on the modelling objectives and related ecological patterns.The full set of raw outputs, which is often large, unprocessed, and not targeted towards the specific objectives of the modelling study, should not be reported here.

13
Have the outputs been post-processed, and how?As for independent data, model outputs may provide a representation of the patterns of interest only after further processing (see question 11).Report here the post-processing steps that are used to go from raw model outputs to ecologically relevant patterns.

14
Are there estimates of model output accuracy, precision, bias, or uncertainty?
Uncertainty estimates for the model outputs should be reported here.Focus should be on model outputs that are used for the model evaluation.

15
Are additional assumptions made when deriving patterns from model outputs?
Report here when some assumptions may be required to derive outputs at the appropriate scale or in the appropriate units.For example, a dry:wet-weight ratio may be assumed across species/seasons/areas to derive weight wet estimates (the relevant pattern) from dry weight (the model output).
(continued on next page) B. Planque et al. 23.How sensitive is the model evaluation to availability and uncertainty of the independent data?24.How much is the model evaluation constrained by computational or theoretical limits?25.How does the perceived performance of the model depend on the chosen evaluation methodology?

OPE template
As a practical tool, we provide in Table 1 a summary of the OPE protocol which highlights the main sections of the protocol, the 25 questions as well as guidelines on how to answer them.We also provide in supplementary material (S1), a Word template that can be used to provide information relevant to a modelling study.

Applications
We provide in the supplementary material (S2) examples of applications of the OPE protocol in the context of six modelling applications: 1 an Individual Based Model (IBM) used to quantify uncertainties in the estimates of mean biomass of the copepod Calanus finmarchicus as a function of sampling design (Hjøllo et al., 2021), 2 a statistical food-web model used to quantify the association between capelin (Mallotus villosus) and its main two prey (krill and Calanus species) (Stige et al., 2018), 3 simulations from the Non-Deterministic Network Dynamics (NDND) model to assess the persistence of trophic controls in the Barents Sea (Sivel et al., 2021), 4 an Ecopath model to estimate trophic positions for ecological groups in the Barents Sea (Pedersen 2022), 5 the Nordic and Barents Seas Atlantis Model (NoBa) simulations to assess cumulative impact of fisheries and climate in the Norwegian and Barents Seas (Hansen et al., 2019), and 6 the reconstructions and predictions of selected physical and biogeochemical properties using the NorCPM1 model in the Barents Sea (Bethke et al., 2021).
These case studies cover a range of modelling practices, modelling tools and study objectives.Knowledge about context within which a model is developed and of the history of the model development is essential to understand the evaluation approach.We realise that the OPE case studies presented in this manuscript can be difficult to read without prior knowledge of each model context and history.In standalone modelling studies, model descriptions would normally be provided in full, but this is not the case here.To correct for this, we included introductory paragraphs that describe the models that were used in each case study and provide a brief history of the models, i.e., where they originate from and how they evolved to finally be used in the current case studies.

Discussion
The OPE protocol as we present it here is complementing other reporting protocols, in particular the ODD protocol and the extensions (e.g., ODDO, ODD+D), by focusing on the model evaluation.We argue that such a protocol can significantly contribute to improving model evaluation and can in general increase transparency and reproducibility of published models.Following Oreskes (1998); Augusiak et al., (2014); Edmonds et al., (2019); Grimm et al., (2020a); Parker (2020), and others, we contend that model evaluation is purpose-dependant and that a clear description of the purpose of a modelling application must be an integral part of the evaluation process, whether the model goal is pedagogical, explanatory or predictive.
Model evaluation is essential and should accompany all model studies.We have therefore developed the OPE protocol for model evaluation, which is generic enough to apply to a wide range of ecological modelling studies, from coupled physical-chemical-biological systems (NORWECOM.E2E, NorCPM1, Atlantis), to simpler models focussed on food-webs interactions (NDND, Ecopath, Gompertz).In our experience, most modellers consider their model as somewhat special (i.e., not like other models) and therefore presume that it would be difficult to evaluate models using a standardised protocol like the OPE.Indeed, we found that it was often work-demanding for modellers to answer the 25 questions of the OPE protocol.Through the six case studies, we identified several challenges in documenting the OPE.Documenting model evaluation is not a standard step in most modelling studies.Lack of experience and training in doing so made it a timeconsuming and demanding task that required several iterations, and substantial amount of thinking and discussion.At times, the OPE exercise was perceived as too time-consuming, little rewarding in the short term and easy to postpone.It was often difficult to find the balance between providing informative answers and remaining concise.In several cases, it was not always obvious what was the right amount of contextual information required to inform readers about the model.The amount of evidence to be presented in support of OPE statements was also debated.When sensitivity analysis had been performed in earlier studies, it could be unclear how much this should be reported.At first sight, some questions appeared unclear or redundant, though these issues were usually resolved after some iterations.Some questions were also of little relevance for some of the model applications explored here.Nevertheless, it was possible to successfully apply the OPE protocol to each specific case study, despite the diverse collection of model types.We therefore anticipate that the protocol will be applicable to many ecological modelling studies.
The protocol can be used from the start of a modelling study, to guide model evaluation throughout the study.Though the primary motivation for this protocol was to construct a tool to help modellers reporting how they evaluated their models given specific objectives, we found that answering the protocol questions for the individual case studies led to additional discussions and reflections on model evaluation.In some instances, it was identified that additional evaluation steps could be taken or that some steps in the evaluation process could have been better specified.In the case of the Gompertz case study, documenting the OPE revealed that posterior predictive checks could have been considered to improve the evaluation.In the NDND case study, it was only after the OPE was documented that the issue of determining a threshold between acceptable and unacceptable models became clear.In the NoBa case study, it became apparent that many aspects of model evaluation for a complex end-to-end model like Atlantis, were still under-developed, and that the OPE could guide future work towards improved model evaluation methodology.In all case studies the OPE helped to clarify existing evaluation procedures and identify possible improvements.Had the OPE been available at the start of these studies, the model evaluation would likely have been conducted more thoroughly.A lesson learned from the exercise is that documenting the OPE is more easily done if modellers take relevant notes about model evaluation while developing their model, rather than leaving the OPE questionnaire to the end.This highlights the potential utility of the OPE to stimulate higher standard of model evaluation, in addition to its original goal of merely reporting how evaluation was conducted.
It is important to note that the OPE protocol goes far beyond model skill assessment.Assessing the prediction skill of ecological models has been the focus of recent literature (see e.g., Stow et al., 2009;Olsen et al., 2016;Steenbeek et al., 2021 and references therein).Skill assessment is an integral part of model evaluation and is clearly identified in the first part of the Evaluation section of the OPE protocol (questions 17-19).The OPE protocol expands beyond skill assessment by addressing issues related to objective, patterns, data, and sensitivity analyses and puts balanced focus on these different elements.Documenting model evaluation is not yet standard practice.The 25 questions outlined in the OPE protocol are a guide to present an B. Planque et al. extensivebut not exhaustivedescription of a model evaluation.A full description of the evaluation is often too long to be included in the core part of a published manuscript.We advocate that the OPE documentation be presented as a technical supplement.By documenting the details of the model evaluation procedure, the OPE provides essential information for the peer-review of a modelling study and directly contributes to higher transparency.Even when not all OPE questions are answered, it makes sense to present an OPE.We encourage modellers to try the OPE protocol by using the word template (S1) and get help and inspiration from the answers provided in the six case studies (S2).We also encourage reviewers to use the OPE questions as a guide when evaluating modelling studies.
The current OPE template is qualitative, thus providing high flexibility in reporting, but makes the evaluation report hard to appraise or to enter in automated systems that prefer numbers over free text.Possible future developments of the OPE may focus on adding standardised evaluation metrics or standardised evaluation vocabularies that could be automatically populated while performing evaluation exercises This in turn would facilitate analyses and comparisons within and between models.Further development of the OPE might also include other aspects of model evaluation that were not explicitly addressed here, such as robustness analysis (Grimm and Berger 2016).The questionnaire structure could possibly be hierarchised to highlight questions that have the highest priority (e.g., questions 1, 2, 3 and 19), or it could eventually be formally linked to other existing tools like TRACE (Grimm et al., 2014;Ayllón et al., 2021).
As noted by Grimm et al., (2014), building a 'culture' of model reporting is about doing all these things as well as you can because you know that peers and model clients are expecting you to; there is no point any more in complaining about "additional effort" for these things.We recognise that we are not there yet.Promoting the OPE and similar documentation during the peer review process would help in getting this culture in place.
The current version of the OPE protocol is a work-in-progress.Model evaluation is complex and the development of tools for reporting how evaluation is conducted is not a simple problem.The case studies presented here all originate from high-latitude marine ecosystem modelling research, which reflects the expertise of the authors.Further applications of the OPE will show how much the experience gained from developing and applying the OPE protocol on these few examples can benefit other modelling approaches on other ecological system types.During the discussions that formed the basis for the current protocol, a central point was that modellers have various cultures, experiences, and practices when it comes to model evaluation.These points of view are not always easy to reconcile with each other.Further discussions based on the use of the protocol on a wider range of models are expected to lead to revisions of the OPE protocol in the future.

Conclusion
The OPE protocol is proposed as a tool to report the evaluation of ecological models.The reporting template is organised along 25 questions which make it easier and faster for modellers to report model evaluation.The OPE structure further promotes comprehensive reporting of the evaluation process, ranging from objectives, to data, skill assessment, and sensitivity analyses.Our experience is that structured reporting of model evaluation helps modellers to think more deeply about the evaluation of their models.From this last point, we suggest that it would be highly beneficial for modellers to consider the OPE early in the modelling process, in addition to using it as a reporting tool (as we have done here) and as a reviewing tool.
20. Has a model sensitivity analysis been performed?If so, how?If not, explain why. a on the model structure?b on the model parametrization?c on other aspects of the model?21.Which elements are the modelled patterns most sensitive to? a input parameters b priors and assumptions c structural elements d processes 22. How sensitive are the modelled patterns to the choice of initial conditions, boundary conditions, spatial and temporal resolution?

a
In the model assumptions?b In the model structuresubmodels, variables, components, scales?c In the model detailsparameter values, functional relationships d In the model forcinginitial conditions, boundary conditions, observation forcing, maps?PATTERNS SELECTED PATTERNS 5 Which ecological patterns are used for the model evaluation?

cycles, shifts, trends, variability, autocorrelation b Spatial patterns -synchrony, travelling waves, patchiness, autocorrelation c Structural, functional patterns -diversity, biomass ratio, integrated production, diet, traits d Other relevant patterns 6 Why are these patterns important/essential to address the objectives
The patterns listed in a, b, and c are by no mean required or exhaustive, but are provided as examples of possibly relevant patterns.aTemporalpatterns-?Explain here how the selection of ecological patterns is justified in relation to the objectives of the modelling application.Are there hypotheses that can explain the emergence of these patterns?Do not discuss how these patterns can be derived from observations or model outputs, this is addressed in questions 11-15.INDEPENDENT DATA 7Where do the independent data originate from?Independent data refers to data that exists independently from the current model being developed.These can be observational data or outputs from other models.Do not discuss outputs from the modelling study, these are addressed in questions12-15.8Whatare the extent and resolution of the independent data?Report here the spatial, temporal, taxonomic extent and resolution of the independent data identified in question 7.For example, if a data series is presented, what are the starting and ending time and the time-frequency of data acquisition; if biodiversity data is provided, what is the taxonomic resolution and the method used to determine taxonomic units.9 How

conducted? If so, what is the method used? If not, explain why.
Sanity checks are informal steps that are taken throughout model development to ensure that the model is not behaving badly.They inform on key conditions that the model must satisfy to be considered useful.For example, checking that a population neither becomes extinct nor grows to unrealistic size.a Which

data and patterns are used for this? b Does this apply to patterns that are not otherwise evaluated for this model application? 17 What is the methodology used to compare ecological patterns derived from independent data with patterns from the model?
This section describes how model outputs are evaluated against independent data.This is sometimes referred to as model "skill assessment".This section should describe the methodology used as well as the rationale for the choice of methods, i. e., how the methods relate to data, model outputs, objectives of the study, and relevant ecological patterns.a What

is the rationale for choosing this method? b How are observational and/or model output uncertainties handled? c Does the methodology rely on specific assumptions? d Were other methods experimented? If they didn't succeed, explain why. 18 Is there a threshold level (match between observed and modelled patterns) that can separate acceptable from unacceptable models?
When are the model outputs reliable enough to be used to answer the main question of the study?Answering this question is critical to evaluate when the model can address the main objective of the study.One should not discuss here the conclusions of the study, but only the skill level required to consider the model useful.19 How

sensitivity analysis been performed? If so, how? If not, explain why.
Pianosi et al., 2016)s the approach used to conduct model sensitivity analyses (SA), in a broad sense, from individual parameter SA to global SA.Various aspect of the methods used for SA can be reported here, including sensitivity to parameters, model structure, boundary/initial conditions, simulation design, and so on (see e.g.,Pianosi et al., 2016).a

on the model structure? b on the model parametrization? c on other aspects of the model? 21 Which elements are the modelled patterns most sensitive to?
If applicable, report here the results of the SA on parameters, model structure, processes, and assumptions.

much is the model evaluation constrained by computational or theoretical limits?
Models that are structurally simple and computationally fast can generally be explored through in-depth SA.It is more demanding to run appropriate SA on models that are structurally complex or that use substantial CPU resources to run.For some models, complexity & run time make SA non-achievable in practice.These issues should be reported here.25 How