How to keep it adequate: A protocol for ensuring validity in agent-based simulation

uncertainty analysis and simulation. We present a twelve-step protocol to highlight the (often hidden) premises for methodological choices and their link to the modelling context. It is designed to aid modelers in understanding their context and in choosing and documenting context-adequate and mutually consistent methods throughout the modelling process. Its purpose is to assist reviewers and the community as a whole in assessing and discussing context-adequacy.


Introduction
The increasing application of agent-based simulation models (ABM) for policy analysis in environmental and land system sciences, among other fields, has been accompanied by persistent calls to improve and formalise methods for their validation (Heppenstall et al., 2021;Elsawah et al., 2020;An et al., 2020;Niamir et al., 2020b;Brown et al., 2017;Filatova, 2015;Filatova et al., 2013;Heckbert et al., 2010;Marshall and Galea, 2015;Rand and Rust, 2011;Siebers et al., 2010;Midgley et al., 2007). These calls are motivated by the concern that an ABM must prove its ability to provide useful and reliable insight for solving real-world problems if it is intended to be more than a theoretically-appealing academic thought-instrument.
If we look at discussions of validation in simulation modelling in general, then traditionally empirical validation, i.e. comparing model predictions to observations of the behaviour of a real-world system, was regarded as the ideal method for showing relevance and reliability of a model (Oreskes et al., 1994). It entails reproducible protocols and quantitative, replicable and transparently communicable results. However, along with any type of model inference from observed system behaviour (behaviour-based inference) 1 , it relies on (statistical) assumptions about the data and modelled system and can be severely misleading if these assumptions are not fulfilled in a specific research context (e.g. Oreskes et al., 1994;Polhill and Salt 2017). Structural validation, for contrast, aims to ensure correspondence of the structure, processes and mechanisms within the model with their real-world counterparts. It is often limited by incomplete structural system knowledge and typically less formalised. When it is conducted as empirical validation of model component behaviour (structural behaviour validation, microvalidation), it is subject to similar statistical prerequisites as empirical behaviour validation at the macro level.
Recognizing that neither empirical nor structural validation can ultimately prove absolute correspondence of a model to reality and that models are by definition abstractions from reality (Oreskes et al., 1994;Quine, 1951) has led the scientific community to replace the condition for model validity from 'corresponds to the real system' to 'is adequate for its intended purpose' (e.g. Forrester and Senge, 1980;Gass, 1983;McCarl and Apland, 1986;Oreskes et al., 1994;Barlas, 1996;Kydland and Prescott, 1996;Rykiel, 1996;Beck et al., 1997;Jakeman et al., 2006;Deichsel and Pyka, 2009;Augusiak et al., 2014;Edmonds et al., 2019). This means that the conditions for a valid, i.e. adequate, model and simulation analysis are context-dependent. They do not only depend on the characteristics of the system to be modelled, but also on the availability of data describing the system and its behaviour as well as the research question to be answered.
Discussing the validation of ABM against this background, one first notices that ABM are used for a large variety of purposes and contexts (Edmonds et al., 2019;Lippe et al., 2019;Schulze et al., 2017;Ligmann-Zielinska et al., 2020). As inherently structure-rich models, they are often (but not always) used in contexts where data-driven modelling approaches are not applicable and as a consequence many prerequisites for empirical validation are not fulfilled (Berger and Troost 2014). The importance of structural validation, uncertainty and sensitivity analysis for ABM used in these contexts has been widely recognised (Moss and Edmonds 2005;Brenner and Werker, 2007;Augusiak et al., 2014;Troost and Berger 2015a;Marshall and Galea, 2015;Polhill and Salt 2017) and has even led some to dismiss empirical inference and validation of ABM altogether (Verhoog et al., 2016). Nevertheless, methods for behaviour-based inference (incl. empirical validation) of ABM have been developed for specific disciplinary contexts: For example, indirect inference of ABM in financial economics (Chen et al., 2012); pattern-oriented modelling as de-facto standard in ecological modelling (Grimm et al., 2005;Thiele et al., 2014); Approximate Bayesian Computation for inference of individual-based models (van der Vaart et al., 2015), micro-validation in energy economics (Niamir et al., 2020a), automatised calibration for innovation diffusion models (Jensen and Chappin, 2016) and real estate market interactions (Filatova 2015;Magliocca et al., 2016;de Koning and Filatova, 2020); or, robust inference of parameter distributions in agricultural economics (Arnold et al., 2015;Troost and Berger 2015a;Berger et al., 2017). Hence, agent-based modelling appears as a very diverse field, in which a multitude of methods for model construction, model inference, validation, evaluation and sensitivity analysis is being used and advocated. Unfortunately, the contexts in which specific methods are applicable are typically not explicitly discussed in general terms.
The ABM community has successfully addressed communication challenges caused by the diversity of modelling structures through adopting the ODD protocol  for formal model documentation. The TRACE format (Schmolcke et al., 2010;Grimm et al., 2014) was suggested for documenting also hidden steps of the modelling process. However, a consensus or a formal protocol regarding which modeling methods to choose for a specific ABM application context that transcends disciplines has not yet been established, -not even within the more confined field of ABM in environmental and land system sciences Polhill and Salt 2017;Filatova 2015).
This article 2 aims to fill this gap by formalising a framework for validation, i.e. a concept and guideline for ensuring and documenting the adequacy of an ABM and the simulation analysis for which it is used. In the following section, we conceptualise validation as "challenging and substantiating the premises on which the conclusions from simulation analysis are built". We revisit premises of inference and validation typically used in simulation analysis in general and discuss to what extent they are tested, and to what extent they are actually presupposed by empirical and structural validation, behaviour-based model inference, uncertainty analysis and result interpretation.
It becomes clear that, given the diversity of contexts in which ABM are applied, it is not useful to prescribe one statistical or structural validation procedure to all ABM. What is more: under a paradigm of adequacy and given the constraints on empirical validation, validity cannot be tested solely by examining the behaviour or structure of the model, once it has been constructed. Validation cannot consist solely in one confined, isolated step of the modelling process -typically located after calibration and before predictive simulations -as which it is commonly still understood. Instead, validation, if understood as systematically examining the adequacy of a model for its purpose, requires careful justification of context-adequate and mutually consistent choices at all stages of the simulation analysisincluding the choice of model components and choice of methods for model inference (inverse modelling, calibration, estimation, empirical validation)-and a consistent tracing, documentation and interpretation of uncertainties through the modelling process to finally ensure the validity of the conclusions drawn from the analysis.
On this basis, in the third section, we develop a step-by-step protocol of guiding questions to help agent-based modellers "keep it adequate" (KIA) by (i) defining the modelling context, (ii) adequately selecting models and methods for model inference and uncertainty documentation, and (iii) adequately deriving and interpreting simulation results and their uncertainty.
The fourth section discusses and concludes how the KIA protocol can help the ABM community. It is intended to (a) guide modellers during the research process, (b) provide a template structure for transparently documenting the rationale for modelling choices, (c) serve as a checklist for reviewers and stakeholders (addressees of simulation results) when assessing the validity of a documented study and its conclusions, (d) foster efficient communication between authors and reviewers, and (e) help in structuring the scientific discussion on the merits of choices regarding model selection, inference and evaluation made during the modelling process.

Arguments for model validity and their premises
If there is one cross-disciplinary consensus in the scientific literature on model validation, it is that model validity cannot be established in general, but only with respect to a specific purpose for which the model is intended to be used. Model validity is the adequacy of a model for its intended purpose (e.g. Forrester and Senge, 1980;Gass, 1983;McCarl and Apland, 1986;Oreskes et al., 1994;Barlas, 1996;Kydland and Prescott, 1996;Rykiel, 1996;Beck et al., 1997;Jakeman et al., 2006;Deichsel and Pyka, 2009;Augusiak et al., 2014;Edmonds et al., 2019). The purpose of any scientific simulation analysis is to answer a research question. Scientific answers result as conclusions from scientific argumentation and are accepted if the conclusions can be validly derived from accepted premises (McCloskey, 1983;Hands, 2001). Scientific objectiveness is ensured by transparently subjecting all premises and deductions to critical scrutiny and peer review (Klappholz and Agassi, 1959;Caldwell 1991;Longino, 1992).
In its most generic form, scientific arguments that employ simulation modelling conform to the following logical proposition : Major premise A: "If a simulation s fulfils conditions U and Results in y for inputs x, we can conclude Z.": ∃s: U(s) ∧ R(s, x, y) ⇒ Z.
Minor premise B: "Our simulation t results in y for inputs x and fulfils conditions U.": R(t, x, y) ∧ U(t). Conclusion: "We conclude Z.": ⸫ Z by A ∧ B and modus ponens. Premise B is a conjunction of two premises. The first premise "R(t, x, y): Our model results in y for inputs x" is supported by result analysis. Showing that the second premise ("U(t): Our simulation analysis fulfils conditions U") holds is what is typically understood as validation.
A typical example: We conclude (Z) "Climate change will increase poverty among farming households" if R(t, x, y): "Simulated farm agent income is lower in climate change scenarios than in the baseline". The necessary condition U(s) is very often formulated as: "The model employed in our simulation analysis provides sufficiently reliable predictions of y(x) in the real-world system." Empirical output validation and structural validation test whether a simulation t fulfils this (or a very similar) formulation of U(s) but they, in turn, rely on further necessary premises.
These premises will be discussed in the following two subsections. The third subsection emphasises the role of uncertainty analysis for sound and robust conclusions (showing sufficient reliability). In the fourth subsection, we highlight that simulation analysis may also rely on differently formulated conditions U(s) that allow for more useful conclusions in some contexts.

Premises of behaviour-based inference including empirical validation
The key underlying premise of any form of inference from the comparison of model and observed system behaviour is: "Predictive performance of a model in observed situations can be generalised to the target situations (i.e., the system situations relevant for the research question)". This premise is trivially fulfilled if the target situation is part of the observed situations (in-sample setting). However, very often the simulation purpose is to anticipate 3 system behaviour for target situations (life after climate changed, in our example) that have not been (fully) observed or, in other cases, to find a generalisable model that explains mechanisms governing system behaviour in many target situations (explanation) (Edmonds et al., 2019).
Direct generalisation of behaviour (i.e., observed x-y relationships between system input and output including the strength of this relationship [ = predictive performance]) from observed to unobserved situations relies on the two premises that the observed sample is redundant enough to control for sampling error and the target situations are part of a statistical population for which the observed sample is representative (representative sample setting). These basic statistical preconditions of representativity and control for sampling error apply to any form of model inference from observed behaviour (behaviour-based inference, inverse modelling), whether parameter values (estimation, calibration) or model structures (data-driven model selection) are selected, or predictive accuracy is estimated and compared to some (implicit) benchmark (goodness-of-fit evaluation) or between training and test samples (cross-validation) 4 : In all cases, ignoring sampling error and non-representativity (bias) leads to the generalisation of spurious, unsystematic, confounded or unstable relationships (overfitting) that causes inaccurate and misleading out-of-sample predictions and makes the inference invalid (Browne 2000;Forster 2000;Hansen and Heckman 1996).
Sampling error is the unavoidable, unsystematic error caused by using a sample and not the full population. It can potentially be reduced by increased sampling rates (Williams et al., 2022). Non-representativity occurs due to a biased sample, which can be caused by different, sometimes subtle reasons, including attrition, self-selection, survivorship or failure bias, observer bias, and unobserved heterogeneity ( Vandecasteele and Debels 2007;Gangl 2010;Gormley and Matsa 2014;Jager et al., 2020;Smith 2020). While some minor biases may be corrected by statistical means, structural breaks, non-stationarity or regime shiftssuch as climate changesubstantially alter statistical x-y relationships causing extreme sample bias: Observed and target situations are so fundamentally different that they must be considered different statistical (sub)populations (non-representative sample setting) and direct generalisation is not possible (Perron 2006;Andersen et al., 2009;Leamer 2010;Filatova et al., 2016;Verstegen et al., 2016).
In non-representative sample settings, anticipation of system behaviour for unobserved situations has to rely on structural knowledge about internal system processes (see next section). Nevertheless, a sample can still be useful for indirect generalisation: Structural knowledge often admits alternative model formulations or parameter values (candidates). Even if a sample is not representative of the target situations, it can help discriminate between candidates if it is representative and sufficiently redundant for selected situations in which the candidates imply clearly distinguishable behaviour. Generalisation to a target situation then relies exclusively on structural knowledge embodied in the chosen candidate, whereas observed behavioural data only contributes indirectly by selecting this candidate. 5 Importantly, the predictive accuracy measured in the sample cannot be straightforwardly generalised to the target situation in these cases as even systematic differences in prediction errors between sample and target situations cannot be ruled out.
Preconditions for reliably discriminating between candidates are structural and practical identifiability (Bellman and Åström, 1970;Cobelli and DiStefano, 1980;Stigter et al., 2017;Guillaume et al., 2019): Structural identifiability means that different candidates are not observationally equivalent, i.e. do not imply the same system behaviour in the observed situations. Even a fully representative and redundant sample is not able to distinguish between models that predict the same output for the same input. 6 Practical identifiability means that the variation in the observational data in connection with statistical assumptions (e.g. on representativity and the form of model errors) is sufficient to unambiguously attribute effects to the individual parameters of a given model structure. Sampling error, confounded input variation (correlated variables, multicollinearity), unobserved heterogeneity, and omitted variable bias are key obstacles for unambiguous model selection and parameter estimation. More complex models require more data or more restrictive prior assumptions on parameters to be practically identifiable (Browne, 2000;Burnham and Anderson 2004;Polhill and Salt 2017). Two candidates that cannot be discriminated by given data are termed 'equifinal' (Beven and Freer 2001). 3 This purpose can be called prediction, projection, scenario analysis, counterfactual simulation, forecast or just simulation depending on context (a more detailed discussion follows in section 3.1). 4 The latter two are most often associated with the term "empirical validation". Both are behaviour-based inference methods because they are used to select/accept a model by comparison to other models [sample averages in the simplest case, see section 3.2 and 3.3.1]. If not satisfied, the search typically continues until a better model is found. In terms such as 'calibration & validation', the second word typically refers to the second stage in a simple twosample cross-validation. Within that cross-validation process the calibration and validation stage each have their separate roles, but together they constitute a method for model selection. This narrow meaning of validation is not to be confused with the comprehensive idea of validation as evaluating model adequacy for purpose advocated in this article (which involves the adequacy of a model selection/inference method). 5 Similarly, indirect generalisation occurs if the output variable of interest has not been observed itself and a model is indirectly tested using another related output variable. Generalisation of the variable of interest then relies on the premise that the structural knowledge embodied in the model correctly relates the two variables. 6 Structural identifiability in our understanding subsumes also the problems of endogeneity often encountered in econometrics.

Premises of structure-based model choice and structural validation
Structure-based simulation is essential to anticipate behaviour for target situations for which direct generalisation from observed data is not possible and to derive structural explanations of system behaviour. Structure-based simulation deduces system reaction from existing knowledge about system components and their interactions. It is sometimes argued that such a deductive process does not create new information. However, as Frisch (1933) argued, the key contribution of quantitative modelling is to analyse the interplay of processes and compare the magnitudes and directions of their individual effects in relation to each other in order to deduce the behaviour of the whole system. This anticipated or emergent behaviour is new information that was not obvious from looking at existing knowledge on individual processes in isolation.
The key premise of structure-based modelling and structural validation is: "A model that contains a sufficiently complete and accurate representation of the internal structure and processes of a system is expected to predict system behaviour well." Structurally assessing the premise of sufficient completeness is often complicated by incomplete knowledge of the system and its potential reconfigurations. In addition, modellers are typically forced to strike a balance between completeness and efficiencystriving to include all relevant processes, while omitting unimportant ones that complicate the model construction (Forrester and Senge, 1980).
Assessing the premise of sufficient accuracy in the representation of individual processes is the subject of micro-validation (Moss and Edmonds 2005;Windrum et al., 2007;Midgley et al., 2007;Arnold et al., 2015;Ghaffarian et al., 2021). Some structural processes and their parameters may be directly observable and measurable. Others, however, may have been generalised from observed subsystem behaviour by behaviour-based inference, in which case the preconditions discussed in section 2.1 (sample representativity, identifiability and control of sampling error) apply: The inclusion of estimated model components into a composite model requires ensuring that the observations from which they have been generalised are representative for all contexts for which they are applied in the composite system.

Uncertainty analysis: the premises for robust conclusions
Given the statistical nature of model inference and the typically incomplete nature of structural knowledge discussed in the previous subsections, simulation analysis is practically always subject to uncertainty. Just showing that one particular model results in a specific output for a particular input is hence not convincing: It invites the immediate criticism that plausible alternative models might show different results. Rather, it must be shown that the final conclusions towards the research question are robust and not affected by uncertainty and bias (van Asselt, 2000;Walker et al., 2003;Saltelli et al., 2013;Fischhoff and Davis 2014;Berger and Troost 2014;Troost and Berger 2015a;Marchau et al., 2019).
This implies, firstly, that the type and degree of uncertainty and bias that are compatible with conclusion Z must be carefully specified in the major premise. Secondly, it is a necessary subpremise of U(s) that implications of uncertainty in structural knowledge and uncertainty in model inference from data (and, in predictive analysis, uncertainty in the anticipated input for target situations) and their effects on results have been carefully assessed.

Alternative basic premises
Not every scientific argument using simulation analysis is based on the premise that the model provides reliable predictions of y(x) in the real-world system. Edmonds et al. (2019) have noted that some types of analysis (e.g. theoretical exposition) do not require any immediate claims about the relation of the model to reality at all or put more emphasis in representing stakeholder's views of the system.
A subtler relation is discussed by Troost and Berger (2020, p. 6f.), who use the following hypothetical ABM application: "Economic policy analysis often works in a normative context: Policy makers need to justify actions with respect to established societal values, norms or ideologies. For example, they might work in a political setting, in which the state is supposed to safeguard minimum living incomes but only to interfere in economic processes if market participants are not at all able to help themselves.
"Assume that in this context analysts build their ABM to simulate the adaptation of farmers to climatic change and model each farm agent decision as a rational optimisation problem with perfect anticipation of (projected) climatic impacts on production and market conditions. In addition, farm agents are embedded into a social network of mutual solidarity, in which agents less affected by climatic extreme events indiscriminately help the severely affected ones. Analysing their simulations, the analysts find that their optimising farm agents become food insecure under projected impacts. They conclude that if perfectly-foresighted, optimising agents in a perfectly functioning social solidarity network do not fare well, real-world farmers are even more unlikely to do so and should receive government help." As  observe, the model would likely not pass conventional structural and empirical validation: Key modelled processes do not correspond to our best knowledge of their real-world counterparts. (In reality farmers do not behave as fully rational optimisers with perfect foresight and networks typically discriminate by family ties, ethnicity, etc.). The model will almost surely overestimate observed farm incomes in the past. Nevertheless, the conclusions would withstand such criticism, because accurately predicted farmer or network behaviour is not a relevant premise of the argument here.
In this case, the premise that would need to be challenged in validation is that the model calculates the best possible reaction in economic terms. Empirically this could be done, for example, by searching for observed cases for which the model predicts worse than observed outcomes. One might also identify other unexpected deviations, e.g. larger farm holdings having higher per-area incomes than smaller ones, which might be observed in the data but not in the model (or vice versa) and that are not expected to be caused by imperfect optimisation of realworld farmers alone. Nevertheless, even if the intention is not to show accurate prediction, premises on representativity, sampling error and identifiability also apply here. Structural validation could, for example, assess whether assumed constraints are overly pessimistic or alternative production, safety or income options that might become available with climate change have been omitted.  further observe that if, for contrast, the analysts find that their computational agents fare well, it would be a logical fallacy to conclude that real-world agents will fare well based on the same premises. Such an argument would require different premises that are much more difficult to support using a model with a clear upward bias. Both cases use the same model in the same empirical context towards the same motivating research question. This illustrates that to judge a model's adequacy we require a very precise definition of its empirical context and the exact argumentative premise it is supposed to support.

A protocol for ensuring validity in agent-based simulation
Summarising section 2, validation means ensuring the adequacy of simulation analysis for answering a specific well-defined research question and such adequacy requires: (I) laying out a logically valid argumentative structure on which potential conclusions from simulation towards the research question can be built; (II) choosing model components and methods of inference and evaluation that (i) fit the requirements implied by this argumentative structure and (ii) rely only on preconditions regarding observation data, system properties and structural knowledge that are fulfilled in the given context; (III) carefully assessing whether the simulation results and specifically their uncertainty and bias are consistent with the requirements of the argument.
Points (I)-(III) imply that adequacy is relative to a modelling context, which consists of the purpose (research question) and the available knowledge and data about the modelled system. Validity cannot be ensured by examining model structure and behaviour ex post only, ensuring it requires assessing the adequacy and mutual consistency of choices at all stages of the modelling process. Given the diversity of contexts in which ABM are applied, it will be impossible to identify onefits-all model structures or statistical methods for all ABM and assessments of adequacy need to be able to cover a broad set of possible contexts.
Taking this into account, in the following sections, we propose a protocol ( Fig. 1) of 12 steps covering and linking all stages of simulation analysis. The protocol helps characterise the modelling context (Part I, section 3.1), guides the choice of context-adequate methods based on this characterisation (Part II, section 3.2), and emphasises the documentation and consistent propagation of uncertainty through the modelling process, so that finally the robustness of conclusions can be comprehensively assessed (Part III, section 3.3). The protocol itself is provided in Tables 1-4 and 6, while the sections in the main text explain the rationale for each step. Where available, we list formal methods of analysis with useful references and highlight the premises for their applicability.
The concept of the protocol involves eleven dimensions (marked by letters a-k in Fig. 1) that characterise modelling contexts and determine adequate choices of models and methods. Six of these represent requirements of the research question ( Fig. 1 a-f) that can be determined already at the beginning of the modelling process, while the other five ( Fig. 1 g-k) require a more in-depth analysis of the relationship between research question and system knowledge and data during the modelling process. For a better overview, the numbering in Fig. 1 links the main stages of the modelling process (blue boxes) and context dimensions (grey boxes) to the associated steps of the protocol. The classification and propagation of uncertainty is indicated in red.

Part I: defining the modelling context
The first step is to characterise the modelling context: the precise research question and the knowledge and data that are available about the system being modelled (Table 1).

Step 1: Precisely define the research question
A research question typically arises from a larger debate, discourse, or decision problem: for example, a public, political or scientific debate, a participatory planning problem or an economic decision problem. A research question to be addressed by simulation analysis is supposed to contribute to this debate, even if answering it may not necessarily resolve the whole debate. Useful contributions can comprise very different questions (Edmonds et al., 2019;Epstein 2008): e.g. detailed, precise forecasts of future states of the world, statistical testing of explanatory models, but also exploring and stress-testing possible consequences of decision options (Berger and Troost 2014;Lempert 2019) or purely theoretical questions concerning hypothetical models themselves (theoretical exposition in the sense of Edmonds et al., 2019). It is paramount to be clear about what precise question the simulation analysis is supposed to answer and what precise argument it could contribute to the debate.

Step 2: Characterise requirements implied by research question
While typologies of model purposes exist (e.g. Edmonds et al., 2019;Epstein 2008), the understanding of commonly employed terms such as prediction, forecast, projection, exploration differs between scientific disciplines. Often, they are used inconsistently (Bray and von Storch, 2009), and all lack the necessary precision on some aspects relevant for methodological choices. Instead, Table 1 (a) defines six dimensions to precisely describe the requirements imposed by a research question: The most basic consideration is the focus of interest: Does this lie in anticipating system behaviour in specific situations 7 (output-focus) or in describing or understanding system structure 8 (structure-focus)? Carefully defining the target situations is a necessary precondition for judging the degree of generalisation in the next step. Required resolution, required transparency as well as computational resource constraints impose limits on a priori model selection. Judging the robustness of conclusions requires understanding the required precision and accuracy (tolerable uncertainty) in simulation outcomes. At this point, it is often not yet possible to formulate this quantitatively (e.g., 2% deviation is acceptable), and should be done in terms of consequences on conclusions (e.g., uncertainty should not affect ranking of policy alternatives by evaluation criteria). Together, these dimensions define requirements that the simulation analysis aims to fulfil. Whether this is actually possible can only be judged at the end of the modelling process (see section 3.3 and Table 7).

Step 3: Identify knowledge and data about structure and behaviour of the modelled system
In addition to the research question, the modelling context is defined by the available information about the simulated system in the form of structural and process knowledge, available observations of system behaviour (input-output trace data) as well asin the case of an output-focusthe anticipated system input data for target situations. The next step is to identify which data, information and knowledge are available, can be obtained with reasonable effort or will remain unattainable for the analysis (e.g. input-output observations of far future system states) (Table 1b).

Part II: Context-adequate model and parameter selection and uncertainty documentation
Appropriate simulation models can be selected in two steps: In a first 7 Such as in prediction, scenario analysis, counterfactual simulation, projection, or forecasts. 8 Such as in explanation, causal identification, or description. structural step, a set of candidate models and candidate parameter sets is constructed or identified whose theoretical characteristics comply with structural system knowledge and the requirements implied by the modelling context (Steps 4-6; Tables 2 and 3). A set of multiple candidates fulfilling the requirements represents the prior model uncertainty 9 (Steps 7-8; Tab 4). In a potential second step, behaviour-based inference can possibly be used to ascribe empirical likelihood to the candidates, rank them and narrow down the candidate set, reducing prior to posterior model uncertainty (Beck et al., 1997) (Step 9; Table 4). Ideally, the two steps complement each other: The first step is key to ensure that only adequate candidates are considered in behaviour-based inference. Omitting this theory-based preselection can only be adequate if the simulation analysis is output-focused and the modelling context allows for the direct generalisation of statistical relationships (namely the expected predictive accuracy) to the target situations (representative and sufficiently redundant data) (Step 6i, Table 3). Only in this specific case, expected out-of-sample predictive accuracy and practical identifiability can be derived solely from the data and are sufficient criteria for model selection (Polhill and Salt 2017). Nevertheless, even for these direct generalisation cases, incorporating structural knowledge in chosen candidate models becomes more essential the scarcer the data: a defensible structure-based error model specification and pre-selection of candidate models increases practical identifiability (see e.g. Troost et al., 2022).
For the second step, it is key to ensure the adequacy of the inference process itself (Steps 7-9, Table 4). Do the necessary preconditions discussed in section 2.1 hold in the given modelling context? Is the specific method chosen appropriate for the context? Is uncertainty properly considered and documented? If not, behaviour-based model inference is clearly not adequate

Step 4: Representativity of data and degree of generalisation
The first step in model selection is to contrast the observed or observable data with the target situation of the research question to determine the degree of generalisation and extrapolation implied following the considerations on representativity and constancy (regime shifts, structural breaks, stationarity) discussed in section 2.1. This analysis requires a basic system conceptualisation (not yet a full conceptual model) that allows judging the system's degree of openness, internal stability, complexity, and stochasticity (Step 4, Table 2).

Step 5: ABM as composite models: structuring component context
While our protocol addresses modellers that are inclined to use an ABM, one question to ask in structural model choice is, of course, whether an ABM indeed suits the given modelling context or a different modelling approach is more promising. ABM are typically composite models (model systems), which are composed of lower-hierarchy models (components) that mirror relevant subsystems and processes. For example, they typically contain a model of individual agent behaviour based on the internal state of and external influence on the agent. This submodel for agent behaviour in turn may itself be a composite of lowerhierarchy components, e.g. for learning, demographics and economic Tracing the influence of the modelling context on adequate decisions in and conclusions from simulation analysis. Conceptual basis and structural overview of the protocol. (Note: Numbers refer to steps in the protocol. Letters refer to the 11 characterizing dimensions of the modelling context. Blue boxes refer to stages of the modelling process. Uncertainty classification and propagation is printed in red. Arrows link context dimensions to the modelling stages in which they influence decisions. Colours of arrows help to visually trace crossing connections, but have no deeper significance.) 9 While we use terminology (prior, posterior uncertainty) borrowed from Bayesian statistics here, this does not mean that this uncertainty can necessarily be cast into a formal prior probability distribution. More often than not, it cannot and it may well only be qualitative descriptions of uncertainty (cf. also Beck at al. 1997 for this general use). Table 1 KIA Protocol, Part I: Guiding questions and categories for analysing and describing the modelling context and their relevance for the analysis.

Dimension
Questions Categories Relevance Step 1: Define the research question Step 2: Analyse the research question and derive requirements for modelling Focus of interest Are we interested in anticipating system behaviour for specific situations (output-focus)?
Or are we interested in learning about inner system structure, processes, and relationships (structure-focus)?
(i) output-focused (for example, prediction, projection, scenario exploration) (ii)) structure-focused (for example, explanation, description, or causal inference) decisions (Schlüter et al., 2017). ABM also typically contain models of agent interactions, e.g. communication, markets, auction, collective action or network models (Schreinemachers and Berger 2011). In addition, many ABM in natural resource management link to biophysical components that model responses of natural systems (e. g. a crop field or watershed) to agent intervention (Arnold et al., 2015). 10 System behaviour in an ABM emerges not only from the interactions between agents, but conceptually also from the interactions of individual model components. In general, such structure-rich composite models are typically used for structure-focused analysis or for outputfocused analysis when direct generalisation from observed data is not possible (Nolan et al., 2009;Voinov and Shugart, 2013). In direct generalisation contexts, prediction is often achieved more efficiently with statistical or machine learning models (Polhill and Salt 2017). 11 The adequacy of a composite model relies on (i) an assembly of components that together fulfil the relevant premises for the overall research question to be answered, (ii) a careful assessment of the How open and stable is the system studied? How complex and how stochastic is its response?
To determine representativity of data and degree of generalisation (step 4) and select a model (step 6) How well are system structure and processes known? How strong is the prior evidence for a specific system structure and process model?
For structure-based model selection (step 6) and specification of prior uncertainty (step 7)

Obtainable system input-output observations
Which observational data is available that traces system behaviour by relating system input and system output? Which could be obtained within the allocated time and resource frame? Which domain does it cover?
An overview and characterisation of the potentially available data Used for assessing degree of generalisation (step 4), structural identifiability (step 8) and behaviour-based inference/practical identifiability (step 9)

Input data for target situation (for outputfocused RQ)
How well can boundary conditions and initial system state (model input data, X) be anticipated for target situations?
Used to select an appropriate method for predictive analysis in step 11 Table 2 KIA Protocol, Part IIA: Guiding questions for basic considerations before structure-based model selection.

Item Guiding questions and actions Outcome
Step 4: Determine representativity observed/observable system behaviour and degree of generalisation Representativity of observed system behaviour for target situations Potential type of generalisation Given system understanding and available data from step 3: Does the RQ imply extrapolation/generalisation from observed system behaviour? Do we have to expect (potentially) structural breaks, regime shifts, non-stationarities between observed and target situations? Can the data be considered representative of all target situations implied by the research question given the characteristics of the system? Has the external influence on the system been observed in all relevant dimensions and across the relevant domain? Have low probability events (likely) been observed in the sample (Filatova et al., 2016)? Does the sample suffer from bias or confounding with unobserved heterogeneity? Can it be corrected by weighting, error clustering, etc.? ( Vandecasteele and Debels 2007;Gangl 2010;Gormley and Matsa 2014;Jager et al., 2020;Smith 2020) Considering the above, is direct generalisation of statistical relationships from observations to target situations potentially possible?
A well-argued decision for either of -No generalisation implied (if target situations fully contained in observed data) -Direct generalisation potentially possible (if data sufficiently representative for target situations, possible biases correctable, no structural breaks/regime shifts etc.) -Direct generalisation not possible (if data not representative e.g. due to structural breaks, regime-shifts, etc.) (used in step 5 and 6 for structure-based model selection and step 8 and 9 to decide on behaviour-based inference) Step 5. Structure the modelling tasks into model components Component definition Structure the modelling task into components (and functional links between them) (e.g. agent behaviour, interactions, environmental response, market mechanisms) Identify and characterise the specific modelling context of each component by running through steps 1-4 for each component. Possibly: Start with a tentative structuring and iteratively run through steps 1-6 or even steps 1-10 until finding a satisfactory structuring.
A structuring of the simulation task into components and the characterisation of the modelling context for each component (basis for structure-based selection of component models in step 6 and possibly behaviour-based inference of these components in steps 8 and 9) adequacy of each lower hierarchy component for its intended role in the composite, and (iii) a consistent consideration of the uncertainty in each component at the composite level (Arnold et al., 2015). It is important to realise that each component has its specific own question to answer and has its own specific modelling context, which may differ considerably from the modelling context of the composite as a whole or that of other components. Even if the ABM is used in an overall modelling context that is not apt for direct statistical inference, this does not rule out that within-model contexts of lower hierarchy components exist in which representative samples allow for direct generalisation and e.g. the use of machine-learning methods for these components. For example, we may not yet have observed how a specific group of farmers behaves and fares in a warmer climate, so we cannot empirically measure the predictive performance of a composite model that simulates potential future farmer behaviour and welfare. We may, however, be able to include a plant growth component into this composite model that can be tested based on observations and experiments in a range of warmer and colder regions if we consider this range representative for potential future growth conditions ).
An important step hence is to structure the overall modelling task into subcomponents and then recursively revisit the steps of the protocol also for each component individually (Step 5, Table 2). This step may often not directly result in the final structure, but may involve various iterations through steps 4-10 until an adequate composite structure for the overall modelling context has been established, which may or may not involve an ABM.

Step 6: Choosing structurally adequate candidate models and prior parameter ranges for each component
The guiding questions in Table 3 (step 6, items i-ix) help to check potential model (component) candidates for context-adequacy from a structural point of view. The table also lists selected literature sources that provide formal tests or more in-depth discussions of each question.
For adequate structure-based model selection, it is useful to first sketch a comprehensive conceptual system model (Argent et al., 2016), even if not all system processes can or finally have to be included in the simulation model. This conceptual sketch can serve as a benchmark to check a candidate's match of the domain of applicability and sufficient completeness of processes for the target situations (Parker et al., 2008). It must be ensured that model structure and parameters fixed in the candidate are also expected to be constant (no change over time) and invariant (unaffected by policy, treatment, change to target situation) in the real-world system (Lucas 1976;Engle and Hendry 1993;Hendry 1996). Relevant changes between situations must be captured as exogenous input or result from internal feedback in the model. It is not always possible to explicitly simulate all potential real-world feedback in the model itself, but it should then at least be possible to capture potential feedback as changing boundary conditions that may then later be assessed in uncertainty analysis (Troost and Berger 2015b;Troost et al., 2022) (Table 3, ii-iv).
Expected deviations, i.e. the part of the system behaviour that is not explained or predicted by the model from a theoretical point of view, should be consistent with the precision and accuracy required by the research question (Table 3, v). Research questions requiring accuracy with respect to an absolute reference necessitate not only a high degree of model completeness with respect to all systematic processes, but also with respect to probability distributions for unsystematic effects as well as reliable system input data for target situations. Research questions requiring accuracy only with respect to the relationships between simulated target situations demand model completeness only with respect to systematic differences. Simplifying assumptions (such as optimising agents in our example introduced in section 2.4) may lead to systematic over-or underestimation (bias). This is not problematic as long as major conclusions drawn from the simulation analysis will not depend on such simplification (robustness to the relaxation of simplifying assumptions, no model artefacts). 12,13 Logical consistency, correct technical implementation, and fit to the required resolution, transparency and resource constraints are obvious preconditions that must be assessed even if the component context allows for direct generalisation (Table 3, vi-ix).

Steps 7 and 8: Documenting prior and input data uncertainty and assessing structural identifiability
Structure-based model selection typically results in a number of plausible model structures and parameter values. This prior uncertainty should be documented (even if not all plausible alternatives can be implemented and tested) (Step 7, Table 4). The first step in determining whether behaviour-based inference can reduce this prior uncertainty then is to assess the structural identifiability of candidates in the observed range of data, i.e. check whether the behaviour of candidate models differs in the domain for which the data is representative (Step 8, Table 4). A variety of analytical and numerical approaches to assess structural identifiability are available (Guillaume et al., 2019;Chis et al., 2011) including numerical parameter screening methods from sensitivity analysis (Campolongo et al., 2007;Troost and Berger 2015a).
Not only uncertain parameters and structure in the model itself, but also uncertain auxiliary parameters or assumptions (e.g. error distributions for expected deviations and measurement error in input data, imputation to deal with incompleteness in the data, 14 alternative choices in data curation, preparation or aggregation) must be documented and considered when assessing identifiability. Structural identifiability may differ between parameters of the same model: Some parameters can be structurally identifiable in the available data (see Appendix A.1), while others are not and their uncertainty cannot be reduced by behaviour-based inference (e.g. Troost and Berger 2015a). Structural non-identifiability cannot be resolved by more of the same data, but requires either widening the range of situations observed or considering more dimensions of the data.

Step 9: Choosing adequate methods for behaviour-based inference and measurement of predictive accuracy
If structural identifiability is given or direct generalisation is possible, one can choose an adequate method for behaviour-based inference (Step 9, Table 4). If not, it is sometimes still useful to measure sample predictive accuracy of candidates and compare it against a null model to ensure the models do not completely go astray.
Behaviour-based inference requires choosing a loss function (a metric to weight deviations between observed and simulated behaviour) and an algorithm to characterise the distribution of the loss function over candidates (exploration/estimation of posterior parameter distribution) or find the candidate with the optimal loss function value (optimisation, calibration).

Adequate choice of loss function or likelihood. Loss functions (
Step 9i, Table 4) are used to weight deviations between simulations and observations by severity. From a decision-theoretic point of view, loss functions should more strongly penalise those errors that would lead to 12 The "Lucas critique" (Lucas 1976) is a famous example in economics for a challenge to modelling practice based on these grounds. 13 Conclusions that are based on comparing model results to asymmetrical, one-sided thresholds even get stronger if the methodological approach is biased against them. Conversely, they are weakened by biases in their favour, especially if these cannot be precisely quantified and corrected. This principle mirrors the conservative rationale in statistical hypothesis testing: Type II errors, false-negatives, are preferred over type I errors, false-positives. 14 A frequently encountered example in agricultural ABM would be a parameter used in imputing cash reserves of farm agents (which are typically unobserved or undisclosed) at simulation start from observed characteristics such as farm size, location, land use or livestock ownership. Table 3 KIA Protocol, Part IIB: Checklist and formal methods for structure-based model selection and structural validation. The third column indicates selected literature sources for further reading that expand on the relevant theory or suggest formal tests for the assessment of the questions.
Step 6. Identify structurally adequate candidate models and (prior) parameter ranges for components by accepting or rejecting possible candidates based on the following checklist (using suitable formal methods listed in the third column if available)

Item
Guiding questions and actions Formal methods

(i) Data or theory-driven approach?
Is the analysis output-focused (step 2) and no generalisation is implied or direct generalisation is possible (from step 4)?
If yes, a data-driven model structure selection approach (or machine learning approach) can be chosen as long as it can also fulfil the transparency requirements (step 2) and sufficient data for practical identification is available.
In this case, the checklist items marked with * can be skipped as reliance on statistical model structure selection methods such as cross-validation, AIC (see step 9) is sufficient. (One may still opt to go for a structure-based approach.) Are the expected deviations (residuals, bias) of the candidate model a priori (from a theoretical perspective) consistent with the precision and accuracy (certainty, relativity, symmetry) required by the modelling context (as identified in step 2)?

(vi) Match of effective resolution
What is the effective a (temporal, spatial, thematic) resolution of the candidate? Does the effective resolution of the candidate model match the required resolution of the modelling context (as identified in step 2)?
-Aumann (2007) (Pielke 1991;Laprise, 1992;Klaver et al., 2020). In this case, the effective resolution is the size of this neighbourhood. As an extreme example, consider the case when the spatial allocation in a nominally 1 ha grid model is purely based on land classes and all cells of the same class show the same behaviour (or just differ randomly following class-specific probabilities) without any further location or neighbourhood effects. The effective resolution is then 'land class polygons' and not '1 ha grid cells'. Similar considerations apply for temporal, thematic and 'social' resolution (e.g. individual, household, village, district). Table 4 KIA Protocol, Part IIC: Guiding questions for model inference from observed system behaviour and documenting model uncertainty.

Item Guiding questions and actions Outcome
Step 7. Describe prior uncertainty comprehensively: List all candidate models, candidate parameters, error parameters and data uncertainty Documenting prior uncertainty Which candidates for model structure and parameter values were identified in structure-based model selection (step 6)? Which parts of the candidates have to be considered uncertain and in principle adaptable/estimable using the data? Can this uncertainty be quantified as a prior probability distribution? Which additional uncertainty has to be considered and reflected as (potentially unstable) parameters during estimation (e.g. uncertainty in observations, imputation of data, alternative choices in data preparation, classification and aggregation, expected deviations)? Which potential candidates are ignored in the analysis (unmodelled uncertainty)?
List of model structures and parameter ranges used to represent model uncertainty in further analysis (and potentially estimated by behaviourbased inference) (→used in step 8, 11) List of auxiliary parameters used to represent data and data preparation uncertainty (→ used in step 8, 11) Ranges or, if available, prior probabilities for these models and parameters (→step 8, 9, 11) List of alternative models and parameter ranges theoretically suitable, but not explored in the analysis (→ used in step 12) List of critical assumptions for which no alternative assumptions will be tested during the further analysis (→used in step 12) Step 8. Assess structural identifiability of candidate models in the population/domain represented by the observed sample (possibly omit if data-driven model selection has been chosen in step 6 and appropriate methods for statistical model selection are used in step 9) Structural identifiability Is the difference between predictions of two candidates in the observed domain sufficient to distinguish them at a relevant order of magnitude? Are outcomes unique to a candidate or do different candidates produce the same outcome?
If not (not identifiable): Can we employ additional relevant dimensions (variables) of the observed data? Can we subdivide the model into components/ parameter groups that are identifiable? Can we reparameterise the model by aggregating unidentifiable ones to identifiable ones without violating structural knowledge on parameter stability? If yes, do and reassess identifiability.
List of parameters or model structures that cause detectable differences within the domain of the benchmark data available for model inference and are hence structurally identifiable (→ step 9) a) identified from a theoretical perspective (e.g. Guillaume et al., 2019;Chis et al., 2011) b) identified using specific sensitivity analysis to identify parameters that have an effect on those outcomes that can to be compared with observations (e.g. Campolongo et al., 2007;Troost and Berger 2015a) Step 9

(ii) Benchmarking
Choose a proper benchmark/null model that reflects the best simple alternative model (e.g. sample average, random allocation, trend extrapolation) (Schaeffli and Gupta, 2007;Pontius and Millones 2011). Include it in the analysis either by explicit inclusion in the set of candidate models (Grimm and Railsback 2012) or implicitly by using it to calculate an absolute goodness-of-fit measure (model efficiency) from the loss function.

(iii) Practical identifiability (a priori)
Can we at all expect the available data to be able to discriminate between the candidate model structures and parameter ranges? Are there enough degrees of freedom for the complexity of the model and assumed error terms? Does the data contain sufficient independent, unconfounded variation of input variables (absence of multicollinearity) so that main and interaction effects of input variables implied by candidate models can be disentangled (e.g. assess using variance inflation factors)? Is the whole domain well represented in the data or are we likely to have a strong influence of outliers?
A first quick assessment whether practical identifiability can at all be expected and it is worth to try model inference from the data.
(continued on next page) stronger changes in conclusions. In direct generalisation cases and when sampling error has been controlled for (e.g. by cross-validation, see below), the measured loss can be directly generalised to target situations. Hence, in this case, one can choose a loss function that is limited to output variables of interest and whose weighting directly reflects the precision, accuracy, relativity and symmetry required by the research question (see Step 2) penalising misclassifications based on their practical implications (e.g. prefer models with stronger deviations overall, but high reliability in critical areas) (Manderscheid 1965;Berger 1980;McCloskey 1985;Farahmand et al., 2017;Manski 2019). 15 In indirect generalisation cases and structure-focused analysis, loss functions must reflect the impact of model errors on our confidence that the candidate reflects underlying system processes. In this case, loss functions should reflect the expected deviations of the model including sampling error, model bias and error correlation (Schoups and Vrugt 2010) regarding all observed output variables linked to the modelled mechanisms 16 : Theoretically anticipated deviations of candidate models are considered less severe than deviations unlikely to occur if the model predicts according to its theoretically expected precision (Hansen and Heckman 1996;Blavatskyy and Pogrebna, 2010). For example, if a model is designed to predict an upper bound, underestimation of observations should be penalised, overestimation not. 17 If the model is expected to be well-specified and implies a welldefined tractable stochastic error distribution, a parametric likelihood function can be formulated. Using parametric likelihoods in cases where their underlying assumptions are not fulfilled or in doubt leads to biased model selection and overconfident conclusions (Beven et al., 2008;Stedinger et al., 2008). Robust loss functions allow for occasional outliers potentially generated by processes not captured in the model. (Willmott and Matsuura 2005;Hyndman and Koehler 2006). If the model is expected to capture the essential systematic relationship, but the exact error distribution is unknown or intractable, summary statistics that capture relevant systematic relationships can be estimated on both, observations and model output. A loss function can then be applied to the difference in the summary statistics rather than the individual observations (Classical and Bayesian indirect inference: Chen et al., 2012;Beaumont 2010;Drovandi et al., 2015). Pattern-Oriented Modelling generalises this principle to incorporate more qualitatively described strong and weak statistical patterns (Grimm and Railsback 2012). In other cases, qualitative criteria are used to define binary-valued acceptance functions (Spear and Hornberger 1980;Troost and Berger 2015a).
Often, absolute goodness-of-fit measures (e.g. model efficiencies) are used instead of pure loss functions or likelihoods (Step 9ii, Table 4). While the latter provide a relative ranking between candidate models, but their absolute values are specific to the sample used, absolute goodness-of-fit measures don't change the relative ranking, but take the sample variance into account in order to allow comparison between models estimated from different samples (Bennett et al., 2013;Hauduc et al., 2015). Implicitly, efficiency criteria compare the evaluated model with a benchmark or null model that employs only basic information of the data. R 2 and Model Efficiency, for example, contain the sample average as a null model. This null benchmark should be carefully chosen. The sample average is only one possible choice. Trend extrapolation, random allocation, or seasonal or group-specific averages are often more adequate benchmarks (Schaeffli and Gupta, 2007;Pontius and Millones 2011). As an alternative, Grimm and Railsback (2012) suggest to always explicitly include a benchmark null model among the candidates.

Adequate assessment of practical identifiability and posterior uncertainty.
It is paramount to document uncertainty in measured predictive accuracy and model rankings and to assess how reliable the data could discriminate between candidates (practical identifiability) (Step 9 iii, iv, Table 4). Methods for behaviour-based inference considerably differ in the extent to which uncertainty in the selection process is characterised and to which prior uncertainty is considered and it is important to select a (combination of) method(s) whose premises fit the application case (Table 5). For example, classical minimum-loss or maximum likelihood-based parameter estimation presuppose that both the likelihood and the model structure are certain and correctly specified and all considered candidate parameterisations are a priori equally likely (Stigler 2007). They identify one best fitting model and limit quantification of posterior uncertainty to confidence intervals for parameters. While large confidence intervals point to low practical identifiability, they cannot conceptually be interpreted as posterior Guiding questions and actions Outcome

(iv) Choice and application of an algorithm for behaviour-based inference
Does the chain of methods/algorithms chosen … a) … consider all (operational) alternative model formulations and parameter sets (from step 6)? b) … adequately consider prior evidence/probability of model structures and parameter values (if available from step 7)? c) … consider and deal with biases in a priori identifiability of models in a sample, e.g. using information criteria (AIC,BIC), k fold crossvalidation? d) … quantify the effect of sampling error and the uncertainty in the inverse modelling process (e.g. in the form of confidence intervals, credible intervals, joint posterior parameter distributions, bootstrapping, cross-validation, by diagnostic tools such as VIF, Cook's distance, etc.)? e) … not rely on assumptions (e.g. certainty of model structure, wellspecified likelihoods, practical identifiability) that are not fulfilled in the given context (cf. Table 5)?
Potentially: A strategy for the evaluation of posterior model uncertainty (potentially the identification of a best model), potentially combining various algorithms and diagnostic tools. Potentially: The result of applying this strategy to the candidate models and parameter values using the available system I/O observations Alternatively: the decision to not pursue behaviour-based inference and continue without being able to reduce prior uncertainty Potentially: The expected predictive accuracy of the candidate models in predicting situations for which the available I/O data is representative (possibly put in relation to the expected predictive accuracy of a simple benchmark). 15 In the direct generalisation case: If we are interested in predicting deforestation, for example, then we can focus on the ability of the model to predict changes from forest to some other land use, without caring whether it also correctly predicts the new land use or changes among non-forest land use classes. (We thank Judith Verstegen for this example.) 16 In the indirect generalisation case: Even if we are only interested in predicting deforestation, but the mechanisms that we have to trust to anticipate developments in unseen situations are supposed to also determine changes in other land uses accurately, then deviations in predictions of these other variables also undermine our trust in predicting deforestation. Since we cannot assume that predictive accuracy on deforestation observed in the sample is the same in the future, this holds even if prediction of deforestation in the sample is accurate. 17 Bayes estimators allow combining a loss function for relevant errors in model application with a likelihood for the posterior probability of the model (Bassett and Deride 2019). probabilities for parameters. Bayesian frameworks (Hobbs and Hilborn 2006) can overcome the latter limitations if prior probabilities are specifiable. K-fold cross-validation 18 is the essential non-parametric method to quantify sampling error in estimated expected loss or predictive accuracy for unseen situations from a sample (Browne 2000;Arlot and Celisse 2010;Bennett et al., 2013;Vehtari et al., 2017). It should be combined with any of the basic inference methods and also avoids the complexity bias when model structures are uncertain: Selecting model structures purely based on predictive accuracy measured in one sample is biased towards models with a higher number of freely adaptable parameters, which increases the danger of overfitting. Adequate model inference requires correcting this bias, e.g. by k-fold cross-validation. Only when parametric likelihoods are applicable (see above), information criteria (AIC, BIC, DIC, WAIC) or formal Bayesian frameworks with appropriately specified prior likelihoods (Burnham and Anderson 2004;Ward 2008;Vehtari et al., 2017) provide an alternative.
Statistical diagnostics for influential observations (e.g. Cook's distance) and multicollinearity in the data (e.g. variance inflation factors) common in econometric analysis should complement the analysis of posterior uncertainty. Fig. 1 illustrated how an adequate modelling process structures, quantifies and potentially reduces uncertainty: The definition of a research question divides uncertainty regarding the research question from uncertainty about wider implications in the debate. Theory-based model selection structures the uncertainty about the research question into prior model uncertainty (represented by different candidate model structures and parameter ranges), input uncertainty (uncertainty in boundary and initial conditions), expected deviation (error terms, bias, aleatory uncertainty) and unmodelled uncertainty (alternative models not included in the analysis, 19 processes that have been ignored, potential exogenous events not considered, non-formalised error terms, unforeseeable events, critical assumptions for which no alternatives are tested, etc.). If applicable and successful, behaviour-based inference potentially reduces prior model uncertainty to posterior model uncertainty. If discrimination of candidate models by data is not possible, the posterior uncertainty remains the same as the prior uncertainty.

Part III: Adequate derivation and interpretation of simulation results and uncertainty
In structure-focused analysis (description, explanation), the resulting posterior model uncertainty is already the final uncertainty to be interpreted for conclusions. In output-focused analysis (prediction, scenario analysis, exploration), posterior uncertainty and input  Bartlett and Mendelson (2002); Arlot and Celisse (2010) 18 The traditional separation of data into one training and one validation dataset is the most basic form of cross-validation, but is subject to sampling error itself. K-fold cross-validation is the more robust extension.
uncertainty still need to be translated into predictive uncertainty for target situations (e.g. future or policy scenarios) by simulation experiments that include uncertainty analysis.
In an adequate modelling process, in which uncertainty is properly analysed and propagated, the final posterior/predictive uncertainty and the unmodelled uncertainty describe the actual state of knowledge regarding the research question that can be defensibly extracted from the available data and structural system knowledge. This final model uncertainty can then be compared with the precision required by the research question for interpretation and derivation of conclusions. Table 6 KIA Protocol, Part III: Guiding questions for the derivation of predictive uncertainty and the interpretation of results.

Item
Guiding questions Outcome Step

10: Interpret posterior uncertainty and expected predictive accuracy (if applicable) (i) Interpreting expected predictive accuracy (if measured)
What is the effect of sampling error on predictive accuracy (measured e.g. via cross-validation, bootstrapping, post-regression diagnostics.) and how does it influence interpretation (considering step 2)? Is there a bias in predictions that points to systematic model error (do disaggregate analysis of residuals!)? How do model predictions compare with the benchmarks?
Have the limits to generalisability been respected (e.g. statements only relative to models included in the analysis and within the bounds of representativity of the sample used)?
-An indication to what extent the models capture the observed variation in the sample of system behaviour and whether it shows systematic biases. -An estimate on the possible effect of sampling variance on measured predictive accuracy.
-Possibly: A qualitative judgment on the predictive accuracy (high, low, sufficient, etc.) based on an explicit and well-justified benchmark scale (e.g. restricted to comparison to a null model, long-term experience with similar models in similar situations) and the required precision derived from research question (from step 2).
(all to be used in step 11 and 12) (ii) Interpreting posterior uncertainty and the results of model inference (if applicable) Considering identifiability, posterior uncertainty (from step 9) and unmodelled uncertainty (from step 7): Does the posterior uncertaintyif measured in step 9 -provide complete information about the effect of sampling error and practical identifiability of candidates (considering choice of method in step 9)? Was it possible to reduce prior uncertainty through inverse modelling? Can candidates (model structures, parameter values) be eliminated because we can clearly rule them out as implausible or highly unlikely a posteriori? Were parameters identifiable? Which alternative model formulations must be considered plausible enough to include into further analysis?
In structure-focused analysis: An interpretation of the evidence about system structure, cause-effect chains or influential system input that could be gained through the analysis which properly reflects the associated posterior uncertainty and plausible alternative model formulations. (→ step 12) In output-focused analysis: A set of models/parameter distributions for use in subsequent predictive simulation that reflects posterior uncertainty and does not neglect plausible alternative models and parameter estimates (→ step 11) Step

11: Choose a simulation design for and run predictive simulations and analyse predictive uncertainty (if the analysis is output-focused) Design of predictive simulation experiments
Does the chosen design globally and representatively consider the full posterior model uncertainty as well as (scenario) input uncertainty and assess its effect on predictive outcomes? Is a form of prediction resp. method of sensitivity or explorative analysis chosen that is consistent with the level of uncertainty in the model and scenario input (see Table 5)? Does the assessment of predictive uncertainty focus on the simulated quantities relevant to the research question? Does it focus on the degree of accuracy, precision conditionality, relativity and symmetry relevant to the research question (step 2)? (For example, in policy analysis does it focus on the robustness of the policy effect rather than the uncertainty in unconditional prediction?) A design for and the outcomes of simulation experiments that … … focuses on quantities and accuracy relevant for the research question (from step 2) … controls for the effect of aleatory uncertainty (e.g. by common random numbers schemes, e.g. Troost and Berger 2016, convergence over a large number of repetitions, assessments of case-wise or stochastic dominance)? … and … … covers the uncertainty space globally and representatively (Saltelli and Annoni, 2010) at a sampling rate adequate for the computational resources. (Consider efficient designs such as Sobol' sequences or LHS, see Tarantola et al., 2012) … or alternatively a comprehensive search for non-robust outcomes or strong deviations over the global uncertainty space (e.g. destructive verification, Midgley et al., 2007;stress testing and red-teaming, Lempert 2019). If uncertainty is nonnegligible, conduct global sensitivity analysis to detect which uncertain input factors have highest influence on output uncertainty (Helton et al., 2006;Campolongo et al., 2007;Saltelli et al. 2008Saltelli et al. , 2019Borgonovo and Plischke 2016;Ligmann-Zielinska et al., 2020;Puy et al., 2021) Step

12: Final interpretation, derivation of conclusions and documentation Conclusions
Is the communication of simulation outputs consistent with the level of uncertainty in model and scenario input (see Table 5)?
Comparing the final predictive resp. posterior uncertainty (step 11, resp. 10) and the unmodelled uncertainty (step 7) with the precision and accuracy required by the research question (step 2): Which conclusions are possible? Are all the premises underlying the final conclusions clearly laid out (including assumptions on system complexity, alternative models, identifiability, representativity, error models etc.) and substantiated using the criteria set out in the previous steps? Is the posterior/ predictive uncertainty fully documented and discussed? Which of these premises are critical to maintain the conclusions? Does any theoretical or measured bias weaken or strengthen conclusions? Is there a clear delineation between what has been modelled with respect to the targeted question and the analysed target situations and what is further speculation in the context of the wider debate but not solely based on the discussed simulation analysis?
A summary of the results of running through the protocol explaining … … the purpose of the analysis and model (e.g. for the introduction of an article and the purpose section of the ODD protocol) … a summary justification of model and method choice following the steps, criteria and premises set out in the previous steps of this protocol (e.g. for the Methods & Results sections of an article, or for the Appendix) … the conclusions building on the comparison of model results and final uncertainty to research question requirements (e.g. for the Discussions and Conclusions sections of an article) … a documentation of prior (step 7), posterior (step 10) and predictive uncertainty (step 11), sensitivity to inputs (step 11) and specifically unmodelled uncertainty, i.e. critical and potentially value-laden assumptions for which plausible alternative assumptions could not be comprehensively tested in the analysis (step 7), e.g. following the schemes of NUSAP (van der Sluijs, 2017; Kloprogge et al., 2011), sensitivity auditing (Saltelli et al., 2013)

3.3.1.
Step 10: Interpretation of predictive accuracy and posterior uncertainty If sampling error has been properly controlled for (e.g. by crossvalidation), expected predictive accuracy indicates how well the model predicts or explains the variation in the population of situations for which the sample is representative (subject to the importance weighting embodied in likelihood or loss function). This is valuable information in its own right. However, whenever using this information to draw further conclusions (Step 10i, Table 6), e.g. about the model being "sufficiently good" or the "correct" or "best explanation", care has to be taken (Oreskes et al., 1994). Even though absolute goodness-of-fit measures such as model efficiencies project predictive error onto an absolute scale between null model and perfect fit, defining any threshold to indicate 'sufficient fit' on this scale remains subjective or based on conventionsimilar to significance levels in statistical analysisunless this threshold can be convincingly derived from the research question (Pontius and Millones 2011). The same holds for thresholds defined on posterior densities or relative differences in information criteria (Stephens et al., 2005).
The well-known problems of induction, under-determination and theory-ladenness imply that proving by comparison to observation that a model is the 'true' model is ultimately impossible (Oreskes et al., 1994;Quine, 1951). Expected predictive accuracy provides a relative ranking and allows identification of the "best" among the candidate models for the given sample. The more comprehensive the list of candidate models and parameterisations that has been tested and the more representative the sample, the higher can be the confidence in having identified a generalisable best model or parameterisation. As all other statistical relationships, measured expected predictive accuracy cannot be generalised to target situations across structural breaks.
Uncertainty in inference can be quantified as a posterior probability for the candidates if a formal Bayesian framework with proper prior probabilities and appropriate likelihood has been used in inverse modelling. However, also in those cases where posterior probabilities or credible intervals cannot be derived, it is important to consider posterior uncertainty (Step 10ii, Table 6) and recognise that the "best" model does not necessarily have or even approach a posterior probability of one (Troost and Berger 2015a). The potential explanatory and predictive power of alternatives should not be neglected in interpretation. If the analysis is structure-focused and interested in which model provides the better explanation, it remains inconclusive whenever two alternative models cannot be robustly discriminated by data or needs to employ additional theoretical considerations, e.g. parsimony as an epistemological principle 20 or correspondence to established theory, to justify a decision for one or the other model. In output-focused analysis, subsequent predictive simulation should use the full posterior distribution, consider confidence or credible intervals or at least a representative ensemble of all candidates that show nonnegligible explanatory power (ensemble modelling, model averaging).

Step 11: Analysis and interpretation of predictive uncertainty
Only in rare cases, it will be permissible to directly generalise expected predictive uncertainty from behaviour-based inference to the target situation (preconditions: representative sample, negligible input uncertainty, one clearly best model). Generally, predictive uncertainty for a target situation is a function of the uncertainty about the systematic effect of system input on behaviour that is captured in the set of models and parameterisations (posterior model uncertainty), the model error (bias and unsystematic aleatory uncertainty) and the uncertainty in system inputs (e.g. scenarios, boundary conditions) for target situations. Building on the considerations by Marchau et al. (2019) and Walker et al. (2003), Table 7 lists which forms of predictive simulation outputs are adequate depending on the level of uncertainty in each of these dimensions. Unconditional predictions require low uncertainty in all "locations" of uncertainty. For all higher levels of uncertainty, comprehensive uncertainty analysis is necessary (Step 11, Table 6). Depending on model complexity and available computational resources, one can choose from a considerable number of approaches for efficient uncertainty and sensitivity analysis 21 (Helton et al., 2006;Saltelli et al., 2008;Gramacy and Lee 2009;Troost et al., 2022). Clear conditions for appropriate choices have been formulated: Uncertainty analysis must be global, i.e. cover the full range of potential input values including interactions and correlation between input factors (Saltelli and Annoni, 2010). Probabilistic predictions require probability information in all locations. It is key that exploration of predictive uncertainty focuses on the output quantity, precision, and resolution relevant to answering the Table 7 Adequacy of different types of predictive analysis depending on systematic and unsystematic model uncertainty and uncertainty in system input for target situations (scenario uncertainty), adapted and extended from Marchau et al. (2019).  (2019) 20 Parsimony as a epistemological principle (simpler models are always to be preferred) differs from a pragmatic argument for parsimony in estimating models for prediction (simpler models are less prone to overfitting). 21 Following the definition of Helton et al. (2006), uncertainty analysis is concerned with quantifying the uncertainty (variance) in simulation outputs, while sensitivity analysis is concerned with linking this uncertainty to uncertainty in model inputs, i.e. determine which uncertain input factors are responsible to which degree for the uncertainty in outputs.
targeted research question. When we compare two target situations, we can distinguish the apparent (or observable) difference, i.e. the difference between two predictions that includes unsystematic, stochastic effects, and the systematic difference, i.e. the difference between two predictions controlled for unsystematic effects. In many decision support situations, the future may not be precisely predictable, but for a good decision it is enough if the systematic differences caused by decision options can be pointed out using pairwise comparison at each tested combination of input factor values (Berger and Troost 2014). For stochastic models, this requires Common Random Numbers schemes (Stout and Goldie, 2008;Troost and Berger 2016). The alternative is running sufficient repetitions and applying statistical comparison tests (e.g. Verstegen et al., 2019). 22 Especially when uncertainty is high in all locations, rather than trying to merely describe all possible outcomes, strategies to detect decision options that are robust under many different scenarios and assumptions should be emphasised (assumptions-based planning, stress testing, red teaming; Lempert 2019; Marchau et al., 2019).

Step 12: Interpretation and conclusions
The interpretation of results should compare the final uncertainty (Step 10 or 11) to the required precision and accuracy of the research question (Step 2). If the required certainty is reached, conclusions that are consistent with the simulated output can be considered valid and sound. If uncertainty is too high, we have to conclude that the knowledge employed in the process is insufficient for the desired type of conclusions (e.g. Carauta et al., 2021). It should not be necessary to emphasise that this is an equally valuable and relevant result (Leamer 2010).
The structure of the argument and the premises that are critical to support the conclusions must be clearly laid out (Step 12, Table 6). This involves the premises that are supported by simulation results, but also the auxiliary and hidden premises (prior model evidence, representativity of data, identifiability, posterior uncertainty).
Both, unstructured uncertainty about wider implications (Step 1) and unmodelled uncertainty (Step 7) remain qualitative and unquantified in the modelling process. Nevertheless, they must be an important part of the interpretation: Conclusions must be qualified with respect to the information omitted from the modelling process. Hypotheses on how omitted processes or alternative system conceptualisations could affect conclusions must be discussed (Forrester and Senge, 1980). Banerjee et al. (2016) argue for an explicit and structured section for 'Speculation' about external validity (generalisability) of results obtained from case studies. Especially, when using models to inform decision-makers in the face of deep uncertainty, transparent documentation of critical and potentially value-laden fundamental assumptions (see protocols in Kloprogge et al., 2011;Saltelli et al., 2013;Fischhoff and Davis, 2014;van der Sluijs, 2017) and additional effort to assess the robustness of decision option outcomes to these assumptions is essential (Lempert 2019;Marchau et al., 2019).

Discussion and conclusions
The purpose of validation is to ensure the adequacy of simulation analysis for answering a specific well-defined research question. This requires a careful analysis of the logical argumentative structure and assessment of the critical premises that conclusions from simulation analysis build upon. Such premises rest on simulation outcomes, but are also implicit in the choice of models and methods of inference from data. Especially the latter is not always obvious to modellers, reviewers, and addressees of simulation results. For example, empirical validation and model inference presuppose representativity of data, identifiability, and control of sampling error. Moreover, specific methods such as maximum likelihood estimation rely on even more restrictive, not always obvious premises (see Tables 5 and 7). Validation needs to ensure that models and methods chosen fit the modelling context, which comprises the research question and available system knowledge and data on system behaviour. And it needs to assess whether the final uncertainty in simulation results fits the requirements on precision and accuracy implied by the research question.
In most cases this is more complex and subtler than a single-step matching of context to a method. Rather it is a hierarchical process, i. e. outcomes of earlier steps affect choices in later steps (e.g. behaviourbased inference should not be pursued without first ensuring representative data and structural identifiability). It is recursive, i.e. in composite models such as ABM the context of each component must be assessed, and iterative, i.e. outcomes of subsequent steps may encourage receding a number of steps and reconsidering choices: For example, if the evaluation of structural identifiability, practical identifiability or predictive uncertainty leads to unsatisfactory results, it may be useful to go back to structure-based model selection or even to a redefinition of the research question: It may be possible to answer a more restricted question that is already useful, where the context does not allow to reliably answer the original question.
The KIA protocol that we have proposed in this article is intended to guide modellers in making adequate choices during the process of simulation analysis and justify them with adequate argumentation. It provides a guideline to reviewers who can use it by starting from the final conclusions and their premises, and working backward to evaluate whether the steps taken during the modelling process adequately support the premises in the given context. Moreover, it is intended to structure documentation: (i) as a checklist to ensure modelling context and justification for all relevant modelling decisions have been discussed in the main body of an article and (ii) as a template for wellstructured tabular documentation in an appendix.
The protocol mirrors and is compatible with established recommendations for a structured modelling process (e.g. Jakeman et al., 2006), but it emphasises the linkages and propagation of uncertainty between modelling stages and highlights general criteria for the choice of adequate methods at each stage. It operationalises the principle "as empirical as possible, as general as necessary" coined for ABM by Brenner and Werker (2007). It incorporates the different levels of uncertainty of Walker et al. (2003) and Marchau et al. (2019), but also explains how this uncertainty comes about in the modelling process. Similar to Polhill and Salt (2017), it highlights the importance of structural model choice compared with purely data-driven model inference. While we have not extensively discussed stakeholder participation, the protocol is meant to be open to valuable stakeholder input and feedback at any step of the process: e.g. in shaping the encompassing debate, defining the targeted research questions, providing information in model selection and inference and shared interpretation Barreteau et al., 2010).
The exhaustive discussion of many of the guiding questions listed in the tables of the protocol would warrant their own articles. Our intention here has been to comprehensively list them and highlight their interlinkages. We have linked many of the guiding questions to literature with more detailed explanation or formal assessment methods. This list of methods does not claim to be complete and it will certainly have to be extended over time as new approaches for model testing, selection or estimation are developed to deal with the formulated questions. We actually hope that this protocol sparks interest in developing new methods and then assists in clearly communicating the conditions for which they are suitable.
In defining eleven dimensions for the characterization of modelling contexts, we have moved beyond discrete typologies of model purpose (e.g. Edmonds et al., 2019;Epstein 2008). Typologies, such as Edmonds et al. (2019), and especially terms such as prediction, forecast, projection or exploration, whose understanding and usage differ between and 22 Common random number schemes are more efficient in terms of required model runs, but sometimes quite difficult to implement (see example in Troost and Berger 2016). sometimes even within disciplines (Bray and von Storch, 2009), can be mapped onto these dimensions to allow for more precise communication (see Appendix A.2). The dimensions are intended to improve communication on methodology by helping to identify which ABM applications share a similar modelling context and might learn from each other and which not. For example, Troost and Berger (2015a) and Carrella et al. (2020) both deal with unknown or intractable likelihoods for model inference. However, the former face both low structural and practical identifiability, while the latter assume few parameters and a large number of identifying summary statistics, i.e. high practical identifiability. As both are explicit about the assumed modelling context, this can be read from their articles, but may still be easily overlooked. Our protocol is intended to highlight these differences and in this way avoid common pitfalls in discussions between modellers and reviewers about adequate and valid model use and inference: e.g. avoid discussions about an appropriate loss function, when structural identifiability is the more important issue; avoid overemphasis on separation of training and validation data, when validation data is not representative for target situations; avoid discussions about unreliability of unconditional predictions when these are neither possible nor necessary; avoid suggesting model simplification to increase practical identifiability when model complexity is required for structural reasons and direct generalisation is not adequate, etc.
Given the breadth of application contexts for ABM and their potential components, we strived to be general in redacting the protocol. We believe that the principles discussed here are applicable to any modelling endeavour and most disciplinary standards that have been established form special cases that are in principle covered by the protocol. In this sense, we expect that it can be useful for many different types of simulation, not only for ABM.
At this point, the KIA protocol itself is a theory-based hypothesis that requires practical testing. We propose it to the community of agentbased modellers for adoption in model construction, documentation, and review. Its use in practice will tell if it proves useful as guidance for model development and a communication device in documentation and review. Based on practical experience, it should then be reviewed and improved.

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
No data was used for the research described in the article.

Acknowledgments
We thank all participants of Workshop W9 "Best Practice in Agent-Based Model Parameterisation and Validation" at the 10th International Congress on Environmental Modelling & Software 2020 for their valuable and constructive input and comments. CT and TB acknowledge funding by the Federal Ministry of Education and Research of Germany (BMBF) for the project SimLearn (01IS19073C). LN acknowledges funding from the Energy Demand changes Induced by Technological and Social innovations (EDITS) project provided by Ministry of Economy, Trade, and Industry (METI), Japan. GP acknowledges funding by the Scottish Government Strategic Research Programme 2022-27, Project ID JHI-C5-1. QBL acknowledges support by the CGIAR Research Program on Grain Legumes and Dry Cereals (CRP-GLDC) and Initiative on Sustainable Intensification of Mixed Farming Systems. TF acknowledges the support of the European Research Council (ERC) under the European Union's Horizon 2020 Research and Innovation Program (grant agreement number 758014). Author contributions: CT conceptualized and wrote the manuscript. RH, AB, HvD, TF, QBL, ML, LN, JGP, ZS, and TB contributed ideas, comments, and corrections at all stages of manuscript writing. We thank Judith Verstegen and a further anonymous referee for highly constructive and valuable feedback during the review process.

A.1 Notes on differences in structural identifiability of parameters
Structural identifiability in the data can considerably differ between different groups of parameters or model components. For example, parameters that relate short-term agent behaviour to static characteristics can be estimated from sufficiently heterogeneous cross-sectional data. For contrast, parameters that affect dynamic behaviour or accumulative development over several periods require panel data Berger, 2015a, 2020). Parameters that affect the probability of low probability events can only be identified if enough low probability events have been observed (Filatova et al., 2016). Structural non-identifiability cannot be resolved by more of the same data, but requires either widening the range of situations observed or more dimensions of the data. Under certain conditions, unidentifiable parameters may be temporarily fixed to allow identification of other components. However, fixing has to be reversed for latter predictive simulation in order not to obscure model uncertainty (noninfluence in the observed domain does not necessarily mean noninfluence in the target situation, see example in Troost and Berger 2015a).

A.2 Mapping purposes to modelling contexts
We believe that terms like prediction, forecast or projection, which are often ambiguous or defined differently between disciplines, as well as typologies of Edmonds et al. (2019) can be communicated more precisely using the suggested dimensions of the modelling context.
For example, the seven modelling purposes of Edmonds et al. (2019) could be coarsely mapped onto our characterisations of modelling context as follows: In 'theoretical exposition' and 'illustration' the system under study is the model itself, with the former being output-focused (moving from an insufficient sample situation to an in sample-situation by exhaustive simulation) and the latter putting emphasis on transparency and interpretability. 'Analogy' does relate to a real system and is structure-focused with a low demand on precision and comprehensiveness, but high demands on transparency and interpretability. In this three cases, conclusions about the relationship of the model to the real-world are left-aside for a moment or discussed as unmodelled uncertainty. 'Social learning' and education can happen in all contexts, can be about the model, opinions of participants or the real system, output or structure, but require transparency and interpretability. 'Description' corresponds to structure-focused, in-sample analysis. (Output-focused in-sample analysisnot mentioned by Edmonds et al. -could be termed 'compression': storing and reproducing observations in a more resource-efficient way than explicitly listing them.) 'Explanation' is structure-focused, out-of-sample generalization. 'Prediction' is any output-focused analysis in out-of-sample or non-representative sample settings. This wide scope of prediction still opens up a lot of room for misunderstanding and clearer definitions of modelling context using the dimensions of required precision and accuracy, transparency, etc. can help in this context to link to appropriate forms of simulation analysis (e.g. Marchau et al., 2019).