On (in)validating environmental models. 1. Principles for formulating a Turing‐like Test for determining when a model is fit‐for purpose

Model invalidation is a good thing. It means that we are forced to reconsider either model structures or the available data more closely, that is to challenge our fundamental understanding of the problem at hand. It is not easy, however, to decide when a model should be invalidated, when we expect that the sources of uncertainty in environmental modelling will often be epistemic rather than simply aleatory in nature. In particular, epistemic errors in model inputs may well exert a very strong control over how accurate we might expect model predictions to be when compared against evaluation data that might also be subject to epistemic uncertainties. We suggest that both modellers and referees should treat model validation as a form of Turing‐like Test, whilst being more explicit about how the uncertainties in observed data and their impacts are assessed. Eight principles in formulating such tests are presented. Being explicit about the decisions made in framing an analysis is one important way to facilitate communication with users of model outputs, especially when it is intended to use a model simulator as a ‘model of everywhere’ or ‘digital twin’ of a catchment system. An example application of the concepts is provided in Part 2.

environmental modelling will often be epistemic rather than simply aleatory in nature.
In particular, epistemic errors in model inputs may well exert a very strong control over how accurate we might expect model predictions to be when compared against evaluation data that might also be subject to epistemic uncertainties. We suggest that both modellers and referees should treat model validation as a form of Turing-like Test, whilst being more explicit about how the uncertainties in observed data and their impacts are assessed. Eight principles in formulating such tests are presented. Being explicit about the decisions made in framing an analysis is one important way to facilitate communication with users of model outputs, especially when it is intended to use a model simulator as a 'model of everywhere' or 'digital twin' of a catchment system. An example application of the concepts is provided in Part 2.
K E Y W O R D S epistemic uncertainties, hydraulic models, hydrologic model, hypothesis testing, limits of acceptability You know the famous line that [philosopher] Isaiah Berlin borrowed from a Greek poet, "The fox knows many things, but the hedgehog knows one big thing"? The better forecasters were like Berlin's foxes: self-critical, eclectic thinkers who were willing to update their beliefs when faced with contrary evidence, were doubtful of grand schemes and were rather modest about their predictive ability. The less successful forecasters were like hedgehogs: They tended to have one big, beautiful idea that they loved to stretch, sometimes to the breaking point. They tended to be articulate and very persuasive as to why their idea explained everything. The media often love hedgehogs.
Philip E. Tetlock, 2006 1 | BACKGROUND: ON MODEL VALIDATION In modelling environment systems such as river catchments we know only too well that we can reproduce the complexity of catchment response with only limited accuracy (see, e.g. Beven, 2001Beven, , 2002Beven, , 2012Beven, , 2019aBeven, , 2019b. There are very good reasons for this, particularly a lack of full knowledge about the inputs and outputs for the part of the system being represented, even the spatial limits of the system itself (e.g. Khan et al., 2014;Kauffeldt et al., 2013;Beven & Smith, 2015), and a lack of full knowledge about the representation of the (usually nonlinear and interacting) processes that control the responses (e.g. Beven & Chappell, 2021;Wagener et al., 2021). These knowledge, epistemic or deep uncertainties may well be different in the future that we are trying to predict than they were in the past where we might have some observed data that can be used for model evaluation, a problem of inference that also limits the application of purely data-based methods (Beven, 2020;Wagener et al., 2022). They should be distinguished from the aleatory uncertainties that can be treated as random variability and to which the full power of statistical theory can be applied.
It is the epistemic uncertainties that make model validation, in the opinion of some, impossible (see, e.g. Stephenson & Freeze, 1974;Oreskes et al., 1994;Oreskes, 1997;Beven, 2013Beven, , 2015Rougier & Beven, 2013). But, following Box (1979), we might still like to know if some models might be useful or fit-for-purpose in some sense, even when we expect them to be wrong in some way perhaps not yet known. This is particularly important when a model simulator is to be used as a 'model of everywhere' or 'digital twin' of a catchment system to make predictions of how that catchment might behave under possible future conditions. In such cases, we wish to try to get the 'right results for the right reasons' (Kirchner, 2006) and to avoid using models that should not be considered fit-for-purpose (e.g. Hrachowitz et al., 2014). Thus, model invalidation is an important part of the modelling as a learning process that underlies the 'models of everywhere' concept (e.g. Beven, 2007;Blair et al., 2019).
Our aim is to show how model validation should be really a process of model invalidation, through an extended and pro-active form of a necessarily subjective Turing-like Test. We develop our argument through our primary expertise in hydrological and hydraulic modelling, but we suggest that the discussion has wider relevance to a range of environmental models. In what follows we provide an overview of concepts of model validation; discuss the importance of observed data in model hypothesis testing and invalidation; and discuss the concept and conditionality of fitness-for-purpose. The idea of a Turing-like Test for fitness-for-purpose is then introduced as a way of framing our expectations about model performance, with suggestions for some principles underlying such a concept in application to environmental models. In Part 2 of this study, we show how these concepts might be implemented in an illustrative case study and discuss how we might learn from model invalidation to make advances in knowledge and understanding of hydrological processes.

| MODEL VALIDATION: AN OVERVIEW
The concept of model validation has been a concern in hydrological modelling (and other areas of environmental modelling) for a long time (e.g. Stephenson & Freeze, 1974;Konikow & Bredehoeft, 1992;Oreskes et al., 1994; and the papers in Bates, 2001, andBeisbart &Saam, 2019). Klemeš (1986) proposed a hierarchical approach to model validation that would test both the applicability of a model at a site and the transferability of a model to other sites or other climatic conditions. The former involved a split-sample test, the latter differential split-sample, proxy-basin tests.
There have been few reported studies that have gone beyond the simple split-sample test. Perhaps the best known in hydrology is that of Refsgaard (1997;Refsgaard & Knudsen, 1996) who showed that model calibration could not entirely compensate for differences between sites and sample characteristics. Ewen and Parkin (1996) also proposed a 'blind' validation test for hydrological models when treating catchments as if ungauged so that no calibration is possible. In their study, a variety of tests were set prior to making model runs using a version of the SHE model. Even allowing for uncertainty in model parameters, not all tests were passed (Bathurst et al., 2004;Parkin et al., 1996). This has not, however, prevented the SHE model from continuing to be widely used (e.g. Refsgaard et al., 2010), and raises the question as to what is being falsified: is it only the particular conditions under which the model was being applied, since the model and its framework as a whole have continued to be used in later applications? Another widely used model, the Soil Water Assessment Tool (SWAT, Arnold et al., 1998) has, in various forms been applied to hundreds of catchments worldwide (Gassman et al., 2007). SWAT is provided with a database of parameter values such that it can be applied in ungauged basins, but it can also be calibrated against historic data. being purpose specific (see also Van Griensven et al., 2008, as an example of using split-sample tests for fitness-for-purpose of the SWAT model). In a recent application of SWAT, however,  showed that SWAT could not provide sufficiently accurate simulations of both hydrograph and phosphorus outputs from a catchment even after conditioning on a calibration period and allowing for uncertainties in the evaluation data. The model did not appear to be fit-for-purpose in this case. Similar model failures have been reported for the INCA-P quality model by Dean et al. (2009), for the WEPP erosion model by Brazier et al. (2001) and for TOPMODEL in predicting saturated areas (Beven & Kirkby, 1979;Güntner et al., 1999), flood frequency (Blazkova & Beven, 2009), streamflow in all flow conditions in a small catchment (Choi & Beven, 2007), and the storm to storm variability in stream chloride concentrations (Page et al., 2007). These failures raise again the question as to whether they are due to an incorrect model structure, an incorrect parameterisation or the suitability of the boundary conditions and auxiliary relations involved in the application.
This might be particularly the case when conditions are changing in either the forcing or catchment characteristics. Fowler et al. (2016) and Wagener et al. (2022) show how split sample testing represents a challenge for climate change impact studies because the validity of predictions depends on the validity of both the model and its parameterisation, something that may not hold if tested under changed conditions. It is then necessary to be careful not to confound rejection of a parameterization with rejection of the model structure.
It has also been suggested that some failures might be the result of inadequate sampling of the model space (e.g. Vrugt & Beven, 2018). This might be more likely when the models considered have many parameter values to be estimated by calibration and where we rely, to a greater or lesser extent, on specific assumptions or the theory of a model to interpret data in ways that can be used in calibration or testing. This is referred to as the theory-ladenness of data and reflects the point that when we compare data with a model we are not comparing a theoretically based model of the world directly with reality, but rather with a data-based model of the world (Odoni & Lane, 2010;Oreskes, 1997): both data and numerical models are representations of the world (see also the statistical theory of reification in Goldstein & Rougier, 2009, and the critique of reification in Briggs, 2016). Young et al. (1996) proposed and illustrated an alternative modelling approach that is based upon using data-based mechanistic models to identify the dominant modes of behaviour in a system based upon analysis of observations; it is these dominant modes that a simulation model needs to be able to reproduce (see also Young, 2013; and the hydrological signatures approach in Hrachowitz et al., 2014). This is nicely illustrated by Nearing, Mocko, et al. (2016) who have proposed a methodology for testing models relative to a purely data-based approach by considering measures of information in explaining the test data. This allows a model to be assessed in terms of the entropy of the observations of interest relative to the information that can be extracted using only data-based models. Where the data-based methods can be shown to explain more information than a theorybased model then that might be a reason to reject the theoretical model and explore the reasons why the data-based model performs better. Their work has shown that many models fail such a test (though this could be because data-based models can better compensate for physical inconsistencies in the observations, such as those shown in , Beven & Smith, 2015, and Beven, 2019a. It is also the case that the supposed validation of a model in one test does not mean that it is applicable generally, or that it will be valid for all possible model applications (see the discussion of SWAT applications above). This is a form of Hume's Problem of Induction (e.g. Beven & Lane, 2019) in the sense that a model that performs well at a site for one set of conditions (in time and space) cannot be expected to perform well for all possible future boundary conditions. The past is not necessarily a guide to future performance, especially when there are epistemic uncertainties about future initial and boundary conditions, a point first made in hydrological modelling by Stephenson and Freeze (1974) (see also Konikow & Bredehoeft, 1992, in respect of groundwater models and Lane et al., 2005, for hydraulic models of channel and floodplain flows). In the case of models with large numbers of parameters, such as SWAT and WEPP, it is likely that even if successful models could be found they might be overfitted to the calibration data, with a danger of poor performance in prediction when the data uncertainties may be quite different.
There has been significant discussion of validation concepts in other domains of environmental science. A philosophical discussion is provided by Oreskes et al. (1994), following on the papers by Konikow and Bredehoeft (1992) and Anderson and Woessner (1992) in groundwater modelling. Oreskes et al. suggest that validation (implying strength of belief from its Latin root) is preferable to verification (also from the Latin, implying truth) since no model can ever be considered as a true representation of reality; it can only be considered an approximation (although the French title of Klemeš, 1986 paper uses the noun verification to translate operational testing). Verification should only be used in the sense of some proof (preferably using formal mathematical methods that are not generally used in environmental modelling) that a computer code is correct in its implementation. In ecological modelling , Rykiel Jr (1996)  Thus, verification is a necessary (and perhaps often over-looked) precursor of validation but a verified model in this formal mathematical sense does not necessarily mean that it is valid as fit-for-purpose.
Verification requires us to show that our model predictions are internally consistent and reproducible (Hutton et al., 2016, see also Imbert, 2019) even before they are confronted by and shown to be in some sense consistent with the observations (or not). This type of verification may occur at a number of different levels, the most basic being that decisions regarding the computational solution taken by a modeller (spatial discretization; time steps, convergence criteria, relaxation coefficients and so on) are not inadvertently impacting model predictions (see, e.g. Kavetski & Clark, 2010Metcalfe et al., 2015;Smith et al., 2021). This may extend to reproducibility between modellers using the same model; or even between models of the same system produced by the same modeller or different modellers. None of these evaluations need make redress to observations, such that a model may be verified but not yet validated and Hutton et al. (2016) suggest that this type of verification needs a community level shift in how we make our code transparent and usable by others.
In environmental hydraulics, there has been much emphasis upon the importance of verification of computer codes as a necessary precursor to validation. Lane and Richards (2001) drew attention to the existence of very different interpretations of the status of a numerical model: according to the American Society of Mechanical Engineers (ASME), the predictions of a model should be taken as verified or correct if its application has followed a series of controls on the numerical accuracy of the associated model solution, that is, it is verified in the sense of Oreskes et al. (1994). For the ASME, testing against observational data should not be a substitute for verification nor should such testing be a necessary requirement for labelling a model as acceptable and usable. Lane et al. (2005) argue against this position in relation to channel and floodplain flows, noting that these criteria fail to capture the conditionality of model applications that follows from the dependence on boundary conditions, geometry and the need for auxiliary relations to make models solvable (e.g. turbulence closure; wall treatments) that themselves may have a restricted range of applicability. The determination of model acceptability cannot be reduced to simply verification that a computer code is a proper solution of the underlying nonlinear mathematical equations (as already recognized by Stephenson & Freeze, 1974). Similar arguments will apply to models based on the approximate solution of dynamic nonlinear equations in other domains, such as atmospheric and ocean circulation models and subsurface flow models with their own requirements of boundary and auxiliary conditions.
It follows that a model, which is apparently acceptable in one situation, is not necessarily acceptable in another, even after allowing for model calibration or modification of the other auxiliary conditions required to make a model run (Morton, 1993). Key here is recognition that some realizations of a model may be more or less acceptable than others, depending on the ways in which uncertainties in input data/ parameters propagate through to model predictions. The aim then might be to set some plausibility criteria or limits of acceptability that help to identify those simulations that are 'acceptable' or 'behavioural'. This is the basis of inferential approaches to parameter estimation, such as with the set-theoretic approaches of Van Straten and Keesman (1991), the generalized likelihood uncertainty estimation (GLUE) methodology (e.g. Beven, 2006Beven, , 2009Beven, , 2012Beven, , 2016Beven & Binley, 2014;Vrugt & Beven, 2018) or the analogous concepts in Approximate Bayesian Computation (Nott et al., 2012;Sadegh & Vrugt, 2014;Vrugt & Sadegh, 2013). Here, limits of acceptability are used to define plausible model simulations and then the uncertainty that remains is quantified and presented as part of the primary model outputs.
To make progress we suggest that it is necessary to replace the notion of model validation, and all the debates around it, with the complementary notion of model invalidation (see also Beven, 2018;Beven & Lane, 2019). It is an important consideration as to when a model should be considered as not fit-for-purpose, but this will depend on both the requirements of the purpose, and the quality of observed data available for model evaluation. There are a growing number of studies of the uncertainties associated with hydrological data that could form the starting point for model invalidation (e.g. Beven & Smith, 2015;Coxon et al., 2015;Ehlers et al., 2019;Harmel et al., 2009Harmel et al., , 2014Hollaway, Beven, Benskin, Collins, Evans, Falloon, Forber, Hiscock, Kahana, Macleod, Ockenden, Villamizar, Wearing, Withers, Zhou, Barber, & Haygarth, 2018;Khan et al., 2014;Kiang et al., 2018;Krueger et al., 2010;McMillan et al., 2012;McMillan et al., 2022;Westerberg et al., 2011;. A critical measure that might then be used is whether there is any overlap at all between the distribution of uncertain observations, and the distribution of model predictions. If there is no overlap, then this might be considered as a reason for invalidating that model and finding something better. Even then, however, any particular observation might be considered as an outlier or not very important to the purpose of the application (Harmel et al., 2014). It has been suggested that the use of a limited set of more extreme events might be of greatest value in model evaluation and testing (e.g. Singh & Bárdossy, 2012) since they are more likely to reveal model deficiencies. This then raises the question, however, as to which events are truly informative, which might be disinformative (Beven & Smith, 2015) and how many such outliers should be allowed before a model is invalidated. This latter will be necessarily a subjective decision since it might be difficult to construct robust significance tests for epistemic errors rather than the random errors of statistical theory (see e.g. Frigg et al., 2014). In particular, making an analogy with statistical theory, we could perhaps allow failure on no more than 5% of the observations, but this might not be appropriate when the observational data that are of most interest for an application might make up that 5%, such as when a hydrological model consistently under predicts the largest peaks when used in assessing flood risk but does well in predicting the other 95% of the observational series (see Part 2 of this article and also Colquhoun, 2014, Briggs, 2016, for critiques of this approach in statistical hypothesis testing).
This then suggests that there should be an expectation that it will not be possible to be entirely objective about model validation when faced with epistemic sources of uncertainty and error. However, good practice should entail being transparent about how uncertainties in model and observations are assessed, what quality measures are used, and how a model invalidation or rejection is to be defined . It has been proposed before that the subjective assumptions that underlie an analysis should be recorded in a condition tree or audit trail to facilitate communication with users of any model predictions (e.g. Beven & Alcock, 2012;Beven, Lamp, et al., 2014;. Here, we suggest extending that concept to include the conditions for model (in)validation. Invalidation implies that it is necessary to do better in some way: either to find a better model structure that is fit-for-purpose, or to better represent the environmental (i.e. boundary) conditions to which the model is being applied, or to improve the evaluation data (Beven & Lane, 2019). This is then a way of progressing understanding rather than continuing to rely on model predictions that have not been evaluated as fit-forpurpose. If we follow philosopher of science Isabelle Stengers (2005) finding that our model predictions are not fit-for-purpose may be just one way of arriving at statements that do not 'say what is, or what ought to be, but [to] provoke thought, a proposal that requires no other verification than the way in which it is able to "slow down" reasoning and create an opportunity to arouse a slightly different awareness of the problems and situations mobilizing us' (Stengers, 2005, 994). That is, showing that something is not fit-for-purpose has the potential to advance science through forcing us to search for other approaches, model structures, data and so on, rather than to simply accept the model and observed data that we have if it is still associated with large and non-stationary prediction errors even after some form of calibration or conditioning on the observations (see also Thompson & Smith, 2019).
We stress that how fitness-for-purpose is evaluated will depend on the purpose. This will be different for models that might be used for a limited purpose (e.g. flood forecasting, where empirical adequacy might be sufficient), and models that aim to demonstrate a scientific understanding of how a catchment responds to rainfall (where conflicts with qualitative perceptual process understanding might be significant, see, e.g. Beven & Chappell, 2021;Wagener et al., 2021Wagener et al., , 2022. Defining invalidation criteria might be quite different for different cases. In particular, we should not confuse empirical adequacy with fitness-for-purpose. We should aim to get the 'right results for the right reasons' (Beven & Chappell, 2021;Kirchner, 2006;Lane, 2012;Lane et al., 2011), but this will depend on the purpose. In flood forecasting, for example, a model that predicts water levels rather than discharges and which therefore does not maintain mass balance but which uses adaptive updating to compensate for lack of knowledge of catchment inputs and flood discharges during flood events will generally be more fit-for-purpose than a model constrained by mass balance in predicting peak levels and timing. A model aimed at understanding, however, will require quite different criteria for evaluation (such as getting the patterns and amounts of overland flow correct, or the 'young water fraction' correct where tracer data are available).

| MODEL (IN)VALIDATION AND HYPOTHESIS TESTING
Validation has long been considered as a form of hypothesis testing (Baker, 2017;Beven, 2018;Clark et al., 2011;Overton, 1977;Pfister & Kirchner, 2017;Rykiel Jr, 1996;Sornette et al., 2007). Holling (1978) takes a strong Popperian falsification position on this, suggesting that models can never be validated, they can only be invalidated (see also Beven & Lane, 2019). That is in line with the discussion of this article, but we note that if multiple models satisfy some basic limits of acceptability (i.e. survive invalidation) they might still be associated with differing strengths of validation, analogous with Popper's varying degrees of verisimilitude in theory testing. Methods for hypothesis testing are well developed in statistical theory, based on treating errors as if they were fundamentally aleatory (after allowing for possible structure in the error series such as bias, heteroscedasticity and co-variation). In fact, statistical theory does not ever reject a model as a hypothesis, it will only give it a diminishingly small likelihood. It does, however, provide tools for deciding whether one model has a significantly higher likelihood than another (such as the use of Bayes Ratios and various information criteria). Thus, although a model is not necessarily rejected it might be superseded by another that could be considered as more valid in the sense of higher likelihood or belief.
The question then is how to assess the likelihood when the model uncertainties are primarily epistemic rather than aleatory in nature. Sornette et al. (2007) provide an iterative methodology for updating the degree of belief in a model as more data become available. The approach uses a statistical likelihood but also a subjective weighting parameter to weight the contribution of any new error information in a way that might depend on the framing of a particular application and expectations about the uncertainties associated with particular types of observations. This will be most applicable when the number of observations is small. As the number increases (as with discharge time series in hydrology) the assumption of a statistical error model tends to stretch the likelihood surface unreasonably (see discussion is Beven, 2016). In such cases, a different approach will be necessary.
This will especially be the case when even the model with the highest likelihood or belief might be associated with significant error.
Assessing the structure and parameters of an error model is, in fact, an important part of statistical hypothesis testing in that it will inform the type of likelihood function to be used. Such a framework does not itself, however, allow the user to decide whether any model is good enough or fit-for-purpose. It is often (even if not always) the case in hydrological modelling (and undoubtedly in other environmental domains) that models that appear to do well in calibration, do not do so well when applied to another period of data or another data set (e.g. Blazkova & Beven, 2009;Choi & Beven, 2007;Coron et al., 2012;Hrachowitz et al., 2014), even when adding a statistical error model. This is a strong indication that either the model structure or the data are subject to epistemic errors, such that the structure of the errors is not stationary and therefore not well represented by a stationary statistical error model. This raises another issue in model testing and validation. As in statistical hypothesis testing it will be possible to make both Type I and Type II errors, either accepting a model that will provide poor predictions or rejecting a model that would have provided good predictions just because of errors in the calibration data. Any assessment of fitness-for-purpose is therefore necessarily conditional on the decisions and assumptions made in the evaluation. This reflects the conditional nature of any validation exercise, but when carried out in the context of invalidation allows for the interesting case of all models tried being invalidated. The question then, of course, is why?

| FITNESS-FOR-PURPOSE AS A TURING-LIKE TEST
The Turing Test is a well-known concept from Artificial Intelligence and here we propose it as a means of addressing the challenge of deciding when a model should be deemed as fit-for-purpose. Turing (1950) proposed that a suitable test for machine intelligence was whether a human interrogator could tell the difference between the responses of another human or a computer program. As Turing posed the question: 'Are there imaginable digital computers which would do well in the imitation game?'. The concept has generated significant discussion in the field of artificial intelligence (e.g. French, 2000;Oppy & Dowe, 2016). A similar challenge has been proposed as a Turing-like Test for simulation models used in the environmental sciences in the form: 'Can a group of experts tell the difference between a sequence of observations in space and/or time and a model simulation?'. 1 If not, then it might be concluded that a model should be considered as fitfor-purpose. Of course, if we consider all journal referees to be experts of this type, then we would conclude that all published model outputs should be considered as fit-for-purpose (although in some cases the comparison of model and observations can be obfuscated by, e.g. already calibrating or bias-correcting model predictions using a set of past observations).
The concept of a Turing-like Test for environmental models raises some interesting issues. The first recognizes that any expert is partially or fully bound by their prior experience and the disciplines within which that experience has developed, a problem frequently identified in relation to environmental policy-making (e.g. Brock & Carpenter, 2007;Klaey et al., 2015). In such an instance, viewing a model as fit-for-purpose is clouded by a particular idea of what makes a model fit-for-purpose that is not simply informed by the model that is under consideration. This is the sense in which reviewers are often chosen because of their knowledge of a particular frame of reference, that is they are the right kind of hedgehog. Past experience suggests that they can become rather prickly when their modelling concepts are questioned. Such a hedgehog will have a strong tendency to assess the model within its own frame of reference, that is the extent to which the model conforms to the paradigm within which it has been built. Many declared model successes are of this type. But, if we return to the definitions of verification and validation, this is not based on any invalidation test but is rather analogous with code to code verification; that is a check of conformity with the basic established principles that guide the modelling strategy, not the extent to which the paradigm itself is right, and the real world is being adequately imitated. Hedgehogs can often be rather short-sighted.
The second is whether a group of experts would be able to consider whether a model is sound or not, without access to at least some observations from the system under consideration (a catchment and its pertinent characteristics in the hydrological case). Exercises in simulating the response of catchments treated as ungauged, with access to only soil, geology, topography and land use maps have not proven very successful in the past because of the difficulty of translating such information into values of model parameters (e.g. Refsgaard et al., 1997). Declaring success, of course, will also depend on just what measures we define to evaluate the model as being fit-for-purpose for a particular context: as, for example, in the Ewen and Parkin framework discussed earlier (see Parkin et al., 1996;Bathurst et al., 2004, for applications).
Classically the Turing Test is a qualitative subjective decision (e.g. Palmer, 2016, in the context of climate models), but should ideally have the status of being auditable .
The subjectivity of such decisions will nearly always be the case for the refereeing of scientific papers that present modelling results in the academic literature, but also in the normal processes of internal and external refereeing in consultancy projects. Referees will sometimes point out when a model or period of data should be rejected, perhaps even after publication (see, e.g. Beven, 2009). The basis for such decisions is not, however, always auditable given the information provided.
There are also examples of model inter-comparison projects, where the performance of multiple competing models is assessed e.g. 1D (Environment Agency, 2005) and 2D (Environment Agency, 2010) hydraulic models; the PILPS intercomparison of land surface parameterisations (e.g. Henderson-Sellers et al., 1996); the MOPEX comparison of hydrological models of Duan et al. (2006); the distributed model intercomparison project (DMIP, Smith et al., 2013); or the benchmarking of land surface parameterisations of Nearing, Mocko, et al. (2016). However, many of these models are often set up to test cases that they ought to be capable of reproducing and not cases that Other model intercomparison exercises (such as DMIP and PILPS), however, have allowed for validation on an observed data set that was not available to the different modelling teams with and without prior model calibration. The results of both these exercises were instructive. In the DMIP project (Smith et al., 2013), a collection of distributed hydrological models was compared with a lumped conceptual model as a benchmark. Performance of the distributed models based on prior estimates of the parameters was variable relative to the benchmark but was improved in all cases by calibration against observed discharge data (without assessment of data uncertainty).
Some of the models performed poorly both in terms of long-term bias, reproduction of snow water equivalents and simulation of flood hydrographs. The conclusion, however, was that the models satisfied the National Weather Service criterion of success (less than 5% bias in predicted discharges on average), low cumulative runoff errors and high values of modelling efficiency. None of the models were explicitly rejected. This represents a Turing-like Test based on the expertise of 32 authors as hydrological modelling experts and, presumably, some additional referees.
To take one example from the series of PILPS inter-comparison experiments with land surface parameterisations, Nijssen et al. (2003) compare 21 different model formulations in an application to a largescale catchment in Northern Scandinavia. The models gave highly variable results, albeit capturing the 'broad patterns of snowmelt and runoff'. Some models showed improved performance after calibration on smaller catchment data. One was rejected after failing a 'consistency test' (a form of verification based on an internal water balance error of more than 3 mm per year). The greatest differences occurred during the snowmelt period, but the authors noted the difficulty of interpreting the differences because of the complexity of the schemes and dependence on the chosen parameter sets. In this case 26 authors chose not to reject any of the remaining 20 models. This, perhaps, demonstrates a greater allegiance to an epistemic community than to getting the right results for the right reasons. A more recent study in the same region, however, showed that multiple land surface models failed to capture the information content of the observations (as captured in an entropy measure) to the same extent as a purely data-based method (Nearing, Mocko, et al., 2016). The conclusion of that study was that the physical basis of these models added no information towards explaining the data.
The above discussion implies that application of a Turing-like Test will need to evolve in a much broader and pro-active sense, beyond traditional model-data comparison. Identification of fitness-forpurpose implies a wide spectrum of influences on what is both fitness and purpose. That is, the expert judgement needs to happen 'upstream', and itself be subject to a Turing-like Test (e.g. as part of defining the tender documents in commissioning research), before any kind of application of such a test to a model study. It may also be worth considering whether the idea of the Turing-like Test could be made more quantitative, even for cases involving significant epistemic uncertainties based on a formal expert elicitation of what might be expected in terms of model capabilities for a given application. It may also require some reflection upon more than just the end point of the modelling process (when a modeller thinks that they have got the model as good as they can get). Turing (1950) did not deny that computers had to be made to imitate, at least until they were able to learn how to imitate themselves. A focus on imitation as an end point, then, overlooks the performative nature of modelling in hydrology (see Lane, 2012) where performance is not only the end point but also the all-too-rarely documented steps that a modeller goes through to develop trust that their model is providing a correct imitation. 2. Models should not be expected to perform better than the observed data on which runs are based and evaluated. A critical evaluation of the data for consistency and uncertainties, independent of the model being studied, should therefore be a prerequisite for model evaluation.

| SOME PRINCIPLES FOR A TURING-LIKE TEST FOR MODEL PLAUSIBILITY/(IN) VALIDATION AND FITNESS-FOR-PURPOSE
3. Models should not contradict secure evidence on the nature of system response and still be considered fit-for-purpose.
4. Evaluation should have the aim of getting the right results for the right reasons and not focus only on the need to make a decision. 5. Evaluation should allow for the possibility that all models might be rejected (invalidated) using criteria that allow for input and other observational uncertainties.
6. Past performance provides the only information about future performance, but the results of a Turing-like Test will always be conditional and the problem of induction and possibility of future surprises remain. 7. Achieving objective evaluation of models in the face of epistemic uncertainties can be a challenge: evaluations and evaluators should themselves be evaluated in terms of their fitness-for-purpose.
8. The basis for the definition of 'fitness' should be recorded in an audit trail that will allow later review of the process, including the expected sources of uncertainty. This audit trail should include an account of the activities the modeller has used to gain trust in the model they think is fit for purpose.

| SO WHAT SHOULD A TURING-LIKE TEST FOR MODELS LOOK LIKE?
These principles do not, however, provide a sufficient basis for an evaluation methodology. In particular, a model might be considered useful even if it explicitly omits some evidence about the nature of system response, to simplify model structure and implementation, while still providing an acceptable match to key observations. Clearly, some types of evidence are more important than others in informally applying a Turing-like Test for a particular application so that what constitutes acceptable performance will be context dependent. In particular, when the epistemic uncertainty of input data is likely to be significant, it will be very difficult to construct realistic realizations of the input uncertainties, and consequently any expectations of performance in reproducing observational data after the inputs are processed through a nonlinear model structure. This is an important issue in many domains of environmental modelling, but one that is often ignored. Thus, we need to think carefully about setting limits of acceptability in such cases. This requirement for thoughtfulness is the most important aspect of this pro-active Turing-like Test methodology being proposed.
To pass a Turing-like Test, a model must provide outputs that convince some set of relevant 'experts' that it is an adequate, acceptable or a behavioural representation of the response of interest for a particular purpose. This judgement should allow for uncertainties in the available data, and for the potential biases of the experts themselves and, in particular, their relative cognitive behaviour (that is their tendency to be fox-leaning or hedgehog-leaning after Tetlock, 2006).
It is relative because it depends on the stance of the expert to the wider modelling framework and approach, as well as its detail, that is being assessed. This suggests a way for defining an appropriate Turing-like Test based on methods that have been developed for expert elicitation (see e.g. Aspinall & Blong, 2015;Aspinall & Cooke, 2013;Cooke, 2014;Krueger et al., 2012;O'Hagan et al., 2006). The Classical Model Structured Expert Judgement method (Cooke, 1991), for example, is based on weighting the judgement of experts based on a preliminary set of questions in the relevant domain of expertise before they give advice on a particular application. This approach has been generally found to give better results than equal weighting. In the Bayesian approach of O'Hagan et al. (2006) distributions associated with the required information can be updated as more information is obtained from experts and model evaluations. Given the potential for epistemic uncertainties in observed data, models, and expert knowledge, however, fuzzy approaches to expert elicitation might also have value (e.g. Krueger et al., 2012). Such approaches can provide a structured framework for setting limits of acceptability in model evaluation.
In recent applications of the GLUE methodology, model evaluations have been based on setting sensible limits of acceptability before viewing the model outputs. The application of constraints in this way has much in common with the blind validation approach of Ewen and Parkin (1996) but can take more explicit consideration of the potential for epistemic uncertainties in the input and evaluation data. Tests might include both data specific to the application catchment, and hydrological signatures for the expected behaviour in different climates, geologies and land uses, an approach that has been used within the prediction of ungauged basins framework (e.g. Gupta et al., 2008;Yilmaz et al., 2008;Wagener & Montanari, 2011;Kelleher et al., 2017) and in assessing climate impact models (Wagener et al., 2022). Such an exercise will focus attention on both the potential sources of uncertainty and what we might realistically expect of a model given the data limitations in any modelling project (see Part 2 of this article). It also consistent with principle 5 above in that it does not preclude model invalidation (e.g. Dean et al., 2009;Graeff et al., 2009;Page et al., 2007;Parkin et al., 1996).

| IMPLEMENTATION OF THE TURING-LIKE TEST CONCEPT
There is an interesting logical conflict here. A scientific model is only ever conditionally valid, subject to further testing, but needs to provide reliable evidence if the model outputs are to be used in inferences or decision-and policy-making (e.g. Frigg et al., 2014;Roussos et al., 2021). Reliable evidence implies that a simulation model should be right for the right reasons or fit-for-purpose (rather than just demonstrating some success in reproducing how the system has worked in the past-a purely data-based model can usually do that, sometimes better, Young, 2013;Nearing et al., 2021;Nearing, Mocko, et al., 2016;Nearing, Tian, et al., 2016). In Part II of this study the implementation of these concepts using limits of acceptability will be discussed and an illustrative example application will be developed in the context of hydrological and hydraulic modelling.

ACKNOWLEDGEMENTS
The origins of the ideas in this article were developed whilst KB was a Herbette Scholar at the University of Lausanne. Further work on the papers has been carried out under the NERC funded Q-NFM project NE/R004722/1, led by Nick Chappell. Thanks are due to the original referees on these papers, Thorsten Wagener and Erwin Zehe, for comments that led to improvements.

DATA AVAILABILITY STATEMENT
Data sharing not applicable to this article as no datasets were generated or analysed during the current study.