A system of metrics for the assessment and improvement of aquatic ecosystem models

https://doi.org/10.1016/j.envsoft.2020.104697 Received 8 October 2019; Received in revised form 2 March 2020; Accepted 10 March 2020 Environmental Modelling and Software 128 (2020) 104697 2 advances in scientific understanding (Flynn, 2005; Anderson and Mitra, 2010; Oliver et al., 2012; Hellweger, 2017), are not interdisciplinary (Mooij et al., 2010; Ward et al., 2019), fail to assess or reign in uncertainty (Arhonditsis et al., 2006, 2008a; Dietzal and Reichert, 2012), and “fail to fail” when they should fail if they were true tests of conceptual understanding (Franks, 2009). Despite continuing advances in process understanding of aquatic biogeochemistry and ecology, and the emergence of a plethora of model approaches and platforms in the literature, it could be argued that the level of predictability in many practical applications of AEMs has not significantly improved over the past two decades (Arhonditsis et al., 2014). These challenges motivate us to find new ways to assess our models so that we can transparently compare different models and model approaches, whether they are simple or complex, and understand the level of predictability they provide. Oreskes et al. (1994) argued the use of models is heuristic, which is consistent with a common view that decisions about defining when a model is “validated” are largely ad hoc. The chosen level of validation often depends on available pre-existing data sets, the background and experience of the individual modeller and a general desire to report favourably on the performance of the model. A common framework and established standards for model assessment and documentation would create opportunities for synthesis between diverse model studies and transferability of knowledge between applications. This would facilitate us to benchmark AEMs, that is, to compare which models are better under which circumstances, and what level of complexity tends to achieve a given level of accuracy and performance for a given application context. Here, we review current approaches that can be used to assess the performance of AEMs, and formalize a general strategy to improve confidence in model predictions. The shortcomings of current model assessment and the need for a systematic approach are detailed further in Section 2. A framework is outlined in Section 3 for the hierarchical assessment of models to encourage modellers to assess not only state variable predictions, but also process behaviour and system-scale dynamics. A range of metrics and characteristic signatures relevant to aquatic ecosystem condition are exemplified for a range of physical, chemical and ecological contexts in Section 4, spanning the diversity of aquatic system models from ponds and lakes to the global ocean. Our goal is to demonstrate the utility of hierarchical assessment to more comprehensively assess when a model is “fit for purpose”. In doing so, our aim is to create standards, a common vocabulary, and encourage a more rigorous, multi-level approach to model assessment that will facilitate comparisons among different AEMs, their applications, and thereby increase their predictive value and usefulness. 2. Approaches to model assessment and need for a standard framework In a review of model assessment frameworks, Pohjola et al. (2013) highlighted four kinds of model assessment: (i) quality assurance (best practise procedures); (ii) uncertainty analysis; (iii) technical assessment; and (iv) assessment of effectiveness in achieving social, environmental, or policy outcomes. In this paper, we focus on the technical assessment of model performance. We note here the extensive history of literature describing appropriate measures of fit for objectively evaluating model performance (Mayer and Butler, 1993; Power, 1993; Alewell and Manderscheid, 1998; Stow et al., 2003, 2009; Bennett et al., 2013; de Mora et al., 2013; Kubicek et al., 2015), providing guidance on specific mathematical measures of model accuracy. In general, these include varied methods for error calculation, correlation and model efficiency measures (Table 1). We do not focus this analysis on the specific suitability of these measures, but rather on providing a framework within which they can be used. For a review of the strengths and weaknesses of different measures of model-data comparison see the summary by Fig. 1. Examples of two different aquatic ecosystem model conceptualisations, focused on (a) resolution of physical and biogeochemical processes and interactions, and (b) resolving trophic complexity and interactions. Both examples have been used in both fresh and marine studies. M.R. Hipsey et al. Environmental Modelling and Software 128 (2020) 104697 3 Bennett et al. (2013) and previous discussions by others (Elliott et al., 2000; Allen et al., 2007). To-date, no well-established guidelines have been developed on how to approach aquatic ecosystem simulation and how to decide when models are fit for purpose. The reason can be found in the large diversity of model approaches, which is compounded by the fact that AEMs have been applied over wide environmental (lake, river, ocean) and application contexts (e.g., forecasting, system understanding, scenario comparison etc.) (Janssen et al., 2015). For any given application context, typical approaches in the literature adopt a wide range of variables, spatial dimensionalities, spatial scales and simulation time-frames. Whilst the diversity of these applications is a good thing overall, the consequence has been that it is difficult to compare model performance between studies and approaches in a way that allows a clear definition of the limits of their predictions. In the past decades there has been limited true improvement in the level of model predictability. Given the rapid uptake of AEMs (e.g. Trolle et al., 2012; Janssen et al., 2015), it would be expected that over time, higher quality datasets, more refined model process descriptions and increasing computer power would lead to improved predictions. This is of course true in some cases, however, there is evidence that in general terms model approaches and their ability to accurately capture trends in observed data have not considerably improved compared to some of the pioneering studies (e.g., Thomann and Fitzpatrick, 1982). This trend was first noted by Arhonditsis and Brett (2004) and subsequent analyses have pointed to a similar conclusion (e.g., Arhonditsis et al., 2006; Robson, 2014b; Paraska et al., 2014; Arhonditsis et al., 2014). During routine applications of models, we anecdotally hear that use of R2 is a “waste of time” for coupled hydrodynamic-biogeochemical model variables, and that modellers often resort to “chi-by-eye”. That is, the modeller analyses the model output in the context of the quality and noise in the data and uses a subjective visual comparison as the ultimate determinant of suitability. Kubicek et al. (2015) found that 92% of studies reported in the journal, Ecological Modelling, reported visual inspections as the only or main method of model evaluation. This approach is not always due to the lack of willingness by the modeller to validate more deeply, but rather due to confusion as to what features need to be tested within the specific application. The question therefore remains: are there ways in which we can further refine our efforts so that they may lead to more robust model predictions? A large challenge remains to improve parameterisation of AEMs. For most applications, information exists about the values and variability of the state variables, but little is known about the values of the model parameters. Modellers then adjust parameters to find the best agreement between modelled and observed data, either by adjusting the model parameter vector by trial and error (manual calibration) or through optimisation algorithms. The advantage of optimisation methods is that they are objective and repeatable methodologies that are more likely to result in an optimal parameter set. Any significant lack of fit is then due to the inadequacy of the model structure and not due to poor parameter choice (Chapra and Canale, 2010). The conventional practice of seeking a single “optimal” parameter set reflects two major assumptions: (i) that Table 1 Summary of quantitative techniques used for assessment of aquatic ecosystem models. Refer to Bennett et al. (2013) for more detailed overview and categorisations of available assessment approaches. Abbreviation Technique Description V Visual inspection Visual inspection of time-series, TS, is often undertaken (“chi-by-eye”), but weak relative to the quantitative metrics listed below. Visual inspection of more complex model outputs may be warranted, and is frequently undertaken, for example: TZ: Time vs Depth for vertical profile assessment XZ: Distance vs Depth for cross section (“curtain”) assessment XY: Plan view spatial (“sheet”) comparison TX: Time vs Distance contour comparison BIAS, MAE, NMAE Bias Mean Absolute Error Normalised Mean Absolute Error BIAS 1⁄4 1 n Xn i1⁄41 ðPi OiÞ; MAE 1⁄4 1 n Xn i1⁄41 ðjPi OijÞ; NMAE 1⁄4 MAE O ; Can be applied to demonstrate localisation of error in spatial models, denoted as MAE(ΔXZ), for example. RMSE Root Mean Square Error RMSE 1⁄4 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PN i1⁄41ðPi OiÞ 2 N s MEF, NSE, B Model Efficiency, Nash-Sutcliffe Efficiency, Bardsley coefficient MEF 1⁄4 NSE 1⁄4 1 PN i1⁄41ðPi OiÞ 2 PN i1⁄41ðOi OÞ 2 Nash and Sutcliffe (1970) & Murphy and Epstein (1989); Bardsley (2013) presents a variant that better accounts for bias: B 1⁄4 R2 2 MEF d2 Index of Agreement Model Skill Score Willmott index d2 1⁄4 MSM 1⁄4 PN i1⁄41 jPi Oij 2 PN i1⁄41ð � �Pi O � �þ � �Oi O � �Þ 2 Willmott (1981) introduces the above skill score to consider both correlation and variance, allowing a choice regarding how to weight error at extremes versus around the mean. R, SR Correlation coefficient Spearman Rank Correlation R 1⁄4 PN i1⁄41ðPi PÞðOi OÞ 1⁄2 PN i1⁄41ðPi PÞ 2PN i1⁄41ðOi OÞ 2 � 1 = 2 SR 1⁄4 1 6 P nτ nðn2 1Þ where τ is the difference in rank between the predicted and observed TD Taylor/Target Diagram Visualisations of patterns of statistics to assess and quantify model performance; see Taylor (2001) and Jolliff et al. (2009). DF Distribution Functions Plots showing variance of data set, including box-plots, violin plots and cumulative distributions FFT, WT, WC Fast Fourier Transform Wavelet Transform Wavelet Coherence Approaches to demonstrate spectral power localisation at distinct frequencies; Wavelet transform demonstrates this localisation as a function of time (or space); Wavelet coherency computes the correlation in wavelet power between variables. CCF, ACF Cross-correlation Function Autocorrelation Function Measure of the similarity of a series to itself (autocorrelation) or another series (cross-correlation) as a function of displacement in time. M.R. Hipsey et al. Environmental Modelling and Software 128 (2020) 104697 4 there exists a single calibration vector for faithfully reproducing a wide range of ecosystem dynamics; and (ii) that our empirical knowledge (monitoring data, experimental work) adequately depicts the patterns of the "real world", and thus offers an objective standard for testing our models. The credibility of both statements has been extensively debated in the literature and there are sound arguments to cast doubt on the legitimacy of such a deterministic approach to mathematical modelling. A recent attempt to address knowledge gaps and improve practice in setting parameter values has been made by Robson et al. (2018), who provide a tool to facilitate modellers’ exploration of evidence for process rates and traits relevant to biogeochemical model parameters. Nonetheless, model practitioners often encounter the problem that several distinct choices of model inputs result in equivalent model outputs, i.e. many sets of parameters fit the data equally well. The nonuniqueness of the model solutions, known as equifinality (Arhonditsis et al., 2008a), is a consequence of insufficient data or in the case when internal process pathways are of substantially higher order than what can be externally observed (Beck, 1987). As a result, our ability to set quantitative (or even qualitative) constraints on model ecological structure is significantly reduced, and thus we are often faced with a situation whereby our models give “good results for the wrong reasons” (Arhonditsis et al., 2007). Aside from parameter uncertainty, models contain errors that arise from its structure or its inputs (Omlin and Reichert, 1999). Model structural error is associated with (i) errors in the selection of appropriate state variables or processes to reproduce ecosystem behaviour, (ii) errors and necessary simplifications in selection of mathematical formulations for describing the processes, and (iii) the fact that our models are based on equations derived from controlled laboratory environments that may not yield an accurate picture of the real world variability in biological systems and complicated interactions between forcing factors (Hellweger, 2017). Essentially, models are simplifications of reality, and all parameters are effectively applied as spatially and temporally averaged values that in reality are unlikely to be represented by fixed constants. In addition, it should be recognised that observational data are also uncertain approximations, and are in fact also models of reality. An important and increasing area of model application is to capture shifts in system function. Shifts occur in response to a varied range of external drivers, such as climate change, the cumulative loading of nutrients and pollutants and/or management measures (e.g., Trolle et al., 2008; Skerratt et al., 2013). In this case, modelled ecosystems are non-stationary; they are being pushed out of their typical state-space range upon which they were trained and predictability in the past is no guarantee a model can capture future trajectories. From the broader ecological literature, we know that ecosystems are vulnerable to deterioration when key system functions are pushed over thresholds, resulting in the loss of resilience and the emergence of a regime shift (Scheffer et al., 2009). Often these dynamics are at the core of what we need models to help us understand, yet we have limited confidence that the models are capturing these shifts and non-linear ecosystem dynamics (Hipsey et al., 2015). The lack of universally accepted performance criteria impedes our capacity to impartially determine what an acceptable model is. Thus, an emerging imperative in the field of aquatic systems modelling is the development of a predetermined standard, considering model complexity, the spatiotemporal domain or even the question being asked. Given the complexities highlighted above and the diversity of simulation contexts, this is unlikely to take the form of a simple cut-off value for an acceptable goodness-of-fit metric. In some cases, it is sufficient to be able to confidently predict that one management option will have a better outcome than another (e.g. shorter algal bloom duration), while in other cases, the decision will hinge on being able to predict how much better a certain scenario may be. Hence, we need to be able to evaluate not only how uncertain our prediction may be, but also what our models can confidently predict (e.g. that a bloom will occur) despite prediction uncertainty. In light of these conceptual and technical challenges, there are several areas where the AEM community would benefit from improved tests and reporting of model performance. These include: � greater emphasis in model publications to highlight the assessment approach and the variables validated (or tested in sensitivity analysis); � adoption of assessment standards to facilitate inter-comparison of diverse model approaches; � improved assessment of process pathways in models as a means to help resolve concerns around equifinality, including assessment of spatio-temporal variability in process rates; � exploration of the degree to which different scales of variability are captured by models, considering not just state variables, but also flux pathways; � assessment of model performance in reproducing theoretically relevant, system-scale responses – that is, even if models capture trends at a sampling point, they must also demonstrate ability to capture emergent behaviours; and � employing a wider range of validation approaches able to accommodate new monitoring technologies and high-frequency data streams to support model-data fusion efforts. Addressing the criteria above will improve credibility and transparency in aquatic ecosystem modelling. For example, capturing emergent properties is particularly relevant when models are used to explore non-stationarity (systems undergoing change), where a model may be out of its calibrated range, or if the goal is to explore uncertain future conditions like climate change. We used the list to formulate the core principles of a comprehensive model assessment framework that can be used to understand, discuss and evaluate underlying principles in AEMs, and to broaden their applicability and transferability. 3. Overview of the multi-level assessment framework The CSPS framework introduces a hierarchical assessment of a range of metrics and theoretically-relevant signatures relevant to aquatic ecosystem structure and function, depicted schematically in Fig. 2. The approach includes four levels of assessment, with an apriori or preapplication assessment of the model (indicated as Level 0), while the other three levels are post-simulation assessments of the model. The four levels are summarised as: 0. Concept: Conceptual validation to ensure that sub-models are consistent with ecological theory and valid over the range of conditions for which the model will be applied; 1. State: Comparison of simulated state variables with observed properties; 2. Process: Comparison of simulated energy and mass fluxes with measured process rates; and 3. System: Comparison of system-scale emergent properties, patterns and relationships with observed and theorised phenomena. These levels are further described generically below given our desire for consistency across both inland and marine waters. Specific examples for different application contexts and specialisations relevant to the modelling community are expanded upon in Section 4. 3.1. Level zero: Conceptual validation A process for model assessment that includes conceptual validation of model structure and sub-model algorithms is a precursor to empirical validation of the model as a whole (Bert et al., 2014). At this level, we ask, “does the conceptual basis of the model accord with current scientific understanding of how the system functions?” This level of evaluation is usually undertaken explicitly when developing a new model, M.R. Hipsey et al. Environmental Modelling and Software 128 (2020) 104697 5 Fig. 2. Conceptual diagram illustrating the CSPS model and assessment workflow highlighting the shifting focus from assessment of state, to process, and ultimately to ecosystem function. M.R. Hipsey et al. Environmental Modelling and Software 128 (2020) 104697 6 but is often overlooked when applying an existing model in a new context or when it is adapted to include additional functionality. Questions to consider include: � Does the model structure adequately reflect our conceptual understanding of the system and its key drivers, having regard for processes that may change within the range of scenarios being considered? An example of a situation in which a previously successful model may fail this conceptual validation is when a model developed for a deep-sea application is to be applied to a coastal ecosystem, where a more detailed representation of benthic processes is needed. Another example is where a model previously applied within a narrow range of temperature conditions is to be applied to climate change scenarios and the conceptual basis of temperature response functions may need to be reconsidered. � Does the mathematical representation of ecosystem processes produce realistic system dynamics? For example, the widely-used Michaelis-Menten kinetics are best suited to steady-state conditions and may be considered dysfunctional in dynamic simulations where nutrient concentrations vary considerably (Flynn, 2005; Frassl et al., 2014; Hellweger, 2017). � Does the model structure reflect recent advances in ecosystem understanding that may be important in this application? For example, many ecosystem models in common use have not been updated to include anammox or dissimilatory nitrate reduction to ammonium (Robson, 2014a,b). � Does the model reflect the system understanding of relevant local disciplinary experts and stakeholders? If not, this may be a hurdle to acceptance of decisions based on model results, as well as not making good use of local knowledge. � Is the model mathematically valid and dimensionally consistent? � Is their evidence that the implementation of the model as software correctly reflects its conceptual and mathematical basis? For example, does it maintain conservation of mass? � Where two or more possible model structures have been identified, what process has been followed to compare the options? In some cases, it may be appropriate to develop an ensemble of models to test the range of possible results and the trade-offs in speed and accuracy. This conceptual validation phase should be revisited after state, process and system validation to consider whether the results suggest the need for re-evaluation of the model structure or implementation. 3.2. Level one: State validation The comparison of time-series of physical, chemical and biological state variables is the main form of model validation. However, it is not the only means by which the accuracy of model state can be assessed, and when used in isolation may not give a complete picture of model performance. In most cases, the frequency of observations is significantly less than the time-step of aquatic models, particularly in the case of coupled hydrodynamic-biogeochemical models. For variables that exhibit rapid changes in time, such as algal biomass during a bloom, standard metrics such as Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) can be fundamentally weak (Elliott et al., 2000). Similarly, in terms of a spatial distribution, an in situ observation is usually the mean of a small sample volume, being compared to model output that may represent the mean value of a much larger quantity of water. Various alternatives for assessing state can be considered, including assessment of scales of variability, and derived metrics, whereby observations and simulated data are subject to some form of transformation. 3.2.1. 1a – Direct comparison Classically, model simulation results are compared with measured data at specific points in time and space where the measurement was taken. This type of assessment is undisputedly where most effort in model validation has been concentrated and will no doubt remain the focus of most model assessments. The variables selected for assessment are inherently linked to the nature of the investigation and associated choice of model approach and structural complexity. Nonetheless, there has been a tendency for modellers to validate against an arguably small subset of simulated water chemistry variables, a product of both a desire to emphasise aspects of model output relevant to the application, but also commonly due to lack of observations for remaining state variables. Measures of model fit (as in Table 1) can be computed for one or more sampling stations and are generally based on calculation of residuals (i.e., the difference between model outputs and the corresponding observations). A potentially useful exercise that adds to the common error calculation, is to calculate skewness or kurtosis in predictions. Spatial assessments of multi-dimensional models against data from remote sensing platforms are increasingly being introduced. Other methods do not necessarily involve error calculation but rather assess patterns in data series. For example, Spearman Rank correlation (SR) may be more useful to identify the degree to which the order of predictions and observations from small to large magnitude are captured, for example, where assessment of the ability of a model to reproduce seasonal and inter-annual variability is required without focusing on exact values. The use of a cross correlation function (CCF), can allow a modeller to look for correlations in how the simulated and observed data vary with a delay across time and may be used for looking at lags in timeseries, or alternatively may be useful in spatially resolved models to determine measures such as patch length. 3.2.2. 1b – Derived metrics describing model state This category refers to metrics that do not involve a direct assessment of state variable time-series or spatial data, but are derived from the simulated variables. The focus of this class of metrics is to test the model against theoretically relevant indicators of ecosystem state, such as relationships between variables. They can provide additional evidence for demonstrating that a model is fit for purpose even if direct value comparisons (e.g., R2) are not possible due to data limitations or if direct validation indicated a weak level of predictability. Examples of derived metrics could include simple stoichiometric indicators (e.g., TN:TP), or other ratios of simulated variables (e.g., DOC:TOC; Chl-a:TSS). Other derived quantities include assessment of relevant dimensionless numbers, for example, the Richardson number as a measure of stratification intensity, or in the case of algal bloom dynamics, metrics derived from analysis of the raw state variable time-series, such as the average duration of a bloom or time of bloom onset. 3.2.3. 1c – Metrics describing multi-scale variability in model state Metrics at this level describe how well various scales of temporal or spatial variability are reproduced in our models. Depending on the model structure, its spatial dimension and time-step, any given simulation will have limits on the scales that it can predict. The inherent difficulty in defining what model approaches are most appropriate for a given scale (e.g., at what temporal scale would a 1D and 3D model converge?) remains a large challenge in the AEM community. Simulations may be assessed from a probabilistic point of view, for example, through comparison of exceedance probabilities (CDFs) in order to ascertain whether a model is able to capture the proportion of time (or spatial domain) over which a certain concentration is experienced. However, assessment of probability distributions alone may not be suited under non-stationary conditions, and they do not inform us if expected modes of variability are adequately represented. For example, a coupled hydrodynamic-biogeochemical model might be designed to reproduce phytoplankton biomass in response to diurnal changes in productivity, but also over intermediate scales due to dynamic hydrological and meteorological conditions, over seasonal scales due to changes in temperature and nutrients, and potentially up to decadal scales if the fundamental drivers of phytoplankton biomass are shifting. M.R. Hipsey et al. Environmental Modelling and Software 128 (2020) 104697 7 Approaches to assess the performance of models over these scales are described in Bennett et al. (2013) as “data transformation methods” and relevant techniques include Fourier transformation (FFT) or wavelet transforms (WT). To date there has been limited application of these methods for assessing AEMs, partly due to the fact that there are few cases of modelled applications that have observational data that span such a large range of time-scales (reviewed by Kara et al., 2012). The routine application of real-time sensors in aquatic systems over the past decade, however, is providing diverse datasets that span from time-scales of minutes to decades, and offer new opportunities to assess models through this approach (e.g., Hamilton et al., 2015). Where spatially rich observational data are available, these approaches can also be used to examine model performance across multiple spatial scales. Spatial maps can be quantitatively compared with observational maps, particularly from remote sensing observations, using a variety of methods (Stow et al. (2009). 3.3. Level two: Process validation Process validation refers to assessment of model performance against the underlying rates of transformation that drive changes in model state variables (i.e., the arrows connecting “stocks” or “pools”); process validation is therefore specific to process-based models. Indeed, the most commonly cited advantage of process models is their ability to resolve the interaction of the different mechanisms that shape ecosystem state, yet rarely do we rigorously assess whether they are correctly captured. This is particularly relevant if we consider that most model applications adopt process parameterisations reported broadly in the literature, potentially from sites that may be inherently different, or from laboratory or mesocosm studies conducted under controlled conditions. Similarly, the associated parameter estimates for these algorithms may also be chosen from within large ranges reported from diverse model applications, or from laboratory assessments such as phytoplankton or sediment incubations. Therefore, it is not obvious that models with complex interactions accurately represent spatial and temporal variability in process pathways correctly, even if several model state variables are seemingly reproduced well at Level 1. Consequently, validating models with regard to the individual flux pathways that connect individual state variables is a way of reducing equifinality, helping modellers to get “good results for the right reasons”. Comparing modelled with measured flux rates also provides a way to pinpoint sources of structural and conceptual error in the model. In practice, this approach remains rare in aquatic ecosystem modelling, as measuring variability in process rates through time and/or space is resource-intensive and difficult, and in some cases may not yet be directly possible in situ. However, given that it has the potential to greatly improve confidence in the underlying function of models, we believe it should be actively promoted, and several examples are outlined in the following sections that may be more routinely adopted. These are classified next as either being from direct measurements or indirect rate estimates. 3.3.1. 2a – Comparison with raw process measurements When developing a model, it is necessary for us to distinguish between process measurements that are required to assist model parameterisation and parameter assignment, and process measurements that can be used for validation. A range of in situ process measurements are relevant to assess physical, chemical and biological model attributes and their temporal and spatial variation. These include rates of mixing, fluxes across the air-water or sediment-water interfaces, and kinetic transformations (e.g. nutrient uptake, rates of primary production, or grazing). Specific examples highlighted in Section 4 relevant to a range of different model applications are reviewed. 3.3.2. 2b – Process metrics interpreted from raw data Process information can also be extracted from raw data series, either derived from changes in the observed data record via inverse modelling, or potentially through more sophisticated data-driven models designed to estimate bulk process rates. As a simple example, the trend of oxygen depletion in the hypolimnion of a lake may be used to estimate the net sediment oxygen demand if other sinks are minor. The use of environmental tracers and isotopic data also has potential to support validation of the flow of oxygen, carbon or nitrogen, for example, if assumptions are made about the relative fractionation and transformation rates that occur for individual process pathways. These approaches are not reported widely in the literature, and usually depend on the validity of several simplifying assumptions that are made when interpreting the data. However, it is highlighted here as an area of increasing interest for the growing area of model-data fusion, as a means to compare modelled process rates against those estimated from empirical means (Robson, 2014a; Hipsey et al., 2015). 3.4. Level three: System validation The complex non-linear interactions and feedbacks that govern the response of aquatic systems to changes in internal or external conditions can lead to ecosystem-scale emergent patterns, relationships and dynamics. These emergent properties are not necessarily predictable directly from the underlying model formulation, and are outcomes from the model that are “not a direct extrapolation of the choices made in model design” (Allen, 2010). A classic example of an emergent property is the behaviour of a flock of birds in flight, which emerges in a way that is not obvious from the behaviours of individual birds. Although not widespread, there are several examples where aquatic models have been assessed in terms of their ability to produce emergent dynamics, though generally not with the direct purpose to refine model accuracy or to justify whether it is fit for purpose. These provide valuable insights to ecosystem behaviour, and we advocate for more effort to be placed in assessing if our models are able to capture higher order behaviour or patterns. As highlighted by Anderson et al. (2010), a different choice of model structure may lead to different emergent dynamics. Where two models perform equally well at Level 1 but predict different emergent system dynamics, these can be treated as competing hypotheses regarding system dynamics, and measurement programmes can be devised to invalidate one or both hypotheses. Examples of system properties that might be captured by models can include simple metrics such as scaling relationships (e.g. nutrient loading vs. chlorophyll-a response), or more complex spatial or temporal patterns in nutrient cycles and community dynamics. In many applications, these patterns may be expected based on empirical experience, such as the succession of different plankton functional groups, or spatial niches in a habitat. Multivariate comparison methods are available to explore performance of models in capturing inter-relationships between variables (e.g., a Taylor Diagram, TD), and these may be particularly useful for identifying cases where models resolve emergent dynamics that are not known a priori. Self Organising Mapping (SOM) is one example, among others, of a machine learning method that may be suited to identification of emergent patterns in both model output and observational data that are rich in two or more dimensions (e.g., Williams et al., 2014). Where observational data are both spatially and temporally rich, Empirical Orthogonal Function (EOF) decomposition can be used to analyse major modes of variance and patterns of variation, which can be compared with the results of the same analysis applied to model output (e.g. Rocha et al., 2019). Another area being explored is the response of ecosystem state-space to perturbations, considering threshold effects, hysteresis and alternative stable states. This level of validation is especially important if the model’s purpose is to define the stability or resilience of key ecosystem attributes to climate change, fishing and/or eutrophication. At this stage only a qualitative or semi-quantitative comparison may be possible. M.R. Hipsey et al. Environmental Modelling and Software 128 (2020) 104697 8 4. What metrics should I use when ... ? Examples for different application contexts In this section, we consider the literature through the lens of the framework outlined in Section 3, by combining the framework proposed with specific assessment techniques summarised in Table 1. Given the predominance of Level 1 validation in the literature, we did not target a comprehensive identification of all published model studies that consider validation. Rather, we aimed to collate a range of examples that have been applied across a broad range of application contexts. Through this review, we also aimed to identify gaps and potential areas for further development of new Level 2 and 3 validation approaches. 4.1. Hydrodynamic applications for ecosystem assessment Relative to water quality and ecosystem modelling applications, approaches for characterizing the performance of physical models of aquatic environments are well established, most notably from the engineering and geophysical sciences literature. They share some common metrics across lacustrine, riverine and oceanographic model applications, depending on the underlying hydrology, model dimension and time-frame of the simulation. A range of general metrics have been categorised related to prediction of a) the water balance, waves and water circulation, b) the heat and salt balance, c) stratification, and d) bottom morphometry and sediment transport (Table 2). In general, the validation of hydrodynamic models is achieved by conducting multiple levels of assessment. First, researchers can conduct time-series assessment of water level, temperature and salinity at fixed points, and/or horizontal or vertical variation derived from profile cross sections (Level 1). Simulation of surface ice dynamics can also be undertaken by time-series assessments of ice thickness, complemented with derived metrics such as ice-on and ice-off dates (Level 1, e.g., Yao et al., 2014). For models that simulate surface or internal waves, spectral plots are commonly reported to demonstrate power across waves of different frequency. Derived indices such as the Richardson number as a measure of stratification intensity, or Schmidt stability (e.g., Bruce et al., 2018), are also useful as Level 1 metrics. Considering mixed layer depth (Acreman and Jeffery, 2007; Bayer et al., 2013), or thermocline/pycnocline thickness and changes to surface layer thickness as a function of various driving factors, can prove useful in diagnosing problems with model parameterisation. For cases where periodicity varies over time, wavelet plots can highlight how power is localised in frequency space over different seasons or time-frames. Isotherm/isopycnal displacement power spectra are a useful metric to demonstrate the time-scale over which models are able to reproduce the internal wave field (e.g. Hodges et al., 2000). Level 2 validation of hydrodynamic models can include assessment of evaporative mass fluxes, estimation of albedo, measurement of velocities (at a fixed location or from drifter tracks, e.g., Dissanayake et al., 2019), and shear stresses, for example, at the sediment-water interface or within macrophyte beds. The use of direct measurements or observation-based estimates of turbulent mixing could also be considered, and a range of novel tracers have been adopted for dilution experiments to characterise contaminant dispersion (e.g., caffeine and pharmaceuticals in waters impacted by wastewater effluent, Cantwell et al., 2016). Level 3 assessments in hydrodynamic models consider the formation of residual currents and eddy structures. For example, different, but otherwise similar, models reveal the emergence of different eddy structures in the North Atlantic Ocean (Holt et al., 2014). Hetland and DiMarco (2012) undertook an assessment of a 3D hydrodynamic model of the Texas–Louisiana continental shelf using data from moorings by presenting maps of model skill; in this example, surface and bottom variance ellipses are used to demonstrate the model captures the point-scale and residual field. In lakes and coastal environments, currents created by differential heating and cooling may be assessed by indirectly comparing against profile cross sections (Woodward et al., 2017). Spatial variability in water currents creates patterns of water age distributions that emerge as a system-scale property, but which are not easily able to be validated. The potential for using methods such as assessing against empirical estimates from conservative tracers, for example from radium isotope measurements (Tomasky-Holmes et al., 2013), may be able to be applied in the future. The increasing applications of models for complex aquatic domains (e.g. estuaries, inter-tidal wetlands, river floodplains, reef structures, or island archipelagos) requires efforts to validate the relative pattern of connectivity across modelled sub-domains. At smaller spatial scales, Level 3 assessment of other patterns that may emerge in physical models can be conducted. Examples include travelling waves in spatial patterns of plant-wrack in intertidal zones (Sun et al., 2010), wave attenuation in seagrass beds (Chen et al., 2007), and the canopy structure such as the dynamics of canopy deflection properties (Dijkstra and Uittenbogaard, 2010). Models with morphodynamic ability can also be assessed by examining spatial patterns in temporal changes in bottom morphometry, to highlight active areas of erosion and deposition. 4.2. Water quality and biogeochemistry A common goal for AEMs is to understand the controls and dynamics of chemical and biological variables relevant to water quality. What constitutes a ‘water quality variable’ can vary depending on the application and context, however, for the purpose of this analysis, the literature is categorised according to several areas of focus pertaining to prediction of a) oxygen and the extent of hypoxia/anoxia, b) the cycling of inorganic nutrients and organic matter, c) geochemistry, d) water colour and clarity e) chlorophyll-a, and f) other chemical and biological contaminant dynamics (Table 3). Across these categories most variables being assessed are dissolved or particulate concentrations that are routinely sampled via traditional monitoring programs and subsequent time-series assessments (Level 1), though, a more detailed exploration reveals a broader range of examples that can support a diversity of potential approaches for capturing water column and sediment biogeochemistry. Oxygen has often been a focus of water quality models as it plays a pivotal role in nutrient cycling and sediment processes. Given the increasing availability of sensor data for measuring oxygen concentrations, it also provides an interesting case study for how a system of metrics can assist in model assessment. For this context, a range of more rigorous Level 1 metrics has emerged such as wavelet analysis of highfrequency in situ data series from a stratified lake (Kara et al., 2012), longitudinal analysis of surface and bottom oxygen in estuaries where strong lateral gradients exist (Xu and Hood, 2006), and the spatiotemporal extent of anoxia in a riverine estuary (Bruce et al., 2014) and Lake Erie (Bocaniov et al., 2016). Direct in situ Level 2 sediment flux measurements from benthic chambers or eddy-correlation instruments have provided useful validation of sediment oxygen demand (e.g., Sohma et al., 2008). In another example, Hetland and DiMarco (2008) adopted apparent oxygen utilisation (AOU), the difference between the oxygen concentration and the saturation value, as an indirect process validation metric to demonstrate the model was capturing the combination of oxygen consumption mechanisms. Of increasing interest in lacustrine and marine environments is the application of high-frequency oxygen sensors to estimate free water metabolism (Hanson et al., 2008), where an inverse modelling technique is used to extract hourly to daily estimates of primary productivity, community respiration and atmospheric exchange from diel changes in oxygen concentration (e.g. Lovato et al., 2013; Webster et al., 2005; Wikner et al., 2013; Winslow et al., 2016). While this method provides relatively coarse estimates due to confounding factors of advection and mixing (Villamizar et al., 2014), repeated estimates over a range of environmental conditions provide an in situ view of water column net productivity and respiration that can be M.R. Hipsey et al. Environmental Modelling and Software 128 (2020) 104697 9 Ta bl e 2 Su m m ar y of v al id at io n m et ri cs fo r ph ys ic al m od el s of a qu at ic s ys te m s (r ef er to T ab le 1 fo r as se ss m en t t ec hn iq ue a bb re vi at io ns ). Pr op er ty b ei ng as se ss ed D es cr ip tio n Va lid at io n le ve l Ty pi ca l r an ge o f d at a ob se rv at io n fr eq ue nc y Sp at ia l s ca le A ss es sm en t te ch ni qu e (s ee Ta bl e 1) a Co m m en ts ( e. g. p ro ce ss in g re qu ir em en ts ) Ex am pl e re fe re nc es b W at er b al an ce , w av es & c ir cu la ti on W at er le ve l Ti m ese ri es c om pa ri so n 1a m in ut es -m on th ly po in t; m ul tip le p oi nt s E, R , V (T S) , V (X Y) D ir ec t o bs er va tio n or c al cu la tio n fr om lo gg ed p re ss ur e ga ug e se ns or s, or fr om r em ot e se ns in g ap pr oa ch es ( e. g. , s at el lit e al tim et ry , r ad ar ) M is sa gh i a nd H on dz o (2 01 0) Ti da l p ro pa ga tio n 2b m in ut es -h ou rl y ho ri zo nt al tr an se ct V( TX ) M ag ni tu de o f a tt en ua tio n or a m pl ifi ca tio n of ti da l r an ge w ith in a n es tu ar y or c oa st al e m ba ym en t, pl ot te d as a fu nc tio n of d is ta nc e Su rf ac e w av es Si gn ifi ca nt w av e he ig ht 1a se co nd sm in ut es po in t V( TS ), E, R Fo r m od el s si m ul at in g su rf ac e w av es th e co m pa ri so n of w av e pr op er tie s ca n be u nd er ta ke n Ji ( 20 17 ) W av e le ng th a nd p er io d 1b se co nd sm in ut es po in t V( TS ) Fr eq ue nc y sp ec tr a 1c se co nd sm in ut es po in t FF T, W T Ev ap or at io n Ti m ese ri es c om pa ri so n 2a m in ut es -d ai ly po in t V( TS ), E, R Ev ap or at iv e m as s fl ux d at a ca n be c ol le ct ed fr om a n ev ap or at io n pa n, or fl ux a ne m om et er Ri m m er e t a l. (2 00 9) Ti m ese ri es c om pa ri so n 2b m in ut es -h ou rl y po in t V( TS ), E, R Co m pa ri so n ag ai ns t l at en t h ea t fl ux es d er iv ed fr om e ne rg y ba la nc e fit tin g to s ur fa ce m et er ol og ic al d at a N us sb oi m e t a l. (2 01 7) H 2 O is ot op es 2b ad h oc po in t V( ot he r) Fi tt in g is ot op ic d at a ca n he lp s ou rc e id en tifi ca tio n an d co m pu te ev ap or at io n ra te s ba se d on d ev ia tio n of m et eo ri c w at er li ne St ad ny k et a l. (2 01 3) Ve lo ci ty Ti m ese ri es c om pa ri so n 1a ho ur ly -w ee kl y po in t; ho ri zo nt al tr an se ct V( TS ), E, R V (X Z) , M A E( Δ XZ ), U se o f p oi nt A D CP m ea su re m en ts fo r po in t s ca le o r cr os s se ct io n Va ri an ce e lli ps e 1c ho ur ly -w ee kl y po in t V( ot he r) Su m m ar y of m ag ni tu de a nd d ir ec tio n of c ur re nt fi el d th at c an b e co m pa re d w ith p oi nt d at a H et la nd a nd D iM ar co ( 20 12 ) Re si du al c ur re nt s 3 w ee kl yse as on al su rf ac e la ye r; ho ri zo nt al tr an se ct V( XY ), V( XZ ) Pa rt ic le tr aj ec to ri es fr om m od el s im ul at io ns c an b e co m pa re d w ith tr ac ks fr om d ro gu es a nd /o r dr ift er s re le as ed in th e fie ld . D is sa na ya ke e t a l. (2 01 9) M ix in g M ix in g in te ns ity 2a ad h oc ve rt ic al p ro fil e V( ot he r) Tu rb ul en t d iff us iv iti es d er iv ed fr om S CA M P da ta c an g ui de tu rb ul en ce p ar am et er is at io n Ru ed a an d M ac In ty re ( 20 10 ) Tr ac er d ilu tio n 2b ad h oc su rf ac e la ye r; ho ri zo nt al tr an se ct V( TS ) Ca pt ur in g th e ho ri zo nt al a nd v er tic al d is pe rs io n of a c on se rv at iv e tr ac er ( e. g. , r ho da m in e or c hl or id e) c an e ns ur e di ffu si on is b ei ng ac cu ra te ly c ap tu re d Re te nt io n ch ar ac te ri st ic s W at er a ge v ar ia tio n 3 ad h oc m ul tip le s ite s E, R Th e us e of r ad io is ot op es c ou ld b e us ed to c or re la te s im ul at ed w at er ag e w ith o bs er ve d es tim at es fr om g eo ch em ic al tr ac er s W at er s ou rc e ap po rt io nm en t 3 ad h oc m ul tip le s ite s V( ot he r) U se o f c on se rv at iv e tr ac er s in di ca tin g w at er s ou rc e fr om s pe ci fic su rf ac e or g ro un dw at er in pu ts o r ra in fa ll, e .g ., ca ffe in e, r ad on , e tc . H ea t & s al t ba la nc e Te m pe ra tu re o r sa lin ity Ti m ese ri es c om pa ri so n 1a m in ut es -m on th ly po in t V( TS ), E, R D at a m ea su re d fr om a n in si tu th er m is to r o r s al in ity se ns or , o r a d ho c m ea su re m en t M os t p ap er s pr es en t t hi s Fr eq ue nc y sp ec tr a 1c m in ut es -h ou rl y po in t FF T, W T D at a m ea su re d fr om a th er m is to r or s al in ity s en so r lo gg in g at h ig h fr eq ue nc y Ka ra e t a l. (2 01 2) Sp at ia l c om pa ri so n 1a da ily -m on th ly su rf ac e la ye r V( XY ), M A E( Δ XY ), d 2 Sa te lli te a cq ui re d te m p da ta ( e. g. , L A N D SA T, M O D IS e tc ) co m pa re d pi xe l f or p ix el w ith s im ul at io n. M od el o r d at a m ay re qu ir e av er ag in g to e ns ur e sp at ia l r es ol ut io ns m at ch Sp ill m an e t a l. (2 00 7) M � en es gu en e t a l. (2 00 7) Sp at ia l v ar ia bi lit y 1c da ily -m on th ly su rf ac e la ye r D F Co m pa re s di st ri bu tio n an d ra ng e of T o r S va ri at io n w ith in th e si m ul at ed d om ai n w ith ou t c on du ct in g pi xe l b y pi xe l c om pa ri so n Sp at ia l p at ch in es s 1c da ily -m on th ly su rf ac e la ye r CC F Ca n as se ss s im ila ri ty in s pa tia l c oh er en ce o f T o r S Ed dy s tr uc tu re 3 ho ur ly -m on th ly w at er c ol um n V( XY ) Vi su al c om pa ri so n of th e em er ge nc e of c om pl ex e dd y st ru ct ur es a nd gy re fo rm at io n in T o r S fie ld s H ol t e t a l. (2 01 4) Te m pe ra tu re A lb ed o 2a ad h oc su rf ac e la ye r V( TS ), E, R M od el s si m ul at in g sp at io te m po ra l v ar ia bi lit y in a lb ed o ca n va lid at e ag ai ns t e st im at es c om pu te d vi a up w el lin g an d do w nw el lin g


Introduction
Models of catchments, lakes, wetlands, rivers, estuaries and marine systems are now in widespread use to simulate water quality responses to anthropogenic change and to unravel nutrient and pollutant pathways (Hipsey et al., 2015;Janssen et al., 2015). Perhaps more than any other field of environmental modelling, aquatic ecosystem modelling spans a large diversity of environments, scales and disciplines; ranging from small wetlands and lakes to the global ocean. Numerous authors have conducted reviews of the diversity of modelling approaches used to simulate lakes Janssen et al., 2015), wetlands (Coletti et al., 2017), rivers (Rode et al., 2010), and marine systems (e.g., Gentleman, 2002;Glibert et al., 2010;Rose et al., 2010;Steele et al., 2013;Robson, 2014b). Looking across the geographic diversity of (process-based) aquatic ecosystem models (AEMs), it is notable that whether the focus is freshwater or marine applications, two main thematic areas of commonly-used model approaches have dominated the literature: (i) coupled physical-biogeochemical models with high spatial resolution and a focus on the biophysical environment and lower trophic levels (typically nutrients, phytoplankton and zooplankton), and (ii) models that are lumped in space and focus on resolving high trophic complexity (see Mooij et al., 2010 for fresh water systems, Fulton, 2010, for marine systems; Fig. 1). This distinction reflects the different backgrounds and research questions being asked by aquatic ecosystem modellers, and also the trade-offs between conceptual complexity, spatial resolution, data requirements and computational demands.
But how good are these models? Over the past decade or so there has been a significant expansion in the scope and capability, however, several authors have argued that AEMs have failed to keep up with advances in scientific understanding (Flynn, 2005;Anderson and Mitra, 2010;Oliver et al., 2012;Hellweger, 2017), are not interdisciplinary Ward et al., 2019), fail to assess or reign in uncertainty (Arhonditsis et al., 2006(Arhonditsis et al., , 2008aDietzal and Reichert, 2012), and "fail to fail" when they should fail if they were true tests of conceptual understanding (Franks, 2009). Despite continuing advances in process understanding of aquatic biogeochemistry and ecology, and the emergence of a plethora of model approaches and platforms in the literature, it could be argued that the level of predictability in many practical applications of AEMs has not significantly improved over the past two decades .
These challenges motivate us to find new ways to assess our models so that we can transparently compare different models and model approaches, whether they are simple or complex, and understand the level of predictability they provide. Oreskes et al. (1994) argued the use of models is heuristic, which is consistent with a common view that decisions about defining when a model is "validated" are largely ad hoc. The chosen level of validation often depends on available pre-existing data sets, the background and experience of the individual modeller and a general desire to report favourably on the performance of the model. A common framework and established standards for model assessment and documentation would create opportunities for synthesis between diverse model studies and transferability of knowledge between applications. This would facilitate us to benchmark AEMs, that is, to compare which models are better under which circumstances, and what level of complexity tends to achieve a given level of accuracy and performance for a given application context.
Here, we review current approaches that can be used to assess the performance of AEMs, and formalize a general strategy to improve confidence in model predictions. The shortcomings of current model assessment and the need for a systematic approach are detailed further in Section 2. A framework is outlined in Section 3 for the hierarchical assessment of models to encourage modellers to assess not only state variable predictions, but also process behaviour and system-scale dynamics. A range of metrics and characteristic signatures relevant to aquatic ecosystem condition are exemplified for a range of physical, chemical and ecological contexts in Section 4, spanning the diversity of aquatic system models -from ponds and lakes to the global ocean. Our goal is to demonstrate the utility of hierarchical assessment to more comprehensively assess when a model is "fit for purpose". In doing so, our aim is to create standards, a common vocabulary, and encourage a more rigorous, multi-level approach to model assessment that will facilitate comparisons among different AEMs, their applications, and thereby increase their predictive value and usefulness.

Approaches to model assessment and need for a standard framework
In a review of model assessment frameworks, Pohjola et al. (2013) highlighted four kinds of model assessment: (i) quality assurance (best practise procedures); (ii) uncertainty analysis; (iii) technical assessment; and (iv) assessment of effectiveness in achieving social, environmental, or policy outcomes. In this paper, we focus on the technical assessment of model performance. We note here the extensive history of literature describing appropriate measures of fit for objectively evaluating model performance (Mayer and Butler, 1993;Power, 1993;Alewell and Manderscheid, 1998;Stow et al., 2003Stow et al., , 2009Bennett et al., 2013;de Mora et al., 2013;Kubicek et al., 2015), providing guidance on specific mathematical measures of model accuracy. In general, these include varied methods for error calculation, correlation and model efficiency measures (Table 1). We do not focus this analysis on the specific suitability of these measures, but rather on providing a framework within which they can be used. For a review of the strengths and weaknesses of different measures of model-data comparison see the summary by  Bennett et al. (2013) and previous discussions by others (Elliott et al., 2000;Allen et al., 2007).
To-date, no well-established guidelines have been developed on how to approach aquatic ecosystem simulation and how to decide when models are fit for purpose. The reason can be found in the large diversity of model approaches, which is compounded by the fact that AEMs have been applied over wide environmental (lake, river, ocean) and application contexts (e.g., forecasting, system understanding, scenario comparison etc.) (Janssen et al., 2015). For any given application context, typical approaches in the literature adopt a wide range of variables, spatial dimensionalities, spatial scales and simulation time-frames. Whilst the diversity of these applications is a good thing overall, the consequence has been that it is difficult to compare model performance between studies and approaches in a way that allows a clear definition of the limits of their predictions.
In the past decades there has been limited true improvement in the level of model predictability. Given the rapid uptake of AEMs (e.g. Trolle et al., 2012;Janssen et al., 2015), it would be expected that over time, higher quality datasets, more refined model process descriptions and increasing computer power would lead to improved predictions. This is of course true in some cases, however, there is evidence that in general terms model approaches and their ability to accurately capture trends in observed data have not considerably improved compared to some of the pioneering studies (e.g., Thomann and Fitzpatrick, 1982). This trend was first noted by Arhonditsis and Brett (2004) and subsequent analyses have pointed to a similar conclusion (e.g., Arhonditsis et al., 2006;Robson, 2014b;Paraska et al., 2014;Arhonditsis et al., 2014). During routine applications of models, we anecdotally hear that use of R 2 is a "waste of time" for coupled hydrodynamic-biogeochemical model variables, and that modellers often resort to "chi-by-eye". That is, the modeller analyses the model output in the context of the quality and noise in the data and uses a subjective visual comparison as the ultimate determinant of suitability. Kubicek et al. (2015) found that 92% of studies reported in the journal, Ecological Modelling, reported visual inspections as the only or main method of model evaluation. This approach is not always due to the lack of willingness by the modeller to validate more deeply, but rather due to confusion as to what features need to be tested within the specific application. The question therefore remains: are there ways in which we can further refine our efforts so that they may lead to more robust model predictions?
A large challenge remains to improve parameterisation of AEMs. For most applications, information exists about the values and variability of the state variables, but little is known about the values of the model parameters. Modellers then adjust parameters to find the best agreement between modelled and observed data, either by adjusting the model parameter vector by trial and error (manual calibration) or through optimisation algorithms. The advantage of optimisation methods is that they are objective and repeatable methodologies that are more likely to result in an optimal parameter set. Any significant lack of fit is then due to the inadequacy of the model structure and not due to poor parameter choice (Chapra and Canale, 2010). The conventional practice of seeking a single "optimal" parameter set reflects two major assumptions: (i) that Table 1 Summary of quantitative techniques used for assessment of aquatic ecosystem models. Refer to Bennett et al. (2013) for more detailed overview and categorisations of available assessment approaches.

Abbreviation
Technique Description

V Visual inspection Visual inspection of time-series, TS, is often undertaken ("chi-by-eye"), but weak relative to the quantitative metrics listed below.
Visual inspection of more complex model outputs may be warranted, and is frequently undertaken, for example:  Murphy and Epstein (1989); Bardsley (2013) presents a variant that better accounts for bias: Willmott (1981) introduces the above skill score to consider both correlation and variance, allowing a choice regarding how to weight error at extremes versus around the mean.

R, SR Correlation coefficient Spearman Rank Correlation
where τ is the difference in rank between the predicted and observed TD Taylor/Target Diagram Visualisations of patterns of statistics to assess and quantify model performance; see Taylor (2001) and Jolliff et al. (2009).

DF Distribution Functions Plots showing variance of data set, including box-plots, violin plots and cumulative distributions FFT, WT, WC Fast Fourier Transform Wavelet Transform Wavelet Coherence
Approaches to demonstrate spectral power localisation at distinct frequencies; Wavelet transform demonstrates this localisation as a function of time (or space); Wavelet coherency computes the correlation in wavelet power between variables.

CCF, ACF Cross-correlation Function Autocorrelation Function
Measure of the similarity of a series to itself (autocorrelation) or another series (cross-correlation) as a function of displacement in time.
there exists a single calibration vector for faithfully reproducing a wide range of ecosystem dynamics; and (ii) that our empirical knowledge (monitoring data, experimental work) adequately depicts the patterns of the "real world", and thus offers an objective standard for testing our models. The credibility of both statements has been extensively debated in the literature and there are sound arguments to cast doubt on the legitimacy of such a deterministic approach to mathematical modelling. A recent attempt to address knowledge gaps and improve practice in setting parameter values has been made by Robson et al. (2018), who provide a tool to facilitate modellers' exploration of evidence for process rates and traits relevant to biogeochemical model parameters. Nonetheless, model practitioners often encounter the problem that several distinct choices of model inputs result in equivalent model outputs, i.e. many sets of parameters fit the data equally well. The nonuniqueness of the model solutions, known as equifinality (Arhonditsis et al., 2008a), is a consequence of insufficient data or in the case when internal process pathways are of substantially higher order than what can be externally observed (Beck, 1987). As a result, our ability to set quantitative (or even qualitative) constraints on model ecological structure is significantly reduced, and thus we are often faced with a situation whereby our models give "good results for the wrong reasons" .
Aside from parameter uncertainty, models contain errors that arise from its structure or its inputs (Omlin and Reichert, 1999). Model structural error is associated with (i) errors in the selection of appropriate state variables or processes to reproduce ecosystem behaviour, (ii) errors and necessary simplifications in selection of mathematical formulations for describing the processes, and (iii) the fact that our models are based on equations derived from controlled laboratory environments that may not yield an accurate picture of the real world variability in biological systems and complicated interactions between forcing factors (Hellweger, 2017). Essentially, models are simplifications of reality, and all parameters are effectively applied as spatially and temporally averaged values that in reality are unlikely to be represented by fixed constants. In addition, it should be recognised that observational data are also uncertain approximations, and are in fact also models of reality.
An important and increasing area of model application is to capture shifts in system function. Shifts occur in response to a varied range of external drivers, such as climate change, the cumulative loading of nutrients and pollutants and/or management measures (e.g., Trolle et al., 2008;Skerratt et al., 2013). In this case, modelled ecosystems are non-stationary; they are being pushed out of their typical state-space range upon which they were trained and predictability in the past is no guarantee a model can capture future trajectories. From the broader ecological literature, we know that ecosystems are vulnerable to deterioration when key system functions are pushed over thresholds, resulting in the loss of resilience and the emergence of a regime shift (Scheffer et al., 2009). Often these dynamics are at the core of what we need models to help us understand, yet we have limited confidence that the models are capturing these shifts and non-linear ecosystem dynamics (Hipsey et al., 2015).
The lack of universally accepted performance criteria impedes our capacity to impartially determine what an acceptable model is. Thus, an emerging imperative in the field of aquatic systems modelling is the development of a predetermined standard, considering model complexity, the spatiotemporal domain or even the question being asked. Given the complexities highlighted above and the diversity of simulation contexts, this is unlikely to take the form of a simple cut-off value for an acceptable goodness-of-fit metric. In some cases, it is sufficient to be able to confidently predict that one management option will have a better outcome than another (e.g. shorter algal bloom duration), while in other cases, the decision will hinge on being able to predict how much better a certain scenario may be. Hence, we need to be able to evaluate not only how uncertain our prediction may be, but also what our models can confidently predict (e.g. that a bloom will occur) despite prediction uncertainty.
In light of these conceptual and technical challenges, there are several areas where the AEM community would benefit from improved tests and reporting of model performance. These include: � greater emphasis in model publications to highlight the assessment approach and the variables validated (or tested in sensitivity analysis); � adoption of assessment standards to facilitate inter-comparison of diverse model approaches; � improved assessment of process pathways in models as a means to help resolve concerns around equifinality, including assessment of spatio-temporal variability in process rates; � exploration of the degree to which different scales of variability are captured by models, considering not just state variables, but also flux pathways; � assessment of model performance in reproducing theoretically relevant, system-scale responsesthat is, even if models capture trends at a sampling point, they must also demonstrate ability to capture emergent behaviours; and � employing a wider range of validation approaches able to accommodate new monitoring technologies and high-frequency data streams to support model-data fusion efforts.
Addressing the criteria above will improve credibility and transparency in aquatic ecosystem modelling. For example, capturing emergent properties is particularly relevant when models are used to explore non-stationarity (systems undergoing change), where a model may be out of its calibrated range, or if the goal is to explore uncertain future conditions like climate change. We used the list to formulate the core principles of a comprehensive model assessment framework that can be used to understand, discuss and evaluate underlying principles in AEMs, and to broaden their applicability and transferability.

Overview of the multi-level assessment framework
The CSPS framework introduces a hierarchical assessment of a range of metrics and theoretically-relevant signatures relevant to aquatic ecosystem structure and function, depicted schematically in Fig. 2. The approach includes four levels of assessment, with an a-priori or preapplication assessment of the model (indicated as Level 0), while the other three levels are post-simulation assessments of the model. The four levels are summarised as: 0. Concept: Conceptual validation to ensure that sub-models are consistent with ecological theory and valid over the range of conditions for which the model will be applied; 1. State: Comparison of simulated state variables with observed properties; 2. Process: Comparison of simulated energy and mass fluxes with measured process rates; and 3. System: Comparison of system-scale emergent properties, patterns and relationships with observed and theorised phenomena.
These levels are further described generically below given our desire for consistency across both inland and marine waters. Specific examples for different application contexts and specialisations relevant to the modelling community are expanded upon in Section 4.

Level zero: Conceptual validation
A process for model assessment that includes conceptual validation of model structure and sub-model algorithms is a precursor to empirical validation of the model as a whole (Bert et al., 2014). At this level, we ask, "does the conceptual basis of the model accord with current scientific understanding of how the system functions?" This level of evaluation is usually undertaken explicitly when developing a new model,

Fig. 2.
Conceptual diagram illustrating the CSPS model and assessment workflow highlighting the shifting focus from assessment of state, to process, and ultimately to ecosystem function. but is often overlooked when applying an existing model in a new context or when it is adapted to include additional functionality. Questions to consider include: � Does the model structure adequately reflect our conceptual understanding of the system and its key drivers, having regard for processes that may change within the range of scenarios being considered? An example of a situation in which a previously successful model may fail this conceptual validation is when a model developed for a deep-sea application is to be applied to a coastal ecosystem, where a more detailed representation of benthic processes is needed. Another example is where a model previously applied within a narrow range of temperature conditions is to be applied to climate change scenarios and the conceptual basis of temperature response functions may need to be reconsidered. � Does the mathematical representation of ecosystem processes produce realistic system dynamics? For example, the widely-used Michaelis-Menten kinetics are best suited to steady-state conditions and may be considered dysfunctional in dynamic simulations where nutrient concentrations vary considerably (Flynn, 2005;Frassl et al., 2014;Hellweger, 2017). � Does the model structure reflect recent advances in ecosystem understanding that may be important in this application? For example, many ecosystem models in common use have not been updated to include anammox or dissimilatory nitrate reduction to ammonium (Robson, 2014a,b). � Does the model reflect the system understanding of relevant local disciplinary experts and stakeholders? If not, this may be a hurdle to acceptance of decisions based on model results, as well as not making good use of local knowledge. � Is the model mathematically valid and dimensionally consistent? � Is their evidence that the implementation of the model as software correctly reflects its conceptual and mathematical basis? For example, does it maintain conservation of mass? � Where two or more possible model structures have been identified, what process has been followed to compare the options? In some cases, it may be appropriate to develop an ensemble of models to test the range of possible results and the trade-offs in speed and accuracy.
This conceptual validation phase should be revisited after state, process and system validation to consider whether the results suggest the need for re-evaluation of the model structure or implementation.

Level one: State validation
The comparison of time-series of physical, chemical and biological state variables is the main form of model validation. However, it is not the only means by which the accuracy of model state can be assessed, and when used in isolation may not give a complete picture of model performance. In most cases, the frequency of observations is significantly less than the time-step of aquatic models, particularly in the case of coupled hydrodynamic-biogeochemical models. For variables that exhibit rapid changes in time, such as algal biomass during a bloom, standard metrics such as Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) can be fundamentally weak (Elliott et al., 2000). Similarly, in terms of a spatial distribution, an in situ observation is usually the mean of a small sample volume, being compared to model output that may represent the mean value of a much larger quantity of water. Various alternatives for assessing state can be considered, including assessment of scales of variability, and derived metrics, whereby observations and simulated data are subject to some form of transformation.

1a -Direct comparison
Classically, model simulation results are compared with measured data at specific points in time and space where the measurement was taken. This type of assessment is undisputedly where most effort in model validation has been concentrated and will no doubt remain the focus of most model assessments. The variables selected for assessment are inherently linked to the nature of the investigation and associated choice of model approach and structural complexity. Nonetheless, there has been a tendency for modellers to validate against an arguably small subset of simulated water chemistry variables, a product of both a desire to emphasise aspects of model output relevant to the application, but also commonly due to lack of observations for remaining state variables.
Measures of model fit (as in Table 1) can be computed for one or more sampling stations and are generally based on calculation of residuals (i.e., the difference between model outputs and the corresponding observations). A potentially useful exercise that adds to the common error calculation, is to calculate skewness or kurtosis in predictions. Spatial assessments of multi-dimensional models against data from remote sensing platforms are increasingly being introduced. Other methods do not necessarily involve error calculation but rather assess patterns in data series. For example, Spearman Rank correlation (SR) may be more useful to identify the degree to which the order of predictions and observations from small to large magnitude are captured, for example, where assessment of the ability of a model to reproduce seasonal and inter-annual variability is required without focusing on exact values. The use of a cross correlation function (CCF), can allow a modeller to look for correlations in how the simulated and observed data vary with a delay across time and may be used for looking at lags in timeseries, or alternatively may be useful in spatially resolved models to determine measures such as patch length.

1b -Derived metrics describing model state
This category refers to metrics that do not involve a direct assessment of state variable time-series or spatial data, but are derived from the simulated variables. The focus of this class of metrics is to test the model against theoretically relevant indicators of ecosystem state, such as relationships between variables. They can provide additional evidence for demonstrating that a model is fit for purpose even if direct value comparisons (e.g., R 2 ) are not possible due to data limitations or if direct validation indicated a weak level of predictability. Examples of derived metrics could include simple stoichiometric indicators (e.g., TN:TP), or other ratios of simulated variables (e.g., DOC:TOC; Chl-a:TSS). Other derived quantities include assessment of relevant dimensionless numbers, for example, the Richardson number as a measure of stratification intensity, or in the case of algal bloom dynamics, metrics derived from analysis of the raw state variable time-series, such as the average duration of a bloom or time of bloom onset.

1c -Metrics describing multi-scale variability in model state
Metrics at this level describe how well various scales of temporal or spatial variability are reproduced in our models. Depending on the model structure, its spatial dimension and time-step, any given simulation will have limits on the scales that it can predict. The inherent difficulty in defining what model approaches are most appropriate for a given scale (e.g., at what temporal scale would a 1D and 3D model converge?) remains a large challenge in the AEM community.
Simulations may be assessed from a probabilistic point of view, for example, through comparison of exceedance probabilities (CDFs) in order to ascertain whether a model is able to capture the proportion of time (or spatial domain) over which a certain concentration is experienced. However, assessment of probability distributions alone may not be suited under non-stationary conditions, and they do not inform us if expected modes of variability are adequately represented. For example, a coupled hydrodynamic-biogeochemical model might be designed to reproduce phytoplankton biomass in response to diurnal changes in productivity, but also over intermediate scales due to dynamic hydrological and meteorological conditions, over seasonal scales due to changes in temperature and nutrients, and potentially up to decadal scales if the fundamental drivers of phytoplankton biomass are shifting.
Approaches to assess the performance of models over these scales are described in Bennett et al. (2013) as "data transformation methods" and relevant techniques include Fourier transformation (FFT) or wavelet transforms (WT). To date there has been limited application of these methods for assessing AEMs, partly due to the fact that there are few cases of modelled applications that have observational data that span such a large range of time-scales (reviewed by Kara et al., 2012). The routine application of real-time sensors in aquatic systems over the past decade, however, is providing diverse datasets that span from time-scales of minutes to decades, and offer new opportunities to assess models through this approach (e.g., Hamilton et al., 2015). Where spatially rich observational data are available, these approaches can also be used to examine model performance across multiple spatial scales. Spatial maps can be quantitatively compared with observational maps, particularly from remote sensing observations, using a variety of methods (Stow et al. (2009).

Level two: Process validation
Process validation refers to assessment of model performance against the underlying rates of transformation that drive changes in model state variables (i.e., the arrows connecting "stocks" or "pools"); process validation is therefore specific to process-based models. Indeed, the most commonly cited advantage of process models is their ability to resolve the interaction of the different mechanisms that shape ecosystem state, yet rarely do we rigorously assess whether they are correctly captured. This is particularly relevant if we consider that most model applications adopt process parameterisations reported broadly in the literature, potentially from sites that may be inherently different, or from laboratory or mesocosm studies conducted under controlled conditions. Similarly, the associated parameter estimates for these algorithms may also be chosen from within large ranges reported from diverse model applications, or from laboratory assessments such as phytoplankton or sediment incubations. Therefore, it is not obvious that models with complex interactions accurately represent spatial and temporal variability in process pathways correctly, even if several model state variables are seemingly reproduced well at Level 1. Consequently, validating models with regard to the individual flux pathways that connect individual state variables is a way of reducing equifinality, helping modellers to get "good results for the right reasons". Comparing modelled with measured flux rates also provides a way to pinpoint sources of structural and conceptual error in the model.
In practice, this approach remains rare in aquatic ecosystem modelling, as measuring variability in process rates through time and/or space is resource-intensive and difficult, and in some cases may not yet be directly possible in situ. However, given that it has the potential to greatly improve confidence in the underlying function of models, we believe it should be actively promoted, and several examples are outlined in the following sections that may be more routinely adopted. These are classified next as either being from direct measurements or indirect rate estimates.

2a -Comparison with raw process measurements
When developing a model, it is necessary for us to distinguish between process measurements that are required to assist model parameterisation and parameter assignment, and process measurements that can be used for validation. A range of in situ process measurements are relevant to assess physical, chemical and biological model attributes and their temporal and spatial variation. These include rates of mixing, fluxes across the air-water or sediment-water interfaces, and kinetic transformations (e.g. nutrient uptake, rates of primary production, or grazing). Specific examples highlighted in Section 4 relevant to a range of different model applications are reviewed.

2b -Process metrics interpreted from raw data
Process information can also be extracted from raw data series, either derived from changes in the observed data record via inverse modelling, or potentially through more sophisticated data-driven models designed to estimate bulk process rates. As a simple example, the trend of oxygen depletion in the hypolimnion of a lake may be used to estimate the net sediment oxygen demand if other sinks are minor. The use of environmental tracers and isotopic data also has potential to support validation of the flow of oxygen, carbon or nitrogen, for example, if assumptions are made about the relative fractionation and transformation rates that occur for individual process pathways. These approaches are not reported widely in the literature, and usually depend on the validity of several simplifying assumptions that are made when interpreting the data. However, it is highlighted here as an area of increasing interest for the growing area of model-data fusion, as a means to compare modelled process rates against those estimated from empirical means (Robson, 2014a;Hipsey et al., 2015).

Level three: System validation
The complex non-linear interactions and feedbacks that govern the response of aquatic systems to changes in internal or external conditions can lead to ecosystem-scale emergent patterns, relationships and dynamics. These emergent properties are not necessarily predictable directly from the underlying model formulation, and are outcomes from the model that are "not a direct extrapolation of the choices made in model design" (Allen, 2010). A classic example of an emergent property is the behaviour of a flock of birds in flight, which emerges in a way that is not obvious from the behaviours of individual birds.
Although not widespread, there are several examples where aquatic models have been assessed in terms of their ability to produce emergent dynamics, though generally not with the direct purpose to refine model accuracy or to justify whether it is fit for purpose. These provide valuable insights to ecosystem behaviour, and we advocate for more effort to be placed in assessing if our models are able to capture higher order behaviour or patterns. As highlighted by , a different choice of model structure may lead to different emergent dynamics. Where two models perform equally well at Level 1 but predict different emergent system dynamics, these can be treated as competing hypotheses regarding system dynamics, and measurement programmes can be devised to invalidate one or both hypotheses.
Examples of system properties that might be captured by models can include simple metrics such as scaling relationships (e.g. nutrient loading vs. chlorophyll-a response), or more complex spatial or temporal patterns in nutrient cycles and community dynamics. In many applications, these patterns may be expected based on empirical experience, such as the succession of different plankton functional groups, or spatial niches in a habitat. Multivariate comparison methods are available to explore performance of models in capturing inter-relationships between variables (e.g., a Taylor Diagram, TD), and these may be particularly useful for identifying cases where models resolve emergent dynamics that are not known a priori. Self Organising Mapping (SOM) is one example, among others, of a machine learning method that may be suited to identification of emergent patterns in both model output and observational data that are rich in two or more dimensions (e.g., Williams et al., 2014). Where observational data are both spatially and temporally rich, Empirical Orthogonal Function (EOF) decomposition can be used to analyse major modes of variance and patterns of variation, which can be compared with the results of the same analysis applied to model output (e.g. Rocha et al., 2019).
Another area being explored is the response of ecosystem state-space to perturbations, considering threshold effects, hysteresis and alternative stable states. This level of validation is especially important if the model's purpose is to define the stability or resilience of key ecosystem attributes to climate change, fishing and/or eutrophication. At this stage only a qualitative or semi-quantitative comparison may be possible.

What metrics should I use when … ? Examples for different application contexts
In this section, we consider the literature through the lens of the framework outlined in Section 3, by combining the framework proposed with specific assessment techniques summarised in Table 1. Given the predominance of Level 1 validation in the literature, we did not target a comprehensive identification of all published model studies that consider validation. Rather, we aimed to collate a range of examples that have been applied across a broad range of application contexts. Through this review, we also aimed to identify gaps and potential areas for further development of new Level 2 and 3 validation approaches.

Hydrodynamic applications for ecosystem assessment
Relative to water quality and ecosystem modelling applications, approaches for characterizing the performance of physical models of aquatic environments are well established, most notably from the engineering and geophysical sciences literature. They share some common metrics across lacustrine, riverine and oceanographic model applications, depending on the underlying hydrology, model dimension and time-frame of the simulation. A range of general metrics have been categorised related to prediction of a) the water balance, waves and water circulation, b) the heat and salt balance, c) stratification, and d) bottom morphometry and sediment transport (Table 2).
In general, the validation of hydrodynamic models is achieved by conducting multiple levels of assessment. First, researchers can conduct time-series assessment of water level, temperature and salinity at fixed points, and/or horizontal or vertical variation derived from profile cross sections (Level 1). Simulation of surface ice dynamics can also be undertaken by time-series assessments of ice thickness, complemented with derived metrics such as ice-on and ice-off dates (Level 1, e.g., Yao et al., 2014). For models that simulate surface or internal waves, spectral plots are commonly reported to demonstrate power across waves of different frequency. Derived indices such as the Richardson number as a measure of stratification intensity, or Schmidt stability (e.g., Bruce et al., 2018), are also useful as Level 1 metrics. Considering mixed layer depth (Acreman and Jeffery, 2007;Bayer et al., 2013), or thermocline/pycnocline thickness and changes to surface layer thickness as a function of various driving factors, can prove useful in diagnosing problems with model parameterisation. For cases where periodicity varies over time, wavelet plots can highlight how power is localised in frequency space over different seasons or time-frames. Isotherm/isopycnal displacement power spectra are a useful metric to demonstrate the time-scale over which models are able to reproduce the internal wave field (e.g. Hodges et al., 2000).
Level 2 validation of hydrodynamic models can include assessment of evaporative mass fluxes, estimation of albedo, measurement of velocities (at a fixed location or from drifter tracks, e.g., Dissanayake et al., 2019), and shear stresses, for example, at the sediment-water interface or within macrophyte beds. The use of direct measurements or observation-based estimates of turbulent mixing could also be considered, and a range of novel tracers have been adopted for dilution experiments to characterise contaminant dispersion (e.g., caffeine and pharmaceuticals in waters impacted by wastewater effluent, Cantwell et al., 2016).
Level 3 assessments in hydrodynamic models consider the formation of residual currents and eddy structures. For example, different, but otherwise similar, models reveal the emergence of different eddy structures in the North Atlantic Ocean (Holt et al., 2014). Hetland and DiMarco (2012) undertook an assessment of a 3D hydrodynamic model of the Texas-Louisiana continental shelf using data from moorings by presenting maps of model skill; in this example, surface and bottom variance ellipses are used to demonstrate the model captures the point-scale and residual field. In lakes and coastal environments, currents created by differential heating and cooling may be assessed by indirectly comparing against profile cross sections (Woodward et al., 2017). Spatial variability in water currents creates patterns of water age distributions that emerge as a system-scale property, but which are not easily able to be validated. The potential for using methods such as assessing against empirical estimates from conservative tracers, for example from radium isotope measurements (Tomasky-Holmes et al., 2013), may be able to be applied in the future. The increasing applications of models for complex aquatic domains (e.g. estuaries, inter-tidal wetlands, river floodplains, reef structures, or island archipelagos) requires efforts to validate the relative pattern of connectivity across modelled sub-domains.
At smaller spatial scales, Level 3 assessment of other patterns that may emerge in physical models can be conducted. Examples include travelling waves in spatial patterns of plant-wrack in intertidal zones (Sun et al., 2010), wave attenuation in seagrass beds (Chen et al., 2007), and the canopy structure such as the dynamics of canopy deflection properties (Dijkstra and Uittenbogaard, 2010). Models with morphodynamic ability can also be assessed by examining spatial patterns in temporal changes in bottom morphometry, to highlight active areas of erosion and deposition.

Water quality and biogeochemistry
A common goal for AEMs is to understand the controls and dynamics of chemical and biological variables relevant to water quality. What constitutes a 'water quality variable' can vary depending on the application and context, however, for the purpose of this analysis, the literature is categorised according to several areas of focus pertaining to prediction of a) oxygen and the extent of hypoxia/anoxia, b) the cycling of inorganic nutrients and organic matter, c) geochemistry, d) water colour and clarity e) chlorophyll-a, and f) other chemical and biological contaminant dynamics (Table 3). Across these categories most variables being assessed are dissolved or particulate concentrations that are routinely sampled via traditional monitoring programs and subsequent time-series assessments (Level 1), though, a more detailed exploration reveals a broader range of examples that can support a diversity of potential approaches for capturing water column and sediment biogeochemistry.
Oxygen has often been a focus of water quality models as it plays a pivotal role in nutrient cycling and sediment processes. Given the increasing availability of sensor data for measuring oxygen concentrations, it also provides an interesting case study for how a system of metrics can assist in model assessment. For this context, a range of more rigorous Level 1 metrics has emerged such as wavelet analysis of highfrequency in situ data series from a stratified lake (Kara et al., 2012), longitudinal analysis of surface and bottom oxygen in estuaries where strong lateral gradients exist (Xu and Hood, 2006), and the spatiotemporal extent of anoxia in a riverine estuary  and Lake Erie (Bocaniov et al., 2016). Direct in situ Level 2 sediment flux measurements from benthic chambers or eddy-correlation instruments have provided useful validation of sediment oxygen demand (e.g., Sohma et al., 2008). In another example, Hetland and DiMarco (2008) adopted apparent oxygen utilisation (AOU), the difference between the oxygen concentration and the saturation value, as an indirect process validation metric to demonstrate the model was capturing the combination of oxygen consumption mechanisms. Of increasing interest in lacustrine and marine environments is the application of high-frequency oxygen sensors to estimate free water metabolism (Hanson et al., 2008), where an inverse modelling technique is used to extract hourly to daily estimates of primary productivity, community respiration and atmospheric exchange from diel changes in oxygen concentration (e.g. Lovato et al., 2013;Webster et al., 2005;Wikner et al., 2013;Winslow et al., 2016). While this method provides relatively coarse estimates due to confounding factors of advection and mixing (Villamizar et al., 2014), repeated estimates over a range of environmental conditions provide an in situ view of water column net productivity and respiration that can be Table 2 Summary of validation metrics for physical models of aquatic systems (refer to    Table 1) a Comments (e.g. processing requirements) Example references b Variation in particle size distribution Spatial differences in particle size composition of bottom sediment can be used to validate areas of differential deposition rates between particle size classes. Where possible references provided indicate an example of the metric being applied, otherwise references relevant to give context to use of the metric are listed.
c Water clarity and light metrics are covered in Table 3.

Table 3
Summary of validation metrics relevant to models simulating water quality and biogeochemistry of aquatic systems (refer to   Nutrients & organic matter       A similar approach for estimating basin-scale average sediment oxygen demand has been shown to be useful in stratified lakes, allowing quantification of the rates of hypolimnetic oxygen drawdown where other consumption mechanisms may be assumed to be relatively minor (Snortheim et al., 2017). Following these examples, Fig. 3 illustrates the utility of combining relevant metrics across the assessment levels for an estuary experiencing frequent hypoxia.

Concentrations of
Many aquatic modelling studies have had their roots in predicting the impacts of eutrophication and have demonstrated the Level 1 performance of their models against data on dissolved and total nutrients, with a focus on P in freshwaters and N in marine waters. An increasing trend towards simulating the complete N, P, Si and C cycles has provided opportunity to validate models using other Level 1 indicators that capture more nuanced aspects of nutrient cycling, such as partitioning between organic and inorganic phases and stoichiometric variability (Li et al., 2013). Other potential Level 1 metrics relate to organic matter composition, such as POC:DOC, OC:ON, or potentially the labile:refractory ratio of the simulated organic pool. To date this has rarely been the subject of model assessment, but it may be possible by comparing with increasingly reported data from Excitation-Emission Mass Spectroscopy (EEMS) studies.
Relevant Level 2 process validation efforts can be applied in the form of comparison against in situ estimates of nitrification/denitrification, organic matter mineralisation, and community respiration or BOD data. In estuarine environments, dilution curves have been applied to estimate sources and sinks of materials by comparing concentrations relative to salinity. For example, a sink of NO 3 along the length of a system can be used as an indirect measure of denitrification intensity (e.g. Eyre and Balls, 1999). Fig. 4 demonstrates this approach for an estuarine carbon cycle investigation, showing validation of the along-stream predictions of DOC, DIC and 13 C-DIC and 13 C-DOC, relative to a conservative tracer. In this case, the use of isotopes in the calibration helped to reduce equifinality, since models able to correctly capture patterns in stable isotope cycling are more likely to be resolving flux pathways that are imprinting distinct signatures during isotope fractionation processes (Sugimoto et al., 2010;van Engeland et al., 2012;Adiyanti et al., 2016). Where sediment-water interaction is an important driver of nutrient cycling, the flux of dissolved constituents as estimated from in situ benthic chambers or from eddy correlation can be used as a powerful Level 2 approach to reduce uncertainty. Two examples comparing modelled against measured NH 4 release are applications in Chesapeake Bay (Brady et al., 2013) and Tokyo Bay (Sohma et al., 2008).
The net rates of carbon and/or organic matter sedimentation are important drivers of water and sediment condition, and are important fluxes to test models against, since they can vary substantially through time and between sites. In oceanic systems, the rate of organic carbon export shows a logarithmic decline with depth, and can be plotted as a Martin curve (Martin et al., 1987), which was tested as a validation metric of the European Regional Seas Ecosystem Model (ERSEM) (Butensch€ on et al., 2012). Studies employing depth-resolved sediment models themselves require a significant validation effort (e.g. Paraska et al., 2014), depending on availability of pore-water and solid phase concentrations, and can benefit from Level 2 validation metrics, such as denitrification efficiency, oxygen penetration depth and oxygen exposure time, which are known to be important determinants of carbon cycling and burial. Capturing vertical gradients in oxygen and other constituents in sediment, for example using in situ microprofile data, may mean models implicitly capture these variables, but further assessment against in situ flux rate determinations (e.g., denitrification) can help test sediment model function. The significance of bioturbation was indirectly validated as an important process by Zhu et al. (2016), by ensuring the DO vs. PO 4 flux rate matched in situ observations. Additional Level 2 metrics for assessing sediment biogeochemical predictions may include summaries of the O 2 :CO 2 sediment-water flux ratio, reflecting the models ability to correctly capture the balance of aerobic and anaerobic respiration that is occurring.
Particularly for spatially resolved models, ensuring models capture empirically established scaling relationships between oxygen exposure time and carbon burial efficiency or organic matter reactivity and age (as depicted by the Middleburg curve) may prove to be a particularly useful test. However, to date this has yet to be reported (Paraska et al., 2014).
Simulating other aspects of aquatic geochemistry is increasingly being undertaken to capture acidity and risks associated with heavy metals. Examples include model applications in an acidic environment such as mining impacted landscapes (Salmon et al., 2017) or coastal sites impacted by acid sulfate soils . Other applications include reservoir management whereby seasonal anoxia and the accumulation of metals in the hypolimnion requires models to capture the redox sensitivity of Mn and Fe.
Despite its importance in driving productivity and shaping biogeochemistry, the light climate has not always featured during model validation. Light profile data and light quality (specific bandwidth attenuation), including the relative shift in extinction coefficient in response to suspended sediment concentrations, can be a useful exercise. When included with Chl-a model predictions, it is useful to show that predictions of Chl-a on average scale correctly with the suspended solids concentration (Chao et al., 2007(Chao et al., , 2010, which may be considered a system-level property. The light climate is also influenced by dissolved substances, though even where dissolved organic matter (DOM) is simulated, few models separate coloured dissolved organic matter (CDOM) from other DOM or specifically validate the contribution of CDOM contribution to light attenuation. Validation of wavelength-specific light attenuation profiles concurrently with CDOM and suspended sediment concentrations offers the potential for us to better resolve the light climate. This is likely to become more important as models aim to resolve ultra-violet (UV), photosynthetically active (PAR) and near infra-red (NIR), due to their different effects on water thermal structure and also organism growth and mortality. Recently, the validation of a coastal ocean model against true colour from satellite imagery also demonstrated the utility of capturing the specific contributors of light reflectance at multiple wavelengths to capture the overall light climate (Baird et al., 2016a).
Total chlorophyll a (Chl-a) is the most common biological variable simulated in aquatic models ranging from small ponds to the global ocean. Elliott et al. (2000) identified the relative merits of a range of error calculation methods for algae time-series. Many of these metrics suffered when the magnitude of the data values is large and they are unforgiving of temporal misalignments between modelled and observed data. For example, if a simulated bloom is of the correct magnitude but one week earlier/later than the observed bloom, is the model still fit for purpose? Further Level 1 comparisons, such as bloom magnitude (e.g. for the spring and/or summer) may be compared separately from the bloom's timing allowing discrepancy in the latter to be isolated. Thus, if capturing bloom size over several years was of more value than predicting its exact timing, a typical error metric for assessing bloom size alone would be of greater use to the study. When assessing modelled Chl-a time-series data, traditional Level 1 assessment metrics can be supplemented by derived metrics such as bloom peak magnitude, duration and time of onset. Another example is the application of wavelet analysis of a high-frequency Chl-a time-series (Kara et al., 2012) to test model performance at predicting scales of variability from days to seasons (Fig. 5). This approach can give an improved view over simple time-series comparisons about the scales of variability that a model can capably reproduce. Spatial comparisons of Chl-a from remote sensing are becoming more common, particularly in coastal and marine models, and wavelet-based comparison may be an effective way to evaluate model Chl-a against satellite data (Saux-Picart et al., 2012).
At Level 2, rates of algal productivityi.e., carbon fixationdetermined directly from in situ experiments using isotopically labelled carbon, or estimates from instruments measuring photosynthetic Fig. 3. An example of multiple assessments of a 3D estuary model of the Swan River Estuary demonstrating performance of oxygen and hypoxia/anoxia prediction; the assessment level is indicated in the top-left corner of each panel. In this example, the level of predictability based on time-series analysis alone is modest (MAE ¼ 36 mmol m À 3 , R 2 ¼ 0.64), but assessment of the temporal and spatial variability in predictions, comparison with available sediment flux rates, and assessment of system-scale anoxia extent together allow a more complete picture of model suitability, leading to a more positive conclusion about the overall model behaviour and its suitability for scenario assessment. Refer to Huang et al. (2018) for model and data details. activity based on fluorescence, can provide an indication of gross primary production (GPP) that can be compared with modelled rates of photosynthesis; though we found this was surprisingly absent from the reviewed literature. For Level 3, deep Chl-a maxima (DCMs) are an emergent feature of complex model dynamics that manifest in response to physical, chemical and biological interactions within stratified systems and can focus model validation (e.g., Carraro et al., 2012;Ayata et al., 2013). Where model application spans a wide range of trophic conditions, general scaling relationships relating system average Chl-a concentration to the degree of external loading may be useful. For example, relatively simple assessments of models of coastal domains can compare performance against the Monbet relationship, or in the case of freshwater lakes, against the Vollenweider model (Vollenweider and Kerekes, 1982;Reckhow and Chapra, 1983). These comparisons may not be relevant for short-term simulations (e.g., of an individual phytoplankton bloom), but may be useful to ensure that complex water quality models are able to meet expectations of cross-system empirical data (see also Chang et al., 2019).
The simulation of biological and chemical contaminants has also been identified as an important area of aquatic ecosystem simulation for inland and coastal waters, particularly for the purposes of informing public health risk assessments. Models of microbial pollutants have been coupled with hydrodynamic models to simulate pathogens and faecal indicator organisms and assess risk in waters used for drinking and recreation (e.g. Hipsey et al., 2008;Sokolova et al., 2013). These studies have tended to validate models by assessing organism counts from available monitoring data (Hipsey et al., 2004;Sokolova et al., 2013). However, this approach is complicated since viable but non-culturable cells (VBNC) make determining viable and inactivated fractions of organism populations uncertain. The increasing adoption of molecular approaches for organism enumeration, in conjunction with adaptive agent-based models has been applied to resolve strains of different pathogenicity (Bucci et al., 2012). Prioritisation of validation at key The thick black contour line shows the 5% significance level of the power spectrum in comparison to a null red noise spectrum (wavelet methods described by Carey et al., 2016). The simulated data were output from a General Lake Model-Aquatic Eco-Dynamics (GLM-AED) model calibrated for Lake Mendota (see Snortheim et al., 2017). Given the differences in the temporal resolution of the simulated and observed data, the wavelet transform provides a useful approach for comparing the two datasets' dominant temporal scales of variability (Level 1c). As evident from the dark red colour at the 12-month frequency, the annual scale emerges as the most important scale of variability throughout the time series for both the simulated and observed data. (For interpretation of the references to colour in this figure legend, the reader is referred to the Web version of this article.) exposure points (e.g., drinking water offtakes or beaches) using exceedance probability plots can also be used to demonstrate a model is suited to risk assessments. Simulating organic chemical contamination is still scarce, likely because of the limited data available for model validation and process identification. A recent study used data derived from sediment cores to validate a 60-year model simulation of polycyclic aromatic hydrocarbons (PAHs) in a large shallow lake (Kong et al., 2017), though further work in this area is warranted.

Community dynamics and ecosystem function
Aquatic ecologists have a long history of developing metrics to describe the structure and function of aquatic populations and communities. Models that aim to resolve these more complex ecological dynamics build on those listed in the previous sub-section, but increasingly are including a wider range of theoretically relevant indicators of ecosystem interactions and function. For these studies we broadly categorised the focus of various AEM applications, and identified 6 broad areas of a) microbial lower trophic level communities, b) benthic populations and habitats, c) pelagic populations and habitats, d) community structure, e) community function, and e) ecosystem response to disturbance (Table 4). Within each category we initially considered Level 1 metrics describing temporal variability of species and their interactions, and the development of spatial niches. We then focused more specifically on literature where models have considered measures of trophic structure and complexity, inter-relationships between simulated ecological groups, and system level responses to ecosystem perturbation. Within this context, the diversity of metrics did appear to be less consistent and differ more notably across the freshwater and marine community.
In modelling the phytoplankton, zooplankton, and bacterial dynamics, the challenges of capturing changes at the level of species or even genus are considerable (e.g. see Lignell et al., 2013;Li et al., 2014;Andersen et al., 2015). Of the published AEMs that simulate Chl-a, only a few aim to simulate succession at the species or genus level (e.g. Elliott et al., 2006;Mieleitner and Reichert, 2008;Gal et al., 2009), with most designed to simulate at the level of taxonomical (e.g. Elliott et al., 2005;Trolle et al., 2011;Chan et al., 2002) or functional groups (PFT's; e.g. Elliott et al., 2000;Segura et al., 2012;Quere et al., 2005;Baird et al., 2016a). For these applications, specific use of even Level 1 validation metrics relevant to species/genus/group partitioning and their spatiotemporal dynamics has been limited, with a reliance on validation against Chl-a observations (discussed above). A reason for this discrepancy is thought to be simply a lack of corresponding observational data (where phytoplankton taxonomic data are available at all, they are usually at a much lower frequency and/or spatial resolution than total chlorophyll estimates) and/or difficulty in converting species counts into the unit used by the model (e.g. biovolume, intracellular carbon, nitrogen or Chl-a concentrations). If the observed species data can be converted into comparable units, the metrics used for total chlorophyll validation can also suffice at the phytoplankton community level (e.g. Elliott et al., 2000).
Planktonic community structure, which may be a Level 1 or Level 3 metric depending on the model structure, can also be examined in terms of relative contributions of various species to the total community biomass. Gal et al. (2009), for example, validated the relative contribution of cyanobacterial species to the total phytoplankton biomass as a function of nutrient loading into the lake. This comparison provided a multi-tier assessment of the model as a good fit that indicated not only successful simulation of phytoplankton biomass and the relative contribution (and hence succession) of the various species, but also an accurate reproduction of the interactions between forcing conditions (nutrient loading) and the food web. The latter metric ensures not only the accurate simulation of phytoplankton in the model but also the assembly of dynamics and processes occurring in the ecosystem.
At larger oceanic scales, Holt et al. (2014) similarly used the emergence of correct biomass partitioning between functional groups to support validation by plotting diatoms as a fraction of total Chl-a, and comparing with the observed scaling between the diatom fraction and total Chl-a as estimated from satellite observations. The model data were also shown as a two-dimensional density histogram, as a running average and least squares regression fit to the continuum function used in Brewin et al. (2012). At the global scale, de Mora et al. (2016) used the known scaling relationships between abundance of phytoplankton functional types and total chlorophyll concentrations to demonstrate the community structure was correctly reproduced (Fig. 6). They further supported model validation with tests of N, P, Si and Fe stoichiometric ratios, and carbon:Chl-a partitioning to demonstrate the correct emergence of expected patterns from the simulated nutrient flux pathways. Further assessment can be undertaken to characterise model performance capturing ecological species succession (Level 3). True ecological succession, whereby the presence of one functional group creates the conditions required for the emergence of the next group (e.g. Hearn and Robson, 2000), should not be confused in this context with the pattern of replacement of one functional group by another due to unrelated changes in environmental conditions, sometimes also referred to as succession (e.g. Chan et al., 2002). The former can be considered a genuine emergent property of the system, while the latter allows only a more detailed state validation of different simulated size or functional groups. Graphical examination of simulated phytoplankton succession based on visual comparison of a time series has been used in a number of studies (see Rigosi et al., 2010, for a partial list). The use of quantitative metrics, however, has been limited and when applied, it typically focuses on goodness of fit and correlation metrics. Successful modelling of phytoplankton succession requires accurate simulation of numerous processes and food-web interactions, which even a tool like a Taylor Diagram may not be able to confidently describe.
The use of tracers to evaluate food-web dynamics is another efficient Level 3 option for assessing a model's ability to capture the food-web dynamics, and in particular the trophic interactions. The use of stable isotopes has become increasingly popular in empirical food-web studies, but also more recently as a means for assessing models. The signature of stable isotopes in organisms integrates over time and provides validation of the mapping of the predator-prey interactions resulting from prey preference and availability in the system. Thus, the successful match between observed stable isotopes and model trophic levels improves confidence in the strength of the simulated trophic pathways occurring within the food web (e.g., Nilsen et al., 2008;Dame and Christian, 2008). Adopting a similar approach, Carrer et al. (2000) used bioaccumulation of dioxins in the food web to evaluate model performance of trophic linkages.
A further high-level series of indicators is based on food quality, both in terms of stoichiometry and fatty acid concentrations. The use of dynamic intracellular stoichiometric ratios in a model allows the user to evaluate model performance not only in the form of changes in the stoichiometric ratio of a certain species over time but also the change in the ratios of organisms of higher trophic levels affected by the lower level organisms. Gaedke et al. (2002), for example, examined C:P ratios in egested vs ingested food in the model as part of mass-balance models of the food-web. Li et al. (2013) compared a probability density function of five phytoplankton group internal N:P ratios with sporadic observations of minimum, maximum and mean internal nutrient concentrations. Thingstad et al. (2007) used model validation and the mismatch between simulated and observed bacterial production in a mesocosm to identify the weakness in model simulation of varying stoichiometric ratios. Mitra and Flynn (2006) incorporated stoichiometry into a multi-species predator-prey model with varying elemental composition and selectivity. Their model successfully simulated the switch in predator diet from predation on the prey to cannibalism when the prey suffered from nutrient limitation and decreased in food quality. Mulder and Bowden (2007) examined whether zooplankton can alter their internal stoichiometry under nutrient poor conditions and whether Table 4 Summary of validation metrics relevant to models simulating ecological function of aquatic systems (refer to  Table 1) a Comments (e.g., data collection and processing requirements)

Example references b
Primary producers and lower trophic levels (bacteria, flagellates, ciliates, phytoplankton) Species or functional group biomass (including bacteria, micro zooplankton and phytoplankton) Time-series comparison   Table 1) a Comments (e.g., data collection and processing requirements) Primary production and respiration PP: Refer to Table 3 Chl-a metrics CR: Refer to Table 4 Community Function metrics

Benthic populations and habitats (macrophytes, macroalgae, coral, invertebrates)
Benthic population biomass/ density   Table 1) a Comments (e.g., data collection and processing requirements) in the water column can demonstrate behaviours such as diel migration for feeding, predator avoidance or reaction to light and/or temperature Gal et al. (2004) Population structuring (population cohort) Size range and distribution 1c ad hoc   include biomasses, biomass ratios, vital rates, vital rate ratios, total production, and total removals (and slopes thereof) across the taxa and trophic levels" Link (  Most biodiversity and diversity indices are only calculable for complex multispecies models. In such cases, rather than directly comparing index values, relationships between indices and environmental variables can be compared, e.g. over multiple sites Washington (1984) Simpson and Norris (2000) Kempton's species diversity    Table 1) a Comments (e.g., data collection and processing requirements) Example references b Response of biodiversity/ richness indices to hydrodynamics and nutrient To validate their models, they compared growth rates under varying nutrient use efficiency across reported food quantity/quality gradients. Simulating growth rates with varying production efficiency under varying food quantity and quality is only possible when correct simulation of the various processes that link food quantity, quality and growth rates is achieved. Additional unique markers such as fatty acid concentrations have been used for high level model assessment. Perhar et al. (2012) used biochemical control on zooplankton growth by N, P and Highly Unsaturated Fatty Acids (HUFA) as a means for testing accuracy of model output. As with stable isotopes and stoichiometry, correct simulation of HUFA concentrations require accurate model description of multiple processes within the food-web.
As model structures and the number of species interactions becomes more complex, the validation challenge becomes considerably more difficult. Sauterey et al. (2015) simulated planktonic diversity (as indicated by the Shannon index, which is a measure of the distribution of biomass among species) and evolving cell sizes of dominant plankton species in a global ocean model. They used the Canberra distance to compare the distance between cell-volume distributions of different model outcomes. Although Sauterey et al. did not directly compare their model results with real-world Shannon and Canberra indices, their work outlined the possibility for how the emergent feature of planktonic community diversity could be used as an additional performance metric for ecological models. Goebel et al. (2010) were able to demonstrate how the expected patterns of plankton distribution in a coastal environment emerged in the model from the interaction of a highly diverse population.
Much of the above discussion about pelagic plankton communities similarly applies to benthic communities, bearing in mind that generally their position is fixed for most of their life history. For individual macrophyte, macroalgal, or invertebrate species simulations, biomass or organism density in coastal and lake models can be assessed across space and time to ensure habitats are accurately represented (Savina and M� enesguen, 2008). In situ rates of detritus production (assessed by traps in flowing environments) and rates of decay of benthic plant detritus (assessed with incubation chambers) can be used as a Level 2 metric. Benthic plant productivity can be assessed by measuring changes in plant biomass when grazers are experimentally excluded, and grazing rates can be assessed through food-web studies supported by stable isotope measurements. Similarly, filtration and clearance rates of filter feeders can be used for species such as mussels. The 3D model study by Bocaniov et al. (2014) simulated zebra and quagga mussels and whilst there was limited in situ data for direct mussel filtration rate validation, the effect of the mussels on improving the Chl-a prediction in the overlying water column was used as a proxy indicator to help justify the filtration rate predictions. In a study by Renton et al. (2011), the development of a restored seagrass bed was simulated using a functional-structural plant model. Results were assessed against total rhizome length, length of the longest rhizome axis and total number of live buds (apices/axes) and internodes, based on a snapshot of data taken two years after the restoration began. At the community level, benthic plant succession can be assessed. Often it is the variability in benthic communities and biomass along a gradient of light/depth and their relationship with patterns of benthic substrates that is important. Li et al. (2010), tested a multi-agent systems model of two macrophyte species in Lake Veluwe and illustrated performance using a map of occurrence indicating regions of model over and under-prediction.
Evaluating model simulations of populations of fish and 'higher' biotic populations is somewhat more complex than variables described above due to lower sampling resolution, species mobility and different indices used to characterise fish populations. Unless regular stock assessment data for fisheries (e.g., Savina et al., 2013) or other organism counts are available, it is difficult to assess model performance using traditional Level 1 indicators (Lehuta et al., 2013), and population models should be assessed using an array of alternative measures. Models of populations often adopt individual-based approaches and the spatial context of predictions needs to be considered in light of the sampling regime used to collect observations. For fish, observed data are often based on catch data. Metrics such as catch per unit effort must therefore be translated to match simulation variables or used to qualitatively assess model spatio-temporal patterns of fish abundance (e.g., Holt et al., 2014;Savina et al., 2013). Higher level indicators often used to characterise fish populations include length-weight (L-W) relationships, length/weight at age and size distribution histograms (Makler--Pick et al., 2011;Megrey et al., 2007;Rose et al., 2007). All three indicators emerge dependent on a large number of individual and population-specific interactions with environmental conditions, thus accurate predictions serve as an indication of model robustness. However, the predictions usually also integrate over large spatial and/or temporal scales, and thus are not truly assessing population response at finer scales. Breckling et al. (2005) and H€ olker and  discuss Level 3 emergent properties relevant to fish population modelling: self-sorting age groups, trophic bottlenecks, size-dependent winter mortality, spatial organisation measures, and the influence of lake morphology on phenotype. Assessing changing population size distributions in response to fishing pressure (e.g. Makler-Pick et al., 2011), can also be a means to qualitatively assess ecosystem response to external forcing.
Assessment of food webs with a high degree of variable interaction may take the form of a multivariate assessment tool (e.g., TD), or more ad hoc tests of theoretically relevant patterns, trends and relationships. An example of the latter is the approach by Fulton et al. (2004), where a range of semi-quantitative tests and qualitative comparisons of population and ecosystem level metrics for coastal embayment models were used. Predictions of the trophic structure were put in context of the Sheldon spectra, to demonstrate mass partitioning between trophic levels was appropriate. Sailley et al. (2013) also compared trophic efficiency metrics that could be used to assess how complex food webs emerge, including comparison of bulk community heterotroph to autotroph ratios, and variation of zooplankton predator: micozooplankton, and predator:prey scaling relationships. These metrics were used in the context of comparing model structures, however, they may also serve as a way for modellers to compare food-web predictions with data. Dynamic relationships that emerge within food-webs such as intraguild predation, and competition between species, may also be used to assess models, however, specific Level 3 metrics quantifying these measures of system-organisation are difficult to define. Reynolds and Elliott (2012) explore the predictability of several emergent properties of freshwater ecosystems including carrying capacity, exergy accumulation, carbon processing capacity and habitat templates (i.e. functional zones). They conclude, that "while species composition may remain quite unpredictable, except on the basis of probabilities and hindsight, the characteristic traits of the successful contestants can be anticipated with considerable certainty" (p. 87). For more complex food web analyses, Deehr et al. (2014), demonstrated a novel approach to validation of a complex EcoPath food-web through integration with N-isotope data. They demonstrate the strong relationship between effective trophic level (ETL) from the model and the δ 15 N signature from observed organism data, in the context of a marine ecosystem subject to trawling pressures.
Ultimately, a large number of modellers are seeking to elucidate fundamental ecological relationships and forecast potentially complex response pathways of ecosystems to changes in external and internal drivers. Metrics described in the above sections specifically support assessing sub-model components (e.g., hydrodynamics, nutrients, phytoplankton etc.), but there are range of more specific metrics that can be used to holistically assess model predictions. An area of increasing interest is demonstrating models are able to capture systemscale Level 3 metrics such as resilience to perturbation, thresholds and stable state transitions, and hysteresis effects (Hipsey et al., 2015;Müller et al., 2016). As yet there remain limited examples where models have been confronted with empirical data that display these trends. Challenges exist in terms of computing compound indices that can be used as indicators of ecosystem "state", though indices of water quality or ecosystem diversity are increasingly being used. In a shallow lake example, Janse et al. (2008Janse et al. ( , 2010, were able to demonstrate the ability of their model to capture the threshold shift from macrophyte to algal dominance, validated by an assessment across multiple lakes. In doing so they were able to identify the threshold P loading level required for "turbidification", and subsequent "restoration" including demonstration of hysteresis effects, which agreed well with empirical work (Fig. 7). In a stability analysis based on the same model, Kuiper et al. (2015) showed that the food web and system stability gradually decreases with the distance from the critical loading in the bistability range, for both directions, thereby highlighting the potential for correctly formulated models to inform users on phenomena such as critical slowing down, ecosystem flickering, and system resilience. These studies hold great promise for informing assessment of the next generation of AEMs that can be more confidently applied to predict ecosystem collapse, recovery and restoration strategies.

Discussion
The four levels of model validation may be challenging, bearing in mind in the past often inadequate data have been limiting the extent to which assessment could be advanced. However, we are seeing an ever increasing range of data streams from new monitoring technologies such as optical nutrient loggers (Rode et al., 2016;Claustre et al., 2019), improved processing of existing sources such as satellite observations (Jouini et al., 2013), citizen science initiatives (Dickinson et al., 2010), open-access and long-term monitoring initiatives, and real-time data portals (Reed et al., 2010). All these data hold the potential to improve the way we run and assess environmental models. Indeed, aquatic ecosystems modellers are beginning to take up these data streams Johnson and Needoba, 2008;Turuncoglu et al., 2013) and it is timely to reconsider the ways we can use this data to improve our model formulations and to describe model uncertainty.

Improving models through improved assessment
Several recent commentaries have discussed the challenges and issues in application of complex environmental models in general (Nordstrom, 2012) and AEMs in particular (Robson, 2014a;Trolle et al., 2012;Arhonditsis et al., 2014;Frassl et al., 2019). The emergence of community models and flexible modelling frameworks to reduce duplication of effort (e.g. Bruggeman and Bolding, 2014;Mooij et al., 2014;Hipsey et al., 2019), and the application of advanced techniques for assessment of model error and sensitivity, go some way towards addressing these challenges. The CSPS framework and examples we provide is an attempt to help modellers move from relying on Level 1 metrics, to more robust and insightful Level 2 and 3 metrics in order to more thoroughly challenge our models and assess their capabilities and limitations. As these metrics become more widely used and reported, they will facilitate an increased depth of analysis in comparative studies (e.g., Salihoglu et al., 2013;Kim et al., 2014;Kwiatkowski et al., 2014;Tittensor et al., 2018) and help us to assess how different model structures, parameterisations, and algorithms perform across a diverse range of applications and simulation contexts.
For the foreseeable future, there may be insufficient data to complete assessment at all four levels in every case. Nonetheless, being aware of the diversity and suitability of metrics at multiple levels and being explicit about the level (0-3) at which assessment has been conducted will help modellers to communicate the type of assessment that is being performed and the implications for model uncertainty. This awareness may also facilitate prioritisation of observational studies and monitoring programs that consider not only state variables, but also fluxes and emergent properties (in other words, ecological "states, rates, and traits"). Undertaking assessment using this structured approach can help modellers to further pinpoint where models are fundamentally weak and communicate to stakeholders where further investment in data collection and monitoring will support prediction. Conversely, modelling that leads to discovery of new or interesting phenomena can motivate new experiments or monitoring to support post-hoc validation efforts.

Model purpose and selection of appropriate metrics
Technical assessment of models is varied and requires the development of workflows that bring together several methods, tailored to the specific application (Bennett et al., 2013) and taking into account the intended purpose of the model (e.g. Harmel et al., 2014). The examples in Tables 2-4 are intended to provide an expandable library that can serve as the basis of a common reference for assessment of aquatic system models. Some examples may serve to cross-fertilise ideas across sub-disciplinary divides. For example, Target or Taylor diagrams are widely used in oceanography, but not routinely used for freshwater models.
Note that not all are relevant in any given case; for example, a specific relationship may be used to parameterise the model directly, in which case it is less suitable as a validation tool. Others may only be applicable to specific spatial or temporal model resolutions. For example, a metric quantifying stratification and mixing requires a minimal vertical disaggregation of the water column. The selected metrics will also depend on the model purpose. Models used for operational forecasting may prefer different assessment metrics than those designed to explore algal seasonal dynamics or long-term nutrient load assessments. In either case, a combination of Level 1 metrics may demonstrate the model's potential, and support with Level 2 metrics would help to ensure the model was not over-fitted and introduce a higher level of credibility into the assessment. However, if the intention is to then apply the model outside the bounds of historical conditions to explore the impact of major system changes on ecosystem dynamics, then including Level 3 validation becomes an important step. Wellvalidated modelling studies in different disciplinary areas (e.g., ponds, lakes, rivers and marine systems) demonstrating use of metrics across multiple levels are encouraged to serve as benchmarks that can guide practitioners.

Using multiple metrics to enrich the calibration and uncertainty assessment process
Calibration of AEMs has historically been a relatively manual 'trial and error' process, though formal calibration methods are beginning to be used in both marine (e.g. Parslow et al., 2013) and freshwater ecosystem modelling (e.g. Ramin and Arhonditsis, 2013;Dietzal and Reichert, 2012). The adoption of calibration-validation pairing (e.g. Trolle et al., 2008), where validation relies on a data set independent from that used in calibration, remains the exception rather than the rule in aquatic ecosystem modelling, though it is common in other fields and widely considered best practise (Robson, 2014b). Adopting a calibration and validation period can avoid overfitting model parameters, and identify the predictive capability of a model, particularly where patterns during the validation period differ from those in the calibration period. Widening the range of metrics models are assessed against may assist in achieving a broader application of calibration-validation pairing.
In recognition of model uncertainty and equifinality problems, there has been a shift in some areas of the AEM community to change model calibration practice from seeking a single "optimal" value for each model parameter, to seeking a range of parameter sets that all meet a prespecified standard of agreement with the data (Aldenberg et al., 1995;Stow et al., 2007;Arhonditsis et al., 2007;Chiu and Gould, 2010;Janse et al., 2010). Running an ensemble of simulations using values from amongst these acceptable parameter sets provides a basis for estimating the uncertainty associated with model predictions. This practice, termed "physical-statistical modelling" (Kuhnert, 2014), relies on Bayesian probability to combine existing (prior) information with observations to project the (posterior) likelihood of ecosystem response. The effective characterisation of model uncertainty using Bayesian approaches depends upon two critical steps: i) selection of a sampling scheme to generate input vectors (e.g., Latin hypercube, Markov Chain Monte Carlo etc.), and ii) selection of a likelihood measure to quantify model misfit. In complex models, the choice of likelihood measures for assessment leads to conceptual dilemmas for modellers such as the selection of likelihood functions that can meaningfully change the inference drawn (Beven and Freer, 2001;Hong et al., 2005;Arhonditsis et al., 2008b). By further tailoring the adopted likelihood measures to consider the assessment ideas introduced in this paper the model calibration and uncertainty process can be enriched to focus on more diagnostically powerful likelihood measures.
Finally, recognising that there is no true model of an ecological system, but rather several adequate descriptions of different conceptual basis and structure, ensemble modelling is a means to obtain better predictions and a better understanding of uncertainty by combining the results of 'competing' models . Several methods exist to synthesise predictions across ensembles, including sequential data assimilation approaches (such as the ensemble Kalman filter and ensemble particle filters; Moradkhani et al., 2006;Vrugt and Robinson, 2007), and post-hoc ensemble integration strategies such as the Bayesian Model Averaging (BMA) (Ramin et al., 2012). Including models of differing complexity and with varied structures in the ensemble allows structural uncertainty to be addressed alongside uncertainty from input data and parameter selection. When combined with the use of more specific and nuanced assessment metrics, modellers can better decide which model structures perform best, without an overt reliance on diagnostically weak Level 1 error metrics, thus motivating reconsideration of their Level 0 validation.

Conclusions
In evaluating the performance of a model, we want to know the answers to several questions: Is the model capable of reproducing observations? If so, is it getting it right for the right reasons (or conversely, is it over-fitted or does it have one error cancelling another)? Can we trust the model to make predictions? If so, in what range of Fig. 7. Analysis using the model PCLake to (a) demonstrate the threshold loading rate of P to transition from a clear macrophyte-dominated state to a turbid, phytoplankton dominated state, and hysteresis effect (based on sequential simulations undertaken increasing load (series 1, black circles) and decreasing load (2, hollow squares)). Validation of the model states against a multi-lake data set of (b) TP, (c) Chl-a and (d) submerged vegetation, spanning the loading range, confirms the model captures ecosystem organisation over a wide range of conditions. Images reprinted from Janse et al. (2010), with permission from Elsevier. circumstances can we trust it? To answer these questions in the case of mechanistic AEMs, we need to go beyond simply comparing simulated and observed concentrations of state variables. Here, we present a way forward; the hierarchical CSPS framework to encourage evaluation of models at four levels: conceptual accuracy (Level 0), state accuracy (Level 1), process accuracy (Level 2), and accuracy in capturing system behaviour (Level 3). Assessment at Level 2 can improve confidence in the biogeochemical basis of model formulations, while assessment at Level 3 allows modellers to critically assess model predictions against spatial and temporal scales of change, stoichiometric indices, and a range of trophic relationships, all of which are based on theoreticallyinformed indicators of ecosystem function. Arguably, only applications that perform well at highest level of assessment justify the implementation of complex, process-based models. Short-to mid-term forecasting predictions, after all, can often be less expensively and more accurately produced through simpler approaches such as regression modelling (Robson and Dourdet, 2015) or evolutionary algorithms (Recknagel et al., 2014), while if the aim is to shed light on system function, a model that fails to perform at Level 3 may be producing the "right answer for the wrong reason". Though it may not always be possible to rigorously assess a model at all levels, modellers should strive to ensure the level of assessment is suited to the purpose of the model and the severity of consequences of an incorrect prediction. A model that is successfully validated for a specific system at all four levels is one that can be applied with confidence to forecast future trajectories of the system. A model that has been successfully validated across several systems at all four levels is one that can more generally be applied with confidence. Over time, it is envisioned that the community-driven adoption of these metrics will accelerate advances in model structure and function and provide an improved foundation for model assessment on which developments in model-data fusion can be built.

Software and data availability
The system of metrics reported in the present analysis is an evolving collection and can be accessed online at: https://aquaticecodynamics. github.io/aem-metrics/.

Author contributions
MRH, GG and BJR designed the framework outline and prepared the main manuscript body with GA; All authors contributed to metric identification, literature review, preparation of the tables and figures, and manuscript development and revision. Funding: MRH received support from the Australian Research Council [DP130104078, DP170104832, LP130100756, LP150100451]. CCC received support from the National Science Foundation [NSF1753639].

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.