Recent progress in performance evaluations and near real-time assessment of operational ocean products

Operational ocean forecast systems provide routine marine products to an ever-widening community of users and stakeholders. The majority of users need information about the quality and reliability of the products to exploit them fully. Hence, forecast centres have been developing improved methods for evaluating and communicating the quality of their products. Global Ocean Data Assimilation Experiment (GODAE) OceanView, along with the Copernicus European Marine Core Service and other national and international programmes, has facilitated the development of coordinated validation activities among these centres. New metrics, assessing a wider range of ocean parameters, have been defined and implemented in real-time. An overview of recent progress and emerging international standards is presented here.


Introduction
Operational ocean forecast systems (OOFSs) now provide a wide range of analyses and forecasts of the marine environment that can be exploited by many users. The value of the products to any particular user depends not only on the quality and skill of the products but also on the user's knowledge (and understanding) of the quality, skill and reliability of the products for his or her particular application. Since the initial implementation of OOFSs during the late 1990s, continuous efforts have been made to evaluate hindcast and forecast accuracy and skill (Hernandez 2011;Martin 2011). Accuracy and skill here are defined respectively as the OOFS products degree of closeness to the 'ocean truth' (Hernandez 2011) and the OOFS's usefulness for a given application ). An overview of skill assessment using observations and other reference datasets representing this truth is given by Stow et al. (2009).
The calibration, validation, verification and quality control of OOFS products are core activities in ocean operational centres Oke et al. 2013;Blockley et al. 2014) (OOCs). Usually calibration refers to a task in which model parameters are optimized. Here, the calibration phase refers to the last comprehensive scientific assessment of the new OOFS version before operation. The calibration phase is also often used to demonstrate that the new system performance is better than the existing system. Validation refers to the OOFS performance assessment while in operation. Verification is defined here as the quantification of OOFS skill based on independent data, i.e. not used to generate the products.
Methods for assessing OOFS reliability (Crosnier & Le Provost 2007) were defined in the early days of the Global Ocean Data Assimilation Experiment (GODAE) experiment (Bell et al. 2009), based on (1) consistency, (2) quality (or accuracy) and (3) added value as proposed and defined by weather forecast skill verification approaches (Murphy 1993;Murphy & Winkler 1987). The first two types of assessments are undertaken routinely by OOCs as 'internal metrics'. The third is considered as user-oriented, and requires use of 'external metrics' measuring the fitness for purpose (provision of dependable, reliable and repeatable information), or the value of ocean forecast services. This is also addressed by some OOCs in parallel with verification tasks performed by users.
Experts in OOCs across the world who are assessing the skill of OOFSs face similar issues with the observational data sets available for validating their products. In general, in situ and satellite measurements are collected by dedicated data assembly centres (DACs) that pre-process the data and make it available for OOFSs. Hence, for similar components of operational systems, methods and tools for assessing the representation of ocean processes can be shared. Owing to the nature and quality of the observations, validation experts also face comparable issues, such as the validation of mesoscale chlorophyll or primary products using ocean colour satellite data, or the use of Lagrangian approaches and drifters to verify the realism of eddies in regional models for oil-spill forecast skill. As a result, there is a great potential for collaboration within the scientific community in this area.
Naturally, working groups were set up to tackle, as a community, these validation issues. This started earlier within the ocean observation community, which raised expert groups to develop guidelines and standards for providing state-of-theart ocean observation products. Some examples of these are the Ocean Surface Topography Science Team for sea surface height (SSH) and satellite altimetry (www.aviso. altimetry.fr), the Group for High Resolution Sea Surface Temperature (GHRSST) for sea surface temperature (SST) from various satellite and in situ sensors (www.ghrsst.org/), the Argo team for in situ vertical profiles of primarily temperature (T) and salinity (S) (www.argo.ucsd.edu), the Global Ocean Surface Underway Data group for sea surface salinity (SSS) (www.gosud.org) and the International Ocean Color Coordinating Group (www.ioccg.org).
This paper aims to highlight recent progress in nearreal-time monitoring of OOFS performance, and to describe different validation strategies and their limitations. For the sake of completeness, we also present DACs validation procedure for advanced observed-based products. Some recent examples are presented and discussed in the first section. The next section illustrates progress by OOCs in integrating validation in their systems. More specifically, since GODAE (Bell et al. 2009), the operational community has maintained a partnership to share and standardize validation methodologies. This community has gained mutual benefit from inter-comparing their ocean products and inferring the relative strength and weaknesses of the operational systems (Oke et al. 2012). These issues are addressed in the framework of the ongoing GODAE OceanView program (Schiller et al. 2015) (GOV) (www.godae-oceanview.org) by the Intercomparison and Validation Task Team (IV-TT). Three initiatives have started and that are ongoing: an Ocean Reanalysis Intercomparison (Balmaseda et al. 2015); the organization of a multi-model ensemble forecast approach for ocean surface parameters; and the organization of the near-real-time operational product 'Class 4 metrics' inter-comparison against observations, described later in this paper and detailed in two companion papers (Ryan et al. 2015;Divakaran et al. 2015).
Recent improvements in near-real time and operational assessment New metrics Presently, real-time OOFS skill assessment focuses on various aspects of the dynamics of physical and biogeochemical processes of the ocean, at different time-scales, over different areas, and with different purposes and uses. Evaluation metrics have evolved in order to synthesize different aspects of system performance together. For example, Taylor (2001) and target ) diagrams consider root-mean-square error (RMSE) or rootmean-square differences (RMSD) together with anomaly correlations versus observations. Similarly, cost functions and model efficiency values  can provide a synthesis of model performance indicators.
Furthermore, new metrics have been designed to characterize other properties. For example, in the case of search and rescue, ensemble predictions and clouds of dispersion (Melsom et al. 2012) have been used to evaluate the contribution of uncertainty in ocean currents to drift projections. Dispersion is also assessed using multi-model approaches (such as Fukushima Cesium 137 concentration estimates; Masumoto et al. 2012). New metrics have also been defined for sea-ice, such as contingency tables and distribution statistics used over ensemble coupled model seasonal forecast experiments (Benestad et al. 2011). Skill assessment of ocean biogeochemical models has also been addressed recently. In particular, Lynch et al. (2009) point out the failure of a model to accurately represent the 'ocean truth', but also the failure to correctly assess the effective skill of the model using appropriate metrics.

Assimilation performance assessment
In parallel, the monitoring of the performance of analysis systems has been continuously improved. Statistics derived from innovations (observation minus background) and residuals (observation minus analysis) are used to assess the consistency of the assimilation framework (including the model background and observation error covariances). In the case of ensemble analysis systems, these statistics can be used to verify the adequacy of forecast spread (Balmaseda et al. 2013;Desroziers et al. 2005;Desroziers & Ivanov 2001). Verifying and reducing ocean model biases is also an important issue, as many assimilation methods are based on the assumption that models have no bias, which can reduce the efficiency of analysis methods and even lead to unphysical increments if biases are present and not handled correctly.
Rigorous skill assessment in the assimilation framework is a difficult task: most available observations are used to adjust models and reduce analysis errors. Thus, independent assessment is only possible by: (1) withholding part of the dataset for statistical quantification of errors (a trade-off between a sufficient population size to estimate a statistic while not significantly impacting the quality of the system performance being measured); or by (2) using other sources of data that have not been assimilated (Gregg et al. 2009). The latter is generally employed with data not available in near-real time, which is useful for reanalysis (or hindcast) evaluation, but not for operational routine verification.

Longer-term forecast assessment
Most of the OOCs provide short-term forecasts (from a few days, to 1-2 weeks), but some have begun providing longer monthly forecasts, like the Japanese Meteorological Agency MOVE/MRI.COM-WNP OOFS. It covers a large part of the Northwest Pacific (15°N-65°N, 117°E-160°W), with a specific zoom (1/10°resolution) over 15°N-50°N, 117°E-160°E (Usuii et al. 2006), and uses a multivariate Three-Dimensional Variational (3DVAR) data-assimilation scheme (Fujii & Kamachi 2003). Persistence and 1-to 30-day forecasts are compared against analyses to provide RMSE statistics. A forecast skill metric is used, whereby the ratio of forecast RMSE over the persistence RMSE is calculated for a given forecast lead-time. Using this skill score, the forecast provides useful skill compared with persistence if the ratio is below 1. Results from the MOVE OOFS are shown in Figure 1 for the velocity field at 50 m depth over the period February 2006 to January 2008. Even for 30-day forecasts, the system performs better than persistence in most areas around Japan, and fails only in the vicinity of the Kuroshio Extension area [ Figure 1(a-c)]. Moreover, Kuroshio dynamics and predictability are assessed using a specific metric based on the Kuroshio main axis. The Kuroshio axis error is defined as the distance between a forecast and the 'true' axis position over the 133-139°E, 30-35°N region, where the axis position is determined using the position of the 15°C isotherm at 200 m depth from the analysis. Figure 1(d) indicates that the dynamical forecasting system is consistently better than persistence at all lead times. This type of metric is also useful to convey forecast skill to users, as the error is expressed in tangible terms (here a distance in kilometres) rather than an abstract unit-less skill score or RMSE value.
Specific approaches for regional operational systems Recent improvements have also been made in terms of evaluating specific regional and mesoscale dynamics. For example, the Gulf of Mexico Pilot Prediction Project (GOMEX-PPP, http://abcmgr.tamu.edu/gomexppp/) is investigating the OOFSs performance for predicting the evolution of the Loop Current in the Gulf of Mexico. The Long Range Ensemble Forecasting System (GOM-LERFS) developed at the Naval Research Laboratory (Stennis, USA), has been providing 2-month forecasts since January 2013, with the intention of supporting end users impacted by strong currents associated with the Loop Current and its eddies, and to provide boundary conditions for coastal ocean models. This 3-km resolution OOFS performs weekly 32-member 60-day forecasts. It is initialized using an analysis provided by the Navy Coupled Ocean Data Assimilation (NCODA) scheme that uses a 5-day assimilation window to ingest satellite altimetry, SST and in situ data obtained from the global telecommunication system. A verification of SSH is performed in which forecasts are compared in real-time with along-track SSH data following the Class 4 metrics approach (i.e. in observation space; discussed below). Then, in model space, the ensemble is used to assess the probabilistic forecast skill. In Figure 2, statistics of weekly comparisons against analyses for the period January-September 2013 show that forecasts remain skilful for approximately twice as long as persistence. The SSH anomaly variance agrees closely in the forecast and verifying analysis, but the ensemble standard deviation does not appear to predict the forecast error, suggesting that the ensemble spread does not fully capture the forecast error patterns. Adequately sampling the uncertainty in initial conditions, model physics and forcing is an important aspect of ocean ensemble prediction that requires further study.
Another example of forecast skill assessment is the dynamical feature-based validation approach used for the Experimental System for Predicting Shelf/Slope Optics (ESPreSSO), operated in real-time by Rutgers University over the New Jersey coast Mid-Atlantic Bight (Wilkin & Hunter 2013). This OOFS is based on the 7-km horizontal resolution Regional Ocean Modeling System (ROMS) using boundary conditions from the HYCOM-NCODA global OOFS. The system is initialized using daily analyses from a Four-Dimensional Variational (4DVAR) analysis system with a 3-day analysis window (Moore et al. 2011), which assimilates a large set of data [including glider T/S profiles and CODAR HF-Radar measurements (http://www.myroms.org/espresso/)].
The 4DVAR approach allows a better quantification of model errors by assessing the impact of the assimilated data, thereby permitting the correction of large-scale biases. Slope currents and water masses used for real-time applications are evaluated using all available data. A specific off-line verification is performed using independent surface drifters, moored data and SSH. Moreover, a dedicated multi-model realtime assessment has been performed, comparing estimates from ESPreSSO together with three other regional OOFS, and three global OOFS (including HYCOM-NCODA) in order to evaluate the OOFS' prediction skill for subtidal currents and shelf water mass changes. This assessment is comprehensively discussed in Wilkin and Hunter (2013) including a presentation of performance improvements through downscaling strategies. In their figures 4 and 5, improvements to Taylor and Target diagrams are proposed, to represent individual vs average performance, and seasonal model biases respectively.

Validation of biogeochemical products
In the field of ecosystem modelling and marine-resources management, in situ data for adequate validation of operational products are sparse. Hence, satellite ocean colour (OC) products remain the main source of information for estimates of phytoplankton pigment concentration distribution [i.e. chlorophyll, CHL; Figure 3(a)]. The OC Thematic Assembly Centre (TAC), within the European MyOcean project (www.myocean.eu), has developed specific processing chains to operationally distribute state-of-the-art, quality-checked daily OC observations over both global and regional domains. The need for Figure 2. Real-time SSH assessment of the GOM-LERFS Gulf of Mexico OOFS against satellite altimetry data. Comparison with analysis is restricted to water deeper than 200 m in the subdomain, 82W to 89W and 22N to 28N. (a) Correlation of SSH anomalies for persistence (red) and forecast (black). (b) RMSE of the ensemble mean forecast (red) and ensemble spread standard deviation (red dashed) for SSH. Also shown is the spatial variability in term of standard deviation (STD) of the analysis (black) and the 4-week ensemble mean forecast (blue).
regional processing comes from the demonstrated inadequacy, at regional scales, of global algorithms to generate reliable products of sufficient accuracy (Volpe et al. 2012). For example, the oligotrophic waters of the Mediterranean Sea were shown to be significantly less blue and more green than the global ocean (Volpe et al. 2007). The OC TAC provides value-added products (generally not distributed by space agencies) such as: (1) daily merged fields from different sensors; (2) Level 4 products without data voids owing to clouds generated using both Optimal Interpolation and Empirical Orthogonal Function approaches; and (3) products that account for the two broad classes of bio-optical regimes (open ocean and coastal waters). Level 4 and Level 3 (L4 and L3) mentioned here refer to product levels as defined by the Committee on Earth Observation Satellites (www.ceos.org). Typically, L4 products are regular maps of a given parameter, obtained by merging and processing similar measurements from different sources, and using specific estimation methods (optimal interpolation, krigging, etc.). For these observation-based products, both offline and online quality assessments are performed by the OC TAC. The former refers to the comparison of space-time co-located in situ and satellite derived products for quantities such as spectral remote sensing reflectance, total suspended matter, coloured dissolved organic matter and chlorophyll concentration. In real-time, such data are not sufficiently robust, and validation is limited to a consistency assessment (Hernandez 2011), where the daily climatology for each region is used as a reference to make pixel-based comparison (online validation). A Quality Index, based on the normalized departures from climatology, is computed from the SeaWiFS sensor [no longer operational; Figure 3 (d)]. In parallel, the monitoring of input data (number, quality … ) has increased the reliability of the products. Many initiatives have led to progress on skill assessment of biogeochemical modelling (Stow et al. 2009). For example, as part of the Ocean Carbon Model Intercomparison Project, univariate metrics were proposed to quantify both physical and biogeochemical parameters of the coupled simulations ). Multivariate metrics (i.e. quantifying the reliability of both the parameters and their relation to observed processes) (Allen & Somerfield 2009), or map-based validation ), is also emerging in this field. However, for realtime assessment of ecosystem-biogeochemical forecasts, most of the OOFSs can only rely on references given by OC satellite products. Moreover, the dynamics of biogeochemical systems is strongly characterized by the patchiness of its properties generated by oceanic mesoscale, which causes heterogeneity in concentration fields (Levy & Martin 2013). Consequently, most forecast verifications mimic OC product assessment, by analysing in a similar way, at the pixel level, the model equivalent to CHL and optical satellite data (Lazzari et al. 2012), as described in the previous paragraph.

Validation of sea-ice products
Growing interest in polar regions has driven the need for improved sea-ice verification metrics to demonstrate the capacity and quality of sea-ice forecast skill to potential users. This effort has been hindered by the reliability and availability of observational datasets together with a lack of knowledge of how to adequately account for nonlinearities in the verification metrics. Contingency table-based metrics, introduced in the early twentieth century (Pearson 1904), have been re-popularized, as well as the root-mean-square distance of ice edge. However, these metrics may not be relevant for regional or processdependent verification. In particular, errors in ice edge location assessment are sensitive to the definition of 'ice edge', as multiple ice edges may be present and the total error will be sensitive to the length of the ice edge. Hence, the metrics defined for the Arctic might not be suitable in the Baltic Sea, and definitions should consider subregional scaling (e.g. size of a gulf) (Lagemaa 2013). However, even if the ice metric is not properly defined, it still gives valuable user information for the dense marine traffic regions like the Baltic Sea.
An example of a contingency table-based metric from the Canadian Meteorological Centre Global Ice-Ocean Prediction Systems (GIOPSv1.0) is shown in Figure 4. GIOPSv1.0 uses a 3DVAR ice concentration analysis for correcting the Los Alamos sea-ice model by assimilating satellite data together with daily ice charts from the Canadian Ice Service. The reference dataset is given by the Interactive Multisensor Snow and Ice Mapping System (IMS) analyses from the National Oceanic and Atmospheric Administration (NOAA) National Ice Centre that provide binary fields of ice/open water on a 4 km grid. Sea-ice analyses suffer from an incomplete coverage of observations, with data-reliability issues and the mis-representation of leads. A particular issue is the high sensitivity of passive microwave retrievals to surface melt, often resulting in erroneous values of open water in summer. Contingency table statistics produced using IMS analyses (applying a threshold of 0.4 to determine binary ice/water values from the GIOPSv1.0 ice concentration forecasts) are used in order to evaluate the proportion of correct ice, or correct water. These contingency scores are computed separately for forecast and persistence fields. Then, differences of scores for forecasts and persistence are computed. These metrics are mapped for 2011 in Figure 4 showing skilful 7day forecasts along most of the ice edge. In other words, 7day forecasts beat persistence considering the prediction of correct proportion of sea ice and correct water.

Reliability assessment of input information
Another recent aspect in OOFS validation strategy is the systematic feedback of errors and anomalies to providers of input data. For instance, validation of atmospheric forcing fields is now carried out for some wave-prediction systems (Feng et al. 2006). Moreover, inputs of ocean assimilation systems, such as in situ data collected by TACs or DACs, can suffer in real-time from incomplete levels of quality control. While automatic procedures are applied for the rapid distribution of the observations in real-time, more detailed visual analysis is often left for delayed-time datasets. Other analyses usually depend on the level of expertise of the provider. In near-real time, s226 F. Hernandez et al. erroneous in situ profiles can drastically impact the quality of ocean analyses. As a result, systematic quality control has been implemented in many OOFSs to prevent this. At Mercator Océan, two techniques are applied for in situ T/ S profiles. First, innovations (guess/forecast minus observation) are tested against a threshold envelope. This envelope is defined using statistics of innovations from an ocean reanalysis and is used to detect anomalous observations (e. g. blue dots between 500 and 1000 m depth in Figure 5). Second, dynamic heights are computed from the T/S increments, and then probability density functions are constructed for consistent dynamical areas, in order to detect points outside from the normal distribution. In some cases, feedback to producers is organized through blacklisting (Cabanes et al. 2013). Similarly, the MyOcean IBI (Irish Biscay Iberian shelves) OOFS team (Puertos del Estado, Spain, and Mercator Océan, France) has developed a comprehensive tool called NARVAL (Numeric Assessment for Regional VALidation) to check its operational performance, in terms of consistency, accuracy and reliability. NARVAL uses available observations, such as: satellite-derived Sea Level Anomalies (SLA), SST and SSS (from both L3 and L4 products), in situ T/S profiles, HF-radar surface currents and tide gauge sea level. This tool builds on the MyOcean project structure such that the input data are quality checked by the TACs. NARVAL is modular and extendible for any new data sources as a reference (measurements, climatologies or model estimates). All validation information produced is archived for further evaluation. Additionally, the 'On-line Mode Validation' provides an automated quality and consistency assessment, and is routinely performed for each forecast bulletin (from the previous day's hindcast up to 5-day forecasts). It generates Class 1-4 metrics (Hernandez et al. 2009) that provide daily statistics and an evolution of the skill score for each parameter over the past two weeks [ Figure 6(a)]. Furthermore, a 'Delayed Mode Validation' provides an overall review of the IBI product quality over longer time periods (i.e. monthly, seasonal and annual). Real-time statistics are accumulated to provide a synthesis assessment over longer periods, while dedicated metrics using off-line datasets, can focus on particular ocean phenomena or parameters. Metrics are performed over the whole domain (26°N-56°N, 19°W-5°E), but also over specific sub-regions of interestboth for users and for verification teams, for example: Strait of Gibraltar, English Chanel, Western Mediterranean Sea, Gulf of Biscay, Western and Northern Iberian shelves, the Canary Islands area and the Irish Sea.

Development of integrated operational verification systems
Using NARVAL, the performance can be monitored for specific areas, dynamics and OOFS, as illustrated for SST using a Taylor diagram (Figure 7). This type of figure is obviously complex, but it is used by validation teams to monitor, at a glance, several systems' SST scores over various areas. NARVAL has been designed to allow automatic inter-comparison between IBI and adjacent regional OOFS within the MyOcean framework for the Mediterranean Sea and the North-West-Shelf [ Figure 6(b)]. Comparison with adjacent OOFSs aims primarily to maintain consistency in products and user delivery. Comparison with the global OOFS (within which the IBI OOFS is nested) quantifies added value of the regional shelf system (Figure 7), that representing tides and highfrequency upper ocean dynamics. There are also comparisons against coastal systems over key areas, such as the SAMPA (Sistema Autónomo de Medición, Predicción y Alerta) system around the Gibraltar Strait (Lorente et al. 2014).
In the MyOcean framework, a similar methodology is implemented for the Baltic Sea by Danish, Estonian, Finnish, German and Swedish OOCs, with a comprehensive validation toolbox designed to cover all available data with various metrics. It provides detailed outputs for expert users and model developers. However, for less experienced users and decision makers, the system provides a more general reliability output. Routines are adapted for mapped, on-track and time-series reference data covering the sea level, ice thickness and concentration, T, S, transports, CHL, oxygen, nitrate and phosphate metrics . Moreover, the five contributing Baltic OOCs have organized a multi-model verification and comparison process, together with a multi-model ensemble estimate assessment. For some parameters, results are regularly posted to the Baltic Operational   Oceanographic System server (www.boos.org). Beyond the extended information on forecast scores, the intercomparison of different OOFS is adding value to the near real-time validation routines, in addition to the usual evaluation against observations and climatology. The multi-model standard deviations from different forecast products (Figure 8) provides valuable information about their uncertainties, which is difficult to assess using regular model-reference approaches owing to sparse coverage of observations. These figures are available daily at www.boos.org. Interestingly, nine OOFS are assembled to show the reliability of surface current forecast, using vector diagrams [Figure 8(c)]. This strategy has also been adopted in MyOcean by the North West European Shelf Operational Oceanographic System, covering the North Sea and English Channel regions, presenting a multimodel assessment from Belgium, Denmark, Germany, Norway, Sweden and UK OOCs (www.noos.cc).

Reducing uncertainties by ensemble approach: ocean surface parameters multi-model estimation
Major incidents, such as the AF447 Air France Rio-Paris airplane crash in June 2009, the DeepWater Horizon oil  (Masumoto et al. 2012;Kawamura et al. 2011). For all these events, national authorities have made requests to their respective OOCs in order to provide some assistance in near-real time or offline, to carry out dedicated studies to complement risk assessment.
Ocean studies performed in support of the effort to find the Air France plane wreckage relied on several new aspects: (1) an international effort to collect forecasts from different OOCs and to provide different ocean datasets to assist rescue activities in real-time; (2) the use of multi-model datasets and ensemble approaches to reduce errors of ocean surface dynamics in hindcasts and forecasts, with the implementation of dedicated high resolution model simulations in the area, nested into global OOFSs; (3) a retrospective statistical analysis of the accuracy of ocean currents and, in particular, the reliability of mixing and transport properties; (4) the formation of an international task team, with contributions from many ocean experts from both the in situ and modelling communities (Scott et al. 2012;Drévillon et al. 2013). Performance gains were also made during the search for MH370 through the use of ensemble mean products that improved the representation of buoy trajectories.
At the Australian Bureau Of Meteorology, deterministic forecast errors of the OceanMAPS OOFS are assessed and reduced by implementing time-lagged ensemble forecast, also called a multicycle ensemble (Brassington 2013).
Over four successive days, forecasts are performed each day, starting from background fields independent from each other. Weighted ensemble averages are then computed, and forecast errors are assessed using spectral methods that quantify the impact of ensemble averaging as a function of wavenumber. For instance, for SST, Figure 9 demonstrates the increase in power for random information (Brassington 2013). By comparing the power spectrum at different forecast periods, the growth in random error relative to wavelength is also captured.
For marine pollution in the Northern Aegean Sea, studies based on a 48-h oil-spill dispersion forecast have been performed recently. The system is based on atmospheric, wave and ocean circulation models coupled with the operational systems using the Aegean-Levantine Eddy Resolving Model (nested in the MyOcean OOFS) and SKIRON of the University of Athens and oil-spill dispersion models (http://diavlos.oc.phys.uoa.gr). A Lagrangian-based verification east of the Limnos Island (Northeastern Aegean Sea) was conducted during October 2012 where 25 drifting buoys and special oilspill drifting instruments were compared with drift predictions. The area was characterized by a very strong front, and in many cases a small error in the prediction of the frontal line resulted in very large errors in the oil-spill prediction (Figure 10, left). This experiment shows that forecasts beat persistence over the first 20 h (Figure 10, right). Moreover, in these areas of varying dynamical features (fronts, eddies), forecast errors grow significantly, emphasizing the need for more advanced prediction systems such as ensemble forecasts. Ensemble approach are considered now at regional scale, as in the Ligurian Sea, where a multi-model strategy is tested against an Figure 9. (a) Power spectrum for SST anomaly in the Tasman Sea for zonal sections (38-32S) and temporally averaged from 1 March to 31 August 2012 from the Australian OceanMAPS OOFS. The black (red) lines represent the 0-lag latest forecast and the ensemble mean, respectively. The periodograms are shown for the forecast hours −096 (4-day before) solid), −048 (2-day before, dashed), 00000 (dash-dot) and 048 (2-day forecast, dotted). (b) Difference in power between the 0-lag latest forecast and the weighted ensemble mean for the forecast −096 (solid), −048 (dashed), 000 (dash-dot) and 048 (dotted).
ensemble predicting system, showing the respective merit of each approach (Mourre & Chiggiato 2014). Additionally, ensemble approaches proposed by the operational SST community at global scales have been shown to provide promising results, where the ensemble is usually more reliable than individual estimates Dash et al. 2012;Xie et al. 2008).
Several OOFSs involved in GOV activities contributed to the rescue actions carried out for the dramatic events mentioned above. The IV-TT proposed to strengthen this multi-model approach, by organizing the real-time provision of operational hindcasts and forecasts among several GOV OOFSs. Recent experiences have shown that (1) surface ocean parameters were the most needed products, and (2) higher resolution improved the estimation (e.g. for drift, dispersion, mixing, sinking, etc.). As a result, since 2013, four global OOCs (US Navy with HYCOM-NCODA, UK Met Office with the Forecast Ocean Assimilation Model (FOAM), NOAA/NCEP with the Real-Time Ocean Forecast System (RTOFS) and the Climate Forecast System (CFS), and Mercator Océan with PSY3) have been providing a daily rolling archive of model native grid fields of best estimates and forecasts of T, S and currents at the surface. From these nowcasts/forecasts, a first multiple/ ensemble assessment has been made, focusing on SST, together with two observation-only datasets chosen as reference: the NCEP Real-Time Global (RTG) (Thiébaux et al. 2003) and the GHRSST NAVO K10 level-4  dataset. Three ensemble computations are defined using daily OOFS outputs: (1) simple arithmetic mean average; (2) weighted average, based on root-meansquare (RMS) daily differences of each member with respect to the reference SST field; and (3) clustered average, based on a k-mean algorithm (Hartigan & Wong 1979). A first hindcast comparison has now been performed for July 2013 compared with NCEP RTG (Figure 11). In this evaluation, Mercator_PSY3, FOAM and CFS perform better than the two other OOFSs. Note also that FOAM biases are slightly different using the NAVO K10 product (not shown). This highlights the sensitivity to uncertainty in the observational dataset used as 'truth'. Above all, Figure 11 shows that the use of an ensemble results in an improvement over each of the members, with the k-means clustered average performing the best of the three ensemble methods. Similar ensemble scores are obtained against the NAVO K10 SST (not shown). RMS scores seem dependent on the number of clusters: preliminary tests (not shown) from one to 10 clusters indicate significant improvements. This assessment is ongoing, with further analysis of the ensemble mean computation for forecasts and for other ocean parameters. One of the key aspects of this community effort is the real-time provision of these OOFS outputs.

Forecast skill: intercomparison of ocean parameters against observations: Class 4 metrics assessment
The Class 4 metrics approach, developed during the EU MERSEA Strand1 project (Crosnier & Le Provost 2007) and improved during the EU MERSEA-Integrated Project, was adopted at the international level by the GODAE community (Hernandez et al. 2009). This approach is based on comparison with reference measurements, from space or in situ, to assess the OOFS forecasting skill. Reference data, providing ocean 'truth', are used in a similar way, to infer the accuracy of both the best estimate/analyses and the forecasts at different lead times. Additionally, to evaluate the added value provided by the model and the OOFSs' Figure 10. Left: experimental area, east of the Limnos Island; surface velocities given by the model (blue arrows). Simulated oil spills (cyan and black), centre of mass of the oil spills (red x) and drifter tracks (green x) are plotted. Right: time evolution of the oil-spill forecasting error (in kilometres, blue bars), derived as the distance between the drifter location and the centre of mass of the predicted oil-spill, compared with the persistence of the centre of mass (also as the difference with drifters in km, red bars).
short-term prediction efficiency, tests of the skill with respect to climatological fields and persistence are made. Class 4 metrics performed in near-real-time are limited by the availability and quality of observations with several important consequences. First, owing to the scarcity of ocean measurements, the real-time assessment relies on observations that are also used by the assimilation system. These observations can be considered approximately independent (neglecting the autocorrelation of observation error in time) for forecast assessment though in particular when considering short time-scale ocean transient features. A second consequence is the larger error budget for real-time observations that are not fully cross-calibrated, verified and corrected as in the delayed mode (Cabanes et al. 2013;Le Borgne et al. 2012). A third consequence relates to the overall quality of reference product in near-real time. If some biases exist between these products and information of the same kind that is assimilated, then validation scores can be impacted or wrong.
Assessment using Class 4 metrics can be distinguished from assimilation diagnostics in several ways. By assimilation diagnostics, we refer to statistics in observation space derived from the 'background minus observation' (i.e. the so-called misfits); from the 'increments' (i.e. the correction applied to the background to obtain the analysis); or from the 'analysis residual' (i.e. analysis minus observations). For most assimilation systems, there is a pre-processing of Global DAC (GDAC) data through editing, filtering or thinning, in order to limit the number of assimilation observations. This is done (1) because of a limit in the maximum number of observations that can be used by the assimilation scheme (computational requirements), (2) because some observations are considered a priori redundant (i.e. thinning provides a means to avoid having to consider correlated observation error) or (3) because the observations include features and dynamical processes and scales not represented by the forecasting system. 'Super-obing' can also be used in this assimilation pre-processing. In any case, the associated assimilation statistics and metrics often result in a net reduction in the number of observations considered. This is obviously not the purpose of the Class 4 metrics: ideally, all 'good' data from the GDAC can be compared with the OOFS fields and, by the way, measure the accuracy, forecasting skill and scales not represented by the OOFS. Thus, this approach is not OOFS dependent (i.e. the way observations are assimilated, the ocean model gridding, etc.). The same observation can be used in the evaluation of several OOFSs. That is, provided the reference data are independent, exact inter-comparison is possible, considering a specific ocean process or parameter.
The 'Class 4' strategy is now in place in several OOFSs, for global Blockley et al. 2012) or regional assessment (Maraldi et al. 2013). In the framework of the GOV IV-TT, the Class 4 metrics project aims to stimulate the inter-comparison of OOFSs by verifying different aspects of ocean processes captured by the available observations in real time. A near-real-time intercomparison activity has been ongoing since January 2013. Five OOCs are involved, and six OOFSs are compared, looking at SST, T/S at depth and SSHwith sea ice concentration in preparation. A companion paper (Ryan et al. 2015) presents the global inter-comparison performed over basin scale areas. This exercise also allows each of the partners to assess more carefully the forecast capability of every OOFS in the region of interest and measure the efficiency of each system. The Australian group has performed regionally this multi-system assessment presented in a second companion paper (Divakaran et al. 2015).
Based on the statistics of the comparison with the same observations, this Class 4 assessment allows the following questions to be addressed: . What is the relative reliability of each system for a given parameter in near-real-time? . What is the performance of each system in forecast mode (5 days ahead)? . What is the added value of the system compared with climatology or persistence? . What benefits could be obtained through an ensemble approach, compared with each individual system?
As part of the Class 4 intercomparison, interesting new metrics and ways of representing the information graphically have been proposed to better synthesize the information. For instance, radar charts provide the score of each system at different forecast lead times, for all parameters evaluated (Figure 12). Note that the terminology 'hindcast' is used here for analysis, nowcast, hindcast or 'best estimate'. Owing to the details of their real-time operational assimilation scheme, every centre is providing what it considers as its 'best field' in near-real-time with minimum delay. Scores are defined by RMSE, based on differences between observation and model values for each parameter normalized by the largest RMSE. Reference observations are fully described in the companion paper (Ryan et al. 2015). Using this approach, one can characterize the relative score of each OOFS for each parameter. Missing parameters are not problematic: SLA is not evaluated for RTOFS (Ryan et al. 2015), and the radar chart can still be used for plotting scores from the other four systems. Moreover, this approach allows us to assess specific features, such as resoIution, by comparing the global eddy permitting (PSY3, ¼°) and eddy-resolving (PSY4, 1/12°) Mercator Océan OOFS. The radar charts indicate that PSY3 skill scores are always better than PSY4 scores, even for SLA. It is worth noting that PSY3 and PSY4 are run in parallel every day at Mercator Océan ), using the same forcing fields and assimilating the same set of observations. In this case, one may ask whether the observations used to derive the Class 4 metrics are capable of assessing the eddy-resolving capability of the PSY4 system, and if this Class 4 metric is able to infer the mesoscale predictive capabilities of these global high-resolution systems. In this case, SLA assessment could be performed using along-track satellite observations filtered differently, in order to capture more mesoscale features. Similarly, model SST could be compared with the highest resolution and most reliable SST products provided in near-realtime by DACs.  Divakaran et al. 2015), or Taylor diagrams. Note that for the Australian regional seas assessment (Divakaran et al. 2015), this diagram (figure 8 from Divakaran et al. 2015) also contains shaded values of the skill score, as defined by Taylor (2001). This score merges the accuracy (RMSE) and the pattern (correlation) evaluation of the parameter variability. Note also that this assessment of vertical parameters (here T, S) is presented separately for biases (consistency assessment), RMSE (quality, or accuracy assessment) and anomaly correlation (pattern of the variability). This allows OOCs to measure at which depth, and for which water masses, the OOFS is reliable. At this stage, Class 4 metrics are univariate, but alternatively, these metrics can be used in more 'ocean oriented' figures such as T-S diagrams ( Figure 13). For this T-S diagram only hindcasts (i.e. not forecasts, persistence or climatology) are plotted together with the observed values, for the sake of clarity. Figure 13 shows that both PSY3 and PSY4 systems present inaccurate dense waters at depth, while the rest of the water column is qualitatively well represented for this 3-month period.
Inter-comparison of several OOFSs using Class 4 metrics also allows OOCs to address the added value that using an ensemble approach might bring. Ryan et al. (2015) show that the ensemble mean outperforms individual OOFS scores in most cases. Interestingly, in their Figure 6, they propose a synthetic global view of the most reliable OOFS for the four parameters tested. Other new approaches mentioned above (IBI OOFS) involve inter-comparing the same diagnostic for several OOFSs in order to show the added value of regional versus global, free model simulation versus assimilation or old versus new system estimates (Figure 7).

Summary
Significant progress has been made in ocean-model skill assessment during the last 5-10 years. Under the constraints of real-time operation, many forecasting centres have implemented more mature validation and performanceassessment procedures. The most advanced examples are operationally integrated, modular and able to use any available reference dataset. Based on a large number of metrics, they permit a diverse validation strategy: (1) comparing old and new systems to measure potential improvements and degradations; (2) comparing coarse resolution 'father' and nested high-resolution 'son' systems to quantify the added value of downscaling; (3) comparing adjacent or overlapping systems to verify the consistency of adjacent forecasts; (4) multi-model comparison to better characterize model error growth using different systems running in parallel; and (5) ensemble approaches to assess the benefit of ensemble versus individual system estimates.
Real-time assessments suffer from limitations imposed owing to observation availability and quality, as many high-quality reference datasets can only be used off-linemeaning that the routine monitoring skill evaluation is less efficient. To avoid spurious effects from erroneous real-time data (for assimilation or validation), quality checking and control of input information (observations, forcing fields) is performed by most OOFS. Moreover, the systematic feedback of quality control information and observation 'blacklists' to providers is starting to be integrated into OOFSs. More complex metrics that are better suited to assessing physical, ecosystem and biogeochemical forecast processes are being progressively adopted in operational centres. Multivariate metrics now complement univariate techniques in order to enhance the parameters-oriented assessments to full ocean process evaluations. In parallel, metrics including Taylor diagrams, target diagrams and radar charts are used to provide a more enhanced quantification of model skill. Additional user-oriented metrics are also being developed, complementing the basic assessment of OOFSs with more detailed information about skill for specific applications.
Operational ocean forecasting systems are evolving toward higher horizontal resolution and eddy-resolving capability, and offer finer mesoscale representation. For instance, AVISO SSH or Reynolds SST L4 mapped products offer 50-100-km resolution. Hence, these products are no longer suitable for evaluating 5-km-resolution global eddypermitting OOFS. For regional and coastal OOFSs providing sub-mesoscale description, this issue is even more crucial. Their evaluation using the existing observing system presents new issues: are the metrics currently used reliable, and do they provide pertinent information?
The L4 observation-based products provided by operational DAC and their evaluation also have to be considered carefully. First, these products can be used directly by the scientific community or other users instead of model-based products. Second, many OOFS validation procedures rely on these products and can be deficient if they are erroneous.
Finally, multi-model inter-comparison and ensemble approaches offer several potential benefits. For example, forecast spread can be used for forecast error evaluation and is particularly efficient if individual model errors are not correlated (e.g. for models using different forcing). In many studies, ensemble estimates are seen to benefit from qualities of each individual OOFS and to reduce errors. With the initiatives carried out by the GOV IV-TT, operational oceanography is following a strategic path similar to that of the weather-forecast community 30 years ago, the goal being to routinely exchange information among OOFS in a multi-model framework, and enhance both system predictability and skill assessments, for the eventual benefit of OOFS users.