Northern North Atlantic Sea Level in CMIP5 Climate Models: Evaluation of Mean State, Variability, and Trends against Altimetric Observations

The northern North Atlantic comprises a dynamically complex area with distinct topographic features, making it challenging to model oceanic features with global climate models. As climate models form the basis for assessment reports of future regional sea level rise, model evaluation is important. In this study, the representation of regionalsealevelin thisareais evaluatedin 18climatemodelsthat contributedto phase5 of the Coupled Model Intercomparison Project. Modeled regional dynamic height is compared to observations from an altimetry-based record over the period 1993–2012in terms of mean dynamic topography, interannual variability,andlineartrendpatterns.Asmodelsareexpectedtoreproducethelocationandmagnitudebutnot the timing of internal variability, the observations are compared to the full 150-yr historical simulations using 20-yr time slices. This approach allows one to examine modeled natural variability versus observed changes and to assess whether a forced signal is detectable over the 20-yr record or whether the observed changes can be explained by internal variability. The models perform well with respect to mean dynamic topography. However,modelperformancesdegradewheninterannualvariabilityandlineartrendpatternsareconsidered. The modeled regionwide average steric and dynamic sea level rise is larger than estimated from observations, and the marked observed increase in the subpolar gyre is not consistent with a forced response but rather a result of internal variability. Using a simple weighting scheme, it is shown that the results can be used to reduce uncertainties in sea level projections.


Introduction
Global sea level is rising and will continue to do so in the future. Regional sea level changes have a direct impact on infrastructure, population, and coastal ecosystems. Yet, regional sea level changes deviate largely from the global mean, and their simulation requires complex and computationally expensive atmosphere-ocean global climate models (AOGCMs) that form the basis for the projection of regional sea level changes (Church et al. 2013). Quantifying future regional sea level change and its impacts is pivotal for coastal communities in order to adapt to rising sea levels. Because of the presence of unforced internal variability, climate model projections become increasingly more uncertain when going to smaller spatial and temporal scales as the dominating processes become more complex and the need for simplifying parameterizations due to computational constraints impede Supplemental information related to this paper is available at the Journals Online website: https://doi.org/10.1175/JCLI-D-17-0310.s1.
Corresponding author: Kristin Richter, kristin.richter@uibk.ac. at Denotes content that is immediately available upon publication as open access.
This article is licensed under a Creative Commons Attribution 4.0 license (http://creativecommons. org/licenses/by/4.0/). the accurate representation of local processes. It is therefore important to evaluate the performance of models on a regional scale with available observations.
For the dynamic sea level contribution to sea level changes, Landerer et al. (2014) evaluated the performance of climate model simulations for the global ocean with a focus on the Southern Ocean and equatorial regions where biases of the mean state were largest. In particular, because of large observed rates of sea level rise, the tropical Pacific Ocean has received considerable attention (e.g., Meyssignac et al. 2012;Palanisamy et al. 2015) compared to other regions such as the northern North Atlantic. We plan to direct our attention to the latter region for several reasons: 1) the sea level in the Atlantic subpolar gyre has been rising over the altimetric record (e.g., Häkkinen et al. 2013), which is associated with a weakening of the gyre circulation; 2) the Nordic seas and subpolar North Atlantic are surrounded by glaciated landmasses and, through gravitational adjustment, regional sea level trends are expected to be affected by changes in land-based ice masses in a way that counteracts the global mean effect; 3) the additional freshwater input to the ocean is suspected to already have an impact on ocean dynamics (Rahmstorf et al. 2015), although recent research shows that the impact of meltwater runoff is still small (Böning et al. 2016); and 4) freshwater input is not yet routinely implemented in AOGCMs used for climate projections. Given these potential weaknesses, and the fact that the region is surrounded by densely populated coasts where the models form the basis for numerous national assessment reports (e.g., Simpson et al. 2015;Grinsted et al. 2015), it is particularly important to evaluate the models here. For the Norwegian coast, AOGCMs have been evaluated earlier (Simpson et al. 2014); however, for the abovementioned reasons, we wish to focus on the whole northern North Atlantic region.
Recent studies attempted to quantify the magnitude of internal variability in climate model simulations in order to assess the time of emergence of an externally forced signal in regional sea level (Lyu et al. 2014;Richter and Marzeion 2014;Bilbao et al. 2015). Richter and Marzeion (2014) found that it takes at least 30 years in the area of interest for an externally forced signal to emerge from the noise when considering a period starting in 1990 (the approximate advent of satellite altimetry). Furthermore, from an observational point, Dangendorf et al. (2014b) found that sea level data from tide gauges exhibit significant decadal and multidecadal correlations independent of any systematic rise. It is therefore not clear whether a forced signal in sea level is already detectable in the northern North Atlantic and Nordic seas or whether observed changes over the altimetric period are caused by internal variability.
Global observational coverage is available for the past few decades (1993-2012 is used in this study). As models are expected to reproduce the locations and magnitudes but not the timing of internal variability, the observations have to be compared to the full historical simulations (1850-2012) of sea level. In this way, it can be assessed whether a forced signal is already detectable (i.e., the observed changes are reproduced by all models in the observational period), whether the observed changes show similarity with modeled internal variability (i.e., the observed changes are reproduced in random 20-yr periods), and whether the models are at all able to simulate variability similar to the observed.
We will focus on the northern North Atlantic region and address the following questions: Is the mean dynamic topography correctly simulated? Is the magnitude and location of observed sea level variability reproduced? Are observed sea level trends consistent with modeled trends over the same period (and therefore forced) or more likely due to internal variability? In section 2 we describe the model data as well as the observations and explain how we will compare the two datasets in order to evaluate the simulated sea level. The results are presented in section 3 for the three variables that are being investigated, namely, mean dynamic topography, regional interannual sea level variability, and regional linear sea level trends. The results are discussed in section 4, followed by the conclusions in section 5.

a. CMIP5 model output
We use output from 18 climate models (Table S1 in the supplemental material) participating in phase 5 of the Coupled Model Intercomparison Project (CMIP5; Taylor et al. 2012) to compare modeled sea level with observations with respect to the mean state, interannual variability, and linear trend patterns. The first realization of each model (r1i1p1) is used. We combine the sea level above the geoid (''zos'' in CMIP5 terminology) with the global mean (thermo)steric sea level change (''zostoga'' or ''zossga''). Most but not all of the models conserve volume rather than mass, that is, a net increase in seawater temperature does not necessarily lead to a global mean sea level rise in zos in these models. This effect is instead computed from the density fields and represented by zossga or zostoga. For consistency, we therefore remove the global mean from the spatial fields of zos at every time step prior to adding the global mean steric change. The resulting variable comprises the regional dynamic and steric sea level change as well as the global mean steric change.
The historical simulations used include all known climate forcings (Slangen et al. 2015). With some exceptions, they cover the period 1850-2005. They are extended up to 2012 using the representative concentration pathway 4.5 (RCP4.5) scenario (Van Vuuren et al. 2011). However, the choice of scenario is not critical over this short period as the scenarios only start to diverge in the second half of the twenty-first century. The climate drift (long-term trends in the absence of an external forcing; Sen Gupta et al. 2013) is accounted for by removing the linear trend found in the control simulations for all variables. Since we are eventually interested in multidecadal variability, annual data are computed.
The grid size of the models varies widely in the area (Table S1), from the comparatively dense grid of 0.98 3 0.58 (longitude 3 latitude) or finer for the MPI-ESM-LR, CCSM4, and NorESM models to the coarser grid of 2.58 3 1.28 for the CMCC-CMS model. (Expansions of acronyms are available online at http://www.ametsoc. org/PubsAcronymList.) For model-observation intercomparison, the processed data are regridded to a regular grid of 18 3 18.

b. Sea level anomalies and mean dynamic topography from observations
Sea surface height (SSH) is defined as the height of the ocean surface above the reference ellipsoid while sea level anomalies are the deviations of the instantaneous SSH from a reference mean sea surface (MSS). In this study, we use sea level anomalies to investigate variability and trends (and refer to it as sea level). Monthly fields of sea level for the period 1993-2012 are obtained from the European Space Agency (ESA) Climate Change Initiative (CCI) project (ESA-CCI; Ablain et al. 2015Ablain et al. , 2017. Annual means for the 20-yr time period are then calculated from the monthly data. Through geostrophy the ocean surface circulation relates to the oceans mean dynamic topography (MDT), which yields the long-term average strength of the ocean currents (e.g., Knudsen et al. 2011). The precise knowledge of the geoid height, together with MSS, known within centimeter accuracy (Schaeffer et al. 2012), enables us to compute the MDT of the ocean using (e.g., Raj 2017): MDT 5 MSS 2 geoid . (1) The geoid is the equipotential surface of Earth's gravity field. More accurately, it is the sea surface in the absence of winds, tides, and currents, only influenced by gravity (e.g., Raj 2017). The Gravity Field and Steady-State Ocean Circulation Explorer (GOCE) High-Level Processing Facility (HPF) models Earth's geopotential as a truncated spherical harmonic expansion in the spectral domain (Bruinsma et al. 2014). Using the gravity field coefficients from the level-2 global gravity models provided by HPF, the geoid height can be determined as detailed in previous studies (e.g., Johannessen et al. 2003;Jin et al. 2014;Raj 2017). In our study, we use the GOCE user toolbox (GUT; Benveniste et al. 2007) to estimate the geoid height from the gravity models and the National Space Institute of the Technical University of Denmark (DTU Space) 2013 (DTU13) MSS model (Andersen et al. 2015) to compute MDT. Even though conceptually simple, the computation of MDT needs to satisfy important a priori conditions (Benveniste et al. 2007): both the geoid height and the MSS used should be referenced to the same reference ellipsoid and estimated on the same tide system. The reference ellipsoid, a sphere flattened at its pole, is an arbitrary reference surface that is a raw approximation of Earth's shape. Here, similar to the DTU13 MSS, the geoid height is estimated in the mean tide system relative to the TOPEX ellipsoid. The TOPEX ellipsoid is the first-order definition of the nonspherical shape of the Earth as an ellipsoid of revolution with an equatorial radius of 6378.1363 km and a flattening coefficient of 1/298.257 (Tapley et al. 1994). In the mean tide system, the effects of the permanent tides are included in the definition of the geoid in contrast to the zero tide system where they are excluded. Further, the noise in the MDT due to inconsistencies in the resolution of MSS and geoid is removed by a Gaussian filter (90 km).
Both maps of sea level and mean dynamic topography, initially on a 0.258 3 0.258 grid, have been remapped to a 18 3 18 grid for comparison with the model data (section 2a). Since MDT is the sea surface relative to the geoid, and the climate models have no geoid, the observed MDT is comparable to the modeled mean sea surface.

c. Contribution from land-based ice and GIA
The research area ( Fig. 1) is surrounded by (partly) ice-covered land and therefore located in the near-field of potential land-ice changes. The mass exchange between land and ocean affects the gravitational field, that is, the geoid, and thus the shape of the sea surface. The effect on regional sea level gradients is largest close to changing land ice (e.g., Mitrovica et al. 2001). In addition, the altimetry-based sea level observations include the change of the geoid due to the ongoing response of the viscoelastic Earth to the last deglaciation [glacial isostatic adjustment (GIA); Tamisiea 2011]. The CMIP5 models are neither coupled to land-ice models nor do they have geoids that could change, and sea level changes originating from past as well as present changes in land ice have to be removed from the observations prior to comparing observed and modeled trends.
The glacier mass change over the observational period is obtained by forcing the glacier model of Marzeion et al. (2012) with temperature and precipitation from gridded climate observations [Climatic Research Unit (CRU) time series; New et al. 2002]. Reconstructions of the model over the twentieth century were shown to be consistent with other reconstructive methods based on observations of glacier length change, mass balance observations, and remotely sensed estimates (Marzeion et al. 2015). The contributions from the Greenland and Antarctic ice sheets are taken from Shepherd et al. (2012). The data include both the contribution from surface mass balance changes as well as dynamical changes of the ice.
To translate the land-ice changes into absolute regional sea level changes, the corresponding fingerprints have been computed by assuming the melt of a uniform ice layer over the glaciated area (e.g., Riva et al. 2010;Slangen et al. 2012Slangen et al. , 2014Perrette et al. 2013) and by computing the induced sea level change after solving the sea level equation (Farrell and Clark 1976) using a pseudospectral approach (Mitrovica and Peltier 1991), including the rotational feedback (Milne and Mitrovica 1998), on a compressible elastic Earth (Dziewonski and Anderson 1981).
To account for the geoid-related signal due to GIA from the sea level observation, we use the correction from the ICE-5G GIA model as provided by Peltier (2004). 1 As we are interested in absolute sea level changes, only the rate of change of the geoid is used. Figure 2a shows the observed 1993-2012 sea level trends in the region. The average trend over the region is 2.35 mm yr 21 , less than the global average of 3.16 6 0.5 mm yr 21 (Ablain et al. 2015). Sea level change is positive except east and southwest of Svalbard and in a very localized area south of the Iceland Basin. The rise is largest in the subpolar gyre and, to a lesser degree, along the European coast. The contribution from changing glaciers and ice sheets to absolute sea level over the same period ( Fig. 2b) is characterized by a sea level decrease around Svalbard and Greenland in accordance with the loss of land ice in these regions. However, compared to observed sea level, changing land ice contributes little to the spatial variability of observed trends whereas the average sea level change over the area due to changes in land ice is significant, contributing 0.49 (glaciers) and 0.43 mm yr 21 (ice sheets), respectively, to the observed regionwide average trend ( Fig. S2 in the supplemental material). The same is true for the geoidrelated GIA contribution: it contains little spatial variability compared to the observed trend pattern while the corresponding regionwide average presents a significant, notably negative, contribution of 20.31 mm yr 21 (Fig. 2c).
To assess the impact of land-ice melting on regional sea level trends, we removed the contribution from glaciers and ice sheets as well as from GIA from the observed sea level prior to computing linear trends. The result is a reduced regionwide average trend of 1.75 mm yr 21 , with few changes in the regional trend pattern (cf. Figs. 2a,d). The pattern is still dominated by the stronger sea level rise in the subpolar gyre around the southern tip of Greenland. The negative trend southwest of Svalbard is slightly enhanced while the sea level rise on the Norwegian shelf, and therefore the across-shelf gradient, is reduced. To be consistent with the models, we use these trends (presented in Fig. 2d) in our trend analysis (section 3c).

d. Methods
The sea level observations from satellite altimetry used in this study span the period 1993-2012. From this 20-yr record of annual means, we calculate maps of sea level variability (in terms of temporal standard deviation) and linear sea level trends. These maps, together with maps of MDT, are used for the modelobservation analysis. The model simulations cover the period 1850-2012. To account for the random phases of internal variability in the models, we slide a 20-yr window over the annual data and compute maps of MDT, sea level variability, and linear trends for each 20-yr period, thus giving us a total of 143 maps per model per variable, available for comparison with the observations. Linear trends are removed from every 20-yr period prior to calculating variability.
To assess the model performance, we use basic statistical measures: modeled and observed fields are compared by computing the area-weighted pattern correlation coefficients (PCCs) and root-mean-square errors (RMSEs) between modeled and observed maps.
In this case, the PCC is the equivalent to the correlation coefficient between two time series, with the only difference being that the correlation is computed between corresponding locations instead of corresponding points in time (Thomson and Emery 2001). Prior to computing PCCs and RMSEs, the regionwide average is subtracted from all maps. That is, we compute centered statistics to assess the similarity between observed and modeled regional anomalies (e.g., Santer et al. 1995). Note that PCCs are not sensitive to errors in the amplitudes in the modeled patterns. These errors are reflected in the RMSEs. Centered statistics are insensitive to the regionwide average. This is of particular importance for the comparison of observed and modeled trends. In line with the global mean, we expect simulated regionwide average trends to be small in the beginning of the historical record and grow increasingly more positive throughout the modeled period considered here . To take this fact into account, we will also report the total RMSE (RMSE without removing the regionwide average) when presenting results of the trend analysis.
As mentioned, we account for the different phases of internal variability in models and observations by comparing the observations with modeled fields from all 20-yr windows. Since, on a time scale of only two decades, internal variability is expected to be strong in both models and observations, we anticipate some variability in the PCCs and RMSEs as we slide the window across the model record. The approach presents us with several possibilities to compute multimodel means: (i) by taking the mean over the observational period (1993-2012); (ii) by selecting, for each model, the period with maximum PCC with the observed field and taking the mean over these model maps (max corr); or (iii) as in (ii), but for periods of minimum RMSE (min RMSE). In the following, we will present results for ensemble means computed following method (i) and (ii), as results for ensemble (iii) are similar to those of ensemble (ii). Note that multimodel means are formed by first computing maps of MDT, variability, and linear trends for each model, and subsequently averaging over these fields. In the same manner as for individual maps (by computing PCCs and RMSEs), we compare the three different ensemble means with observations. This enables us to assess whether the observed regional trend patterns are due to internal variability or externally forced. In the latter case, we expect a good agreement not only between models and observations but also between the individual models over the observational period (1993)(1994)(1995)(1996)(1997)(1998)(1999)(2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012).

a. Mean dynamic topography
The mean dynamic topography determines the surface circulation of the oceans and is thus crucial not only for heat and salt transport, but also directly for sea level changes, particularly along the margins of the oceans. Observation-based MDT is shown in Fig. 3a with the regionwide average removed for easier comparison with modeled MDT. The cyclonic circulation in the Nordic seas is visible by local minima in the basins of the Nordic seas and elevated MDT along the surrounding coasts of Greenland and northern Europe. The even more pronounced and larger minimum south of Greenland represents the subpolar gyre. The North Atlantic Current, the inflow to the Nordic seas, and the Norwegian Atlantic Current are represented by the northwest-southeastward sea level gradient in the southeastern half of the map, in particular along the continental slope (see Fig. 1). Figure 4 shows the RMSE of the MDT fields for each model and the PCC for the observational period together with the corresponding ranges that represent the spread of RMSE and PCC over all 20-yr periods. The PCCs are large (around 0.9) for seven models with corresponding low RMSEs (;0.1 m). These models capture the western extent of the subpolar gyre correctly (Fig. S2). Models with PCC , 0.6 (GFDL-ESM2G, MRI-CGCM3, and BCC_CSM1.1) and a mean RMSE (c) Ensemble mean of differences between models and observations for the observational period. Black, gray, and white contours in (b) and (c) represent signal-to-noise ratio of 1, 1.5, and 2, respectively, defined as the ratio of ensemble mean and ensemble standard deviation of the regional anomalies. The numbers above each map represent the regionwide average and root-mean-square, respectively. strength and location vary. The spatial variability of the multimodel mean over the observational period (Fig. 3b) is slightly smaller than the observed variability. The ensemble mean difference between modeled and observed MDT over the observational period is shown in Fig. 3c. The most pronounced feature is a systematic underestimation of the strength of the subpolar gyre in the Labrador Sea. There is also a significant lack of regional details in the North Atlantic Current in the Norwegian Sea.
Generally, the lower the PCC is, the higher the RMSE is (Fig. 4). The RMSE is of the same order of magnitude as found in Landerer et al. (2014) for the entire world ocean over the period 1993-2002. The spread of RMSEs and PCCs derived from all available 20-yr periods is small for most models, indicating that internal variability has little effect on modeled MDT patterns. The exception is GFDL-ESM2G, where the RMSE increases gradually and the PCC decreases gradually throughout the historical simulation (not shown). PCCs are highest (.0.9) with corresponding low RMSEs for the multimodel mean. This applies to any 20-yr period over which MDT is calculated: the observational period, the period of minimum RMSE, and the period of maximum pattern correlation (Fig. 4).

b. Variability
The observed interannual sea level variability (Fig. 5a) exhibits local maxima in the Lofoten Basin, as well as in the Iceland Basin and Irminger Sea at the eastern rim of the subpolar gyre. In contrast, observed interannual variability is weak in the western branch of the subpolar gyre and on the shallow European shelf from the British Isles along the Norwegian coast up to the entrance to the Barents Sea. In the coastal areas of the North Sea, however, variability is stronger.
Models are designed to reproduce the main climate modes in terms of amplitude and location. However, the phase and timing of simulated variability is not necessarily the same as that of the observed variability. Therefore, we do not expect the models to reproduce the observed variability over the observational period, and the RMSEs and the PCCs are expected to vary when comparing all available modeled 20-yr periods with the observations. This is indeed the case (Fig. 6). The range of PCCs is much larger than the equivalent range for the MDT (Fig. 4). RMSEs are on the order of 1-2 cm, with some models showing a spread as large as 1 cm. The PCCs are low (between 20.1 and 0.5), and some models display both negative and positive correlation coefficients. The PCCs for the observational period are not systematically in the higher end of the range, indicating that the observed variability is not in phase with the variability in the forced simulations and thus indeed unforced. In the same way as for MDT, the ensemble means of variability for the observational period and periods of minimum RMSE and maximum PCC, respectively, show smaller RMSEs and larger PCCs than any of the individual models. Figure 5b shows the modeled variability as the multimodel mean over the periods of maximum pattern correlation of the individual models. These periods are FIG. 5. Sea level variability from (a) observations and (b) multimodel mean over the periods of maximum pattern correlation. The PCC between those fields is also shown. (c) Difference of modeled and observed variability averaged over all models for the period of maximum pattern correlation. The mean difference has been removed and only anomalies are shown. Black, gray, and white contours represent signal-to-noise ratio of 1, 1.5, and 2, respectively, defined as the ratio of ensemble mean and ensemble standard deviation of the regional anomalies. The numbers above each map represent the regionwide average and root-mean-square, respectively.
FIG. 6. As in Fig. 4, but for results of model-observation comparison of sea level variability. different for each model as the internal variability is random and not in phase between the individual models. The mean difference averaged over all models is 0.21 cm (not included in Fig. 5c), indicating that models overestimate the variability in this region. The multimodel mean variability shows enhanced variability away from the coasts in accordance with the observations (Figs. 5a,b). However, the pattern is very smooth and does not reproduce the finer spatial features. This is partly due to the ensemble averaging: when looking at individual models ( Fig. S3 in the supplemental material) local details like the enhanced variability in the central Nordic seas are evident in most models (except IPSL-CM5A-MR). The strength and location of these details, however, differ from model to model. The relatively good statistical performance of the ensemble mean as shown in Fig. 6 is therefore due to smoothing of local features, and the performance comes from good reproduction of large-scale features. However, a systematic underestimation of variability in the Lofoten Basin is identified, as well as an overestimation of the strength of the variability in the southern Iceland Basin (Fig. 5c, at lower right). In the remaining areas, no systematic differences are found.

c. Linear trends
As mentioned, we use the altimetric sea level corrected for contributions from land ice (Figs. 2d and 7a) in the following trend analysis. Linear trends in sea level over the observational period may be unforced, that is, due to internal variability only; externally forced (natural or anthropogenic); or a combination of both. To account for the presence of unforced trends, we apply the same procedure as for the investigation of variability and compare the observed trends with trends from all 20-yr periods available in the models.
Statistical parameters from the observation-model comparison of trend anomalies are shown in Fig. 8. The PCCs for trends span an even wider range (from 20.45 to 0.45) than for variability (cf. Fig. 6), as is to be expected. For most models, PCCs are symmetric around zero. Notable exceptions are GFDL-ESM2G and IPSL-CM5A-MR, with maximum positive PCCs much larger than absolute negative PCCs. The RMSEs are on the order of the observed mean sea level rise itself or higher. For the ensemble means, RMSEs are smaller than for the individual models. For the ensemble mean over the periods of minimum RMSE and maximum pattern correlation, the PCCs exceed 0.6 (0.18 for observational period), indicating that the observed trend pattern is at least partly simulated at some point in time by some of the models. It is also noteworthy that PCCs are not largest and RMSEs not smallest over the observational period (red and black dots in Fig. 8, respectively), hinting toward the importance of internal variability. Note that the RMSE is not the total RMSE but the RMSE of the fields with the regionwide average trend removed. That is, we compare observed and modeled trend anomalies. This way, we account only for spatial variability and not for biases in the regionwide average trend. This is important as the modeled trends, and therefore the total FIG. 7. Linear sea level trends over the observational period from (a) observations (as in Fig. 2d) and (b) multimodel mean of modeled trends. The PCC between those fields is also shown. (c) Multimodel mean of the differences between modeled and observed trend anomalies for the observational period. The regionwide average difference that has been removed (as we look at anomalies) is 0.82 mm yr 21 . Black, gray, and white contours represent signal-to-noise ratio of 1, 1.5, and 2, respectively, defined as the ratio of ensemble mean and ensemble standard deviation of the regional maps. The numbers above each map represent the regionwide average and rootmean-square, respectively. RMSE, are strongly affected by the global average sea level rise. As we expect the observed trends to be at least partly affected by internal variability and therefore correspond to any modeled time period with potentially lower global sea level rise, it is important to take this into account.
The modeled average sea level trends for the observational period range from 0.12 (MRI-CGCM3) to 4.97 mm yr 21 (BCC_CSM1.1; Fig. S4 in the supplemental material) with a multimodel mean of 2.26 6 1.23 mm yr 21 (one standard deviation), which is larger than the observed average trend of 1.75 mm yr 21 . The ensemble mean of modeled regional trends over the observational period is shown in Fig. 7b. Compared to the observations (Fig. 7a), the simulated spatial variability is small (0.84 versus 1.48 mm yr 21 ). The marked sea level rise in the subpolar gyre that dominates the observed trends is not simulated at all. This indicates that it is a manifestation of internal variability, the response to an external forcing that is not included in the models, or processes that are not properly represented or resolved. Instead, a negative trend in the central North Atlantic is simulated that is also present in the observations but has a signal-to-noise ratio of less than 1. The sea level rise along the shallow shelf areas and west of Greenland is systematic across the models (high signal-to-noise ratio), indicating that it may be a forced signal. However, it should be noted that the relatively high signal-to-noise ratio shown in Fig. 7b is due to the regionwide average sea level trend. Once this trend is removed and only anomalies are considered, the signal-to-noise ratio is less than 1 in the entire area (not shown).
To assess whether the models systematically over-or underestimate observed trends, we compute the multimodel mean of the differences between modeled and observed trend anomalies for the observational period (Fig. 7c). The regionwide average difference is 0.51 mm yr 21 , reflecting the overestimation of the sea level rise in the region by the models. The differences shown in Fig. 7c resemble the spatial pattern of the observed trend anomalies (Fig. 7a), and indeed the PCC between the two fields is 20.86.
Compared to the observed trends, the modeled trend field is rather smooth. While a large grid size contributes to a smoother result, this also indicates that the modeled trend anomalies during this period are offsetting each other partially in the ensemble mean and are therefore induced by internal variability rather than forced. This is confirmed when looking at individual models (Fig. S4). While all models simulate a regionwide average sea level rise, the spatial patterns are very different. From this, we conclude that the observed regional sea level trend pattern results from internal variability while the regionwide average positive sea level change is externally forced.
To investigate whether internal variability as simulated by the models reproduces the observed spatial trend pattern, we turn again to the pattern correlation analysis and select, for each model, the 20-yr period with the maximum pattern correlation. The results are shown in Fig. S5 of the supplemental material for each model, and the multimodel mean is presented in Fig. 9b. Indeed, the multimodel mean now displays positive trends in the region of the subpolar gyre with a relatively high signal-to-noise ratio. The PCC of the ensemble mean with the observed anomalies is 0.66. The regionwide average of the multimodel mean (subtracted in Fig. 9b) is 1.32 mm yr 21 , indicating that the periods of maximum pattern correlation occur at times of lower regionwide sea level rise in the majority of the models. The regionwide average sea level trend for individual models ranges from 20.68 mm yr 21 for HadGEM2-ES (period of maximum pattern correlation: 1954-73) to 4.45 mm yr 21 for GFDL-ESM2M (1981-2000. For the period of maximum PCC, the multimodel mean difference (Fig. 9c) between modeled and observed anomalies is less systematic than the differences FIG. 9. Linear sea level trend anomalies (relative to regionwide average) from (a) observations and (b) ensemble mean over periods of maximum pattern correlation. The PCC between those fields is also shown. (c) The multimodel mean of the differences between models and observations. Black, gray, and white contours represent signal-to-noise ratio of 1, 1.5, and 2, respectively, defined as the ratio of ensemble mean and ensemble standard deviation of the regional maps. The numbers above each map represent the regionwide average and root-mean-square, respectively. for the observed period (Fig. 7c) with a lower corresponding PCC of 20.62. It appears that a part of the observed signal can be identified as an internal signal.
In the next step, we combine the forced signal that appears to consist of a regionwide sea level rise (Fig. 7b) with the signal that originates from internal variability (Fig. 9b) to obtain the full signal (Fig. 10b). The PCC with the observations is 0.61. The approach implicitly assumes the perfect cancellation of internal variability during the observational period, which cannot be expected to be entirely true due to the different representation of internal variability in the models in terms of strength, location, and frequency. The anomalously high sea level rise in the subpolar gyre is well captured, as is the sea level rise along the east coast of Greenland. The regional pattern of the differences between modeled and observed trend anomalies has a very low signal-to-noise ratio (PCC of 20.42 with observations), except in the Davis Strait, where the models appear to systematically overestimate the change in sea level (Fig. 10c).

Discussion
We have assessed the representation of simulated sea level in terms of its mean state (MDT), interannual variability, and linear trends in 18 CMIP5 climate models by comparing the model output with observations from satellite altimetry. As the altimetric record covers only 20 years, a time span on which internal variability dominates over externally forced signals (Richter and Marzeion 2014), the observations were compared to the entire model record (1850-2012) using 20-yr time slices.
A summary of the statistical results with respect to multimodel means is shown in Table 1 and for individual models in Fig. 11. While the multimodel means represent the observed MDT rather well (PCC . 0.9), the multimodel mean performance degrades with respect to regional interannual variability and trends. This is also true for individual models: the PCCs for MDT can exceed 0.9 for some models, while the maximum PCC is below 0.6 for linear trends and variability.
With the exception of GFDL-ESM2G, the model performance with respect to MDT depends only a little on the chosen period (small spread in RMSE and correlations in Fig. 4), indicating that internal variability has little influence on MDT in a 20-yr period and is of the same order of magnitude as intermodel differences. The most striking intermodel differences are the simulation of the location, strength, and shape of the sea level minimum associated with the subpolar gyre and to some degree the Nordic seas minimum (Fig. S2). Shortcomings in the representation of open ocean circulation can be expected to have consequences for the ability to simulate relevant regional coastal sea level, since it is the transport of steric anomalies toward the shelves and the dynamics of the slope currents that constitute the FIG. 10. Linear trends in sea level from (a) observations (as in Fig. 2d) and (b) combined modeled signal from the observational period (Fig. 7b) and periods of maximum pattern correlation (Fig. 9b). The PCC between those fields is also shown. (c) The multimodel mean differences with observations. Black, gray, and white contours represent signal-to-noise ratio of 1, 1.5, and 2, respectively, defined as the ratio of ensemble mean and ensemble standard deviation of regional anomalies. The numbers above each map represent the regionwide average and root-mean-square, respectively. regional steric/dynamic contribution to coastal sea level change and variability. The larger range in PCCs for variability (Fig. 5) as compared to MDT reflects the importance of internal variability in the area. The finer details (e.g., in the Lofoten Basin and the areas east and west of Reykjanes Ridge) are not captured by the models. When looking at individual models, variability is strikingly overestimated by some models in large parts of the region. A more detailed analysis is necessary to reveal the reasons for the under-or overestimation of interannual variability (i.e., steric vs dynamic changes).
Unsurprisingly, we found that linear 20-yr trends are heavily impacted by internal variability and no forced pattern of regional sea level trend could be detected over the observational period, supporting previous studies (Richter et al. 2017). According to our results, the observed increase in sea level in the subpolar gyre appears to be induced by internal variability. This is in line with the results of Häkkinen et al. (2013), who found that the increase is part of Atlantic multidecadal variability mainly forced by wind stress changes. Climate modes involving atmospheric circulation (such as the North Atlantic meridional dipole) have been shown to govern variability in both on-shelf sea level (Richter et al. 2012a;Calafat et al. 2012;Dangendorf et al. 2014a;Chafik et al. 2017) and ocean circulation (e.g., Nilsen et al. 2003;Sandø et al. 2012;Richter et al. 2012b) in the North Atlantic, on time scales from months to decades. Although considered part of natural variability, spatial shifts and changes in these modes (Ulbrich and Christoph 1999) may lead to significant sea level changes, especially along the coasts (Chafik et al. 2017).
A recent study by Becker et al. (2016) compared longterm correlation in sea level as observed by century-long tide gauge records with simulated nearby sea level from FIG. 11. Taylor diagrams to summarize the performance of individual models with respect to (a) MDT (observational period), (b) variability (maximum PCC), and (c) linear trends (maximum PCC). The dashed arcs represent the RMSE. Standard deviation and RMSE are standardized with the respective observed spatial standard deviation. The respective multimodel mean and the observations as well as the multimodel mean based on selection (w1) and weighting (w2) are also shown. climate models and found that the models overestimated long-term trends at the North Atlantic coasts as induced by internal variability. Here, we found no evidence of an overestimation of 20-yr trends, but our approach is not perfectly suited to detect such a misfit. Note also that we are focusing on larger spatial scales and the open oceans to shed light on the model performance in the regions from where long-term sea level changes are likely to originate. For a more thorough assessment of both variability and trends, and to disentangle the steric and dynamic contribution, we would need to produce a steric estimate from both models and observations. This is, however, out of the scope of this study.
The performance of the individual models is summarized in Fig. 11. In the Taylor diagrams, the distance between the points representing the simulated fields and the point representing the observed field is inversely proportional to the overall similarity of the two fields. For variability and trends (Figs. 11c,d), models are grouped along a line as we display their performance during periods of maximum PCC, which is similar for most models (Figs. 6 and 8). From Fig. 11 outliers can be identified visually, that is, MRI-CGCM3 in Fig. 11b and HadGEM2-CC in Fig. 11c. It is also clear that the ensemble mean delivers the best performance. However, because of the ensemble averaging, finer details are lost and the spatial standard deviation of the ensemble mean is always smaller than the observed one (Fig. 11), except when weighting is applied (see below).
From Fig. 11 it can be seen that there is no clear relationship between the model performances regarding MDT and their performances regarding variability and linear trends. Also, there is no consistent relationship between model grid spacing and their performance. While models with comparatively small grid spacing like CCSM4 and NorESM are consistently in the upper half of the ranking, so is CMCC-CMS-a model with relatively large grid spacing in the area. In contrast, MPI-ESM-LR does perform well with respect to MDT, but performance degrades considerably regarding variability and trends. A good performance with respect to the mean state is therefore no guarantee for an equally good performance with respect to decadal trends and interannual variability.
One might ask whether the results of our analysis can be used to improve projections and/or reduce uncertainties of regional sea level and risk assessments that are based on the climate simulations we investigated here. Weighting model output has been used widely with respect to, for example, mean surface temperature (Räisänen et al. 2010), the meridional overturning circulation (Schneider et al. 2007), and sea ice extent (Knutti et al. 2017). Various approaches are conceivable, such as selecting the model that performs best, excluding models that perform inadequately, or assigning different weights to models based on their performance.
Prior to computing weights, it needs to be assessed whether our diagnostics (mean state, interannual variability, or short-term trends) are relevant for projections on long time scales. The regional trend patterns are mostly due to internal variability. It is therefore questionable to draw conclusions about sea level projections (i.e., forced changes) based on the models' performance with respect to linear trends over the observational period. The accurate representation of the mean state is more relevant to projections (and has also been shown to be time independent), as is the simulation of the regionwide sea level rise that has been shown to be at least partly forced. If weights are to be computed, these two diagnostics should be used.
We emphasize that our study focuses on evaluating the large-scale features of CMIP5 models in contrast to details on shelves and near coasts, where we do not expect these coarse grid models to perform well (e.g., Griffies et al. 2014). The overestimation of near-coast sea level changes pointed out by Becker et al. (2016) may indicate that CMIP5 models may be lacking in representation of relevant processes. For example, the decoupling between on-shelf and off-shelf sea level changes on short-interannual time scales (e.g., Hughes and Williams 2010; Bingham and Hughes 2012) may play a role on longer time scales, but it is not necessarily modeled correctly, neither are transfer mechanisms involving coastally trapped waves (Calafat et al. 2012;Dangendorf et al. 2014a;Frederikse et al. 2016). The ability to model exchanges with the Mediterranean Sea has also been proposed to be important for the representation of sea level changes on the European shelves (Hughes et al. 2015). Such small-scale processes should be taken into consideration for improvement and evaluation of future models but are outside the scope of this study. Our motivation is that the circulation and variability in the open oceans need to be realistic to model the off-shelf, long-term steric and dynamic changes that will, through the abovementioned or other mechanisms, affect long-term sea level changes on shelves and coasts.
Here, we provide a test of our hypothesis that there is potential benefit in basing model selection or weighting on the models' performance with respect to the largescale mean state. Comparisons were done between the full ensemble (equal weights), an ensemble of models with MDT RMSE below median (zero weights for deselected models), and an ensemble weighted by the inverse of the MDT RMSE. The according weights are presented in Table S2, and the skill of the modified multimodel means is included in Fig. 11. The performance of the ensemble mean MDT improves only marginally with the selection and weighting (Fig. 11a and Fig. S6 in the supplemental material), as the full ensemble already performed reasonably. However, the strength and shape of the subpolar gyre becomes more realistic with selection, but not with weighting. As expected, the performance with respect to trends and variability does not improve with weighting or selecting, but it does not deteriorate either (see also Figs. S7 and S8 in the supplemental material).
More importantly, compared to using the full ensemble, selecting models leads to a reduction in spread of regional sea level ensemble projections along the coastlines. A similar spread reduction is not apparent in our weighted results (Figs. 12d-f). These results indicate that climate model performance with respect to simulating coastal sea level change should be assessed by their oceanwide circulation and that selection should be considered a liable option. This also underlines that climate models need to be investigated further regarding their steric and dynamic features and performance, preferably on longer time scales as well.

Conclusions
We decomposed annually averaged sea level in the northern North Atlantic and Nordic seas over a 20-yr period into a mean state, linear trends, and residual interannual variability to evaluate the performance of climate models that are routinely used to project the steric/dynamic component of future sea level changes against altimetric observations.
We demonstrated that models are in general capable of capturing the main features of observed sea level changes if internal variability is taken into account. Our study shows that the observed linear trend pattern over the period 1993-2012 is likely dominated by internal variability, whereas the mean state pattern is rather insensitive to which 20-yr period is chosen. While no forced signal could be detected in the regional trend patterns, the observed regionwide average sea level rise during the observational period (1993-2012) appears to be partly forced.
According to our results, the ensemble mean outperforms every single model, in line with what has been shown in other studies (e.g., Yin et al. 2010;Simpson et al. 2014). Risk assessments are, however, not only interested in the most likely projection, but also the uncertainty range around it. Our results also highlight that the multimodel projection spread of regional coastal sea level change can be reduced by selecting models based on their ability to simulate the mean dynamic topography over the entire ocean region in question. This highlights the general importance of a realistic mean state representation as the most important feature of ocean models in order to be suitable for sea level projections.
Our selection-weighting approach is rather ad hoc and simple and, as such, is subject to certain caveats (e.g., we are not considering intermodel dependencies). Sea level changes as investigated here are the result of several processes (air-sea heat exchange, redistribution of heat and freshwater in the ocean, and dynamic changes) and, in line with Knutti et al. (2017), several variables (e.g., ocean temperature and salinity) and diagnostics (e.g., stratification) should be analyzed in conjunction in order to assign sensible weights to each model.