Estimation of global coastal sea level extremes using neural networks

Accurately predicting total sea-level including tides and storm surges is key to protecting and managing our coastal environment. However, dynamically forecasting sea level extremes is computationally expensive. Here a novel alternative based on ensembles of artificial neural networks independently trained at over 600 tide gauges around the world, is used to predict the total sea-level based on tidal harmonics and atmospheric conditions at each site. The results show globally-consistent high skill of the neural networks (NNs) to capture the sea variability at gauges around the globe. While the main atmosphere-driven dynamics can be captured with multivariate linear regressions, atmospheric-driven intensification, tide-surge and tide-tide non-linearities in complex coastal environments are only predicted with the NNs. In addition, the non-linear NN approach provides a simple and consistent framework to assess the uncertainty through a probabilistic forecast. These new and cheap methods are relatively easy to setup and could be a valuable tool combined with more expensive dynamical model in order to improve local resilience.


Introduction
Predicting accurately the sea water level variability from short to large time scales is of great importance for coastal communities. The range of impacts and challenges is broad, ranging from harbour management (where minimum water level is required to allow ships to enter the harbour) to life-threatening natural disasters or long-term sea level rise leading to loss of land availability and fertility for agriculture. Coastal flooding due to storm surges is considered as one of the biggest sources of casualties during tropical cyclones; storm surges have large social, economic and environmental impacts [1][2][3][4]. Therefore, timely and accurate prediction of sea-level variability and extremes is crucial for global coastal resilience.
Deterministic numerical models have proven to be powerful tools for predicting sea variability. In particular they are effective for simulating storm surge propagation and impacts, and facilitate understanding of the complex physical processes associated with the storms [5][6][7][8][9][10][11]. However, they are relatively expensive and complex to set up and run operationally, with associated additional computation costs if ensemble forecasts are required for analysis of risk or variability.
More generally, machine learning approaches and particularly deep learning have shown huge potential in pattern recognition for a wide range of applications. Recently, these techniques have emerged in climate, meteorological and oceanographic fields with convincing results. For example, convolutional neural networks have been trained to predict variations in the El Ninõ/Southern Oscillation (ENSO) with skill superior to state-of-the-art dynamical forecast systems [12]. Machine learning algorithms have also been used to aggregate 'best-estimate' forecasts from an ensemble for the predictions of ocean waves [13]. Neural networks have also successfully been used to bias-correct measurements leading to more homogeneous climate data records [14].
In sea level and tide processes, regressions have been used to infer meteorological impacts on sea water level and storm surges [15,16]. Regression models have also been successfully driven by offshore gauge data in New York [17] and statistical models have been applied to estimate extreme storm surges and associated return periods [18], or as bias-correction to water level predictions along US East Coast [19,20]. The latter have shown similar performance compared to deterministic hydrodynamic models in capturing extremes in some cases. More recently, storm surges hindcasts in estuarine ports of the UK have been possible using artificial neural networks leading to accurate forecasting coastal flooding [21]. Neural networks have also been used for tide predictions at Mangalore, India [22], and along the Swedish coast to analyse long sea level records [23] where higher performance was obtained when local sea level forcing was prescribed (compared to linear models).
The aim of this study is to describe a general non-tuned machine learning framework, based on neural networks, and apply this around the globe with demonstrable skill in predicting non-tidal sea level residuals and extremes. The manuscript is structured as follows. Section 2 presents the GESLA tide-gauge data and associated pre-processing, the neural network ensemble, and the split between training and test sets as well as the scoring probabilistic measure. The first part of section 3 shows the key results of the study based on performance statistics for over 600 gauges around the world while the second part focuses on two particular regions with contrasting behaviours. Finally, section 4 discusses the results, the benefits and limitations of the approach, and the future steps.

Global extreme sea level analysis dataset-GESLA
The Global Extreme Sea Level Analysis database (GESLA version 2 [24]) provides unified highfrequency (15 min to 1 hour temporal resolution) quasi-global coastal sea level water information. Only public data (around 1070 gauges) are used in the present study. While data have been standardised, a simple but strict methodology was applied to pre-process each gauge in a systematic and reproducible manner. The key aspect of this stage was the elimination or reduction of potential issues arising from spurious data (e.g. temporal or reference height shifts) as well as removing long-term trends. An example of the preprocessing stage is illustrated in Supplementary figure 1 (stacks.iop.org/ERL/15/074030/mmedia). The following steps were sequentially applied to each gauge:  Fig S1b). The procedure is as follows: an average complex value is calculated from N yearly values (red square). The average separation from this mean is calculated (ε) over N years. The complex difference (ϵ year i ) for each year from the mean is independently assessed and the gauge year is rejected if (e.g. green diamonds). This procedure ensures the rejection is based on the relative size of the separation from the N-year mean whilst preventing rejection for very small amplitudes. • The total water signal is re-interpolated over a constant 1 h time vector based on the original temporal resolution excluding rejected periods of data; • Finally, only gauges with over 3 (not necessarily contiguous) years of data are kept, with at least 2 years for training and one year for testing the model.
At the end of this process, 621 gauges remain and are used in this study. They provide an extensive coverage of the coastlines worldwide. The non-tidal residual is computed as the difference between the observations and the harmonic tide prediction (computed from all remaining sections). The aim was to implement a reasonably simple, robust and consistent pre-processing methodology to objectively deal with the large amount of data available. However, one could define different thresholds or apply different type of pre-processing to the gauge; exploratory analysis suggests that this would not impact the key results of this study.

High resolution atmospheric and ocean wave reanalysis -ERA5
To assess the impact of atmospheric and ocean wave processes on the non-tidal residual, an ensemble of hourly physical predictors are extracted from the high-resolution atmospheric reanalysis ERA5 of ECMWF [26]. These are pre-processed over three length scales: Local −10 m wind components and mean sea level pressure at 0.25 • resolution as well as significant wave heights (including wind waves and swell) and peak periods (at 0.5 • resolution) at the closest grid point from the gauge; Neighbourhood -spatially accumulated precipitation in a 3.5 • box centred on the gauge; Regional -maximum and minimum wind speed components, maximum wave heights and minimum mean sea level pressure in a 5 • box centred on the gauge.
In addition, proceeding 3 h gradients of all the atmospheric predictors are computed to capture late intensification / de-intensification (for example a low pressure system developing rapidly) as well as for the harmonic tidal level.

Machine learning
Each gauge is modelled independently using artificial Neural Networks (NNs). Each NN is composed of 3 hidden layers of 48 neurones. The input layer has 33 nodes (one for each environmental predictor described in previous sections combined with 7 hourly time steps of harmonic tide), and the outer layer has a single node providing the non-tidal residual target. While a sigmoid activation function is used for the last layer, the hidden layers consists of Leaky ReLU activation functions [27] combined with batch normalization layer to normalise the activations [28]. The NN had just under 7000 trainable parameters and its schematic view is provided in Supplementary figure 2. Finally, an Adam solver [29] is used to minimise the root mean square error between non-tidal residual predictions and observations; the NN is fitted for 150 iterations or less if the errors is not reduced within 10 consecutive iterations. Due to the large number of gauges available, this configuration has been lightly tuned on three random gauges (namely, a few combinations of the number of neurones, number of hidden layers and type of activation) and then applied to the full set without further adjustment.
For each gauge, the test set consists of the most recent year of recorded data (8784 time steps) while the rest is part of the training set. Therefore depending on the gauge, the training set extends from 2 years to 32 years permitting an analysis of the impact of the training size on the performance. Figure 1(a) shows the number of gauges as a function of the length of the training data.
An ensemble of 20 Neural Networks (NNs) is trained at each gauge location to generate a probabilistic forecast. Each NN is fitted using 50% of the training set, randomly sampled. While a larger ensemble would have improved our probabilistic forecast, 20 members were chosen as a pragmatic balance between computational cost (over 12 000 NNs have been fitted in this study) and variability in the predictions.
All data (features and targets) have been standardised and normalised. The Neural Networks (NNs) are built with the Keras Python module [30] interfacing with Tensorflow [31] while processing was mainly done with the Scikit-Learn packages [32]. The neural networks have the traditional structure, where each node is connected to every node of the next layer. The temporal evolution of sea water level and nontidal residual is continuous. Recurrent layers (such as Long Short-Term Memory, LSTM layers [33]) can be used to capture the dynamics of temporal processes. An LSTM neural network structure was implemented and tested for a few gauges but it did not lead to significant improvements of the predictions, and therefore a more simple and traditional structure was kept in this study. Finally, as a baseline, an ensemble of multi-variate linear regressions are fitted and used for predicting sea water level in the same manner as the neural networks for comparison; again for the linear regression no time series model was used.
Note that the neural network described above did not converge for 11 randomly-located gauges. Given the global coverage and the large number of gauges, these 11 gauges have been removed and no further investigation were carried out on these particular gauges.

Continuous ranked probability score
To assess the skills of the probabilistic predictions, a Continuous Ranked Probability Score (CRPS) is computed, with units cm. In weather forecasting, this is a common qualitative measure of performance for probabilistic forecasts comparing a distribution with observations [34][35][36].
The CRPS is defined as a quadratic measure of the difference between predicted, H p (η r ; t), and observed, H o (η r ; t), cumulative density functions (CDF). The quadratic measure is integrated over all possible residuals, z, and then averaged over time t to give a CRPS for each gauge where for each gauge H p (z; t) denotes the probability of an anomaly less or equal to η r being predicted at time t; and H o (η r ; t) is step-function, denoting the probability of an anomaly less or equal to η r being observed at time t. Intuitively, an ensemble producing a wide range of outcomes or an ensemble with a mean significantly different from the observed values would be heavily penalised while a narrow ensemble centred on the observations would lead to a better score. The CRPS is computed for the one year test period as well as for 95th percentile extreme values (surge). While a 20-member ensemble is not extensive, using a CRPS metric is a better validation approach compared to using the mean or median where the information contained in the ensemble is mainly lost. The CRPS are computed using the properscoring Python library.

Global skills of the NN
The CRPS is computed for the observed non-tidal residual to provide a baseline metric for the signal not captured by the astronomical harmonic analysis. The harmonic analysis does not aim (and has not been designed) to capture this kind of variability; the nontidal residual simply provides a first-order baseline for comparison based on a 37 constituents harmonic analysis and it is expected than any method should capture parts of the non-tidal signal. The boxplot summarising this baseline skills per number of training years as well as their global distribution for the extreme values (over the 95th percentile anomaly) are presented in figure 1(c) (yellow box) and figure 2(a), respectively. The length of the time series has a weak impact on the CRPS, which ranges from 15 to 25 cm on average. Figure 2(a) illustrates the spatial variability of the CRPS with larger value in mid latitudes due to consistent winter storms and larger tides compared to tropical regions.
The NNs consistently capture the non-tidal residual due to the effect of atmospheric forcing as well as tide-tide interactions and tide-surge interactions with a mean CRPS of around 10 cm (figure 1(c)blue box). The CRPS for outlier gauges with large non-tidal residual can be improved from over 50 cm to around 25 cm. Figure 1(b) shows the percentage of non-tidal residual (baseline) captured in the NN predictions ranging from 30 to 60% on average. While longer training period improves the skills, it appears that after 6 years of training data, the performance remains fairly stable. While for any gauge, the NN captures the non-tidal residual (figure 2(b)), the skill varies spatially ( figure 3(a)). It is mainly due to the ease of improving a bad skill compared to reducing already good skills (lower than 10 cm). While the NN approach leads to high skill in reconstructing extremes of non-tidal residual, it is worth considering how a multivariate linear regression would perform in comparison. Figure 3(b) shows the percentage improvement between the two methods. While tropical regions show the lowest improvements using a NN (10-20%), the skills at higher latitude improves by up to 50% with clear regions of the globe emerging as Europe, West coast of North America, Alaska, Chinese Coast, North Australia and the Northern coastline of Japan (facing the Sea of Japan). Except one point in the Canary Islands, the NN outperformed the regression anywhere else; this might be due to a fitting issue at this particular site (not investigated).
Supplementary figures 3, 4 and 5 highlight similar results for the whole 1 year test time series. The skill improvements is not as high as for the extremes but is still significant and systematic. The Baltic sea regions can be pointed here as a region of lower skill improvement from the regression to the NN. This is potentially due to the long time scale sea-level variability that is not included in the predictors used, due to seasonally integrated winds and salinity changes [37]. Similar performance are also obtained for the lowest levels (5th percentile, lowest level being of importance for harbour management) and the 99th percentile (not shown) of the non-tidal residual.
Predicting the full range of non-tidal residuals is key for a broad range of applications. However assessing the skill of the models in stressed conditions is also of relevant importance. While the usual extreme statistics cannot be applied to this study (the test sets being only one year at each gauge), looking at the most extreme skew surges within over 600 gauges highlights the capability of the models. Figure  4 shows the 20 largest skew surges in the test set. Predictions of these large skew surges are almost always under-estimated compared to observations, but the neural network ensemble shows some skill in capturing them (over 2/3 of the signal) and systematically outperforms the multivariate linear regression. Note that the present neural network and training set have been designed to predict the complete time series and not only the extreme storm surges; therefore the training set is highly unbalanced such that extremes are seen as outliers which penalises the model predictions (more details on the impact of the the training set are provided in discussion).

Time series at two particular locations
The previous section focus on time-averaged skills in capturing the non-tidal residual. However, it is difficult to assess the highly-complex time variability of this residual. Therefore, two gauges have been selected for a more detailed investigation for their very different characteristics: • Anchorage (149.89 W / 61.24 N -around 14 years of training data -Supp. Figure 6(a), Alaska, USA, located at the end of the Cook inlet and protected from the open ocean. Due to its location, Anchorage is not exposed to extreme surges (less than 1 m in the test year) but the time series exhibits significant tide-related variability not captured by the harmonic analysis with this constituent set ( figure 5(a)), • Dunkirk (2.37E / 51.05 N -also around 14 years of training data -Supp. Figure 6(b). This gauge was used in the light tuning, mentioned in the method section), North France, located in the English channel, on the North Sea side. For Dunkirk gauge, the test year includes the winter 2013-2014 when severe winterstorm Xaver (Dec. 2013) crossed Northern part of the North Sea and led to significant surges all along the North Sea coast [38]. This was also the highest sea water level anomaly in our 15-year period at Dunkirk (around 2.5 m while the highest peak in the 14 training years was 2.2 m). Finally, the storm occurred far away from Dunkirk where pressure and wind speed did not show any exceptional values but the surge wave travelled around the North Sea, making an interesting and challenging case for the NN ( figure 5(b)). Figure 5 shows a few weeks of non-tidal residual at each selected gauge. The multivariate regression captures fairly well the long-term smoothed variability at Anchorage (figure 5(a)) but cannot capture the hightemporal variability induced by complex tides in the Cook inlet that were not computed in the tidal harmonic analysis; the ensemble variability is also almost non-existent. On the other hand, the NN ensemble captures efficiently the variability (with some spread) leading to a good CRPS (9 cm versus 21 cm for the regression over the one year test window). A Fourier transform is applied to the one-year signal (figure 6(a)), highlighting the compelling skill of the NN to capture the energy of the system at all time scales while the regression underestimates by an order of magnitude the energy for time scales lower than a day. This shows the capacity of a non-linear NN to predict tide-tide interactions or tide components not included in the harmonic analysis.
Similar conclusions are obtained at Dunkirk. While the extreme storm surges induced by storm Xaver (around 6th December 2013) are underestimated (and so is the previous peak in late November), the prediction is more accurate than the one predicted with a regression. For comparison, the Met Office CS3x deterministic forecast [39] also underestimates the peak by around 75 cm ( figure 5(b)). Over the test year, the NN CRP scores 8 cm and 18 cm for the mean and 95th percentile when the regression gets 13 cm and 32 cm. As for the Anchorage, the energy is well captured by the NN at this gauge except the two smaller peaks for periods of around 3 h 40 min and 4 h 50 min. For periods longer than 1 day, the energy is slightly under-estimated by both the regression and the NN. This highlights skills into predicting sea water anomaly and particularly extreme events using a simple NN forced by a small range of atmospheric and wave data.

Discussion and conclusion
An ensemble of NNs have been built for over 600 tide gauges spread around the world in order to predict the non-tidal residual (total sea water level minus an harmonic analysis based on 37 constituents), in term of general behaviours as well as extremes events. The results presented in this study have highlighted the global skill of NNs in capturing non-tidal residual variability and extremes, systematically outperforming predictions based on multivariate linear regressions (in term of CRPS but also in term of correlations). Due to the large amount of available data, the same simple pre-processing and neural network structure were applied to each gauge. A higher level of data quality control or gauge-by-gauge NN tuning could have been applied, and better performances would then be expected. However, analysis and preprocessing requiring localised intervention was not the aim of the study.
While it was expected that the non-linearity of the NN would play a key role in predicting extreme events through environmental forcing, the results have shown an even better performance of the NNs in their ability to represent tide-tide non-harmonic interactions, treat noise, and express uncertainty. Similar advantages are also reported in the application of Bayesian approaches to the study of tidal currents [40]. Traditional harmonic and response methods [41,42] have successfully been used for decades to predict tidal amplitudes across the world; however the advent of easily accessible meteorological data combined with novel applications of methods (for example neural networks, as in this study), could offer a new avenue for improving predictions by capturing non-linear processes.
The model has also shown significant skill in reconstructing extreme surges but still lacks accuracy in the strongest events, in capturing the peak elevation (figures 4 and 5(b) for example). This is partially due to the training data. Extremes can be seen as outliers and are only a fraction of the training set. The machine learning technique minimises a cost function (here, root mean square errors) which generalises common behaviours, and is not well designed for outliers. This leads to bias in the performance toward the average dynamics and not towards the extreme anomalies (positive or negative). Therefore the capability at predicting extremes could be improved by using a differently balanced training set [43]. As a simplistic example, one can draw a similar amount of training data in regular bins covering the range of outcomes (using sampling with replacement technique for bins with a very small amount of data); this leads to a more balanced training set. Supplementary figure 6(a) illustrates the impact of the training set on the model skill at Dunkirk (during storm Xaver in 2013). The NN now captures the amplitude of the peak on the 5th December as well as the deterministic CS3x model, and the peak on the 6th December almost perfectly. The mean of the NN ensemble with a balanced training set is 50 cm higher than the unbalanced result. As seen in Supplementary figure 6(b), in term of energy, the balanced training set is in much better agreement with data for periods longer than 12 hours but it penalises the weaker period where the energy in-between peaks is over-estimated. In terms of CRPS, the mean score decreases by less than 1 cm while the extremes (95th percentile) score improves by 6 cm. This type of model can be a great tool alongside a deterministic numerical model to improve coastal resilience and potentially set-up warnings in the future as they can also be used to solve classification problems instead of regression ones (as done in the present work) enabling an outcome such as low risk, high risk and extreme risk for example. It was shown here that only a couple of years of training data were enough to get reasonable skills, and there is not significant skill improvement in 30 years training data compared to 6-7 years. In addition, though not shown, even old data collected in the past could be used for present forecasts as long as reference levels have been corrected.
So far the present work has not be extended to locations with no data and the next step would be to built a globally connected tool to predict non-tidal residuals spatially. In addition a better representation of the regional / global atmospheric forcing might help to improve skill. This could be achieved via dimensionality reduction of environmental information based on unsupervised learning such as principal component analysis or auto-encoder. Finally, investigating more in depth the impact of using a more complex neural network structure adapted to time series (Long-Short Term memory for example) could also be of interest in the future. Setting up highresolution full physics numerical models in complex inshore regimes is time and computationally expensive and requires physical expertise. These new types of machine learning approaches are appealing for informing stakeholders where there is no capacity for implementing such deterministic weather -surge forecasting systems.