B IAS C ORRECTION OF O PERATIONAL S TORM S URGE F ORECASTS USING N EURAL N ETWORKS

predictors used for NN inference are already available operationally, iii) prediction by the NNs is very fast, typically a few seconds per station, and iv) the NN correction can be provided to a human expert who may inspect it, compare it with the model output, and see how much correction is brought by the NN, allowing to capitalize on human expertise as a quality validation of the NN output. While no changes to the hydrodynamic model are necessary to calibrate the neural networks, they are specific to a given model and must be recalibrated when the numerical models are updated


Introduction
The Sea Surface Height (SSH) oscillates around the mean sea level following a tidal component and a non-tidal component, i.e. the meteorological component, also called storm surge [1].Tides are regular and periodic sea-level arXiv:2301.00892v4[physics.ao-ph]18 Dec 2023 Illustration that shows water level differences for storm surge and a normal (predicted) high tide as compared to mean sea level.Z it the observed total water level, T is the predicted tide with harmonic analysis, and R is the storm surge predicted by the physical model.variations with high predictability for the gravitationally driven portion of the water level variability.For the most part, they are directly related to periodical geophysical forcings, such as a combination of the gravitational forces exerted primarily by the Sun and the Moon, and the effect of Earth's rotation and the associated Coriolis force.Tides are commonly the largest source of short-term SSH fluctuations with a fortnightly neap-spring cycle due to the relative positions of the Sun and the Moon.Their amplitude also varies with the time of the year -with the strongest tides appearing around the equinoxes due to the alignment of the Sun and the Moon.On the other hand, the principal irregular factors that affect the SSH are atmospheric pressure and winds acting on the oceans, as well as the associated large-scale waves and currents these trigger.Thus, the relative importance of these two components, tidal and meteorological, depends on the time of the year, weather conditions, and the local bathymetry.For instance, the meteorological component at high latitudes around Norway is greatest during the stormy winter months, particularly over shallow seas.
Storm surges are formally defined as the height of water above the normal predicted tide, i.e., the meteorological component, see Fig. 1.However, the term storm surge is usually reserved for events that give rise to unusually high SSH, rather than minor deviations from the predicted tide.Storm surges can lead to hazardous situations if combined with high tides.For example, when a cyclone makes landfall during a high tide, even of moderate amplitude, it can create an exceptionally high water level rise [1].Similarly, severe storms acting on shallow waters that coincide with spring tides can lead to critical coastal floodings [e.g., 2], and potentially damage infrastructure.If the surrounding land is low-lying and densely populated, surges can pose a great threat to life and property [e.g., 3,1,4].Furthermore, the soil may become infertile for several years after an inundation because of saline deposits [1].Overall, flood events in populated coastal areas can affect residents' health, food security, and access to clean water.
From a climate change perspective, impacts and risks are becoming more complex and difficult to manage because of the simultaneous occurrence of multiple hazards [4].A combination of, for instance, increased sea level, storm surge, and heavy rain can lead to an extreme event, even if each of these events is not extreme.Above 1.5 • C global warming, there is high confidence that a combination of these events will increase the compound flood risks [4].Given that extreme surge events are projected to increase in the 21st century, the prediction of extreme surge events is a primary concern for designing warning systems and coastal defenses [4].For all these reasons, improving the numerical models used to predict storm surges is presently a research area of large interest.
Coastal floods due to storm surges are also a threat to countries around the North Sea, including Norway.Lives were lost in the extreme event of 1960 [5], and, although more recent events have not caused deaths, they have damaged properties and caused significant economic losses.For example, the storm surge event that affected Norway in October 1987 and, more recently, the events "Dagmar" (December 2011) and "Elsa" (February 2020).To mitigate potential damages, it is crucial to have a warning system for extreme water levels in place.The Norwegian Meteorological Institute (MET Norway) has developed a system that predicts and issues warnings in case of extreme levels at 23 permanent Norwegian stations [5].Observations at these locations are transferred in Near Real-Time (NRT) from the Norwegian Mapping Authorities and used for post-processing the forecasts.Then, the corrected values are transmitted from the Research and Development Department to the Forecasting Center at MET Norway, and published on the official website for water level forecasts operated by the Norwegian Mapping Authorities.Furthermore, a decision support system dashboard for automatically detecting and visualizing the exceedance of certain water level thresholds is internally available to the forecasters at MET Norway, who will communicate any warnings to key users and the general public [5].The core of MET Norway's complex warning system is the numerical model, the Regional Ocean Modeling System (ROMS), which predicts the meteorological component.The astronomical tides computed with harmonic analysis are then added to the storm surge signal to obtain the total water level.In addition to the numerical simulations, the flow of NRT observations and the dissemination of the forecasts to the authorities and the general public are essential components of the warning system [5].In the following, we will refer to the current storm surge model that runs at MET Norway, which already includes a simple weighted difference correction, as "Nordic4-SS".This correction is applied operationally, by removing the sliding average for the error between the observed and predicted storm surge over the last few hours in the past, as explained further down.
Given that the numerical model is the core of the warning system, improving the forecasts will directly impact the ability to reduce the consequences of extreme storm surge events.While analytical models can be used to analyze the main driving mechanisms behind storm surges, numerical models are required to capture the complexity of weather patterns, bathymetry, and the coupling between the ocean and the atmosphere that impact storm surges [6,7,1].Statistical methods, e.g., Neural Network (NN)s, are an alternative to the numerical models frequently used to produce flood warnings to alert the population living in risk areas [8,9].The two approaches have been compared in the literature [10,11].Numerical models, although capable of describing the physical processes involved, take a long time to develop, set up, and run, require high-quality bathymetric data, are computationally demanding compared to the statistical algorithms, and still have biases and errors owing to the complexity of geophysical systems.These physics-based models have, however, high reliability (even if they are not perfect).Data-driven models, on the contrary, use Machine Learning (ML) or other statistical methods to determine the relationship between a set of predictors and the target variables.As a consequence, the complexity of these algorithms is limited by the quantity and quality of the historical and operational data available.They are, however, computationally more efficient than pure numerical models.Moreover, as we show in this article, it is possible to combine the best aspects of the two approaches by correcting the numerical model with a data-driven method [11], obtaining a third approach.
There has recently been a renewed interest in the field of ML [12], while efforts have been made to model storm surges operationally with NNs as an alternative to hydrodynamic models [e.g., 13,14,15,16,17,18,19].At a global scale, complex ML models for predicting the meteorological component of SSH at hourly intervals using NNs have been developed [20,9], achieving results comparable to those from hydrodynamic models.Together, these studies indicate that NNs are capable of predicting SSH, although they do not necessarily improve the performance compared to state of the art physics-based numerical models.On the other hand, this third approach has successfully been applied to correct short-term water level predictions with harmonic analysis in the Galveston Bay [21], and to post-process the meteorological component predicted with a numerical model using data-driven methods in the Adriatic Sea [22].
In the present work, we show that even a state-of-the-art operational model like Nordic4-SS has significant biases that can be corrected using a residual learning approach that combines numerical modeling with NNs.Furthermore, we do not limit our work to lead times of about a day, as most previous studies do [e.g., 13,14,15,16,17,18,20,9], but we extend the lead time range to sixty hours, showing the impact of our methods when applied to medium-range forecast.In the first part of the paper, we show that the residuals in Nordic4-SS depend on local wind speed and direction, and present clear bias patterns.The bivariate dependency of the average error on wind variables is visualized in polar plots, a technique that has previously been used in this field before [21] and in air quality applications [e.g.23,24,25].Although removing the average error in the polar plots does not significantly improve the prediction of extreme surge events, they illustrate and quantify statistical relationships between the variables.These dependencies strongly agree with the intuition and experience of the meteorologists on duty.In the second part of the paper, the Root Mean Square Error (RMSE) in Nordic4-SS is reduced at several stations by applying a residual NN method, i.e., by subtracting the residuals estimated with NNs from Nordic4-SS.
The paper proceeds as follows: First, we describe the methods and the data used; then, we present the results for three selected Norwegian harbors, Oscarsborg (OSC), Bergen (BGO), and Andenes (ANX); and, finally, we provide a summary of our findings, together with a discussion of the results.We also provide further details in the appendices.All material needed to reproduce these results is available in a Github repository (see Appendix A).A table with the coordinates of the stations is provided in Appendix B. Basic storm surge theory is explained in Appendix C and basic Machine Learning concepts used in the study are explained in Appendix D. Lastly, we show two examples of polar plots as a function of wave characteristics in Appendix E. These figures are not included in the main discussion because DECEMBER 19, 2023 the wave parameters do not improve the ML models, but they are shown to illustrate that error patterns are also visible in other quantities.

Data
In this section, we describe the datasets used to to validate and correct the numerical storm surge model.We evaluate ROMS in hindcast and forecast modes.However, the usefulness of an improved hindcast in an operational context is limited, so we show only the results of correcting the Nordic4-SS forecasts.Moreover, although the methods have been applied to all the permanent Norwegian stations, we show the results for three of them located in different regions: OSC, BGO, and ANX.

Total water level observations
We use SSH data from 22 stations in mainland Norway operated by the Norwegian Mapping Authority.These data are transferred in real-time to MET Norway and used to post-process the forecasts, estimate the residuals in the numerical model and validate it, and as input to the NNs used to reduce the errors in Nordic4-SS.The geographical location of these stations is shown in Fig. 2 (see also Table 1 in Appendix A for details).We have grouped the stations based on our physical understanding of the Norwegian climate, how waves propagate in the ocean, and the geography.The three groups consist of the stations located in Skagerrak (blue), the West Coast of Norway (orange), and Northern Norway (green).Furthermore, we exclude the station Ny Ålesund, in Svalbard, as it is subjected to completely different weather dynamics.
Figure 2: Locations of the permanent water level stations in mainland Norway.The stations have been divided into three groups according to their geographical locations and the characteristics of tides and weather systems that affect the area.Blue dots represent stations in Skagerrak, orange dots represent stations along the West Coast, and green dots represent stations in Northern Norway.Contours of mean sea level pressure in hPa are computed with data from ERA5 from the period 1959-2021.

Tides
We used tide data to compute the meteorological component from the observed total water level height, and as predictors in the data-driven models.The tide data used in this work is obtained with harmonic analysis and come from two different sources: a) data retrieved from the Norwegian Mapping Authority's API [26], and b) estimates made using the pangeo-pytide Python package [27].There are minor differences between the two datasets (in the order of millimeters) that we hypothesize are due to small differences in the number of constituents and the underlying optimization algorithms used by each package.Data from the Norwegian Mapping Authority are preferred, because it is the official source of tide data for Norway, but also because the RMSE for the different models is slightly lower when we train our models on this dataset.Provided that the differences between both datasets are much smaller than the uncertainty in the observations, for convenience, we use a combination of the two datasets depending on what is stored in our systems.

Forecasts
Our ultimate goal is to improve the operational storm surge model, Nordic4-SS (01/2018-03/2021).For this, we train NNs with data that are available at the analysis time, including storm surge and weather forecasts.

Nordic4-SS surge forecasts
The Norwegian storm surge model, Nordic4-SS, is based on ROMS [28,29], a state-of-the-art model system that has been in operational use, for instance, within the US National Oceanic and Atmospheric Administration for more than a decade, to forecast the water level and currents [28].Nordic4-SS is a free-surface model that uses a terrain-following coordinate system for solving the primitive equations [29] with a horizontal resolution of 4 km.The model domain is shown in Fig. 3, and covers the North Sea, the Nordic Sea and the Barents Sea.Nordic4-SS consists of the output from ROMS corrected with a weighted difference method (explained in the Section 3).These predictions are available for the 23 permanent Norwegian stations through MET Norway's Weather Application Programming Interface (API) [30].
MET Norway runs ROMS in barotropic mode (2D) every 12 hours, and every forecast has a length of 120 hours (five days) [31,5] starting from the analysis time.However, for stability reasons, the model run starts 24 hours before the analysis time, meaning that each run is 24 + 120 hours long (see Fig. 4).The length of the barotropic time step is set to ten seconds, and the time-stepping scheme followed is the LF-AM3 with Forward Backward (FB) [29].The bathymetry is inherited from a legacy model [32].Nordic4-SS is forced at the surface boundary with mean sea level pressure and momentum fluxes computed by applying the Charnock relation to 10 meter winds.This forcing data is retrieved from ECMWF forecasts (note that the first 24 hours of the data are a ramp up time corresponding to 24 hours before the analysis time, as visible in Fig. 4).The forcing data consists of analysis of pressure and winds from the ECMWF forecasting system.Moreover, a quadratic bottom friction is chosen, with a drag coefficient of 2.5 × 10 −3 .The model runs with open boundary conditions given by the Chapman condition for two-dimensional momentum [33] and the Flather condition [34] for free surface.The values for two-dimensional momentum and sea surface deviation are all set to zero, although the inverted barometer effect is added to the analytic surface deviation at the boundaries.DECEMBER 19, 2023 Figure 4: Diagram of Nordic4-SS runs showing the full length of a single run, from 24 hours before the analysis time, t 0 , to 120 hours after.Each run can be decomposed into a ramp up period before t 0 , where the model runs with analyzed forcing, and the forecast that starts at t 0 .
Nordic4-SS runs in ensemble and deterministic mode.Here, we use only the deterministic model because it is forced by an atmospheric model with higher resolution than the Ensemble Prediction System (EPS) members, which leads to an appreciable reduction in the residuals in Nordic4-SS compared with the other members of the ensemble.Furthermore, the length of the dataset used to train the data-driven methods in this study is limited by the archived forecasts of Nordic4-SS, the most recent implementation of the storm surge model, which is available from 2018.
A fraction of the error in Nordic4-SS is associated with the input to the model.This could be the forcing chosen to run ROMS, the description of the bottom topography, or the bottom friction coefficient [35,29].Nevertheless, another part of the error is intrinsic to ROMS and can be affected by limitations and errors in subscale parameterizations, missing or excluded physics, finite resolution in space and time, discretization and truncation errors [35,29], etc.Note that MET Norway currently runs ROMS for storm surge forecasting without tides.However, the storm surge and tidal components are non-linearly dependent and cannot be completely separated for large surges/tides with the magnitude of the impact being a topic of ongoing research [e.g., 36,5].Some efforts have been made to estimate the intrinsic error in ROMS by running it with analyzed atmospheric forcing.The calculations show that, even though the instrumental error at the water level stations is estimated to be less than 1 cm [5], the uncertainties in our ground truth, computed as the difference between total water level and tides, are estimated to be around 3 cm [5].This means that additional errors are introduced when, for example, the storm surge signal is derived from the difference between the tides and the total water level.Other potential sources of error in Nordic4-SS are related to local wind effects (as we show in the polar plots in the results section), lack of wave-ocean coupling, or resonances in the basins.

MEPS meteorological predictions
The Meteorological Co-operation on Operational NWP (MetCoOp) is a Nordic cooperation on Numerical Weather Prediction (NWP) between the Finnish Meteorological Institute (FMI), MET Norway, the Swedish Meteorological and Hydrological Institute (SMHI), and the Estonian Environment Agency (ESTEA).The NNs developed in this study to improve Nordic4-SS take as input forecasts from MetCoOp-Ensemble Prediction System (MEPS) [37,38,39], a forecast ensemble with a convection-permitting atmospheric model covering Scandinavia and the Nordic Seas produced by MetCoOp.MEPS has a horizontal resolution of 2.5 km and 65 vertical levels.The boundary conditions are taken from ECMWF, and initial perturbations are based on the SLAF method [40].Here, we use MEPS forecasts from runs in a 6-hours cycle (00, 06, 12, 18 UTC) with lead times up to 66 hours.Considering that the goal is to design a framework for improving Nordic4-SS forecasts that can be deployed operationally, and that Nordic4-SS runs at 00 and 12, we take DECEMBER 19, 2023 the MEPS forecasts from the 06 and 18 runs, from lead time 6 to 66. Thus, we design our operational-based setup so as to make sure that the predictions from MEPS are available at the analysis time of Nordic4-SS, making it directly transferable to operational applications.This dataset is also used to study the dependence of the residuals in Nordic4-SS in terms of wind speed and direction.

Hindcasts
The storm surge hindcast dataset, NORA3-SS, is used to validate the numerical model because the longer time series it provides give a more detailed insight into the storm surge statistics than the short time series from Nordic4-SS.In this validation process, we also use the atmospheric hindcast data ERA5.

NORA-SS surge hindcast
NORA3 is a high-resolution numerical mesoscale weather simulation made by MET Norway that covers the North Sea, the Norwegian Sea, and the Barents Sea.It is available from 1974 to 2021 (and will be extended in the future).With a resolution of 3 km, NORA3 downscales the ERA5 reanalysis providing an improved wind field, especially in mountainous areas and along the coastline [41,42], and performs much better than ERA5 with regards to the observed maximum wind.The downscaling is based on the HARMONIE-AROME model [43,44,37] (Cycle 40h1.2), a nonhydrostatic numerical weather prediction model that explicitly resolves deep convection [41,42,45].While the operational storm surge model (Nordic4-SS) is forced with weather forecasts from ECMWF, the storm surge hindcast (NORA-SS), is forced with reanalysis data from NORA3.Except for the forcing, NORA-SS runs ROMS with the same setup as Nordic4-SS.

ERA5 winds
In order to study the dependence of the error in NORA-SS on wind conditions, we use gridded reanalysis data from ERA5.Climate reanalyses combine past observations with models to generate time series of climate variables.In this study, the ERA5 reanalysis [46] was chosen to represent observed historical meteorological and wave conditions, spanning the period 1980-2020.The regridded data cover the Earth and are available on 37 pressure levels and single levels, on a regular latitude-longitude grid with a 0.25 • × 0.25 • horizontal resolution and 137 vertical levels.It is also dynamically consistent with the forcing, since NORA3 uses ERA5 as its host model [41,47].
Wind data from the gridded datasets are selected from the nearest grid box for each station.Wind speed and direction are calculated from the eastward (u) and northward (v) components of the wind at ten meters at an hourly frequency.Experiments were also conducted using pressure and wave data, but we focus on the wind dependency because, out of these three, the wind parameters are experimentally found to be the most important NN predictors for correcting Nordic4-SS.

Methods
In this section, we introduce the methodology used to reduce the residuals in MET Norway's storm surge model, Nordic4-SS.Although the methodology has been developed for improving the predictions made with Nordic4-SS at stations located along the Norwegian coast, it can easily be generalized to other models and regions as long as in-situ observations and numerical model data are available.
In the first part of the work, we validate the numerical model using traditional statistical methods (this will be referred to as "traditional" in the following).Polar plots are used to display the dependency of the systematic bias, either in Nordic4-SS or NORA-SS, on two variables simultaneously; for instance, local wind speed and wind direction.These plots, and their corresponding tables, can not only be used for validation, but also to correct the average error.In the second part of this work we apply NNs.Finally, the models are evaluated and compared against each other using the RMSE, and we select the model with the best performance.The sequence of processes followed to validate and correct Nordic4-SS is represented in the workflow diagram in Fig. 5.The first step consists of collecting and processing all the data needed from the different sources described above.Then, we compute the polar plots and train the NNs on the residuals.This workflow diagram is explained in detail in the following.The following notation is used in the next sections: Z refers to the observed SSH, T stands for tide, and R is the meteorological component predicted with ROMS (as illustrated in Fig. 1).

DECEMBER 19, 2023
Figure 5: Workflow diagram for validation and correction with the polar plots method (PP) in blue and the correction with the Machine learning method (ML) in yellow.Common for both methods (in red) is the data collection and preparation and the application of the weighted differences correction on the storm surge data.The ML part has several components.We start by selecting a subset of stations from which we extract the predictors and prepare labels and feature arrays.Then, we develop the NN models, design the architecture, select the features and tune the hyperparameters.Finally, we select the model with the best RMSE computed with test data.

Post-processing the storm surge predictions
The outputs from Nordic4-SS and NORA-SS are adjusted with the weighted differences correction method [5].The method relies on the observations of the previous five days and is applied before computing the residuals needed for validating and correcting the numerical model.At each location, the elements in the offset vector O represent the error from t − 120 to t − 1 hours: Then, the bias is computed as the sum of the weighted offsets, where the last observations have larger weights: As we show in the following, even after removing the bias computed with this weighted differences correction method, bias WD , there is a systematic error in the ROMS output.To further compensate for it, we learn these errors with two different data-driven methods: the traditional method and the NNs.

Residual framework
The residual errors are defined as the differences between what we expect and what is predicted, in our case, the observed and predicted storm surge.We define the residuals at the location s, corresponding to a node on the discrete numerical grid, and time t as: where R s,t is the output from ROMS, run either in forecast mode (Nordic4-SS) or hindcast mode (NORA-SS), at a given time and location, corrected with the weighted differences method by subtracting bias WD ; Z s,t is the total observed height; and T s,t is the tide estimated with harmonic analysis.All values are measured with respect to the official chart datum.Note that the residuals form a time series themselves.
In an ideal model, the residuals should be random fluctuations.Any structure in the residuals suggests that the original model is not perfect and could be improved.When the signal in the residuals is complex, it might be convenient to model the structure directly, i.e., model how the forecasting model will fail, to later remove the predicted residuals from the numerical model and improve its performance.To this end, we use NNs as a post-processing tool, training NNs on the signal in the residuals, which is a less complex task than predicting directly the full storm surge dynamics.It is also particularly convenient to model a less complicated signal in the light of short training samples and the limited in situ ground truth data available, which puts a limitation on the complexity of the NN model that can be used before overfitting.
Autoregressive models are traditionally used to model autocorrelated residuals, where the lagged errors are combined in a classical regression model.However, the fact that NNs are inherently nonlinear makes them better candidates for modeling complex data patterns than traditional methods.We use the time lag concept from the autoregressive models, but we combine it with a more flexible learning algorithm, NNs.That is, the input nodes of the NN consist of time-lagged variables.Although the individual values in the lagged variables are duplicated, the NN training process can assign unique weights to the vectors with lagged data for each variable while learning how past values influence future values.The forecasting performance of our algorithms is affected by the time lag selection, in addition to the model selection and setup.Therefore, it is essential to select the time lags carefully.We have verified with our data sample that, if the time window is too short, the model will not have enough information to learn the correlations in time.Contrarily, we see that a too large time-lag value results in irrelevant inputs and reduce the performance.In our experiment, the number of inputs, hence, the number of time lags, is limited by the length of the records.Using too many time lags results in a large model size compared with the size of the training dataset available, which leads to overfitting.A sensitivity analysis concluded that 24 hours of lagged values for each variable is optimal given the size of our dataset.By using this range of lagged variables we expect the NNs to model the highest autocorrelated part of the signals in the residuals and the semi-diurnal periodicity in the tide-surge interaction missing in the system.

Training, validation and test
When using data-driven algorithms to make predictions, it is common to split the data into a training and a test dataset.We fit the parameters to the training dataset.When finding the model with the best performance and adjusting the

Statistical bias correction
In the first part of this work, we show that the residuals in the storm surge model (Eq.5) depend on local wind conditions and suggest a traditional 2D polar plot method to analyze the joint dependence on wind speed and direction.In these plots, the wind direction is represented by the angular coordinate, whereas the wind speed is represented by the radial coordinate and increases with the radius (see Fig. 7).Then, we calculate each bin's average error, standard deviation, and number of observations.We divide the data into 2D bins instead of interpolating and plotting a continuous field, and keep only the bins with at least five observations.The optimal bin sizes for the datasets used in this work is 1 m/s × 10 degrees for the ERA5 wind data.Although polar plots can be used to correct the statistical bias in the storm surge model by computing the average residuals for each bin and then subtracting this bias from Nordic4-SS, they are most useful as a visual representation of the error in polar coordinates in a validation process.

Machine learning
The traditional statistical method involving polar plots can potentially be used to remove a part of the systematic error in the storm surge forecast.Nonetheless, it has two major drawbacks: 1) the performance is poor in the case of extreme events, due to the scarcity of rare events in the training dataset, and 2) because variables are correlated, it can only be used to correct the bias associated with two predictor variables, such as wind speed and direction, for one location at a time.One way of mitigating this problem is to use NNs.These are more flexible, nonlinear models with the ability to model complex relationships between inputs and outputs.
In order to reduce the bias in Nordic4-SS, we apply the residual method with NNs, i.e., we predict the residuals in the numerical model and then subtract them from the storm surge predictions.We apply this method to each station independently.The models have been implemented with the Keras library [48], for hourly lead times ranging from one to 60 hours.

Station selection
The Norwegian climate and storm surge conditions are affected by the country's geography.A long, intricate coast line, deep and narrow fjords, high mountains, and steep valleys are important factors to consider in prediction systems.Therefore, it is natural to use predictors from different stations depending on where we want to predict the residuals.However, the choice of stations is not trivial.In theory, for each of the 22 stations where we want to improve the forecasts, we could test the performance of the ML models using all possible combinations of predictor variables and stations, but the number of possible combinations means that a direct testing approach would require enormous efforts and computational resources.A more practical solution involves grouping the 22 stations and selecting a set of predictor stations for each group.The groups have been determined by performing the k-means algorithm for k = 3 on the storm surge data, and coincide with the physical-based groups shown in Fig. 2. The k-means method has been tested for different numbers of k.For a partition based on k = 4, the West Coast is divided into two groups of stations, north and south of Bergen, leaving a too small number of stations in one of the groups.In addition, for k = 5, Northern Norway is divided into two groups.On the contrary, when running the algorithm for k = 2, the stations are grouped according to their latitude, meaning that most of the stations on the West Coast are grouped with the stations in Skagerrak, which does not agree with the geography and the physics of the basins.In summary, the lowest number of clusters coherent with the geography and the physics is k = 3, and a higher number of clusters could be considered if we had more in situ observations.Moreover, when predicting the residuals at a given location, it is useful to provide the NNs with data from remote locations that contain information about the weather systems at a previous state.Therefore, to select the predictor stations for each of the three groups of stations, we consider how wave and weather systems propagate.We take, for instance, into account that tides are mostly generated in the Atlantic, before a part of the wave propagates along Britain and follows the coast in Northern Europe, reaching Skagerrak, and another part propagates to the Norwegian Sea and continues northwards.Weather systems typically move in a north-easterly direction, but in Southern Norway they can also move along a more zonal path.In addition, we see that stations located in the inner sections of the fjords have particularly complicated dynamics and, as such, the data from these stations are not good candidates as input to the NNs.Also, due to autocorrelations in the residuals, past observations and forecasts at the station where we want to improve the forecast are key predictors.Furthermore, for robustness, we conduct experiments to test the performance of the ML algorithms on a subset of possible combinations of stations.
In summary, each group shown in Fig. 2 has its corresponding set of predictor stations.All NN models include data from the station we are predicting at and data from 4 other stations, depending on which group they belong to.We use data from AES, BGO, VIK, and TRG for the stations in Skagerrak (if the station for which we predict the residuals is in this list, we add OSC to complete the set of 5 stations), AES, OSC, MAY, and SVG for the stations along the West DECEMBER 19,2023 Coast (if the station is among the predictors, we add BGO), and BOO, HFT, KAB and KSU for the stations in Northern Norway (if the station is among the predictors, we add ANX).

Data preparation
The NNs perform better on normalized data.Before training the models, we standardized all the data using the sample mean and the standard deviation computed with the training data.The opposite transformation is then applied to test the models by comparing the RMSE.We tested alternative normalization methods, but they were found to either have no impact on, or degrade, the accuracy of the NN.The gaps in the data are filled with the training sample's most frequent value, with scikit-learn's SimpleImputer class [49].At OSC, BGO, and ANX, this corresponds to 1.8 % of the total prediction dataset, after removing missing labels (8.5 %).If no repeated values are found, the algorithm selects the minimum of the dataset.

Model development
Once we have identified the model architecture we want to use, multiple hyperparameters must be specified before beginning the training process.There is no analytical way to determine the optimal values of these parameters.Instead, we rely on systematic experimentation and testing.However, the optimal hyperparameters will depend on the features selected and the architecture.
The residuals are estimated for one station at a time using input data from several locations, including the one where we want to predict.The number of predictor variables is limited by the length of the Nordic4-SS forecasts.Therefore, we have to carefully select the best candidates, discarding predictors that carry less value.The set of predictor variables consists of observations, Nordic4-SS predictions (initialized at t 0 and t 0 − 24) corrected with the weighted differences correction method (Eqn.4), tides, and 10 m wind forecast variables from MEPS (initialized at t 0 ).The maximum range of hours is from t − 24 to t + 60.Note that observations are only provided for the past to make the method forecast compatible.The diagram in Fig. 8 shows the period spanned by each variable relative to the analysis time.
We use a direct multi-step forecasting strategy that involves training a separate model for each forecast time.The architecture is that of a sequential model, consisting of a dense layer of 32 nodes, followed by a batch normalization layer, a dropout layer with rate 0.3, a dense layer of 16 nodes, another batch normalization layer, and a dropout layer with rate 0.4 (see Fig. 9).Thus, the number of nodes decreases with the layer number, from the first to the last one.The weights in all the dense layers have been initialized with the Glorot uniform initializer [50], and the nodes are activated according to the Rectified Linear Unit (ReLU) function.
The architecture of the NN models used to predict the residuals at each lead time is illustrated in Fig. 9.Each node in the graph represents a specific layer function, while the arrows represent the flow.This way, the NN graphs show the order of the layers, starting with the input layer on top and ending with the last dense layer (output) at the bottom.The nodes also include the input and output shapes of each layer.The number of predictors is variable, but if we train the models with observations, tide, storm surge, and wind from five different stations, from t − 24 to t + 60, after removing predictors with missing data the number of predictors for OSC is 1550 (as shown for the input layer in Fig. 9).Note that the number of observations, or samples, used to train the networks on each operation is variable.Because the NN operates on a batch of the input, the question marks in the graph are a placeholder for the number of samples.In this work, we have used a batch size of 32.
The Adaptive Moment Estimation (Adam) optimizer [51] is used for performing the training, with a learning rate of 0.001.The advantage of using this method is that it converges rapidly, and no manual tuning of the learning rate is needed.Moreover, we have chosen the Mean Square Error (MSE) as a loss function.In order to evaluate the model, we use the Mean Absolute Error (MAE).This metric is used to judge the model's performance, not in the training process.
When the metric has stopped improving for 20 epochs, the learning rate is reduced by a factor of 2. The lower bound of the learning rate is set to 0.0001.The maximum number of iterations is 500, but the training terminates when the loss does not improve over 50 epochs.

Results
Herein, we analyze the residuals in the numerical model at the Norwegian stations with simple statistical methods, and use NNs to learn these residuals in order to improve Nordic4-SS.We start by validating the numerical model and searching for a signal in the residuals.We find that the residuals are correlated in time and depend on the wind conditions.As will be shown, when learning these errors, ML algorithms outperform traditional methods.However, the more complex the algorithm is, the more difficult it is to interpret the results.We therefore believe that the polar plots are a convenient complement to the NN technique as they expose the physical relationship between the systematic Figure 8: Diagram of predictors used to train the NNs in forecast mode for lead times up to 60 hours.We use storm surge forecasts from Nordic4-SS generated at the analysis time, t 0 , but also forecasts from 24 hours before t 0 .We also use weather forecasts from MEPS generated six hours before the t 0 from lead time six to 66 hours, observations from the last 24 hours, and tide estimates for the last 24 hours up to t 0 + 60 hours.
error in the numerical model and the wind variables.Hence, they allow us to better understand the flaws of the storm surge model, and where it fails.Moreover, analyzing the statistics of the error in the numerical model through polar plots is a way to systematize the knowledge obtained through years of experience of the forecasters.Even though the methodology described above is independent of the location, there is an evident spatial variability in our results.We focus on the results obtained for three stations, each of them located in one of the three different Norwegian regions defined above (see Fig. the periods of spring and neap tides, that occur every two weeks (alternating), and the M2 tide which is on a 28 day cycle, providing evidence that the tides have an important influence on the residuals.

Representation of the residuals in polar coordinates: polar plots
In addition to this signal in time, we observe a dependence on local winds.Polar plots illustrate the variation of a magnitude in polar coordinates.

Polar plots in hindcast mode
The average and standard deviation of the residuals in the hindcast NORA-SS, as well as the total number of observations, conditioned on wind speed and direction from ERA5, are illustrated in Fig. 11.If we focus first on the radial coordinate, we can observe some similarities among the three stations, even though they are affected by different dynamics and geographical conditions.For instance, the predictions at all three locations are more accurate and exact when the wind is weak.The systematic error and the uncertainty tend to increase with wind speed in all of them, while, unfortunately, the number of observations decreases, leading to more random noise in the plots.Still, it is important to highlight that the magnitude of the wind speeds registered has large variations across the stations, and so has the bias.For instance, the bias and the standard deviation at 8 m/s are much greater at OSC than at BGO or ANX.OSC also has a lower density of observations at 8 m/s than the other two stations, corresponding with the local climate.
The dependence of the bias on the wind direction is a local characteristic with strong spatial variability.For example, the well-defined pattern at the station OSC indicates that when the wind blows approximately from the north to the southeast, the model overestimates the surges; contrarily, when the wind has an eastward component, the model underestimates the SSH.The pattern of the systematic error at OSC differs significantly from that at BGO and ANX.When comparing these stations, in addition to considering their geographical conditions, it is important to remember that they have different climates.Bergen is located on the West Coast of Norway, and is affected by a number of cyclones each season that push the water against the coast while the pressure is low.Oscarsborg, on the other hand, has a continental climate, and experiences calm to moderate wind conditions due to its sheltered location.Andenes is usually affected by higher tides and low-pressure systems generated near Iceland.These different conditions are well represented in our data.
Wind speeds are greatest at ANX, where winds of 24 m/s have been observed, followed by BGO, where the maximum records are about 15 m/s.In contrast, at OSC, all wind records are below 9 m/s.In summary, although the patterns of the mean error at the three stations are very clear, the dependence on the wind direction is completely different.This information has been compiled in lookup tables to serve as a guide for the meteorologist on duty.
We have already mentioned that NORA-SS contains many more samples than Nordic4-SS (66536 vs 2372), so it provides more robust statistics.At the same time, the hindcast is forced with reanalysis data (NORA3) instead of the atmospheric forecasts (MEPS) used to force Nordic4-SS.Hence, the hindcast can overestimate the real forecast skill and underestimate its variability.In this sense, the results obtained for the hindcast are idealized, as it represents conditions that can not be reproduced operationally.Still, the figures shown above, generated with the NORA-SS, strongly suggesting of a pronounced local systematic error in the ROMS model setup, even under idealized conditions.The conditioned statistics of the residuals are also computed for the forecast data (Nordic4-SS) for the period January 2018 to March 2021.We show the average and standard deviation of the residuals, and the number of observations at OSC in Figures 12a, 12b, and 12c, respectively.Given that the forecasts are available for a much shorter period than the hindcast and that we have 12-hourly data instead of hourly data for 0-lead-time conditions, we have aggregated the results into larger bins.For the forecast data, the bins have a size of 2 m/s × 30 deg.Furthermore, we plot the values when there are at least three observations in the bin.Even though the resolution is coarser compared to the hindcast polar plots, the overall patterns agree, confirming that the detailed structure in the hindcast is not misleading despite not having the same error distributions, which corresponds to our idea that the imperfections of the ROMS setup play a systematic role in the structure of the residuals.Moreover, these plots have a practical use in the context of the decision support system, as the coarser resolution facilitates the forecaster's decision-making.An interesting difference between the hindcast and the forecast polar is that the forecast shows much greater wind speeds.A possible explanation is that ERA5 has a coarser resolution that MEPS, and shows average values over grid cells.Also, MEPS has been developed for Scandinavia and the Nordic Seas, while ERA5 is a global dataset.A comparison of ERA5 and MEPS is beyond the scope of this study, although it would help to clarify this difference.

Polar plots in forecast mode
Similar polar plots have been generated for significant wave height in the radial coordinate and mean wave direction in the angular coordinate (examples are provided in Appendix E).We see that the dependence of the residuals with these wave parameters is in line with the results obtained for wind speed and direction.The residuals also depend on the local pressure.By binning the error at intervals of 5 hPa, we see that the numerical model overestimates the meteorological component when the pressure is low and underestimates it when the pressure is high (not shown here).This is a general result valid for all the stations.Nevertheless, it must be interpreted with caution, as the domain regions with the highest and lowest pressure values observed also have fewer observations and therefore the highest uncertainties.

Correction of the numerical storm surge model
From the results shown above, it is clear that the residuals in the numerical storm surge model are correlated in time and depend on the meteorological conditions.We have tested two residual techniques: one based on polar plots and one based on NNs.At the first attempt, we learned the residuals in the hindcast and tried to use this bias to correct the forecast data.Unfortunately, we then discovered that the error distribution in the hindcast and forecast data are different enough to make such transfer learning ineffective when using NNs, leading to higher RMSE.Therefore, we focus only on correcting the operational model currently used by MET Norway (Nordic4-SS), using forecast-compatible datasets that will allow us to operationalize the correction process in the future.

Bias correction with polar plots
The polar plots in Figs.11 and 12 show an evident structure in the systematic error, i.e., a dependence on both wind speed and wind direction at all locations.This is the error that we want to correct using data-driven models.Nonetheless, directly removing the bias observed in the polar plots from the Nordic4-SS forecasts does not lead to a meaningful enhancement of the model; only a few millimeters of improvement are obtained, which is less than the estimated wave gauge measurement error.We see, however, that the model's performance is sensitive to the periods chosen and the size of the bins, which we interpret as a consequence of using short forecast time series.
Experiments conducted in hindcast mode, using longer time series from NORA-SS, show that the polar plots, in fact, have the ability to reduce the RMSE at some locations.For instance, at TRG, the relative improvement after removing the bias in the polar plots is of 4%.Meanwhile, at OSL and VIK the improvement is of 3%.In spite of that, the method is inefficient when extreme weather occurs and uncertainties are high.This is because the method consists of subtracting the mean bias computed for each bin and, even in the hindcast, there are not enough situations represented in the bins that correspond to severe storms.

Residual correction with Neural Networks
We model the nonlinearities in Nordic4-SS residuals with NNs, training one model (same architecture but different predictors) for each station in mainland Norway, and we run it for every lead time from t + 1 to t + 60.Then, we subtract the learned residuals from Nordic4SS.Although the residuals also depend on wave and pressure conditions, experiments indicate that wind speed and direction are the most important predictors and that providing wave and pressure data in addition to wind data to the NNs does not improve the results.For this reason, the ML algorithms presented here are trained with wind data in addition to storm surge predictions, tide data, and past SSH observations.Fig. 13 exposes the spatial variability in the RMSE, bias, and standard deviation of the residuals as a function of lead time from t + 1 to t + 60.The figure shows Nordic4-SS forecasts (dashed lines) and the NN-corrected forecast (solid lines) at three locations: OSC (blue), BGO (orange), and ANX (green).The RMSE in the operational storm surge data typically increases with lead time at all stations, but the steepness of the curve depends on the station considered (Figs.13a, 13b, and 13c).The RMSE at OSC is reduced for all lead times by approximately 1 cm.If we decompose the RMSE into the bias and standard deviation, we see that the bias is the smallest component, and it is mostly positive after correcting with NNs.It is unclear why the bias oscillates, but we can see that after correcting with NNs these oscillations are out of phase with the bias in Nordic4-SS.If we look at the results from BGO, we see that the performance of the NNs is poor and almost does not provide any improvement, at least for the first 24 hours.Remember that BGO is one of the stations located further west in Norway, and that cyclones often move to the east or northeast.For this reason, we interpret this result as a lack of metocean information from remote western locations, affecting the NNs capacity to learn and anticipate the atmospheric conditions at BGO.The bias at BGO is also increased after applying the post-processing method.Turning now to the results at ANX, we observe that the NNs improve the results and that this improvement increases with lead time, from 0.5 cm at t + 10 to 1 cm at t + 60.The bias is also reduced for almost all the lead times.Moreover, 12-hours oscillations are observed in the curves in Fig. 13.These oscillations can be attributed to a contamination of the tidal component [5] .

Spatial distribution of the residuals in the numerical model and relative improvement
The NN residual method can be applied at any location where both predictions from Nordic4-SS and water level measurements are available.The first row of Fig. 14 shows the RMSE of the Nordic4-SS forecasts at all the 22 stations for t + 1, t + 24, and t + 48.We see that the RMSE is lowest on the West Coast, and highest in Skagerrak and in the inner parts of the fjords.Furthermore, it increases with lead time.The second row of Fig. 14 shows the percentage improvement in RMSE after correcting Nordic4-SS with NN for lead times t + 1, t + 24, t + 48.At t + 1, the stations in Skagerrak show most improvement (36% at OSC).As lead time increases, the NNs show a better performance in Northern Norway.Note that HVG and VAW have different characteristics than the other stations in Northern Norway, which is reflected in the results.The poorest performance of the NNs is observed in Western Norway, where the storm surge values and the error in Nordic4-SS are lowest.

Application of the ML correction method to a storm surge event
We have selected two storm surge events to illustrate the improvements in the surge predictions in forecast mode after applying the ML correction for a lead time of one hour.From a historical perspective, these events might not be the most extreme, but they were the only two events in OSC that registered storm surges above 60 cm in the short test period (April 2020 to March 2021).Still, they provide an indication of the performance of the models when SSH values are anomalous high.
The first event was registered on November 21-23, and was associated with a low pressure system that originated south of Iceland and traveled to Northern Norway, causing strong westerly winds in Southern Norway.Due to this event, MET Norway had to issue warnings of high water levels (with an uncertainty of 5-10 cm) and strong winds.The consequences associated with the sea-level warning were local floods and risk of small damage on infrastructure and buildings along the coastline.The event is illustrated in Fig. 15a.This figure was constructed by taking the values of lead time t + 1 associated with seven forecast runs.Note that each Nordic4-SS forecast and, subsequently, each ML prediction is generated every 12 hours.We see that the predicted storm surge was above 80 cm, overestimating the actual event that resulted in a storm surge of about 70 cm.At the peak of the event, the error in Nordic4-SS is -12 cm (Fig. 15b), slightly outside the uncertainty range, but after correcting with NNs the error is reduced to 4 cm.Moreover, the wind speed and direction predicted with MEPS for the peak of the event were of 4 m/s and 260  Figs.15c and 15d show the storm surge and residuals, respectively, of the event caused by storm Bella on December 26-27, 2020.A large area in the North Atlantic was affected by a low-pressure system that originated when an intrusion of cold air from Canada met relatively warm air over the Western Atlantic.The Jet Stream helped develop and deepen the low-pressure system before it continued its track to the North Sea.While interacting with the Azores high, it generated strong winds and heavy rains.The storm had an enormous impact across the Norwegian territory, forcing MET Norway to issue 62 warnings because of strong winds, floods, and land-and snow slides.We can see that the contribution of the storm surge to the rise in SSH was of 60 cm on December 27 at 00 hours at OSC. Nordic4-SS predicted the rise in SSH associated with the storm, but the fall was predicted a couple of hours before it actually occurred.Consequently, the numerical model underestimated the maximum levels by more than 20 cm.On the other hand, we see that the residual learning method can detect the delay, reducing the error by more than 10 cm.Unfortunately, for longer lead times, the NNs do not perform equally well for this extreme event.This shows that our correction technique can be used for short term updates to the predictions issued at the reference stations.In addition, this is an event where the forecasters could have used the polar plots to adjust the forecasts.The wind speed and direction forecasted with MEPS for the peak of the event at OSC were 12 m/s and 284 • respectively.For these values, the polar plots indicate that Nordic4-SS typically strongly underestimates storm surges, which was the case.

Summary and Discussion
The aim of this study is to improve Nordic4-SS, the numerical storm surge model that runs at MET Norway.First, we show that the model has systematic errors that depend on atmospheric conditions and the geographical location of the station considered.As part of the validation process, we represent the dependence of the error in the model relative to the local wind speed and direction in polar coordinates.We also find that the residuals are significantly autocorrelated.Put together, these results indicate that the residuals in the numerical model are not random and that there is a structure that could be learned using data-driven models.
Given that the numerical model already has a good performance, that decades of development have been dedicated to understanding the physical processes, that it is generally very reliable, and that the meteorologists on duty are trained to use this model, we consider it most adequate to reduce the errors in this model using a post-processing approach, rather than training a new data-driven model to compete with Nordic4-SS.From a practical point of view, the methods proposed are particularly convenient because they run on top of MET Norway's model, meaning that they can be readily used in practice and that it is possible to compare the current predictions to the corrected values.Another advantage of this post-processing method is that the forecasters need no knowledge of ML to apply the result.Furthermore, although correcting a physical model with a coarse resolution can reduce the computational cost of running a numerical model [11], our experience from this study indicates that it is also computationally possible to run a physical storm surge model with the required resolution for operational purposes and make the ML correction on top of it to improve its quality.The additional computational time spent on correcting Nordic4-SS with the NNs is in the order of seconds per station.
When computing the polar plots, we observe that the error and the uncertainty tend to increase with wind speed at all locations.The fact that the error in the storm surge model is big when winds are strong, has naturally strong implications for the prediction of extreme surge events, which are the ones we are most concerned about because they can cause the most damage.The dependence of the bias on wind direction, on the other hand, is a characteristic of each station.Overall, the patterns observed in the polar plots are in line with the experience-based intuition of the forecasters and help build confidence in our correction methods.We explored the possibility of using polar plots to correct the operational numerical model by removing the bias conditioned on local wind speed and direction from the storm surge forecasts.However, we were not able to obtain a meaningful enhancement of the forecasts, and only about 3 − 4% improvement at some locations for the hindcast.The strengths of this method rely on the fact that it is possible to gain physical understanding of the local bias.A drawback is that we cannot simultaneously apply the method to correlated variables, which means that we cannot subtract the bias computed for different variables because we would end up removing the correlated part of the bias twice.The poor performance of the method, and the fact that we cannot correct the error associated with multiple variables, along with the nonlinear nature of the problem, motivates the use of NNs for bias correction of the numerical storm surge model.
Thus, the second method developed for improving the numerical storm surge model consists of learning the residuals at each Norwegian water level measurement station with NN models.With this method, we managed to improve the skill in Nordic4-SS for lead times up to 60 hours at several of the permanent Norwegian stations, in particular those located in Skagerrak and Northern Norway.The models have been validated by comparing the RMSE of the current forecasts based on ROMS against the RMSE of the corrected values.For instance, the percentage improvement at OSC is of 36% for t + 1 and 9% for t + 24.Although the NNs outperform the polar plot method, it must be taken into account that NNs are complex nonlinear methods that involve numerous computations and, as such, are harder to interpret.This might cause hesitation about applying the methods operationally.However, we believe that using this as a post-processing method and allowing the forecasters to compare the residuals estimated with NNs to the polar plots, will make the method more likely to be adopted operationally, at least as an aid to the forecasters in producing their analysis.
To illustrate the applicability of the NN correction method, we analyzed the forecast of the storm surge events above 60 cm in the test dataset.We found two events, one where Nordic4-SS overestimates and one where it underestimates the water levels.A comparison of the Nordic4-SS forecast and the corrected values for the event caused by storm Bella showed a reduction in the error of about 25%, or 10 cm.Although the improvement was computed for t + 1, it demonstrates the benefits of applying the residual NN method for predicting storm surge events.It is important to remember that the model was trained in forecast mode with only two years of data.
Despite the ability of the NN to correct the storm surge predictions, there were some unexpected constraints in the correction process.First, we learned the relevance of using datasets that are continuous in time and rely on the exact same model setup for the whole period.For instance, we could not transfer learning using NNs from the hindcast to the operational model, as the original intention was, because of slight differences in the datasets.This was a problem even though both datasets were generated with the same modeling system, ROMS, with the same setup and bathymetry and parameterizations.A possible explanation for why we did not succeed in transferring learning is that the output from ROMS relies on the atmospheric forcing and how the model is initialized.The hindcast is initialized once a year, and uses all the data available a posteriori over the whole year to be optimized.In contrast, the operational model is initialized every 12 hours and, naturally, uses only data that are already available by the time the forecast is run.Since we were not able to apply transfer learning from the NORA-SS to Nordic4-SS, we had to split an already short forecast sample to train and test the NNs.In other words, we had much less data to correct Nordic4-SS than we initially thought.It must also be mentioned that enough data is essential not only for the training process but for testing the results, because the RMSE is sensitive to the number of events in the test sample, which varies from year to year.
All the experiments conducted were based on the residual learning framework, which consists of learning the deviation of the predicted storm surge values from the observed values, instead of estimating the actual SSHs as previous studies do [e.g., 20,9,22,21].As such, the targets are the residuals, where the ROMS output has been corrected using a weighted differences correction method prior to the computation of the error.The residual learning technique has been applied for short-term predictions before [e.g.21].However, comparing the results with previous findings in the literature is not straightforward because 1) not all studies use the RMSE to evaluate the prediction skill of their models, DECEMBER 19, 2023 2) even when the same metric is used, we have seen that the RMSE and the relative improvement for data corrected with the same method have a strong spatial variability, 3) when data are scarce, the RMSE is also sensitive to the period chosen to train and test the models, and 4) a reduced number of studies model storm surges for lead times longer than a day.For this reason, we assessed the performance of the ML models by comparing the corrected data to Nordic4-SS predictions without ML correction.
We want to emphasize that data-driven methods are highly sensitive to the number of samples provided.Hence, we expect an improvement of the results when training the algorithms with longer time series.Further improvements could involve fine-tuning the parameters in the NNs.For example, the polar plots illustrate how the bias depends on the location of the station.This implies that we need to train one NN for each station.In this work, however, the predictor variables were chosen based on experiments conducted to improve the residuals at OSC at t + 1, and the stations these predictors were selected from are determined for each of the three Norwegian regions, not individual stations.We believe that a natural progression of this work would be to explore the advantages of selecting a different set of stations and predictor variables to correct the residuals at each location and lead time.Furthermore, we could optimize the architecture of the NNs for each location and lead time instead of running the same models.If we had more samples, we could also explore adding more variables, for instance, weather forecasts generated in the past, or more lags, without the risk of overfitting.We could also experiment with using inputs from a different weather model, with longer lead times than MEPS, which could allow us to correct the complete storm surge forecasts until t + 120.Finally, it would be interesting to study extreme events and nonlinear interactions of the residuals with tides when winds are strong, and when tides are high.
To summarize, this study has shown that applying a data-driven methodology for reducing the residuals in Nordic4-SS will positively impact the efficiency of warning systems and the response to storm surge events at many Norwegian stations.The methods developed are particularly convenient because they can, with minimal cost and effort, be adopted as a post-processing tool of the operational storm surge model without changing the current procedures or setup of ROMS.Given that the ML models are trained on data that are available when making the predictions, the computational demand is manageable, and as the parameters are already optimized, they can efficiently run on top of MET Norway's predictions.Moreover, this study sets a precedent for successful bias correction with NN residual learning of an operative storm surge model and describes a simple methodology that can be applied to any numerical model, also at global scale.This data-driven method, which can be seen as a post-processing method, does not interfere with the process of improvement of the numerical model itself, but has the advantage that it is faster to develop.Therefore, it can be seen as a flexible solution for correcting errors related to the numerical models' missing physics, setup, forcing, etc.The performance of the NN method is limited by the amount of observations, but as long as the numerical model is not perfect, the present study indicates that it would be beneficial to apply this correction technique.
and sea level have an inverse relationship, denominated Inverse Barometer Effect (IBE).The IBE consists of a rise of the water level in the presence of low air pressure or vice versa.However, the sea level does not change instantaneously owing to the need to move water masses and the inertia in the whole ocean system, but responds to the average change in pressure over a larger area.As a general rule, if the air pressure drops by one hectopascal (hPa), the water level rises by one centimeter, accordingly to what hydrostatics would predict.More formally, atmospheric pressure can be transformed into an equivalent inverse barometer SSH, η ib , as expressed by the following equation: where g is the local gravity, ρ water is the water's density, p atm is the atmospheric pressure, and p ref is the reference atmospheric pressure usually set to 1013 hPa.
Air-sea interaction is not limited to the IBE.Consider the situation in which the wind blows over the ocean.The air, which moves more rapidly than the water, produces a shear stress parallel to the sea surface, transferring energy and momentum.How the wind stress affects deeper layers depends on for how long the wind blows, the strength of the turbulent coupling between the ocean and the atmosphere, the Coriolis effect, and the stratification of the water column.
The effect of winds on SSH is inversely proportional to the water depth.It is, therefore, more important when the wind blows over an extended shallow region.The magnitude of the turbulent wind stress, τ , is often parameterized as a function of the wind speed at a certain level, U h , and the air density, ρ, in the following way: where C D is a dimensionless wind drag coefficient.In addition, in the presence of a boundary in a rotating system, the wind stress parallel to the shore will cause an Ekman flow perpendicular to the coast.If the Ekman flow is directed towards the coast, water will be piled up [3].Earth's rotation causes winds to move toward the right in the Northern Hemisphere, such that the largest surge will be in the right forward part of the storm in this hemisphere, due to the Coriolis effect.Meanwhile, the balance between frictional forces due to wind stress and the Coriolis force will drive surface currents at an angle to the right of the wind direction in the northern hemisphere (the exact angle will vary with the turbulent properties of the fluid but will typically range from 15 − 30 • ), the well-known Ekman current.The Ekman transport (the integral over the vertical dimension) will point 90 • to the right of the wind stress vector [3].Consequently, when the Earth's rotation bends the currents into a more perpendicular direction with respect to the shore, it can amplify the surge.When the surge arrives on the Norwegian coast, it is primarily winds from the south and west that create an excess of water along the coast [53].
The conservation of momentum also involves the process of momentum transfer from wind-generated waves.Applying a linear approximation for a steady state and a sea surface slope in equilibrium with a constant wind field, the surge height ζ due to wind-driven forces can be understood with the following relation [e.g., 54, 1]: where τ s is the wind stress at the sea-air interface, g is the gravitational acceleration, h is the depth of the water, and W is the shelf width.In this approximation, we have not considered that the depth is variable.
Furthermore, in rotating fluids, Kelvin waves are a solution to the hydrodynamic equations with vanishing meridional velocity normal to a lateral boundary [3].These waves can, in theory, exist at all frequencies, also at the time scales of storm surges.Furthermore, Kelvin waves can only move along a coast in one direction, with the coast on the right in the Northern Hemisphere.A particular characteristic of Kelvin waves is that they can be generated by surface wind forcing close to the shore, tidal forces, or reflection of other waves incident on a coast.This means that strong winds in a remote location can generate long waves that travel along the coast, causing the water to rise even though the local winds are calm.For example, in the North Sea, winds can cause a Kelvin wave that propagates anti-clockwise from eastern Britain and eventually lead to high water levels on the Norwegian coast.The propagation of Kelvin waves is slower in shallow seas.In the North Sea, which has an average depth of 50 m, a Kelvin wave that originated in England can reach Norway in about a day [55].
Despite the fact that the effects of the atmosphere on the SSH are mainly due to wind stress and atmospheric pressure gradients, SSH departures associated with storm surges depend on a wide range of parameters.Some important factors are the storm intensity, forward speed, size, angle of approach to the coast, central pressure, as well as the shape and characteristics of coastal features [56].For instance, a storm surge will be greater in regions with a shallow slope of the continental shelf than in regions with a steep slope.Furthermore, a narrow shelf with relatively deep waters is associated with lower surges but higher and more powerful waves.The opposite is true for narrow shelves with shallow waters.Bays are particularly vulnerable because storm surge water is funneled in.In addition to the storm's characteristics and the topography, the SSH might be affected by nonlinear effects, such as the bottom friction that removes energy from the motion, finite water depth, flow curvature, and tide-surge interactions [1].Furthermore, the rain effect can also contribute considerably to rising sea levels in estuaries.Heavy rains can cause surface runoff which can quickly flood streams and rivers, increasing the water level in estuaries.These effects are challenging to capture accurately in numerical models and introduce systematic biases in model outputs.
As mentioned before, surges are mainly generated by wind stress and low-pressure systems.The analytical expressions in Eqns.8 and 9 are helpful for the physical interpretation of the processes involved.Even so, the actual response of the sea to the weather, in the presence of irregular boundaries and variable depths, is more complex and cannot be fully described by these equations.This limitation motivates the use of advanced numerical models and NNs, which are nonlinear models, to model the meteorological component of the SSH variations.Moreover, minor variations in weather patterns might result in very different responses, in particular in water bodies with tendencies for resonances and oscillations [1].In this work, we use data from stations located along the Norwegian coast to develop a post-processing correction method of Nordic4-SS.
The numerical modeling system used in this paper, ROMS, runs in barotropic mode.Thus, the governing equations are the shallow water equations [28].If the total height of the fluid column is h = H + ζ, where the H is the equilibrium depth and ζ is the sea surface deviation, we can integrate the velocity between H and ζ to obtain the volume flux through a fluid column U.Then, the shallow water equations in flux form are: and where f is the Coriolis parameter, ρ 0 is the sea water density, τ s and τ b are the wind and bottom stress, respectively, and X is the internal mixing.

D Neural Networks fundamentals
Machine Learning (ML) models exploit computers' capabilities of learning from past experiences (the input data) in order to make predictions.In this paper, we perform regression tasks, i.e., we predict the value of a continuous variable.These algorithms fall into the category of supervised learning because the models are trained with both features (input data) and labels (output data).For this, we use Neural Network (NN), more specifically Deep Neural Network (DNN), a subclass of ML algorithms that use multiple layers to iteratively extract information from the training dataset.In each iteration, the signal travels from the input to the output layer, passing through the intermediate hidden layers.DNN have the capability of modeling complex nonlinear relationships and are therefore suited for the problem we want to solve in this paper.A NN has several components; all layers are conformed by nodes where the computation happens.
The input data to each layer is combined with coefficients, also called weights, that either amplify or dampen the input.An activation function will then transform the weighted sum of the inputs to the neuron and pass on this information to the next layer or provide the predicted residual for the last layer.This way, different layers are able to apply different transformations to their inputs.
When we train a NN, we adjust the weights to minimize the loss function, which traduces to reducing the MAE or the MSE: The training starts from random parameters updated for each learning iteration (epoch) to minimize the cost.The learning rate is one of the key parameters we have to tune; it defines how quickly we move toward a local minimum.

DECEMBER 19, 2023
A too large learning rate can overshoot, while a too small learning rate will require more iterations.Other important training parameters to consider in the design of the models' architecture are the number of layers and nodes per layer.However, it may only be feasible to test some combinations of parameters due to computational cost.
A popular metric for measuring the model performance is the Root Mean Square Error (RMSE): The MSE can in turn be decompose into a bias and a variance component as follows: We aim to fit the input data by adjusting the learnable parameters in a model.Ideally, we want the models to capture the regularities in the training data and generalize well to unseen data.A model with high variance is characterized by high complexity and tends to overfit.In general, these models work well on the training data but fail to generalize on unseen data and therefore have a high test error.On the other hand, a model with high bias is too simple and unable to learn complex features.As a result, it underfits the data, fails to learn how to train the data, and has high training errors.The issue of overfitting can be addressed by: a) reducing the number of features (input data), b) using regularization, and c) early stopping.In this paper, we have used the three methods mentioned above.We have carefully selected the number of stations, variables, and range of hours used to train the model instead of using all possible variables.We have also used the dropout regularization method, which consists of randomly dropping out nodes during training.In addition, if no improvement is shown in the performance metric, the training will stop after 50 iterations.

D.1 Number of predictors vs. number of samples
In the field of ML, the datasets are usually structured as tabular data, either in the form of arrays or dataframes.Most columns represent the predictor variables (also called input variables or features), while one column often represents the output (or labels).On the other hand, the rows are often referred to as samples (also called observations, records, or instances).Most ML algorithms assume that the number of predictors (p) is much smaller than the number of samples (n): p ≪ n.Therefore, as the dimension, p, increases, we need more samples to successfully represent the domain.
Moreover, nonlinear models with higher flexibility and variance, like NNs, depend more on the samples provided.For this reason, they require more data in the training process.Another factor that determines the amount of data needed is the complexity of the underlying function to learn.This is, we need enough data to capture the relationship between the predictors and between the predictors and the labels.
Our study aims to predict the residuals in the numerical storm surge model, which is a nonlinear problem.We see that NNs perform better than polar plots, but, as discussed, NN are complex models that need more data than traditional statistical methods.Unfortunately, the number of samples in Nordic4-SS is limited, as the most recent version only has been run since 2018.The number of predictors is also limited, but it can quickly grow due to the number of stations, variables, and lagged hours.Despite the use of domain knowledge to reason about the data that may be needed to capture the complexity of the problem, feature selection is not a trivial task.In the following, we provide some numbers to illustrate the challenges associated with working with a small dataset.
Given that the labels are the residuals in Nordic4-SS, the number of predictions available from Nordic4-SS limits the number of samples in our problem.Nordic4-SS runs twice a day, meaning that, if no data is missing, from January 2018 to March 2021, the number of samples is (365 × 2 + 366 + 90) × 2 = 2372.Remember that we must split the data and leave some samples for testing.Due to seasonal variations, we have set apart an entire year for testing, resulting in a maximum of 1642 samples for training and 730 samples for test.When it comes to the predictors, we have explored the possibility of using observed total water level, tide estimates, storm surge predictions (also predictions generated in the past), pressure, wind, and wave data.If we limit the hour of past data to 24 hours before the analysis time and a lead time of 60 hours, we could potentially add data from 8 variables from the 23 permanent locations for 24 hours (23 × 8 × 24 = 4416 predictors), and for seven variables for +60 hours (23 × 7 × 24 = 3864 predictors).In addition, we can add forecasts from Nordic4-SS and MEPS generated 12 hours and 24 hours before the analysis time.As we see, the number of possible predictors is much greater than the number of samples and, because most of them a priori contain relevant information for the prediction of the residuals, it is not trivial which ones should be included in the ML models.To limit the number of predictors, we reduced the number of stations from which we extract the parameters we will use to train the models, from 23 possible stations to 5, as explained in section 3. We also decided not to use past weather forecasts, and after running different experiments with winds, pressure, and waves, we discarded the pressure and wave data, as these do not improve results.The optimization learning curves for t + 1 at OSC are illustrated in Fig. 16 and show how the algorithms learn incrementally from the data, as the error (MAE) is reduced with the number of iterations.The training curve provides an idea of how well the model is learning, while the validation curve indicates how well the model generalizes to previously unseen data.Therefore, we expect the training error to be lower than the validation error, as in Fig. 16.
Even though both the training and validation error decrease to a stability point, and the validation error is expected to be larger than the train error, a too wide gap between the lines could indicate unrepresentativeness due to a small training dataset.We will, however, show that the NNs can improve Nordic4-SS.Moreover, when lead time increases (not shown), the gap between the curves increases and the convergence occurs for a smaller number of epochs.As expected, different stations have naturally different learning curves.

E Polar plots conditioned on wave parameters
Here, we show two examples of polar plots of the average residuals constructed for significant wave height in the radial coordinate and mean wave direction in the angular coordinate from ERA5 (Fig. 17).Fig. 17a shows that Nordic4-SS underestimates storm surges at BGO for waves with mean direction 160 • − 280 • , and overestimates for directions between 290 • − 360 • .On the contrary, at ANX, Nordic4-SS overestimate surges with mean direction between 120 • − 300 • , and underestimates for the other directions, as seen in Fig. 17b.We can also observe that significant wave heights are higher at ANX compared with BGO.While the patterns in these figures appear clearly, we do not see any improvement by including both wind and wave predictors in the ML model, presumably because waves are strongly correlated to winds as they are the result of the combination of wind and fetch.Since the wave data from ERA5 is not available at all locations, we decided to use the wind parameters as predictors.

Figure 1 :
Figure1: Illustration that shows water level differences for storm surge and a normal (predicted) high tide as compared to mean sea level.Z it the observed total water level, T is the predicted tide with harmonic analysis, and R is the storm surge predicted by the physical model.

Figure 3 :
Figure 3: The model domain of Nordic4-SS.The domain has a polar stereographic projection covering the area from Bretagne to Novaya Zemlya, and between Norway and Greenland.Note that the Baltic Sea is not included (masked out) in the model domain.The horizontal model resolution is 4 km.The colors indicate bottom depth in meters, and the contour lines are plotted at 100, 500, 1000, 2000 and 3000 meters.Figure reproduced from "A forecasting and warning system of storm surge events along the Norwegian coast" [5].

Figure 6 :
Figure 6: Train-test split of the original dataset for the polar plot (PP) method and the ML method.In the case of the PP method, the training dataset consists of the hindcast data spanning the period January 2018 -December 2017, whereas the test dataset consists of two years of data, from January 2019 -December 2020.The ML corrects the Nordic4-SS predictions.In this case, the training dataset consists of 70% of the data in the Nordic4-SS data in the period January 2018 -March 2020, whereas the remaining 30% is held out for validation.The test dataset covers the period April 2020 -March 2021.

Figure 7 :
Figure 7: Schematic polar plot representation, where the ρ represents the radial coordinate (e.g., wind speed), θ represents the angular coordinate (e.g., wind direction), and the yellow square is the binned data (e.g., average residuals).

Figure 9 :
Fig.10shows time series, histograms, and autocorrelation plots of the residuals in Nordic4-SS at each of the three chosen locations.Each color represents a station.Although the magnitude of the residuals is different at the three stations, they all show errors larger than 10 cm and a pronounced seasonal cycle.Furthermore, predominantly negative residuals indicate that Nordic4-SS overall overestimates the meteorological component.In the autocorrelation plots, the shaded areas are delimited by the confidence intervals, and values outside these bands are considered significantly autocorrelated.Note that the lags are of 12 hours because the residual time series are 12-hourly.The autocorrelation plots indicate significant non-randomness in the residuals at all three locations and, thus, a signal that has the potential to be corrected.Although the three stations have different autocorrelation patterns, they all show significant autocorrelation the first 10 days, with two longer-period peaks around two weeks and just before one month.These peaks coincide with

Figure 10 :
Figure 10: Statistics of the residuals in MET Norway's operational storm surge model (Nordic4-SS).The residuals have been computed as the difference between the observed and predicted meteorological component, where the predicted meteorological component is the output from Nordic4-SS corrected with a weighted differences correction method.The panels show the time series of the residuals at a) Oscarsborg (OSC); b) Bergen (BGO); and c) Andenes (ANX); histograms of the residuals at d) OSC; e) BGO; and f) ANX; the autocorrelations in the residuals up to 60 days at g) OSC; h) BGO; and i) ANX.Note that the figures were constructed with 12-hourly data from the period January 2018-March 2021.

Figure 11 :
Figure 11: Polar plots of the statistics in MET Norway's storm surge hindcast (NORA-SS) conditioned on wind speed and direction.The residuals have been computed as the difference between the observed and predicted meteorological component, where the predicted meteorological component is the output from NORA-SS corrected with a weighted differences correction method.The panels show the average residuals at a) Oscarsborg (OSC); b) Bergen (BGO); and c) Andenes (ANX) (red colors indicate an underestimation and blue colors indicate an overestimation by the hindcast); the standard deviation of the residuals at d) OSC; e) BGO; and f) ANX; the number of observations at g) Oscarsborg (OSC); h) BGO; and i) ANX.The bins are defined as boxes of size 1 m/s × 10 deg.In figures a) to f), only bins with at least five observations are colored.The figures were constructed with hourly hindcast data from the period 2000-2019.Note that the colorbar saturates for the highest values in panel a.

Figure 12 :
Figure 12: Polar plots of the statistics in MET Norway's operational storm surge model (Nordic4-SS) conditioned on wind speed and direction.The residuals have been computed as the difference between the observed and predicted meteorological component, where the predicted meteorological component is the output from NORA-SS corrected with the weighted differences correction method.The panels show a) the average residuals at Oscarsborg (OSC), where red colors indicate an underestimation, and blue colors indicate an overestimation by the hindcast; b) the standard deviation of the residuals at OSC; and c) the number of observations at OSC.The bins are defined as boxes of size 1 m/s × 10 deg.In figures a) to f), only bins with at least three observations are colored.Note that the figures were constructed with 12-hourly 0-lead-time data from the period January 2018-March 2021.Note that the colorbars saturate for the highest values.

Figure 13 :
Figure13: Statistics of the residuals in the operational storm surge model (Nordic4-SS), before (dashed line) and after (solid line) the ML correction.The residuals have been computed as the difference between the observed and predicted meteorological component, where the predicted meteorological component is the output from Nordic4-SS corrected with a weighted differences correction method.The panels show the RMSE of the residuals at a) Oscarsborg (OSC); b) Bergen (BGO); and c) Andenes (ANX); the bias in the residuals at d) OSC; e) BGO; and f) ANX; the standard deviation of the residuals at g) OSC; h) BGO; and i) ANX.Note that the figures were constructed with 12-hourly data from January 2018-March 2021.

Figure 14 :
Figure 14: The first row of panels shows RMSE of the residuals in MET Norway's operational storm surge model, Nordic4-SS, for lead time a) 1 hour, b) 24 hours, and c) 48 hours at all the 22 permanent harbors in mainland Norway.The second row of panels shows improvement in RMSE after applying the residual NN correction at a lead time of d) 1 hours, e) 24 hours, and f) 48 hours.The RMSE has been computed using only one year of data, from April 2020-March 2021.Note that this is a very short period consisting of only 631 records and that the RMSE is sensitive to the period chosen.

Figure 15 :
Figure 15: Storm surge and residuals of two events at the Norwegian harbor Oscarsborg (OSC).Panel a) shows the observed storm surge with a dashed-dotted red line, the predicted storm surge from Nordic4-SS with a dashed black line, and the storm surge from Nordic4-SS corrected with NNs with a solid blue line, for the event on November 21-23, 2020.Panel b) shows the residuals in Nordic4-SS predicted with neural networks with a dashed-dotted purple line, the residuals in Nordic4-SS with NNs with a dashed black line, and the residuals from Nordic4-SS after correcting with NNs with a solid blue line, for the event on November 21-23, 2020.Panels c) and d) are equivalent to a) and b), but for the event on December 27-28, 2020.

DECEMBER 19, 2023 Figure 16 :
Figure 16: Mean Absolute Error (MAE)[m] as a function of epoch.The orange line shows the optimization learning curve for the training and the blue line the validation curve.The curves are obtained for a NN that learns the residuals in MET Norway's storm surge model, Nordic4-SS, at the station Oscarsborg (OSC) for lead time t + 1.The training data corresponds to the period January 2018-March 2020 and test data to the period April 2020-March 2021.

Figure 17 :
Figure 17: Polar plots of average error in MET Norway's storm surge hindcast (NORA-SS) significant wave height and mean wave direction.The residuals have been computed as the difference between the observed and predicted meteorological component, where the predicted meteorological component is the output from NORA-SS corrected with a weighted differences correction method.The panels show the average residuals at a) Bergen (BGO); and b) Andenes (ANX).The bins are defined as boxes of size 1 m × 10 deg.Only bins with at least five observations are colored.Red colors indicate an underestimation and blue colors indicate an overestimation by the hindcast.The figures were constructed with hourly hindcast data from the period 2000-2019.Note that the colorbars saturate for the highest values.