Are Feature Agreement Statistics Alone Sufficient to Validate Modelled Flood Extent Quality? A Study on Three Swedish Rivers Using Different Digital Elevation Model Resolutions

Hydraulic modelling is now, at increasing rates, used all over the world to provide flood risk maps for spatial planning, flood insurance, etc. *is puts heavy pressure on the modellers and analysts to not only produce the maps but also information on the accuracy and uncertainty of these maps. A common means to deliver this is through performance measures or feature statistics. *ese look at the global agreement between the modelled flood area and the reference flood that is used. Previous studies have shown that the feature agreement statistics do not differ much between models that have been based on digital elevation models (DEMs) of different resolutions, which is somewhat surprising since most researchers agree that highresolution DEMs are to be preferred over poor resolution DEMs. Hence, the aim of this study was to look into how and under which conditions the different feature agreement statistics differ, in order to see when the full potential of high-resolution DEMs can be utilised. *e results show that although poor resolution DEMs might produce high feature agreement scores (around F > 0.80), they may fail to provide good flood extent estimations locally, particularly when the terrain is flat. *erefore, when high-resolution DEMs (1 to 5m) are used, it is important to carefully calibrate the models by the use of the roughness parameter. Furthermore, to get better estimates on the accuracy of the models, other performance measures such as distance disparities should be considered.


Introduction
Flood risk maps have become standard documents in local, regional, national, and sometimes even international spatial planning. ey are widely used and considered as valuable sources of information, mainly in planning and development projects. Detailed positioning of houses, buildings, and infrastructure is decided based on these maps. However, the reliability of these flood maps have been questioned for years. Deterministic flood maps, in particular, which show defined flooded areas associated with a specific flood-return year (e.g., the 100-year flood), are criticised to mislead map users to think that they show absolute boundaries of flood extent. Nonetheless, they can never be exact due to effects of different factors and modelling assumptions (e.g., input data, model, and model parameters) in the production of these maps. Different data, models, or parameters can change the position of the flood borders.
To overcome this limitation of flood model results, they are calibrated (e.g., using different roughness values) against a historic (reference) flood event with the same discharge. Each flood extent from the calibration is then validated for its accuracy through a performance (goodness-of-fit) measure that quantifies the model quality. Numeric scores provided by the performance measures are used to justify the choice of an optimal model that can be used as a flood hazard map. In uncertainty and probabilistic mapping implemented through the Generalised Likelihood Uncertainty Estimation (GLUE) methodology [1], these performance estimators play an important role, where they are used for deriving the likelihood weights assigned to the results of the ensemble, prior to the creation of the map [2]. Moreover, performance scores are widely used to analyse the sensitivity of the modelled flood to the effects of the input topographic data [3][4][5][6], roughness value [7][8][9][10][11], and hydraulic models, or their numerical solutions [3,7,8].
In flood extent validation studies, performance measures used include various types of feature agreement (F) statistics [9], as well as disparity measures [2]. F-statistics are the most commonly applied in assessing flood model predictions. ey give a general view of the model's performance by considering the total sizes of over-and underestimation and overlap between the model and the reference data [12], but the outcome may be affected by what is prioritised in the equation. e disparity measures (mean and median disparities), on the contrary, account for the distance between the modelled and reference data at each point sample location [2,12]. Nevertheless, the scores will be dependent on where the samples are taken or the sampling strategy employed. According to Beven [13], all these equations have their assumptions, which can have implications on the choice of optimal models. Hence, decisions of the optimal model can be relative to the performance equation used.
Lim and Brandt [12] also showed that the combination of the digital elevation model (DEM) and roughness (Manning's n) can influence the performance measures such as the F-statistics scores. However, it remains uncertain how these various statistical estimators' results differ and when they differ. Hence, the aims of this study are to evaluate the following: (1) how the different performance measures, such as the different feature agreement statistics, vary in their results, in terms of the quantified values they produce; (2) how the generated flood extents of models considered as "best" coincide with the validation data and the computed performance scores; (3) how the sizes of overlap and underand overestimation vary with the given DEM or DEM and roughness pairs; and (4) how the proportion of areas correctly and incorrectly modelled agree with the performance measures. e effects of model performance in the assessment of simulation results based on the influence of the DEM alone, or both the DEM and roughness values used, are investigated in relation to the aims. e topographic data remains an important hydraulic numerical model input that provides the geometry of the study area (through, e.g., cells, triangular meshes or cross sections) and is the main source of the elevation values [2]. e grid size or the resolution mainly affects the details in the modelling and the accuracy (both vertical and horizontal) of model results [4,5]. Nonetheless, as what is shown in some studies [3,6,12], even lower resolution DEMs (≥25 m) can produce better quantified performance.
us, basing decisions on optimal models from a given performance measure alone may discard the benefits of using fine resolution DEMs. In addition, as the Manning's n is the parameter that is adjusted during the calibration, it is important to also see how these values, especially when paired with different DEMs, can affect the performance scores.

Verification Measures for Spatial Predictions of Floods
One way to distinguish the goodness of predictions from simulation result or modelling is through its quality or agreement to the observed data (truth) [14]. is quality is assessed through verification methods that help describe or quantify their performance. A model is usually compared with observational data, based on binary conditions (i.e., yes/ presence or no/absence), through a 2 × 2 contingency table [15]. In Table 1, hits (A) represent predicted events to happen and did take place. False alarms (B) are events modelled to occur but did not occur, whereas misses (C) are events that actually happened but were not modelled to happen. Model results that are correctly predicted not to happen fall under correct rejection (X). A number of performance estimators based on this table are described in Mason [15].
In flood inundation studies, the goodness of model prediction can be evaluated by assessing the overall, so called global, spatial accuracy of the simulated flood result produced from model calibrations or ensemble modelling, against a reference dataset, which is often a historical flood extent having the same magnitude as the modelled one. In such applications where the quality of spatial predictions of modelled events are verified, performance measures that can be applied will include accuracy, the critical success index (CSI), hit rate or the probability of detection (POD), false alarm rate or the probability of false detection (POFD), and the false alarm ratio (FAR) [9]. e degree of correspondence between the model and the observation data can be estimated through computing the proportion correctly predicted by the model (i.e., accuracy) [9,15]: According to Schaefer [16], this equation can be affected by the size or number of correct rejections, especially if it is large, which can lead to high accuracy score. In flood modelling, where dry areas are usually larger, a high accuracy score can be achieved, despite the low number of hits in the model [10]. us, in most performance measures used, correct rejection (X) is disregarded. e removal of the correct rejection from the original accuracy equation leads to another performance measure widely known as the critical success index or the threat score: In flood model verification, the application of CSI allows to focus on the correct prediction made by the model (hits, A), and the two main causes of flood modelling errors, i.e., false alarm (B) and misses (C). ere are several equations based on the CSI, and they are known as the feature agreement (F) statistics. Generally, they are implemented by either considering (1) the binary flooding conditions (i.e., flooded � 1 or dry � 0) of each cell (in a raster data) in the observed (D) and modelled (M) result [9,17] or (2) by comparing the areal size and location of the modelled flood against the reference data [18] (in vector data). All F-statistics equations rely on three important factors in the contingency table that help assess how well the model forecast the flooding. ese are (1) the total number of cells (P) correctly modelled to be flooded (M 1 D 1 ) or the size of overlap (A, hits) between the modelled and the reference data, (2) total number of cells (M 1 D 0 ) or size (B) overestimated (false alarms) by the model, and (3) total cells (M 0 D 1 ) or size underpredicted (C, misses). e most typically used F-statistics is F1, which is based on the original CSI equation. It is based on getting the intersection (hits) and union (hits + false alarm + misses) of the modelled and observed event [3] (equation (3)). e values generated by the equation is from 0 (no fit) to 1 (perfect model fit): However, this feature agreement statistics is also known to give bias in the results it produces [9], especially when the prediction gives large overlap size (hits) [12]. If the flooded area is large, as is the case when rivers flood wide flat floodplains, the modelled flood area will overlap the reference flood area to a relatively big extent, compared with the areas that are either over-or underestimated.
is means that a rather poor model can achieve strong statistics. As stated by Lim and Brandt [12], as long as the model has hits greater than the combined sizes of misses and false alarms, the equation will lead to an F-value > 0.50. Because of this limitation, several variants of the equation have been introduced (Table 2) to improve the quantified scores. ese versions of the equation modify the numerator to penalise the model with the overprediction (F2 O ), underprediction (F2 U ), or both under-and overpredictions (F3) it produces [9]. Among these variants, F2 O is a widely used performance estimator in flood extent validation studies and in creating probabilistic maps. According to Hunter [9] and Hunter et al. [8], it constrains the overlap produced by the model in F1, by penalising it with the overprediction it produces. F2 U and F3 are the least applied among the four equations. In F2 U , the overlap is subtracted by the underestimation made by the model, whereas in F3, both under-and overestimation are subtracted from the overlap. Maximum value of these performance measures is also 1, indicating a perfect model fit with the reference data. eir minimum can be − 1.
Other spatial performance measures that can be used in flood modelling are the hit rate or probability of detection, probability of false detection, and false alarm ratio. e POD is a standard statistics that indicates the ratio of correct prediction by the model, to the total observed data [15,16]. In flood model verification, it tells the total number of pixels or areal size correctly predicted by the model to be flooded, in relation to the reference data, by considering both the overlap and the underestimation made by the model. Maximum POD score is 1 in models with perfect overlap with the reference data. According to Mason [15] and Hunter [9], the POD measure alone cannot be used as a performance estimator because the equation is affected by the size of the hits (A). Large hits (and low underestimation) can generate high performance score, even if the model has largely overestimated the flooding. us, the POD equation should be supplemented with information that considers errors in overestimation [9], through the probability of false detection (also known as the false alarm rate) and the false alarm ratio. POFD accounts for the proportion of overprediction made by the model in relation to the correct rejections (X), while the FAR score considers the proportion of overestimation in relation to the hits. Maximum scores for both performance measures are 0, i.e., no overestimation made by the model. However, these two can also pose similar problems as the POD. As the POFD is related only to correct rejections, how well the inundation is produced in terms of the overlap or how the underprediction is minimised are not considered in the equation. On the contrary, although FAR accounts for the effect of the size of overlap made by the model, the size underpredicted is not accounted in the equation.
A further measure that can be applied to spatial model verification is the BIAS. is measure is for quantifying the consistency between the model result and the observed flood event [9] but does not indicate the model skill or performance [15]. erefore, the bias score is considered more of a descriptive statistics that specifies the ratio of the predicted data to the observation [15]. In flooding applications, this measure indicates if overall the model result underestimates (BIAS < 1) or overestimates (BIAS > 1) the flood. A value of 1 shows no bias. e farther away the score from 1, the larger the bias on either under-or overestimation.

Materials and Methods
Model performance measures are evaluated for model results based on different DEM resolutions in combination with different roughness values or single-handedly by the DEM resolution. To meet the aims, there were three study areas at different geographical locations in Sweden used ( Figure 1): Voxna, Testebo, and Eskilstuna (two different parts) (Figure 1). Voxna and Testebo rivers were used to determine the influence of DEM together with Manning's n on the performance scores. In analysing the effect of the DEM resolution (including its errors), the Eskilstuna river was used as a case. Simulation results from earlier studies of Lim and Brandt [12] (Voxna and Testebo), Klang and Klang [19], and Brandt [20,21] (Eskilstuna) were used in the current study. Specific data processing and modelling procedures are thereby described in the said articles in detail.
Hit rate/probability of detection (POD) Probability of false detection (POFD)/false alarm rate  Its western part is flat, consisting mainly of agricultural lands and forests. At the east, the river flows through the town of Edsbyn, where mostly built-up areas are present. is part of the river is steeper. e Voxna river has a mean annual flow of 40 m 3 /s at the outlet.
Topographic data used for creating the digital elevation models for Voxna came from a combination of bathymetric and LiDAR data. e DEMs were produced from the topographic data through TIN generation and conversion to raster, which is the main input required by the hydraulic model used [12]. e different DEMs used for this river had resolutions of 3, 4, 5, 10, 15, 20, 25, and 50 m. e highest resolution DEMs were not possible to be used for the study area due to the technical limitations of the 2D hydraulic model in terms of the size of the site. For each DEM, 10 different Manning's roughnesses were used, ranging from 0.010 to 0.100, with 0.01 increments in accordance with Horritt and Bates [7].
is was mainly to see how these different roughness coefficients affect the model results from different DEM resolution models. e simulated flow was 360 m 3 /s, which is equal to the big flood event in 1985. A corresponding reference flood inundation map was provided by Ovanåker municipality for this particular event. e flow was modelled as steady state 2D flow in CAESAR-LISFLOOD software (Coulthard et al. [22]). Eighty simulations from the different combinations of DEM and Manning's n were carried out for the river.

Testebo River.
e second area is around the Testebo river at Forsby, north of Gävle. It has a mean annual flow of 12 m 3 /s. e surrounding areas of the river, particularly its central portion is mainly flat arable land, with some mixed forests. Most residential areas are situated at the edges of these lands. Steeper sections of the river are located at the eastern and southern parts of the study site.
DEMs produced for Testebo river were processed in the same way as for the Voxna river. Similar DEM resolutions (except for the 1 and 2 m resolution, which were also included) and range of roughness values as for the Voxna river were used for the modelling. Here, the simulated flow was 160 m 3 /s, which coincides with the 100-year flood. Gävle municipality provided a flood inundation map over the 1977 flood, which is equivalent to this flow. A steady state 2D flow using the CAESAR-LISFLOOD software was also implemented to the study area [12]. A total of 100 simulations were conducted for the combinations of resolution and roughness values.

Eskilstuna River.
e third location consists of two parts of the Eskilstuna river (with a mean annual flow of 24 m 3 /s) in Eskilstuna. e northern part of the study area is located in the city centre where most residential, commercial, and industrial establishments are concentrated. is site is characterised by relatively steeper banks. e other location is situated just south of the city, where the floodplain is flatter. Residences are visible in the western part of this study site, along the river. Some forests are also lining up the river. At the edges of these forested areas are arable lands.
DEM resolutions used for the modelling were 1.04, 2.09, 3.13, 10, 25, and 50 m. Note, however, that these particular DEMs have not been derived in the same way as those for Voxna and Testebo rivers, where linear interpolation of point cloud data was used to construct the DEMs of the desired resolution. Here, the original point cloud data, which had a density of 1.64 and 1.36 points/m 2 , corresponding to cell sizes of 0.78 and 0.86 m for the flat and steep subareas, respectively, were used to produce the reference elevation model.
is DEM was then resampled and degraded by introducing random errors to simulate the behaviour of the results from a Leica ALS50 LiDAR instrument for different flight heights. Details of the DEM degeneration can be found in Klang and Klang [19] and Brandt [20,21].
A single roughness value (0.033) was used for all model runs. is Manning's n is the most commonly suggested friction coefficient for rivers in Sweden. As a single Manning's n was used, this means that the uncertainties of modelled flood extents can be isolated to only depend on the characteristics of the DEM and not on the DEM resolution in combination with roughness [20,21]. Flow equivalent to the maximum probable flow of 198 m 3 /s was used for the simulations. A steady state 1D flow was simulated using the HEC-RAS software (Hydrologic Engineering Center [23]).

Model Performance
Evaluation. Different feature agreement statistics, which are applied in flood extent validation studies, were used to assess the overall performance of the simulation results. Each model result was compared to the observed flood data (i.e., the historical flood maps for Voxna and Testebo rivers and the simulation result from the original point cloud data in case of Eskilstuna). Since the reference data were all in vector formats, equations accounting for areal size (cf. Table 2) were used.
is was implemented by overlaying the reference and the given modelled flood map resulting from simulating a given DEM, or DEM and Manning's n combination. en, the sizes of overlap and under-and overestimation were taken ( Figure 2), prior to the application of the specific F-statistics equation.
Aside from using the F-statistics, each simulation and best model results from the different resolution DEMs were also evaluated using two other spatial performance measures that can help understand the performance scores provided by the F-statistics, in terms of the probability of detection and false alarm ratio. Both statistics were considered to be able to determine how well the model produces the inundation in relation to the hits, as well as underestimation it produces (POD) [9], and how it overpredicted the results (FAR). Probability of false alarm or the false alarm rate was not used in this study, due to its inclusion of correct rejection (X) in the equation, which is difficult to account in spatial model results such as in flood extents because of huge dry areas being modelled [10]. To further understand whether the model from a given performance, DEM, and DEM and Manning's n combination tends to over-or underestimate the ood predictions, BIAS scores were also computed.

Results
e e ects of performance in evaluating quality of ood models' results as in uenced by both the DEM and Manning's n used (Testebo and Voxna), and by the DEM (Eskilstuna, North and South), were analysed through the following: (1) the overall performance scores of the simulation results, (2) the most optimal models' performance scores and their corresponding spatial extents and size of inundation areas, (3) changes in the sizes of overlap and over-and underestimation produced by the model for the given DEM and DEM and roughness combinations, and (4) the POD, FAR, and BIAS scores of the models.  Figure 3, it can be seen that the DEMs produced varying performance values in each measure, depending on the Manning's n paired with them. Moreover, from the two study sites, the Testebo river produced more variation in the performance score (for each goodness-of-t measure) when using high-to-low-resolution DEMs, compared with the Voxna river. Overall, the Voxna river resulted in higher performances from the di erent combinations of DEMs and roughnesses, with only 1 to 4 model results (out of 80) having F < 0.50 in the four F-statistics. It can also be observed that, regardless of the goodness-of-t measure used for both study areas, higher performances were more concentrated in medium to lower resolution DEMs (from 10 to 50 m), paired with a wider range of Manning's n values, whereas better results from higher resolution DEMs (1-5 m) came from a smaller range of roughness values. Among the di erent feature agreement statistics, F2 O had fewer simulation results that gave higher performance.

Combined E ects of DEM and Roughness
Lower roughness values also led to better results for this measure, especially when paired with coarser resolution DEMs. With F2 U , more simulation results produced higher performances. Nevertheless, it can also be seen that higher resolution DEMs performed poorly with lower Manning's values (particularly for the Testebo river, producing more negative performances), while coarser resolution DEMs performed well with most of Manning's n, especially for the Voxna river. e performance patterns for F1 and F3 were the same, but the latter produced lower and even negative values, especially for the higher resolution DEMs paired with lower roughness.

Most Optimal Performance Results.
e four performance measures considered di erent pairs of DEM and Manning's n to be the most optimal from the ensemble of 80 (Voxna) and 100 (Testebo) simulations ( Figure 4). In general, F2 O produced the best results from each DEM with much lower roughness values, while F2 U produced the best results with higher friction. F1 and F3, which had the same optimal model results, almost followed the roughness pattern of F2 U for each resolution, but using a lower friction coe cient. It is also visible from the diagram that the best models from the three performance estimators (F1, F2 U , and F3) that used higher resolution DEMs (1-5 m) worked well with higher roughness, and that the most optimal Manning's n used decreased as the resolution became coarser.
When considering the maximum performance scores for the di erent DEMs, the lowest resolution DEMs (25 and 50 m) produced the most optimal performance in all four F-statistics for the Testebo river (  flooding. F2 U produced larger extents in all DEM resolutions for Testebo (especially in the south and southeast) than F1 | F3 (except for the 50 m DEM) because of the higher roughness values used. In Voxna, the difference in extents between F2 U and F1 | F3 were less noticeable in the map, but if the size of the inundation area will be compared, the former had a larger flooded area as brought by using the higher roughness value (i.e., n � 0.030) ( Table 4). e size of flooded areas produced by the best models for each resolution and performance measure and the percent difference in areal size in relation to the reference data are further shown in Tables 3 and 4. A percentage equal to 0 means that the size of the inundation area from the simulation is the same as the reference data. Take note that this value does not signify an exact match in extents (overlap) between the model and the reference. A negative value (<0) means that the modelled flood area is smaller than the reference data, while if it is positive (>0), the modelled size is larger.
e farther away the value is from 0 means the smaller or larger the under-or overestimation is. Here, the values clearly show that models computed with the highest F2 U scores have bigger modelled areal size than the reference data, in both study sites, and the other three performance measures (except for the 50 m DEM for the Testebo river). Inundation areas of best models derived from this performance measure for the Testebo river have size differences from the reference data ranging from 27 to 37%. For Voxna, the differences in sizes were smaller (5-12%). In the Testebo river, the 50 m DEM that was paired with n � 0.010 produced a size equal to the reference data (0% difference), but it can be seen in Figure 3 that they did not exactly match each other's extents. Results with the highest F2 O have generally smaller inundation areas, especially for the Testebo river. But again, for this river, although the areal size of inundation   produced from the 50 m DEM was smaller than the actual ood data, the percent di erence was very small (− 0.17%) compared with the other DEMs. In addition, the most optimal results based on the highest resolution DEMs for this performance measure have smaller inundated areas than the reference (− 25 to − 49% di erence for the 1, 2, 4, and 5 m for Testebo and − 9 to − 13% for Voxna). With F1 and F3, all optimal models produced larger ood extents than the reference data in Voxna, whereas in Testebo, the 10, 25, and 50 m DEMs produced smaller ood areas than the rest of the DEMs.

Resolution's Effect on Quantified Performance (Eskilstuna River).
For the Eskilstuna river, a constant roughness (n � 0.033) was used for all DEMs to produce the results. Irrespective of the topographic characteristics of the study area, i.e., steep (north) or flat (south), the highest resolution data (1 m) received the highest performance in all four measures ( Figure 6 and Table 5). A distinct trend is also shown that the performance decreased as the resolution became coarser. Among the goodness-of-fit measures, F1 showed a more gradual decrease in performance from finer to coarser resolution, while F3 manifested a more significant decrease. From the two study areas, it can be noticed that the shift in performance between the highest resolution and the 10 m DEM was more prominent for the northern part of the river (i.e., from 0.93-0.96 down to 0.74-0.87), particularly when F2 U and F3 were used.
When looking at the inundation areas produced by the different resolution DEMs (Table 5), 50 m produced the largest flood areas in both sites although there was larger difference in the north (10.55%). Also noticeable is that 3 m data produced the next largest difference after 50 m. Lowest percent difference in areal size was made by 1 m DEM, both in the northern and southern parts of the river. e 25 m also generated smaller percent difference in flooded area than some of the higher resolution DEMs.
When the inundation extents are examined (Figure 7), there were higher degree of overlaps (with the reference) produced by the higher resolution (1-10 m) DEMs. But after these resolutions, the mismatches at the edges became more recognisable, especially when 50 m was used. Over-and underestimations in flooding in some parts of the study areas also became obvious with this resolution.

Changes in Sizes of Overlap and Over-and Underestimation for the Different Resolution DEMs.
All performance measures' equations are based on accounting for the overlap, overestimation, and underestimation sizes produced by the simulation results. If these three will be looked at for each of the results from the different study cases (Figure 8), and for each DEM resolution, it can be seen that the general pattern for the Voxna and Testebo rivers were similar; i.e., there were more overlaps attained when using the lower resolution data and high roughness. Nonetheless, these resolutions also generated more overpredictions (but minimal underpredictions), especially when they were paired with higher roughness values. On the contrary, the sizes of overlaps produced by higher resolution DEMs were generally lower than the coarser resolution DEMs. Furthermore, higher friction coefficients paired with them generated better overlap size, as well as overestimations, while lower frictions produced underestimations. e big shifts in the size of overlap and over-and underestimation from the highest (1 m) to the lowest (50 m) DEMs were most obvious for the Testebo river.
In both the southern and northern parts of Eskilstuna river, the overlap sizes did not vary much among the results from the different digital elevation models, but 1 m got the highest inundation match with the reference data. In addition, the overlap size decreased as resolution became coarser. e increase in size of overestimation from finest to coarsest DEM was similar to the Testebo and Voxna rivers. Nevertheless, the pattern that followed for the underestimation was opposite the two other study areas; i.e., the coarser resolution DEMs generated more underpredictions for the Eskilstuna river.  (Figure 9). It can also be seen in the graphs that there was more spread in the POD values of results that used finer resolution, compared with coarser resolution DEMs (due to the effect of roughness), especially for the Testebo river. Higher overall POD scores were computed for Voxna river. e models that used 10 to 50 m data for this study site received higher POD values (POD ≥ 0.886). Eskilstuna showed an opposite trend in the probability of detection scores. Unlike the two other rivers, the maximum POD value (0.98) was attained in both sites by the 1 m resolution data. It is also shown in the figure that the POD decreased as the resolution became coarser.
If the POD scores of the most optimal results according to the different F-statistics (as marked in Figure 9) will be considered, it can be seen that they varied with the different goodness-of-fit measures. e most optimal results from F2 O had lower hit rate compared with the results from F2 U and F1 | F3. In the case of Testebo, best models quantified using F2 U received the highest POD scores. For Voxna, the hit rates of most optimal results based on F1 | F3 and F2 U were very close to each other.
False alarm ratio values have inverse performance relationship from PODs; the closer the FAR value is to 0 (maximum), the better. e pattern manifested in false alarm ratio most likely agreed in all three rivers; i.e., they decreased as the resolution became coarser. Simulations that came from the 50 m DEM also received the lowest FAR values, specifically for the Testebo river (FAR � 0.45). e highest FAR scores attained were as follows: Testebo

Discussion
If models are calibrated using a combination of DEM and roughness values, the best performing models varied among the feature agreement statistics used, as what was exemplified in the Testebo and Voxna cases. is was because the different equations try to handle the sizes of overlap, overestimation, and underestimation differently in the numerator. Also, as roughness values have impact on the size of the flooding produced (i.e., flood area increases with increased Manning's n), the friction coefficient paired with a given DEM can affect the scores in a given performance measure, especially with F2 O and F2 U . e most optimal models derived for different resolution DEMs using F2 O were paired with lower roughness values (mostly from 0.01 to 0.05) (Figure 4), as they produced smaller inundations.
Since the equation tries to penalise the overlap with the overprediction by the model, a lower size of overestimation will give higher performance scores. Contradictory to this was the result using F2 U , which tries to suppress the underestimation produced in the model. us, models with lowest underestimations, i.e., those produced by higher roughness values, received maximum performance using this measure. F3, on the contrary, produced the lowest overall performance in all feature agreement statistics although the combinations of best performing models it considered were the same as F1. e equation tries to limit the e ect of the overlap size to the quanti ed performance value, which is the problem with F1 that was mentioned by Hunter [9]. In the constraint that F3 applies in the numerator, it subtracts the overlap with both the sizes of overand underestimation, balancing the e ects of F2 O and F2 U . is prevents the model from attaining higher scores, which can help identify better performing models. is also means that if the F3 value is high, the model can be considered to be very good. However, the use of F2 O and F2 U may be good to use if the modeller seeks to nd minimum or maximum, respectively, probable ood inundation areas.
Furthermore, the roughness value used together with the DEM resolution was important to get higher performance in the case of Testebo and Voxna rivers. As generally seen in the results of the performance scores (Figure 3), higher roughness values worked better with higher resolution DEMs, while lower resolution DEMs performed better with lower Manning's n. Also clear is that a particular DEM paired with the right Manning's n can attain high performance values, which was especially true for the coarser resolution DEMs. As shown in Tables 3 and 4, the maximum performance scores received by the lowest resolution DEMs (25 and 50 m) in at least three goodness-of-t measures (F1, F2 U , and F3) were higher than the nest resolution DEMs. e overlap size with the reference data produced by them was also the highest ( Figure 8). Even the POD scores ( Figure 9) were high and yielded the least varied values compared with the other resolution data. But, despite the high quanti ed performance that they received, the extents generated by them were not better than the higher resolution DEMs in terms of details and their accuracy at the edges ( Figures 5 and 7). ese resolutions (speci cally 50 m) also produced more overestimation of the ooding, lower FAR scores, and gave higher BIAS in overestimation.
Peculiarly, the high performance scores manifested by the low-resolution DEMs for the Testebo and Voxna rivers were opposite to that of the Eskilstuna river, wherein the highest resolution DEMs received the highest performance scores. In addition, a more signi cant decrease in performance is noticeable for the Eskilstuna test sites as the resolution was decreased. Even when the performance scores for the Testebo and Voxna rivers using di erent DEMs paired with similar Manning's n of 0.03, the 1-5 m DEMs received lower performances than the coarsest resolution DEMs. One of the possible explanations can be the reference data used to compare the results to derive the di erent performance values. For both Testebo and Voxna rivers, historical ood observations were used for extent validation, while for Eskilstuna, the reference data were derived from simulations using the original laser scanned point cloud data of 0.78 m (North) and 0.86 m (South) DEMs, due to the absence of a ood extent from historical ood event. e 1 m DEM gave the highest performance in all F-statistics (from F 0.93-0.974 in Table 5) and better details and match in the borders generated (Figure 7) because it was most similar to these two resolutions. e 50 m, on the contrary, was the most dissimilar in extent and performance because of the resolution difference from the reference model. e tendency of getting higher performance for the finest DEM resolution when benchmarking with a better topographic data was similar to the results of Cook and Merwade [5] and Peña and Nardi [6]. As also seen in Table 5, the percent difference in the inundated area between the reference model and the modelled floods became larger (due to the modelled areas getting smaller) as DEM resolution changed from 1 to 3 m. Why the modelled areas became smaller for 3 m than 1 m resolution is unclear to us, but it may have to do with local conditions and where the elevation data points have been sampled. However, with further change in resolution, the percent difference first decreased between 10 and 25 m DEM, followed by an increase for 50 m resolution. e modelled inundated areas are now getting bigger and bigger. e effect is most profound for the Northern area, which has a narrower floodplain. e reason for this behaviour can be attributed to raised bed levels when the DEM cell size is getting closer to or even exceeding the width of the channel. Small depressions that used to be present in the high-resolution DEMs were removed and replaced by higher lying areas, with elevated water levels as a result. With very big cell sizes, the cells would then represent the surrounding terrain, rather than being representative of the river channel. is effect was shown in Lim and Brandt [12] for the Testebo and Voxna rivers. Cook and Merwade [5] and Saksena and Merwade [4] also show in their results that the size of inundation areas become larger when using coarser grid resolution. Hence, from a flood risk perspective, the use of coarse-resolution models may actually increase the chances of being on the "safe" side when the flood map is used for planning purposes. Maximum scores in all performances (from the entire ensemble of models) for both Testebo and Voxna rivers were also lower than for the Eskilstuna river. is is not surprising as the deviations in result for the Eskilstuna river only depended on the quality of the DEMs. However, besides adding roughness as an influencing parameter, these low scores could also have been influenced by the historical data used for the validation. According to Lim and Brandt [12], the data can have accuracy issues in terms of how they were generated (e.g., how precise the flood borders were digitised or produced), and their timeliness (i.e., the time difference between the event and generation of topographic data), which can affect getting better inundation matches. us, an accurate estimation of performance scores can also be subjective to the observation data, as what is stated in Merwade et al. [24]. Moreover, a perfect or very high match between the modelled and real flood extents can be difficult to achieve because of other factors, which can affect the actual flood extents, that are not taken into account by the model used or in the modelling performed. e single roughness Manning's n of 0.033 also worked well for Eskilstuna and Voxna rivers in the four performance estimators although this did not produce the most optimal results for the latter. is was regardless of the hydraulic models used (HEC-RAS 1D for Eskilstuna and 2D CAESAR-LisFlood for Voxna). However, for the Voxna river, performance scores among the different resolution DEMs using this friction had lower variability in the three feature agreement statistics (except for F2 U ) than the Eskilstuna river. But even when considering the performance results from the ensemble, the quantified performances for the Voxna river were higher and less varied than the two other study areas. For the Testebo river, this roughness led to lower performance for higher resolution DEMs, especially when using F2 U and F3.
is was for the reason that higher roughness values worked well with these resolutions for Testebo river, while lower resolution DEMs were better with lower roughness values. e result that the Eskilstuna river's flat terrain area received better performance measures than the steeper area might at first seem surprising, as the modelled flood borders clearly are at further distance from the reference model in the flat area. Nonetheless, Cook and Merwade [5] also achieved similar results in their study over the flat Brazos river (better performance) and the steeper Strouds creek (lower performance). is can be attributed to the flatter sites' relatively larger inundated areas in comparison to the fringe that will be either over-or underestimated. But, it is still important to notice that the disparity distances between the modelled and the reference flood edges are always bigger in the flatter areas. is makes it questionable to only rely on the feature statistics treated in this paper. Instead, the distance disparities described in Brandt [21] and Lim and Brandt [12] should also be considered. ese measures will provide better estimates on local flood extent uncertainties, especially where the terrain is flat. Finally, it is clear from the results that when high-resolution DEMs are used, calibration of the roughness coefficient is of utmost significance. Important to note here that calibrated results from earlier models, as well as standard Manning's n values, are not advisable to be used. Calibration should be done according to the current DEM, study area, model, and model conditions. As what is shown in Figure 4, although the general trend for roughness values and DEM for each performance measure have similarities, the most optimal roughness still differed. ese differences in the friction values reflect sampling variation, which is typical for any model parameter. In the case of Testebo and Voxna, the varying roughness values can be ascribed to the differences in their physical characteristics, which affect the flow characteristics in the river. As different study areas are spatially heterogeneous, the friction values will vary from one site to another. In addition, because the DEM provides the topography of the area for the model, its quality in terms of resolution will affect the flow characteristics, which in turn can impact the friction to be used and the flood extent produced. e usage of the 1D or 2D model can also have an effect on the optimal roughness values derived, due to their assumptions in solving the energy loss equation [25] (cf. Horritt and Bates [7]). For the 2D model results (Voxna and Testebo), if the basis of optimality will be F1 | F3 for the highest resolution DEMs (1-5 m), the most optimal Manning's n is 0.08 but varied as the resolution decreased. For the Eskilstuna river, the most optimal roughness will be difficult to know as only a single roughness was used. Likewise, the discharge used can also produce varying inundation extents and patterns, as well as lead to different optimal roughnesses [7]. Maximum discharges were used for the different study sites. Although they were roughly of the same magnitude, the flow differed a bit (360 m 3 /s for Voxna, 160 m 3 /s for Testebo, and 198 m 3 /s for Eskilstuna). is can be another reason for the varying friction values. All these factors can affect the modelling results, as what is also shown in earlier studies, and as a result of the equifinality problem [1]. However, since our study focused on the variability of performance measures as effect of the DEM resolution and roughness, our results are centred to the effects of topography in the flood extents and performance measures, rather than the model or the flow used.
From a feature statistics point-of-view, it may seem unnecessary to use high-resolution DEMs, when average resolution DEMs provide similar F-statistics scores, as in the case of Voxna and Testebo rivers. However, as the results from Eskilstuna river show (wherein only the effect of the DEM is considered), feature statistics clearly get better when resolution is increased. e positive effect of resolution in the flood model results when considering both the effects of DEM and roughness became more evident when the disparity distance measures were used, especially for Voxna river. In Lim and Brandt [12], the mean and median disparities for 3 to 5 m were 59.71-63.16 m and 10.75-15.18 m, respectively. ese were clearly better than for the 50 m DEM (D�117.7 m and D � 41.17 m). Nonetheless, the results of the disparity measures are more unclear for the Testebo river.

Conclusions
is study has highlighted the implications of using feature statistics in assessing modelled flood inundation areas. Some of the most frequently used feature statistics have been tested on three different rivers, whereof one constituted two different topographies and flat and steep side slopes, respectively. From the results, it is clear how the performance values using the feature statistics varied when going from the most commonly used F1 that divides the overlap area of modelled and reference flood with the total combined area of modelled and reference flood, over the F2 O and F2 U that penalises over-and underpredicted areas, respectively, to F3 that penalises both over-and underpredicted areas. is suggests that a high F3 value, i.e., close to 1, more or less guarantees a high match between modelled and reference flood area. However, even when feature statistics show high values, caution should be taken especially for local or flat areas. Consequently, alternative performance measures, such as disparity distances, may be better substitutes. e direct relationship between the roughness value and the size of flooding was also shown to affect quantified performance, especially from F2 O and F2 U equations. As the flooded area increases with higher Manning's n, the tendency to overestimate flood becomes higher. Because F2 O penalises the model with overestimation produced, lower roughness values that generate smaller extents and minimal overprediction performed better with this measure. On the contrary, F2 U worked well with higher friction because the equation tries to minimise the underestimation made by the model. e results also showed that DEMs of poor resolution can receive relatively high performance scores over a wide range of roughnesses. But despite the high scores, there are more spatial inconsistencies in the flood extents produced from low-resolution data. erefore, assumptions on the goodness of results based on poorer resolution DEMs must be carefully made. Usage of higher resolution DEMs remains advantageous for modelling as they represent the topography of the study area better. As the results from the Eskilstuna river showed, which is only dependent on the quality of the DEM, better resolution has the potential to always bring better results. However, high-resolution DEMs needs to be carefully calibrated through the roughness value, to be able to fully utilise their modelling advantage. If not calibrated properly, the feature agreement statistics will directly show decreased scores.
Performance scores also varied in the different study areas. Regardless of the measure used, Eskilstuna and Voxna rivers received higher overall performances than the Testebo river. ese variations in the scores can be caused by several factors such as the modelling assumptions of the model utilised (1D vs. 2D) or the spatial heterogeneities of the study areas, which can both influence the flow being modelled, and the inundation extents produced. Moreover, the reference data and its quality can also affect the quantified performance.

Data Availability
e data used to support the findings of this study have not been made available because they are the property of Lantmäteriet (the Swedish mapping, cadastral, and land registration authority) and the municipalities of the studied areas.
Disclosure is research has been carried out as part of the employment at University of Gävle.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.