A GPU-Accelerated Shallow-Water Scheme for Surface Runoff Simulations

The capability of a GPU-parallelized numerical scheme to perform accurate and fast simulations of surface runoff in watersheds, exploiting high-resolution digital elevation models (DEMs), was investigated. The numerical computations were carried out by using an explicit finite volume numerical scheme and adopting a recent type of grid called Block-Uniform Quadtree (BUQ), capable of exploiting the computational power of GPUs with negligible overhead. Moreover, stability and zero mass error were ensured, even in the presence of very shallow water depth, by introducing a proper reconstruction of conserved variables at cell interfaces, a specific formulation of the slope source term and an explicit discretization of the friction source term. The 2D shallow water model was tested against two different literature tests and a real event that recently occurred in Italy for which field data is available. The influence of the spatial resolution adopted in different portions of the domain was also investigated for the last test. The achieved low ratio of simulation to physical times, in some cases less than 1:20, opens new perspectives for flood management strategies. Based on the result of such models, emergency plans can be designed in order to achieve a significant reduction in the economic losses generated by flood events.


Introduction
The numerical distributed simulation of overland and open-channel flows has been studied since the early Seventies ( [1][2][3]), and is being increasingly used to better model and understand the transient flow dynamics of flash floods [4][5][6][7][8][9][10][11][12][13]. These flash floods are generally characterized by rapidly varying overland flows as a consequence of fast and complex watershed response to intense rainfall. In a perspective of continuous increase in flash flood frequencies and damage, also owing to climate change, flooding events in Europe rank highly among natural hazards in terms of people involved and casualties, with a huge economic impact of losses characterized by a mean value of about 5.8 billion euros every year in the period 2013-2017 [14]. In addition, a growing number of people are facing high levels of flood risk due to increased urbanization in flood-prone areas. The development and application of tools capable of providing accurate and fast predictions for flood management are hence crucial to obtain a reduction in the damage caused by floods. In the past, lumped hydrological models or simplified hydrodynamic models were often adopted to predict these events at the catchment scale [15][16][17][18]. Unfortunately, simplified approaches are generally not capable of describing the rapid watershed responses and complex surface flow processes, and accurately predicting local depth and velocities. Moreover, the reduced representation of the physics of the phenomena can lead to significant dependence on model parameters. Hydrologic models for floods were in the past Parallel numerical schemes based on message passing interface (MPI) or graphics processing units (GPU) techniques have been recently developed, thereby allowing otherwise prohibitive simulations thanks to the high ratios between physical and computational times [48]. However, most of the GPU-accelerated models are based on uniform Cartesian grids, characterized by two main limitations: (i) a unique resolution must be adopted in the whole domain, and (ii) the domain must be rectangular. The adoption of non-uniform grids in GPU-accelerated models has recently gained attention in the literature [43,[49][50][51][52][53][54][55][56], since it allows the use of high resolution only in the portions of the domain where it is needed.
The objective of this paper is to present Parflood Rain, the evolution of a GPU-parallel numerical model for the solution of the 2D SWEs [48,51,57], for the simulation of rainfall-runoff processes in a watershed. The new version of the code required the introduction of a new source term related to rainfall. The model is characterized by a finite volume explicit discretization technique, which is well balanced, second-order accurate, and based on positive depth reconstruction. Thanks to the structure of the algorithm, the scheme is also numerically stable for overland flow. Many issues related to stability had to be addressed due to the thin water layer flowing across the domain. Moreover, the model is based on a recently developed type of grid called Block-Uniform Quadtree (BUQ), designed to exploit the computational power of a GPU and allow the enforcement of the desired resolution in different regions of the computational domain [36]. Since this is the first application of the BUQ grid at the catchment-area scale, we have conducted an in-depth analysis aimed at identifying the appropriate resolution to be adopted at the watershed sides in order to preserve an accurate description of the flow dynamics. A remarkable reduction in computational times and allocated memory was achieved compared to Cartesian high-resolution grids.
The paper is organized as follows: the explicit finite volume numerical scheme for the SWEs is described in Section 2, and the main features of the Computed Unified Data Architecture (CUDA®) implementation and of BUQ grids are illustrated. In Section 3, the accuracy of the numerical model is verified against different challenging synthetic test cases. In Section 4, the capability of the numerical scheme to efficiently simulate real flooding events is assessed by reproducing a real test case. Appendix A contains details of the numerical implementation of source terms and rainfall excess.

Numerical Scheme
The numerical model, explained in detail in [48,51], solves the 2D-SWE through an explicit finite volume scheme that in integral form [58] can be written as: where A is the area of integration element, C the element boundary, n the outward unit vector normal to C, U the vector of the conserved variables, H = (F, G) the tensor of fluxes in the x and y directions, S b, S f and S r the source terms representing the bed slope, the frictional effect and the rainfall rate, respectively. A modified form [59] of SWEs is adopted here, and the vector terms are: −gh Water 2020, 12, 637 4 of 27 in which η is the free surface elevation, h the flow depth, u and v are the velocity components in the x and y directions. The terms uh and vh identify the unit width discharges in the two coordinates directions, respectively (also addressed as qx and qy for the sake of simplicity in notation), g is the gravitational acceleration, z is the bed elevation (η = z + h), r and f are the rainfall and infiltration rate, respectively. The conserved variables are updated at each time step by adopting the second order Runge-Kutta method, which ensures the second order of accuracy in time: In Equation (3), n represents the time level, i and j the cell positions in the x and y directions, and ∆t n is the time step calculated accordingly to the Courant-Friedrichs-Lewy condition [58]. The term U i,j n+1/2 is obtained as: where the operator D i (U i,j n ) is defined as: in which the numerical fluxes F and G at the intercells i ± 1 2 and j ± 1 2 in the x and y direction respectively, are evaluated by using the states at the cell faces as input in the Harten, Lax and van Leer approximate Riemann solver with the contact wave restored (HLLC). The HLLC scheme, which in many situations behaves like the exact Riemann solver, does not require an iterative scheme, is positive definite and generally computationally more efficient than other well-known solvers, is in fact accurate and robust, and preserves the entropy-satisfaction property of the HLL scheme [58,60]. To obtain the Riemann states, a surface reconstruction method (SRM [10]) with hydrostatic reconstruction [61] is used. The flux and the bed and friction slope source term computations are performed in the present approach by adopting the variables obtained from SRM and local bed modification treatment in order to overcome the numerical instabilities induced by the presence of very small depths related to runoff processes. Considering two adjacent cells in the x direction, i and i + 1, the SRM-reconstructed water surface elevations on the left-and right-hand sides of their common interface are given by: where δz is the difference between the bed elevation at the two sides of the cell interface, obtained through slope-limited interpolation (minmod Ψ) of the related cell center values: The states of bed elevation at the common interface can then be obtained as follows: Water 2020, 12, 637

of 27
To ensure the positivity of the water depths, a unique bed elevation is defined for each cell interface [61] and the water depths are corrected once again to ensure their non-negativity [62]: The Riemann states of the unit width discharges are then calculated: In Equation (6)-(10) the variables without superscript (η, h, u, v) are defined at the cell center. The details of the numerical implementation of bed and friction slope source terms are reported in Appendix A for the sake of space.

Evaluation of Rainfall Excess
In the numerical model presented here the rainfall source term S r represents the inflow due to the net rainfall (Equation (1)). This term, variable in both space and time, can be obtained as the difference between rainfall r and infiltration rate f (Equation (2)).
Several possibilities are available in the code to introduce the input due to the rainfall in order to perform a distributed rainfall simulation over the domain. It is possible to set a unique value of rainfall intensity r (mm·h −1 ) at each cell or rain time series for predetermined areas (Thiessen polygons). Another way is to exploit the observations at the rain gauges present in the domain by computing the precipitation at each cell (in time) through the inverse distance squared approximation: where r i,j is the rainfall intensity at cell i, j, r g is the rainfall intensity measured by the gth rainfall gauge, d g,i,j is the distance from cell i to the gth rainfall gauge and N is the total number of rainfall gauges. Furthermore, a series of maps (e.g., radar maps) containing the amount of precipitation in time (with constant or variable time spacing) can be provided as input to be transformed to the rainfall rate. In the present model, the rainfall excess is evaluated through the application of the SCS Method, which allows the evaluation of the depth (mm) of the excess precipitation, P e , through a simple empirical formulation [63]. More details of the numerical implementation of the SCS approach are reported in Appendix A.

Graphics Processing Unit (GPU) Implementation of Shallow-Water Equation (SWE)
The numerical scheme described in Section 2.1 is implemented through the CUDA ®in order to overcome the high computational cost of the explicit FV scheme by exploiting NVIDIA TM GPUs. The main aspects of GPU implementation are briefly recalled in this section, for details see [48,57].
In CUDA, the basic work unit is represented by a thread, while many threads are grouped into a block [64]. In the present model, each thread corresponds to a computational cell used to discretize the physical domain. The CUDA program controls two resources, namely the host (CPU) and the device (GPU), including the data exchange between them. The former performs pre-and post-processing and controls the simulation's time advancement, while the latter executes the pieces of code (kernels) with the instructions for the actual computations, which are performed in parallel over blocks. The update of the conserved variables from time-step n to n + 1 is executed by a sequence of 10 tasks (Figure 1), each corresponding to a distinct CUDA kernel. Compared to the previous model [48], the implementation of Parflood Rain features an additional kernel for processing rainfall maps, and modifies some of the original kernels to include the rainfall source term. Moreover the conserved variables are here reconstructed through the SRM approach.
Water 2020, 12, x FOR PEER REVIEW 6 of 26 Figure 1. Iteration flux diagram. Each task is handled by a graphics processing unit (GPU) kernel.
The first task is only performed if open boundary conditions are present. In this case, some conserved quantities must be imposed in boundary cells [48]. In the second task, the time-step is calculated according to the CFL condition, and the blocks are reorganized in order to exclude dry blocks from the computations and improve efficiency [48]. However, if at least one cell of a dry block falls in the rainfall area at the current time step, the "rain" block is activated and processed like a wet block. Then, during the third task, a dedicated kernel computes the rainfall intensity for each cell in the "rain" blocks. Tasks 4, 5, and 6 correspond to the first half step of the second order scheme, and include the bed elevation reconstruction, the flux and bed slope source term evaluation, and the computation of the friction slope source term. The updating process continues with the second half step (tasks 7, 8, and 9). Finally, during task 10, the solution is updated according to Equation (3), and the rainfall source term is also added, before a new iteration begins. Conversely, only tasks 4, 5 and 6 are performed for the first-order scheme: the rainfall excess computation and the updating of the conserved variables according to Equation (5) are performed during task 6; the iteration process then starts again immediately from task 1.

Block-Uniform Quadtree (BUQ) Grids
BUQ grids were first presented in [51] for flood modelling, and this is the first time BUQ grids have been applied at the watershed scale. This circumstance requires an in-depth analysis to determine the minimum resolution (largest cell size) adequate to preserve an accurate description of the flow dynamics. The main characteristics of this type of grid are briefly reported here; for a more detailed description the reader should refer to reference [51].
When the domain is discretized with a Cartesian grid, the cells are organized into regular blocks containing M × M cells (in this work, M = 16), and stored in the memory as a two-dimensional array characterized by a straightforward correspondence between physical localization of a cell and its position in the memory. In this case, retrieving information from cell neighbors is very simple. To overcome the limitation of a Cartesian grid, a new type of structured non-uniform grids (BUQ) was designed to allow a multi-resolution discretization of the domain while maintaining efficiency on GPU. In BUQ grids, each block still contains M × M cells, but the cell size can differ for different blocks. In particular, cell size can vary between the highest level of resolution (level 1, Δ1) and the coarsest one (level N, ΔN = 2 N−1 Δ1), i.e. each block contains cells with size equal to a power-of-two multiple of the minimum size in the domain. The most important criterion for grid generation is the fact that the variation in the resolution level between neighbor blocks cannot exceed one [51]. This idea is similar to quad-tree partitioning [65], the main difference being the fact that single cells are replaced by blocks of cells in the spatial organization of the quad-tree nodes. As an example, Figure  2 shows a detail of a BUQ grid. Starting from a seeding point where the maximum resolution is requested, four different resolutions are originated during pre-processing. The regular blocks (containing M × M cells each) are shown at the different levels of resolution. The first task is only performed if open boundary conditions are present. In this case, some conserved quantities must be imposed in boundary cells [48]. In the second task, the time-step is calculated according to the CFL condition, and the blocks are reorganized in order to exclude dry blocks from the computations and improve efficiency [48]. However, if at least one cell of a dry block falls in the rainfall area at the current time step, the "rain" block is activated and processed like a wet block. Then, during the third task, a dedicated kernel computes the rainfall intensity for each cell in the "rain" blocks. Tasks 4, 5, and 6 correspond to the first half step of the second order scheme, and include the bed elevation reconstruction, the flux and bed slope source term evaluation, and the computation of the friction slope source term. The updating process continues with the second half step (tasks 7, 8, and 9). Finally, during task 10, the solution is updated according to Equation (3), and the rainfall source term is also added, before a new iteration begins. Conversely, only tasks 4, 5 and 6 are performed for the first-order scheme: the rainfall excess computation and the updating of the conserved variables according to Equation (5) are performed during task 6; the iteration process then starts again immediately from task 1.

Block-Uniform Quadtree (BUQ) Grids
BUQ grids were first presented in [51] for flood modelling, and this is the first time BUQ grids have been applied at the watershed scale. This circumstance requires an in-depth analysis to determine the minimum resolution (largest cell size) adequate to preserve an accurate description of the flow dynamics. The main characteristics of this type of grid are briefly reported here; for a more detailed description the reader should refer to reference [51].
When the domain is discretized with a Cartesian grid, the cells are organized into regular blocks containing M × M cells (in this work, M = 16), and stored in the memory as a two-dimensional array characterized by a straightforward correspondence between physical localization of a cell and its position in the memory. In this case, retrieving information from cell neighbors is very simple. To overcome the limitation of a Cartesian grid, a new type of structured non-uniform grids (BUQ) was designed to allow a multi-resolution discretization of the domain while maintaining efficiency on GPU. In BUQ grids, each block still contains M × M cells, but the cell size can differ for different blocks. In particular, cell size can vary between the highest level of resolution (level 1, ∆ 1 ) and the coarsest one (level N, ∆ N = 2 N−1 ∆ 1 ), i.e. each block contains cells with size equal to a power-of-two multiple of the minimum size in the domain. The most important criterion for grid generation is the fact that the variation in the resolution level between neighbor blocks cannot exceed one [51]. This idea is similar to quad-tree partitioning [65], the main difference being the fact that single cells are replaced by blocks of cells in the spatial organization of the quad-tree nodes. As an example, Figure 2 shows a detail of a The correspondence between physical space and memory, typical of Cartesian grids, cannot be maintained for BUQ grids, since all blocks require the same memory size, but blocks with different resolution have different physical size. For this reason, blocks are stored in memory in a nonpreordained way, and additional data structures to efficiently access information on neighbor blocks are introduced. During computations, these arrays will only be used by threads corresponding to cells lying on the border of a block in order to retrieve the values of conserved variables in the neighbor cell that belongs to a different block. The natural neighbor interpolation procedure [66] is adopted to reconstruct the conserved variables in adjacent cells characterized by a different spatial resolution. The C-property in these cells is guaranteed thanks to proper flux computation and bed slope source term treatment, detailed in [51].
This kind of architecture was preferred with respect to other possible procedures for the partitioning of the physical domain (e.g., adaptive mesh refinement (AMR) methods). In the multiresolution approach followed by the writers, the location of the maximum resolution of the computational grid is in fact decided a priori, once and for all, on the basis of the domain's characteristics and of the presence of geometrical elements (such as embankments, built up areas, etc.) capable of sensibly affecting the flow dynamics. On the contrary, the implementation of AMR methods on GPUS poses several challenges mainly related to non-coalesced memory access that leads to considerable overhead.

Experimental and Synthetic Rainfall-Runoff Tests
Two synthetic rainfall-runoff experimental cases were selected to demonstrate the applicability, accuracy, and stability of the proposed scheme. In the absence of analytic reference solutions of the complete hydrodynamic model, the computed results were compared with the solutions of the kinematic wave equation or with experimental data if available.

Test Case 1: One-Dimensional Three Planes Cascade Rainfall-Runoff Case
In this laboratory experiment realized at Kyoto University in 1955 [67] and often adopted for validation purposes [12,25,30,[68][69][70][71], the flow phenomenon evolved on a three planes cascade of width equal to 19.6 cm having slope Sx1 = 0.02, Sx2 = 0.015 and Sx3 = 0.01 and length L = 8 m each ( Figure  3). The three planes were subject to different rainfall intensities equal to r1 = 3888 mm·h −1 , r2 = 2296.8 mm·h −1 and r3 = 2880 mm·h −1 . The Manning coefficient n = 0.009 m −1/3 s was suggested by the author [67], even though other researchers in the literature [68] considered a slightly different value of n = 0.01 m −1/3 s to achieve better results in the numerical simulations. This accounts for possible The correspondence between physical space and memory, typical of Cartesian grids, cannot be maintained for BUQ grids, since all blocks require the same memory size, but blocks with different resolution have different physical size. For this reason, blocks are stored in memory in a non-preordained way, and additional data structures to efficiently access information on neighbor blocks are introduced. During computations, these arrays will only be used by threads corresponding to cells lying on the border of a block in order to retrieve the values of conserved variables in the neighbor cell that belongs to a different block. The natural neighbor interpolation procedure [66] is adopted to reconstruct the conserved variables in adjacent cells characterized by a different spatial resolution. The C-property in these cells is guaranteed thanks to proper flux computation and bed slope source term treatment, detailed in [51].
This kind of architecture was preferred with respect to other possible procedures for the partitioning of the physical domain (e.g., adaptive mesh refinement (AMR) methods). In the multiresolution approach followed by the writers, the location of the maximum resolution of the computational grid is in fact decided a priori, once and for all, on the basis of the domain's characteristics and of the presence of geometrical elements (such as embankments, built up areas, etc.) capable of sensibly affecting the flow dynamics. On the contrary, the implementation of AMR methods on GPUS poses several challenges mainly related to non-coalesced memory access that leads to considerable overhead.

Experimental and Synthetic Rainfall-Runoff Tests
Two synthetic rainfall-runoff experimental cases were selected to demonstrate the applicability, accuracy, and stability of the proposed scheme. In the absence of analytic reference solutions of the complete hydrodynamic model, the computed results were compared with the solutions of the kinematic wave equation or with experimental data if available.

Test Case 1: One-Dimensional Three Planes Cascade Rainfall-Runoff Case
In this laboratory experiment realized at Kyoto University in 1955 [67] and often adopted for validation purposes [12,25,30,[68][69][70][71], the flow phenomenon evolved on a three planes cascade of width equal to 19.6 cm having slope S x1 = 0.02, S x2 = 0.015 and S x3 = 0.01 and length L = 8 m each ( Figure 3). The three planes were subject to different rainfall intensities equal to r 1 = 3888 mm·h −1 , r 2 = 2296.8 mm·h −1 and r 3 = 2880 mm·h −1 . The Manning coefficient n = 0.009 m −1/3 s was suggested by the author [67], even though other researchers in the literature [68] considered a slightly different value of n = 0.01 m −1/3 s to achieve better results in the numerical simulations. This accounts for possible uncertainties in the evaluation of the planes roughness coefficient at the time of the experiments. Both values of the Manning coefficient were considered here, and two different runs of the numerical model were performed with n = 0.009 m −1/3 s (Parflood Rain, test I) and n = 0.01 m −1/3 s (Parflood Rain, test II). In addition to the different values of the rainfall intensities above the three planes, different rainfall durations were also considered, namely t = 10 s (case A), t = 20 s (case B) and t = 30 s (case C). The experimental observations for the three test cases considered are compared in Figure 4 with the resulting hydrographs of unit discharge at the downstream boundary, obtained by the complete hydrodynamic model. values of the Manning coefficient were considered here, and two different runs of the numerical model were performed with n = 0.009 m −1/3 s (Parflood Rain, test I) and n = 0.01 m −1/3 s (Parflood Rain, test II). In addition to the different values of the rainfall intensities above the three planes, different rainfall durations were also considered, namely t = 10 s (case A), t = 20 s (case B) and t = 30 s (case C). The experimental observations for the three test cases considered are compared in Figure 4 with the resulting hydrographs of unit discharge at the downstream boundary, obtained by the complete hydrodynamic model.  The experimental observations for the three test cases considered are compared in Figure 4 with the resulting hydrographs of unit discharge at the downstream boundary, obtained by the complete hydrodynamic model.   As a whole, the comparison shows a good agreement between numerical results and observations, especially for test cases B and C. Some high experimental values can be observed in test condition A (Figure 4a), (unit discharge higher than 70 m 2 s −1 ). However, the investigator had ascribed this strange behavior to the fact that during the experimental runs the lateral inflow rate q achieved at each reach of the flume was not completely uniform. If the three anomalous values are ignored, the comparison between numerical results and observations for case A is quite good. The numerical results related to the adoption of the two different roughness coefficients proposed in the literature (Parflood Rain, test I and test II) show the sensitivity of the numerical computations to the Manning's parameter adopted. Better results were achieved for case A with n = 0.009 m −1/3 s and for case B with n = 0.01 m −1/3 s.
As shown in Figure 5, the V-shaped catchment is characterized by two hillsides with a 0.05 slope in the x-direction, flowing into a rectangular channel having a 0.02 slope in the y-direction. Each side is 1000 m long and 800 m wide, while the channel width is 20 m. The Manning roughness coefficients for the hillsides and the channel are 0.015 m −1/3 s and 0.15 m −1/3 s, respectively. The whole domain was subject to a uniform rainfall intensity equal to r = 10.8 mm·h −1 for a duration of 1.5 h. A free outflow condition was imposed at the channel outlet.
Water 2020, 12, x FOR PEER REVIEW 9 of 26 As a whole, the comparison shows a good agreement between numerical results and observations, especially for test cases B and C. Some high experimental values can be observed in test condition A (Figure 4a), (unit discharge higher than 70 m 2 s −1 ). However, the investigator had ascribed this strange behavior to the fact that during the experimental runs the lateral inflow rate q achieved at each reach of the flume was not completely uniform. If the three anomalous values are ignored, the comparison between numerical results and observations for case A is quite good. The numerical results related to the adoption of the two different roughness coefficients proposed in the literature (Parflood Rain, test I and test II) show the sensitivity of the numerical computations to the Manning's parameter adopted. Better results were achieved for case A with n = 0.009 m −1/3 s and for case B with n = 0.01 m −1/3 s.
As shown in Figure 5, the V-shaped catchment is characterized by two hillsides with a 0.05 slope in the x-direction, flowing into a rectangular channel having a 0.02 slope in the y-direction. Each side is 1000 m long and 800 m wide, while the channel width is 20 m. The Manning roughness coefficients for the hillsides and the channel are 0.015 m −1/3 s and 0.15 m −1/3 s, respectively. The whole domain was subject to a uniform rainfall intensity equal to r = 10.8 mm·h −1 for a duration of 1.5 h. A free outflow condition was imposed at the channel outlet. The numerical computations were performed with three different grids: the first two were Cartesian, characterized by square cells of size 1 m × 1 m and 10 m × 10 m, while the third one was a BUQ multiresolution grid characterized by square cells of size 1, 2, 4 and 8 m. The maximum resolution was adopted in this last grid for the perimeter of the domain and for the channel, while the planes were described according to the BUQ grid partitioning with increasing cell size in order to reduce the number of computing cells ( Figure 6). Indeed, if compared to the 1 m × 1 m Cartesian grid, the adoption of the multiresolution leads to a decrease in the total number of computational elements from over 1.6 × 10 6 to less than 0.3 × 10 6 (with a reduction of about 80%). As a result, the BUQ-based simulation performs 6.2 times faster than the Cartesian-based one ( Table 1).
The computational efficiency of the simulations performed with the BUQ grids were verified by means of the compression rate CR and speed-up SU defined as: where N and T are the number of allocated cells and the computational time, respectively, for the simulation with the BUQ grid considered. NC and TC are the same quantities referred to an identical simulation run using a Cartesian grid with size equal to the maximum resolution adopted in the BUQ grid considered. The numerical computations were performed with three different grids: the first two were Cartesian, characterized by square cells of size 1 m × 1 m and 10 m × 10 m, while the third one was a BUQ multiresolution grid characterized by square cells of size 1, 2, 4 and 8 m. The maximum resolution was adopted in this last grid for the perimeter of the domain and for the channel, while the planes were described according to the BUQ grid partitioning with increasing cell size in order to reduce the number of computing cells ( Figure 6). Indeed, if compared to the 1 m × 1 m Cartesian grid, the adoption of the multiresolution leads to a decrease in the total number of computational elements from over 1.6 × 10 6 to less than 0.3 × 10 6 (with a reduction of about 80%). As a result, the BUQ-based simulation performs 6.2 times faster than the Cartesian-based one (Table 1).   The numerical discharge hydrographs at the end of each hillside and at the channel outlet obtained by using the three different grids are presented in Figure 7 and compared with the analytical solutions of the kinematic wave problem ( [75][76][77][78][79][80]).    The computational efficiency of the simulations performed with the BUQ grids were verified by means of the compression rate C R and speed-up S U defined as: where N and T are the number of allocated cells and the computational time, respectively, for the simulation with the BUQ grid considered. N C and T C are the same quantities referred to an identical simulation run using a Cartesian grid with size equal to the maximum resolution adopted in the BUQ grid considered. The numerical discharge hydrographs at the end of each hillside and at the channel outlet obtained by using the three different grids are presented in Figure 7 and compared with the analytical solutions of the kinematic wave problem ( [75][76][77][78][79][80]).
The numerical hydrographs are in quite good agreement with the reference solution. In particular the simulation performed by adopting the BUQ grid provides results very close to those obtained by using the fine (1 m × 1 m) Cartesian grid. The simulation performed through the adoption of the 10 m × 10 m Cartesian grid is characterized by a very low number of elements and by a shorter computational time compared to the one achieved through the adoption of the BUQ grid. In this case, the grid size adopted is not adequate to describe the channel geometry in detail, and the resulting hydrograph at the channel outlet does not reproduce the reference solution accurately. The numerical discharge hydrographs at the end of each hillside and at the channel outlet obtained by using the three different grids are presented in Figure 7 and compared with the analytical solutions of the kinematic wave problem ( [75][76][77][78][79][80]).

Nure Watershed Field Case
Finally the model was applied to simulate a large flood event that occurred in the Western Emilia-Romagna Apennine region, striking the Trebbia and Nure watersheds especially (Figure 8 The numerical hydrographs are in quite good agreement with the reference solution. In particular the simulation performed by adopting the BUQ grid provides results very close to those obtained by using the fine (1 m × 1 m) Cartesian grid. The simulation performed through the adoption of the 10 m × 10 m Cartesian grid is characterized by a very low number of elements and by a shorter computational time compared to the one achieved through the adoption of the BUQ grid. In this case, the grid size adopted is not adequate to describe the channel geometry in detail, and the resulting hydrograph at the channel outlet does not reproduce the reference solution accurately.

Nure Watershed Field Case
Finally the model was applied to simulate a large flood event that occurred in the Western Emilia-Romagna Apennine region, striking the Trebbia and Nure watersheds especially (Figure 8

Event Description
The rainstorm event was particularly severe with rainfalls that in the most affected areas reached 280 mm in 6 h, with an hourly precipitation peak of 107.6 mm h −1 . The precipitation data were acquired every thirty minutes through weather radar and the reflectivity maps obtained were linked to the observations at the rainfall stations present in the drainage area (courtesy of the Regional

Event Description
The rainstorm event was particularly severe with rainfalls that in the most affected areas reached 280 mm in 6 h, with an hourly precipitation peak of 107.6 mm h −1 . The precipitation data were acquired every thirty minutes through weather radar and the reflectivity maps obtained were linked to the observations at the rainfall stations present in the drainage area (courtesy of the Regional Agency for Prevention, Environment and Energy of Emilia-Romagna, Italy). The storm center was located South-West of the Nure watershed, yet the total volume of rain on the river basin was over 45 × 10 6 m 3 (Figure 9). The rainfall event gave rise to such an exceptional flood that it devastated the whole Nure river valley, destroying many buildings located along the river, especially in the village of Farini ( Figure  10), and the valley road infrastructure, and damaging the Sassi Neri check dam (located about 500 m upstream of Farini) built to protect the river course from a landslide that tends to obstruct the valley (Figure 11).
(a) Figure 9. Isohyet map of total rainfall (mm) for the event of interest obtained through the radar reflectivity, linked to rain gauges on the Trebbia and Nure. In red the boundary of the Nure watershed closed at the Farini outlet.
The rainfall event gave rise to such an exceptional flood that it devastated the whole Nure river valley, destroying many buildings located along the river, especially in the village of Farini (Figure 10), and the valley road infrastructure, and damaging the Sassi Neri check dam (located about 500 m upstream of Farini) built to protect the river course from a landslide that tends to obstruct the valley ( Figure 11).
The rainfall event gave rise to such an exceptional flood that it devastated the whole Nure river valley, destroying many buildings located along the river, especially in the village of Farini ( Figure  10), and the valley road infrastructure, and damaging the Sassi Neri check dam (located about 500 m upstream of Farini) built to protect the river course from a landslide that tends to obstruct the valley (Figure 11).

Available Field Data
Several survey campaigns were carried out in the days following the storm. An unmanned aerial vehicle captured ground images at 10 cm per pixel resolution. A light detection and ranging (LiDAR) survey and a 1 m × 1 m DEM of a significant portion of the river valley were also produced, and a traditional ground survey was conducted to identify the maximum levels reached by the water in the village of Farini and upstream. Clear marks left by mud and debris, over 11 meters above the valley floor, were observed upstream of the narrow cross section of the Sassi Neri check dam. These signs allowed identifying the occurrence a posteriori of an ephemeral lake for a few hours only. The water level hydrograph was acquired during the event at the Farini gauging station, which is located downstream the bridge crossing the river Nure in the town center. Unfortunately, the hydrograph was affected by many inaccuracies.

Bathymetry and Roughness
The 1 m × 1 m DEM of the river valley (visible in the foreground along the main river reach in Figure 8.b) was inlaid through a geographic information system (GIS), in the DEM available of Emilia-Romagna region with a square mesh of size 20 m (in the background of Figure 8.b) to achieve a DEM of the entire watershed under investigation (Figure 12).

Available Field Data
Several survey campaigns were carried out in the days following the storm. An unmanned aerial vehicle captured ground images at 10 cm per pixel resolution. A light detection and ranging (LiDAR) survey and a 1 m × 1 m DEM of a significant portion of the river valley were also produced, and a traditional ground survey was conducted to identify the maximum levels reached by the water in the village of Farini and upstream. Clear marks left by mud and debris, over 11 meters above the valley floor, were observed upstream of the narrow cross section of the Sassi Neri check dam. These signs allowed identifying the occurrence a posteriori of an ephemeral lake for a few hours only. The water level hydrograph was acquired during the event at the Farini gauging station, which is located downstream the bridge crossing the river Nure in the town center. Unfortunately, the hydrograph was affected by many inaccuracies.

Bathymetry and Roughness
The 1 m × 1 m DEM of the river valley (visible in the foreground along the main river reach in Figure 8b) was inlaid through a geographic information system (GIS), in the DEM available of Emilia-Romagna region with a square mesh of size 20 m (in the background of Figure 8b) to achieve a DEM of the entire watershed under investigation (Figure 12).
village of Farini and upstream. Clear marks left by mud and debris, over 11 meters above the valley floor, were observed upstream of the narrow cross section of the Sassi Neri check dam. These signs allowed identifying the occurrence a posteriori of an ephemeral lake for a few hours only. The water level hydrograph was acquired during the event at the Farini gauging station, which is located downstream the bridge crossing the river Nure in the town center. Unfortunately, the hydrograph was affected by many inaccuracies.

Bathymetry and Roughness
The 1 m × 1 m DEM of the river valley (visible in the foreground along the main river reach in Figure 8.b) was inlaid through a geographic information system (GIS), in the DEM available of Emilia-Romagna region with a square mesh of size 20 m (in the background of Figure 8.b) to achieve a DEM of the entire watershed under investigation (Figure 12).  According to the Emilia-Romagna Region public database, the Nure watershed is characterized by many different soil types and related uses. The original soil polygon map was reworked by appropriately combining some similar subgroups, and seven macro areas were identified with related CN values (Table 2 and Figure 13). The Manning's roughness coefficients for the numerical simulations were then deduced from literature values on the basis of the soil use. simulations were then deduced from literature values on the basis of the soil use.

Results for the Nure Watershed Test Case
The BUQ grid is obviously to be preferred to reduce the computational burden and achieve short computational times. For this reason, five preliminary runs adopting Cartesian grids of increasing cell dimensions (5, 10, 20, 40 and 80 m) were performed with the aim of identifying the suitable maximum cell size to be adopted for the watershed sides which gives accurate results. A transmissive boundary condition was assumed in the numerical simulations at the Farini outlet, by imposing a channel slope equal to 0.01, suitable to represent the local river characteristics correctly. Seven crosssections were identified at the outlet of the main watershed sub-basins (Figure 12), and the corresponding discharge hydrographs were then evaluated and compared.
As Figure 14 shows, no significant differences are present for Cross Sections 1, 2, 5 and 6 (see Figure 12 to identify their locations) in the discharge hydrographs resulting from the simulations performed on Cartesian grids from 5 to 40 m cell size, in terms of neither peak value, nor flood peak timing. The discharge hydrographs show remarkable differences from the ones obtained from the other preliminary runs for the last run adopting a Cartesian grid of 80 m cell size. For this reason, the

Results for the Nure Watershed Test Case
The BUQ grid is obviously to be preferred to reduce the computational burden and achieve short computational times. For this reason, five preliminary runs adopting Cartesian grids of increasing cell dimensions (5, 10, 20, 40 and 80 m) were performed with the aim of identifying the suitable maximum cell size to be adopted for the watershed sides which gives accurate results. A transmissive boundary condition was assumed in the numerical simulations at the Farini outlet, by imposing a channel slope equal to 0.01, suitable to represent the local river characteristics correctly. Seven cross-sections were identified at the outlet of the main watershed sub-basins (Figure 12), and the corresponding discharge hydrographs were then evaluated and compared.
As Figure 14 shows, no significant differences are present for Cross Sections 1, 2, 5 and 6 (see Figure 12 to identify their locations) in the discharge hydrographs resulting from the simulations performed on Cartesian grids from 5 to 40 m cell size, in terms of neither peak value, nor flood peak timing. The discharge hydrographs show remarkable differences from the ones obtained from the other preliminary runs for the last run adopting a Cartesian grid of 80 m cell size. For this reason, the resolution of 40 m was confirmed to be suitable to describe the watershed sides with no loss of accuracy.
Three BUQ grids, characterized by a decreasing total number of computational cells, were then considered to perform different simulations of the event (Figure 15). In BUQ Grid 1, the minimum grid size of 5 m × 5 m was imposed in correspondence of low-order river branches and at the watershed boundary. Due to both the grid partitioning, and the huge number of cells imposed at the maximum resolution, the maximum cell size of 40 m × 40 m was not present. In BUQ Grid 2, instead, the maximum resolution was required only in higher order river branches and at the watershed boundary, while in BUQ Grid 3 no resolution constraints were imposed at the watershed boundary. As reported in Table 3, the total number of cells of the three BUQ grids decreased from 8.49 × 10 6 to 0.88 × 10 6 from BUQ Grid 1 to BUQ Grid 3. As a consequence, a significant speed-up was achieved in the simulations adopting BUQ Grid 3 compared to the reference Cartesian mesh of 5 m × 5 m. The duration of the simulated event was 40 h. Three BUQ grids, characterized by a decreasing total number of computational cells, were then considered to perform different simulations of the event (Figure 15). In BUQ Grid 1, the minimum grid size of 5 m × 5 m was imposed in correspondence of low-order river branches and at the watershed boundary. Due to both the grid partitioning, and the huge number of cells imposed at the maximum resolution, the maximum cell size of 40 m × 40 m was not present. In BUQ Grid 2, instead, the maximum resolution was required only in higher order river branches and at the watershed boundary, while in BUQ Grid 3 no resolution constraints were imposed at the watershed boundary. As reported in Table 3, the total number of cells of the three BUQ grids decreased from 8.49 × 10 6 to 0.88 × 10 6 from BUQ Grid 1 to BUQ Grid 3. As a consequence, a significant speed-up was achieved in the simulations adopting BUQ Grid 3 compared to the reference Cartesian mesh of 5 m × 5 m. The duration of the simulated event was 40 h. In the simulation adopting the Cartesian mesh of 5 m × 5 m, the number of cells N C was equal to 8.49 × 10 6 , while T C was 14.77 h with a simulation to physical time ratio of 0.37. Compression rates and speed-ups achieved with the different BUQ grids adopted are reported in Table 3. The maximum values of C R and S U , respectively equal to 9.65 and 7.86, were obtained in the simulation based on BUQ Grid 3, as expected. All simulations were performed using a P100 Tesla ® GPU.
Comparable results were achieved in the different simulations in the main river channel at the Farini level gauging station ( Figure 16). Malfunctioning of the depth sensor led to many uncertainties affecting the depth hydrograph acquired during the event, so only the maximum water level is reported. This was collected after the event from the marks left by mud and water on the bridge structure. All simulations predicted the maximum water level quite well.  In the simulation adopting the Cartesian mesh of 5 m × 5 m, the number of cells NC was equal to 8.49 × 10 6 , while TC was 14.77 h with a simulation to physical time ratio of 0.37. Compression rates and speed-ups achieved with the different BUQ grids adopted are reported in Table 3. The maximum values of CR and SU, respectively equal to 9.65 and 7.86, were obtained in the simulation based on BUQ Grid 3, as expected. All simulations were performed using a P100 Tesla ® GPU.
Comparable results were achieved in the different simulations in the main river channel at the Farini level gauging station ( Figure 16). Malfunctioning of the depth sensor led to many uncertainties affecting the depth hydrograph acquired during the event, so only the maximum water level is reported. This was collected after the event from the marks left by mud and water on the bridge structure. All simulations predicted the maximum water level quite well.   Table 3. Characteristics of simulations, compression rates C R and speed-ups S U for the different BUQ grids adopted.   In the simulation adopting the Cartesian mesh of 5 m × 5 m, the number of cells NC was equal to 8.49 × 10 6 , while TC was 14.77 h with a simulation to physical time ratio of 0.37. Compression rates and speed-ups achieved with the different BUQ grids adopted are reported in Table 3. The maximum values of CR and SU, respectively equal to 9.65 and 7.86, were obtained in the simulation based on BUQ Grid 3, as expected. All simulations were performed using a P100 Tesla ® GPU.

S U (-)
Comparable results were achieved in the different simulations in the main river channel at the Farini level gauging station ( Figure 16). Malfunctioning of the depth sensor led to many uncertainties affecting the depth hydrograph acquired during the event, so only the maximum water level is reported. This was collected after the event from the marks left by mud and water on the bridge structure. All simulations predicted the maximum water level quite well.       The computed maximum water levels are in good agreement with the observations just upstream of the narrowing of the river valley, where the Sassi Neri check dams are located. Upstream of the check dam the calculated water surface elevation (left frame of Figure 18) is almost horizontal and the velocity (right frame of Figure 18) very low. This clearly shows that an ephemeral lake formed here, storing an estimated water volume of about 0.9·10 6 m 3 . The numerical results confirm that the two Sassi Neri check dams were completely submerged and circumvented to the left, as confirmed     The computed maximum water levels are in good agreement with the observations just upstream of the narrowing of the river valley, where the Sassi Neri check dams are located. Upstream of the check dam the calculated water surface elevation (left frame of Figure 18) is almost horizontal and the velocity (right frame of Figure 18) very low. This clearly shows that an ephemeral lake formed here, storing an estimated water volume of about 0.9·10 6 m 3 . The numerical results confirm that the two Sassi Neri check dams were completely submerged and circumvented to the left, as confirmed The computed maximum water levels are in good agreement with the observations just upstream of the narrowing of the river valley, where the Sassi Neri check dams are located. Upstream of the check dam the calculated water surface elevation (left frame of Figure 18) is almost horizontal and the velocity (right frame of Figure 18) very low. This clearly shows that an ephemeral lake formed here, storing an estimated water volume of about 0.9·10 6 m 3 . The numerical results confirm that the two Sassi Neri check dams were completely submerged and circumvented to the left, as confirmed by the side erosion, and by the damage to the structure and to the wings observed after the event (Figure 11).
The flow accelerated and became supercritical with maximum velocities over 10 ms −1 downstream of the valley bottleneck originated by the presence of the landslide. Lower velocities of about 5 ms −1 were predicted downstream of the bridge in Farini (right frame of Figure 18).
The evolution of the flooding event can be observed at the watershed scale in Figure 19. The numerical model (BUQ Grid 3) provides an accurate reconstruction of the event, characterized by a very high runoff at 3 a.m. on 14 September. Later, the water tended to reach the network of surface streams in which the channel flow occurred with a very quick reaction to rainfall. The maximum flow at Farini occurred at about 3:45 a.m. on 14 September, and the flood event can be definitely referred to as flash flood, as often occurs for the small catchments of the Italian Apennine hydrographic district, characterized by fast hydrological response times.
Water 2020, 12, x FOR PEER REVIEW 18 of 26 by the side erosion, and by the damage to the structure and to the wings observed after the event ( Figure 11). The flow accelerated and became supercritical with maximum velocities over 10 ms −1 downstream of the valley bottleneck originated by the presence of the landslide. Lower velocities of about 5 ms −1 were predicted downstream of the bridge in Farini (right frame of Figure 18). The evolution of the flooding event can be observed at the watershed scale in Figure 19. The numerical model (BUQ Grid 3) provides an accurate reconstruction of the event, characterized by a very high runoff at 3 a.m. on 14 September. Later, the water tended to reach the network of surface streams in which the channel flow occurred with a very quick reaction to rainfall. The maximum flow at Farini occurred at about 3:45 a.m. on 14 September, and the flood event can be definitely referred to as flash flood, as often occurs for the small catchments of the Italian Apennine hydrographic district, characterized by fast hydrological response times.  The maximum computed water depths for the whole Nure watershed are shown in Figure 20. The water depths at the catchment sides are generally low (below 0.1 m).
Water 2020, 12, x FOR PEER REVIEW 19 of 26 The maximum computed water depths for the whole Nure watershed are shown in Figure 20. The water depths at the catchment sides are generally low (below 0.1 m).   Figure  21 (lower frames) shows also the velocity field with superimposition of the velocity vectors: as can be seen, the model is capable of providing very detailed information about the velocity field in the different regions of the computational domain.

Discussion
The simulation of rapidly varying real flooding scenarios at the watershed scale can nowadays be satisfactorily performed through high spatial resolution 2D models, thanks to the increasingly widespread and affordable LiDAR DEMs. The computational time has always represented a constraint for this kind of simulation. The new multiresolution models, such as that presented here, implemented in a CUDA/C++ code that exploits parallel computation offered by GPUs, allow simulating flooding on several hundreds of square kilometers while maintaining a high spatial resolution in the areas of major interest, with very satisfactory ratios of simulation to physical times.
In particular, the computational model developed and validated here, solves the complete 2D-SWEs, and allows simulating the process of flood propagation starting from the precipitation falling on a watershed. Moreover, due to the adoption of the multiresolution BUQ grids, the modeling tool can also accurately reproduce flood inundations in areas characterized by the presence of roads, railways or channel embankments, or in urban environments, where the flooding dynamics can be strongly influenced by streets and buildings and the geometrical description must be very detailed.   Figure 21 (lower frames) shows also the velocity field with superimposition of the velocity vectors: as can be seen, the model is capable of providing very detailed information about the velocity field in the different regions of the computational domain. for flood protection purposes, to evaluate the performance of flood detention structures, or to investigate the behavior of new flood protection hydraulic structures. Finally, the performance of this model could be further improved thanks to the extension of the code to multi-GPUs.

Discussion
The simulation of rapidly varying real flooding scenarios at the watershed scale can nowadays be satisfactorily performed through high spatial resolution 2D models, thanks to the increasingly widespread and affordable LiDAR DEMs. The computational time has always represented a constraint for this kind of simulation. The new multiresolution models, such as that presented here, implemented in a CUDA/C++ code that exploits parallel computation offered by GPUs, allow simulating flooding on several hundreds of square kilometers while maintaining a high spatial resolution in the areas of major interest, with very satisfactory ratios of simulation to physical times.
In particular, the computational model developed and validated here, solves the complete 2D-SWEs, and allows simulating the process of flood propagation starting from the precipitation falling on a watershed. Moreover, due to the adoption of the multiresolution BUQ grids, the modeling tool can also accurately reproduce flood inundations in areas characterized by the presence of roads, railways or channel embankments, or in urban environments, where the flooding dynamics can be strongly influenced by streets and buildings and the geometrical description must be very detailed. In the future, automatic procedures for analyzing terrain models could allow the creation of computational grids based on the identification of some salient features of the basin of interest, completely independent of the hydrodynamic phenomena that can take place in it. These tools, equipped with suitable criteria, could guarantee the generation of a grid consisting of the fewest number of calculation cells compatible with the solution accuracy required.
The present solution algorithm preserves all the terms of the momentum equations and is suitable to deal with the rapidly varying flows that can occur in a two-dimensional flow field during inundation phenomena with low ratios of computational to physical time. Flood management in populated territories crossed by rivers is a crucial issue. Hence, fast and accurate numerical models such as that presented here can be adopted to perform simulations of hypothetical flooding scenarios for flood protection purposes, to evaluate the performance of flood detention structures, or to investigate the behavior of new flood protection hydraulic structures. Finally, the performance of this model could be further improved thanks to the extension of the code to multi-GPUs.
Author Contributions: Conceptualization and methodology, F.A., R.V. and F.P.; software, F.P., R.V., A.F. and S.D.; validation, F.P. and F.A.; investigation, F.A. and F.P.; writing-original draft preparation, F.A.; writing-review and editing, F.A., R.V., F.P., A.F. and S.D.; visualization, F.A. and F.P.; supervision, F.A. and R.V. All authors have read and agreed to the published version of the manuscript. In Equations (A2)-(A6), z is obtained through the local bed modification scheme: in which ∆z is a function of δz (7): if δz ≥ 0 : where ε h represents the wet/dry threshold (here 10 −12 m) and the water surface elevations η i,j or η i+1,j are the original water surface elevations at cell center rather than the reconstructed value (which is used in Equations (A5) and (A6) instead).

A.2. Friction Source Term Treatment
After the update of the conserved variables at the half time step (4), the friction source term can be discretized in an effective explicit formulation through analytical manipulations. As described in [74] from the momentum components expanded into a scalar form, the unit width discharge in x the direction, if DEN = ∆t n gn 2 (h n ) − where: mx = qx n + ∆t n Ax , my = qy n + ∆t n Ay , in which Ax and Ay denote the scalar components of Equation (5) neglecting the terms S r and S f . Like Equation (A9), the explicit formulation can be similarly obtained in the y direction.

A.3. Soil Conservation Service SCS Method
The continuity principle establishes that the total precipitation P can be considered as the sum of the excess precipitation, P e , of the initial abstraction I a before ponding, for which no runoff occurs, and of the precipitation depth retained in the watershed, F a : P = P e + I a + F a .
(A11) Based on this principle, the SCS method hypothesizes the following equivalence between the ratios of the actual to the potential quantities: The initial abstraction I a can be evaluated through the empirical relation: in which the parameter α usually assumes the value 0.2. With this assumption, the excess precipitation P e becomes: where the potential maximum retention S ∞ (mm) is related to the curve number CN by the expression: For each computational cell, the cumulative infiltration (mm) can be found by substituting Equation (A14) in Equation (A12): with P and F a calculated and stored at every time step ∆t. The infiltration ∆F a for the current time interval can be obtained through a difference between the cumulative infiltration at times n and n-1.
The related infiltration rate f is then computed as follows: f n = F a n − F a n−1 ∆t . (A17)