Environmental Modelling and Software

This paper presents an integrated modeling framework aiming at accurate predictions of flood hazard from heavy rainfalls. The accuracy of such predictions generally depends on the complexity and resolution of the employed model components. We propose an integration of complementary models in one framework that facilitates GPUs to improve accuracy and simulation time. The spatially distributed runoff model integrates surface flow routing based on the full shallow water equations, infiltration based on the Green–Ampt equation, and interception. In urban areas, the runoff model is coupled with the Storm Water Management Model (SWMM). The integrated model is validated and tested on laboratory, rural and urban scenarios with regards to accuracy and computational efficiency. The GPU acceleration yields speedups of 1000 times compared to a CPU implementation and enables the coupled simulation of flash floods at 1 m resolution for an urban area of 200 km 2 in realtime.


Introduction
Floods are increasing in many parts of the world due to climate and land use change Blöschl et al., 2019) causing disproportionally high damage in urban regions (Jongman, 2018). To mitigate future flood damage, detailed models that assist in assessing the flood hazard spatially are crucial (Rosenzweig et al., 2021). In contrast to lumped models, spatially distributed models allow for an explicit representation of spatial variations and inhomogeneities in input data, such as topography, vegetation, soil characteristics, and urban features. Notwithstanding scale issues (Grayson and Blöschl, 2001), there is a lot of value in spatially distributed high-resolution modeling for management purposes. However, higher resolutions lead to slower simulations. In addition, ensemble simulations that quantify the uncertainty of the predictions and provide insights into the effects of parameter variation increase the computational burden even further. Thus, the challenge is to advance the capabilities of numerical modeling while balancing simulation performance and model accuracy.
To accurately represent the topography, the resolution of the simulation grid should be chosen accordingly. As a rule of thumb, terrain features should be covered by at least 3 cells to be represented explicitly (Gallegos et al., 2009;Fewtrell et al., 2011;Horváth et al., 2020). For urban areas, a resolution of 2 m or less is considered wave (sometimes also called zero-inertia) or kinematic wave approximations (Neal et al., 2012;Le et al., 2015;Fry and Maxwell, 2018;Yang et al., 2020). For urban regions, the full or dynamic SWEs in combination with shock capturing schemes are able to reproduce observed hydraulic behavior and velocities more accurately than simplified models (Kvočka et al., 2015;Costabile et al., 2020). Cozzolino et al. (2019) conclude that the preferred model for floodplain simulations should be the full 2D SWEs as simplified models often suffer from a poor representation of receding flows and bed discontinuities. Still, issues such as wetting and drying over complex terrain pose a numerical challenge and constitute an active area of research (Chen and Noelle, 2017;Xia et al., 2017;Buttinger-Kreuzhuber et al., 2019). If not treated properly, numerical instabilities occur and lead to slow simulations. The full SWE offer a model to simulate both complex open channel hydrodynamics and overland flow processes (Costabile et al., 2013;Fernández-Pato et al., 2016), in particular at high resolutions (Caviedes-Voullième et al., 2020). Recent studies (Costabile et al., 2017;Aricò and Nasello, 2018;Caviedes-Voullième et al., 2020) point out that solvers for the full SWEs might in fact require less computational time than their zeroinertia counterparts. Popular scheme choices for the full SWEs include discontinuous-Galerkin (DG) (Kesserwani et al., 2008;Vater et al., 2017;Ayog et al., 2021) and finite volume (FV) (Audusse et al., 2004;Horváth et al., 2015;Hou et al., 2014;Buttinger-Kreuzhuber et al., 2019;Dong and Li, 2021) methods. Second-or higher-order FV schemes are more accurate than their first-order counterparts, which are prone to numerical diffusion (Audusse and Bristeau, 2005;Noelle et al., 2006Noelle et al., , 2007Li et al., 2014;Navas-Montilla and Murillo, 2015;Hou et al., 2015). Most second-order schemes follow a monotonic upstreamcentered scheme for conservation laws (MUSCL) approach (Van Leer, 1979), resulting in a shock-capturing scheme with reduced numerical diffusion yet without unphysical oscillations. However, compared to first-order schemes, the higher accuracy of second-order schemes comes at the price of higher runtimes due to a reduced CFL constant and second-order time integration. Thus, in large-scale flood modeling first-order schemes are still commonly used (Xia et al., 2019;Echeverribar et al., 2019;Morales-Hernández et al., 2021). In complex flash flood scenarios, it is not immediately clear whether the application of a second-order scheme pays off in terms of accuracy and required computational work.
A promising way to achieve computational speedups is the execution in a massively parallel fashion on supercomputers (Noh et al., 2018;Kuffour et al., 2020) or on graphics processing units (GPUs) (Lastra et al., 2009;Brodtkorb et al., 2012;Vacondio et al., 2014;Lacasta et al., 2015;Le et al., 2015;Horváth et al., 2016;Xing et al., 2018;Xia et al., 2019;Morales-Hernández et al., 2021). To achieve a high computational performance on GPUs, regular grids are a convenient choice due to the structured arrangement of cores (Morales-Hernández et al., 2020). Unstructured meshes should be reordered for efficient memory access patterns allowing for coalescent transactions (Lacasta et al., 2015). Cutting-edge flash flood models are on the verge of handling resolutions of 5 m for large regions of up to 2500 km 2 , or, 100 million cells (Xia et al., 2019). Traditionally, cities are split into multiple smaller simulation regions that tend to underestimate inundation (Xing et al., 2018). Thus, high-resolution simulations at large scales, e.g. spanning entire cities, are needed.
A variety of infiltration models exist, including the empirical Soil Conservation Service (SCS) curve number method (Chow et al., 1988;Aureli et al., 2020), the empirical Horton model (Fernández-Pato et al., 2016;Fernández-Pato and García-Navarro, 2018), the semi-empirical Green-Ampt model (Fiedler and Ramirez, 2000;Simons et al., 2013;Delestre et al., 2017;Fernández-Pato et al., 2016), and more complex models such as Richards equation (Maxwell, 2013;Le et al., 2015;Kuffour et al., 2020) although capturing macropore flow (Zehe et al., 2007) remains a challenge. The model's ability to capture the local effects of green infrastructure (GI), such as green roofs, rain gardens, or bioswales, is important in urban flood resilience planning (Berland et al., 2017;Fry and Maxwell, 2018;Rosenzweig et al., 2021). For urban flood hazard modeling, the flow in sewer systems and its interaction with the overland flow may be relevant. The Storm Water Management Model (SWMM) is an established tool for routing stormwater in sewer systems. It is developed by the Environmental Protection Agency (EPA) as an open-source software package (Rossman, 2017). A widely used approach to bidirectionally couple urban drainage networks to overland flow are dual drainage models (Leandro and Martins, 2016;Yang et al., 2020;Li et al., 2020;Rosenzweig et al., 2021). The interaction terms are commonly based on the water level differences between the sewer nodes and the surface water (Djordjević et al., 2005;Chen et al., 2016;Rubinato et al., 2017;Fernández-Pato and García-Navarro, 2018). Previous coupled models featuring a dynamic sewer network simulation, e.g. Leandro and Martins (2016), Fernández-Pato and García-Navarro (2018), Noh et al. (2018), Yang et al. (2020), were run on central processing units (CPUs) and not exploiting recent leaps in model acceleration facilitated by GPUs. As the coupled models runtimes are typically governed by the surface simulation (Noh et al., 2018), fast runoff models are crucial.
In this paper, we present a coupled modeling framework for fast simulations in urban and rural settings. The framework includes several components considered relevant in rainfall-runoff modeling and flash flood hazard assessment, that is, spatially distributed interception and infiltration, an accurate representation of overland flow, and subsurface flow in sewer networks. We go beyond current modeling practice by using both a spatially distributed rainfall-runoff model and a fully bidirectional coupling of the sewer network accounting for drains and overflows at large scales and very high resolutions. We propose a novel, hybrid coupling approach of a GPU-accelerated runoff simulation with an established CPU sewer network simulation. We validate and test the framework in laboratory, rural and urban scenarios. We answer the question whether first-order or second-order schemes in the surface flow discretization of the full 2D SWEs should be favored in terms of the workload-accuracy tradeoff. Moreover, we highlight the influence of resolution and of the individual model components. Finally, we address the extent of computational acceleration on a modern GPU for high-resolution simulations of entire cities.

Surface flow model
The full shallow water equations (SWEs) are used to describe the surface flow and may be written in vector form as where = [ℎ, ℎ , ℎ ] is the vector of conserved variables, ℎ represents the water height, ℎ is the discharge along the -axis, and ℎ is the discharge along the -axis. and are the flux functions, The bed slope term , models the fluid's acceleration due to the gravitational forces. The friction term , A. Buttinger-Kreuzhuber et al. accounts for the bed friction. Here, and are the average flow velocities in and directions, respectively, is the gravitational constant, is the bed level (assumed to be time-independent), and is the Manning friction coefficient.
To integrate the interception and infiltration processes of the runoff model and the sewer model with the surface flow, coupling terms and , respectively, are added on the right hand side of Eq. (1). The coupling term for the sewers accounts for the specific sewer exchange discharge (m/s). The source terms and are given by The runoff rate (m/s) is the difference between the effective precipitation rate and the effective infiltration rate. Its precise application is defined in Section 2.4.

Spatio-temporal discretization of the surface flow
For the spatial discretization of the SWEs, the finite volume method (FVM) was chosen on a uniform Cartesian grid. Although all runoff components can also be discretized on unstructured grids, structured grids require less pre-processing steps and enable fast memory access patterns on the GPU (Morales-Hernández et al., 2020). Moreover there is no need to store or optimize mesh connectivity. The FVM discretizes the conserved variables as cell averages yielding a system of ordinary differential equations for the cell averages , ( ).
For the simulation of the overland flow, we employ either a firstorder accurate or a second-order accurate scheme. The first-order scheme of Chen and Noelle (2017), to which we refer as CN scheme, enables a better handling of flow states across bed discontinuities than the original hydrostatic reconstruction (HR) scheme proposed by Audusse et al. (2004). A second-order accurate extension of the firstorder CN scheme is presented in Buttinger-Kreuzhuber et al. (2019), to which we refer as BH scheme. The second-order accuracy in space is achieved through a minmod limiter. The minmod parameter is set to 1 in order to ensure robust and fast simulations (Horváth et al., 2020). At wet-dry boundaries only the velocities are set to zero below a cut-off water depth threshold. In the simulations this threshold is set to 0.1 mm. The surface flow is discretized in time by the explicit Euler's method for the first-order CN scheme with a Courant-Friedrichs-Lewy (CFL) constant of 0.5 to guarantee numerical stability and non-negativity of the water depths. The second-order BH scheme is integrated in time with Heun's method and the CFL constant is set to 0.25. The CFL condition restricts the time step , where is the uniform grid resolution and is the maximum of the absolute values of the numerical wave speeds. The numerical wave speeds at the cell interfaces are computed from the eigenvalues of the Jacobian of the flux functions and (Buttinger-Kreuzhuber et al., 2019). Both schemes are mass conserving and preserve lake-at-rest steady states. The friction source term is evaluated in a semi-implicit manner by splitting it into a coefficient-wise product of an implicitly evaluated state and an explicitly evaluated friction term̃ (Brodtkorb et al., 2012;Buttinger-Kreuzhuber et al., 2019).

Runoff model
The spatially distributed runoff simulation integrates the surface flow routing component with an interception and an infiltration component determining the effective surface runoff. First, part of the rain is stored in the canopy of vegetation through interception. Second, infiltration occurs as surface water percolates into permeable soils. The remaining water effectively materializing during a rain event runs off the surface as overland flow.
The rainfall intensity is given by a time-and space-dependent precipitation rate . The integrated interception component reduces the effective precipitation and accounts for micro-topographic depressions and losses due to vegetation. The interception storage capacity is typically roughly around 1 mm (Robinson and Ward, 2017), except for rural forests, where values as high as 5 mm were found (Xiao et al., 1998). The cumulative interception ( ) until time is modeled with a constant non-negative rate until a predefined storage capacity is reached. Thus, the spatially distributed effective precipitation rate is given by The infiltration process is modeled by the Green-Ampt equation. The cumulative infiltration up to time is where is the saturated hydraulic conductivity. The difference between the initial water content and the saturated water content of the soil is usually called effective porosity. The suction head represents the capillary attraction of the water towards the soil voids. Solving Eq. (8) for the infiltration rate , the time derivative of , we obtain The proposed dynamic infiltration model accounts for the surface water pressure via the surface water height ℎ. A shortcoming of the presented Green-Ampt model is the inability to account for multi-layered soils, limited storage capacity, soil water redistribution in dry phases and macropores. Extensions to overcome these limitations have been proposed (Corradini et al., 2000;Gowdish and Muñoz Carpena, 2009;Mohammadzadeh-Habili and Heidarpour, 2015;.

Temporal discretization of the runoff model
The Green-Ampt (GA) model is discretized in time with the implicit Euler method solving Eq. (9) at every cell for every time step. The infiltration depths at time step +1 is given by We note that, even though the infiltration rate is undefined for = 0, the implicit Euler method yields a well-defined infiltration depth close to zero, in contrast to the explicit Euler method. If the infiltration depth increment , defined by = +1 − , exceeds the available surface water depth, it is restricted to ensure a nonnegative surface water depth. The effective infiltration rate is then given by where is the CFL-limited timestep of the overland flow for timestep . For the runoff model, it is enough to perform a simple integration for both the effective precipitation and infiltration rate. We combine the precipitation and infiltration increment into a single effective runoff increment , The overland flow and runoff models are tightly coupled, every step in the surface flow simulation is synchronized with the computation and application of the runoff update.

GPU implementation of the runoff model
The spatial discretization of the surface flow with the FVM enables straightforward parallelization on structured grids, as only neighboring cells need to be considered when computing the next time step. The GPU implementation of the runoff model uses the CUDA platform of NVIDIA. In CUDA, parallel tasks are organized into thread blocks and typically each parallel task (each so-called thread) is associated with one cell. A block size of 16 by 16 cells ensures a high utilization of the GPU (Horváth et al., 2016). The number and positions of the neighboring cells required to compute the fluxes of a particular cell are called the computation stencil. For the first-order CN scheme, only the direct neighbors are accessed. For the second-order BH scheme, the computation of the minmod-limited gradient requires a neighborhood of two cells to be accessed in the four axis-aligned directions. Halo cells are required to exchange the data between adjacent blocks. Effectively, fluxes are computed only for the inner block of 14 by 14 cells for the first-order CN scheme or for 12 by 12 cells for the second-order BH scheme due to the different computational stencils and hence the different number of halo cells required. The GPU implementation handles data dependencies with on-chip memory for fast processing of the computational stencil, e.g. shuffles introduced with the Kepler microarchitecture. For implementation details, the reader is referred to Horváth et al. (2016).
GPU implementations of the SWEs operate either on single-precision or on double-precision floating-point variables. For single-precision state variables the memory burden is lower and floating-point operations are faster, as GPUs were originally optimized for singleprecision computations and therefore speed-ups of up to eight times over double-precision are reported (Morales-Hernández et al., 2020). To alleviate the computational burden in the proposed runoff model, both the surface state and the infiltration state are stored in singleprecision floating-point variables. In previous publications (Buttinger-Kreuzhuber et al., 2019;Horváth et al., 2020), we verified that the employed surface flow model relying on single-precision state variables captures lake-at-rest steady states, displays the expected convergence order and is capable of handling wet/dry transitions. Still, truncation errors might accumulate over time, which may have an effect on the mass balance (Liang and Smith, 2015;Dazzi et al., 2021). Therefore, we verify that the runoff model's mass balance is within machine precision in the presented test cases.
On a modern GPU with 24 GB of video memory the domain size in our computational model is limited to around 175 million active (wet) cells for the second-order BH scheme with dynamic runoff and singleprecision floating-point variables. For the first-order CN scheme, the domain size is limited to around 225 million active cells as the firstorder time integration does not require the storage of an intermediate state.

Sewer network model
In the Storm Water Management Model (SWMM), a sewer network is represented by a set of nodes connected by links (Rossman, 2017). Links transmit pipe discharges from one node to another. A so-called node assembly consists of the node and all connected half-links, these are the halves of the link that are connected to the node. At each node assembly, the change in the hydraulic head is modeled by the continuity equation. The pipe flow is governed by the transient 1D SWE, i.e.
where is the pipe flow velocity, is the cross-sectional flow area, is the hydraulic head, and is the friction slope in the pipes. In the SWMM, Eq. (13) is solved with a finite difference scheme. Thus, at each link, momentum and continuity are conserved, in contrast to the nodes, where only continuity is conserved. The continuous state variables in the time differences are approximated with their average values over the conduit length. SWMM 5.1 uses an implicit backwards Euler method for the time discretization, which is solved iteratively with Picard's method. We use the Preissmann slot model, implemented in the latest version of SWMM 5.1.013, which is integrated in our coupled model setup. Unfortunately, SWMM does not exactly preserve water volumes. Continuity errors are typically within a few percent (Rossman, 2006) and may be reduced by artificially introducing a finer spatial discretization (Pachaly et al., 2020). SWMM 5.1 is written in C/C++ and is easily incorporated into existing C++ software such as the proposed modeling framework.

Bidirectional surface-sewer coupling
The surface-sewer discharge exchange term depends on the water level at the surface , the hydraulic head at the manhole and the bed surface elevation . In the following, and are the manhole's area and diameter, respectively, is the distance between the surface and the invert level of the pipes entering the node. The invert level refers to the lowest elevation admitting water flow. Following Djordjević et al. (1) inflow into a non-pressurized node, (2) inflow into a pressurized node, where the surface flow depth is small when compared to the node width, (3) inflow into a pressurized node, where the surface flow depth is large when compared to the node width, (4) outflow onto the floodplain.
If the head in the pipe network is lower than the surface elevation, i.e.
< , the discharge exchange (m 3 /s) is given by the free weir equation If the head in the pipe network exceeds the surface elevation, i.e. > , the discharge exchange is either given by the submerged weir equation, as long as ℎ < ∕ . If ℎ ≥ ∕ the node is considered fully submerged, the submerged orifice equation is considered a more appropriate description. For example, for circular manholes, the orifice equation is applied if ℎ > ∕4. The discharge coefficients for the free weir, the submerged weir, and the orifice equations are set to , = 0.56, , = 0.11, and , = 0.2, respectively (Rubinato et al., 2017).
If the head in the pipe system exceeds the water level of the surface flow, an orifice equation is used (Djordjević et al., 2005). Assuming that the surface velocity is negligible, the discharge exchange is given by This equation also holds for dry surface cells, i.e. when ℎ = 0. With these four cases, all exchange flow conditions are properly handled. A negative exchange discharge indicates flow into the sewer network from the surface, a positive value indicates sewer overflow.
Sewer overflow also occurs if water flow from the roofs of the surrounding buildings exceeds the sewer inflow capacity. Consequently, water spills over at this node. In this case, the roof water is directly added to the sewer overflow as excess discharges and not reduced by the surface-sewer exchange equations, Eqs. (14)- (17).
The surface-sewer coupling takes place only at the cells where manholes and inlets are connected to the surface. To this end, the geometry of the manholes and inlets is rasterized on the simulation A. Buttinger-Kreuzhuber et al. The sewer and the surface simulator are executed in parallel, but they need to be synchronized for data exchange. These synchronization barriers (indicated with double red lines) are needed for data exchange between the two simulators (indicated with the dashed red arrows). With this less tightly coupled approach, each simulator is able to advance independently of the other. grid. Effectively, at each cell the perimeters and areas of all intersecting manholes and inlets are collected. Furthermore, we also collect the contributions of each cell to each sewer node. We are able to include cases where multiple manholes intersect the same cell, as well as cases where multiple cells contribute to the same node. For each rasterized cell, we keep track of the corresponding node head by averaging over all nodes connected to the specified cell. The sewer node exchange discharge Eqs. (14)-(17) are solved on a per-cell basis, where we account for the relative contributions, The specific sewer exchange term for the surface model is limited by the water availability in the case of sewer inflow, i.e., In time, the sewer network model is interleaved with the surface and runoff model with an a priori defined coupling timestep . At multiples of the coupling timestep, information between the runoff model and the sewer model is exchanged. The coupling time step size also represents an upper limit to the time step sizes of the individual simulators. If not stated otherwise, it is set to 1 s. As a rule of thumb, the coupling time step size should be roughly of the same order as the time step sizes of the individual solvers. A time step of the coupled model is illustrated in Fig. 1 and subdivided into the following steps: (1) Exchange sewer-surface coupling data, i.e. provide node heads and excess discharges to the surface flow simulation, and provide exchange discharges to the sewer model. Each simulator performs multiple routines at each of these steps (see Fig. 1). When coupling data need to be exchanged, the simulators are required to wait for the other simulator at synchronization barriers. In the proposed C++ framework, the synchronization barriers are implemented with the functionality offered by std::thread and boost::barrier. Synchronization barriers are set only at the beginning of a coupled simulation time step, but not in the individual simulator's advance methods. The loop for the time steps in the advance step is executed independently of the other simulator. To advance from to + , each simulator only needs the minimal amount of time steps required for its own numerical stability.
For the sewer simulation, the exchange discharges are obtained from the surface simulation and applied to the sewer network. Additional inflows from external sources, e.g. roofs, are applied and compared with the sewer network's inflow capacities. These excess discharges contribute to node overflow. SWMM routes the water flow in the sewer network. Then, the sewer network state (node depths and heads, link discharges and volumes, inflow volumes, and excess volumes) is updated. The excess volumes accumulated during one coupling time step in the sewer network simulation are provided as excess discharges to the surface runoff simulation in the next coupling time step. More precisely, we compute the excess discharges prescribed in the surface simulation at time step +1 from the excess volumes accumulated in the sewer network simulation from to +1 = + . As a consequence of the parallel execution, during the coupling time step the state of the surface variables remains fixed for the sewer simulation and vice versa.
For the simulation of the surface flow and runoff, node heads from the sewer simulation are necessary for the computation of the surface-sewer exchange discharges. Once acquired, the surface runoff simulation advances independently of the sewer simulation. The routines in the surface runoff advance step correspond to typical routines in FVMs for overland flow simulation. These routines are performed on the GPU and looped until the next coupling time step + is reached. The surface state variables (water depth and level, surface flow discharges) and the infiltration state variables are updated after timesteps according to Section 2.4. After the loop, the applied exchange discharges during the coupling time step are computed and are provided to the sewer simulation in the next coupling time step.
The coupled simulation time steps are executed until the simulation end time is reached. As the sewer simulation and the surface runoff simulation are executed in parallel, the execution time of the coupled surface-sewer model is determined by the execution time of the slower simulator and not by the sum of the execution times as it would be the case for a sequential coupled simulation. Usually, the one-dimensional sewer network simulation is faster than the two-dimensional surface flow simulation (Noh et al., 2018).

Results and discussion
We demonstrate the capabilities of the coupled model on laboratory and real-world test cases. The scenarios include a small-scale rainfall-runoff plot experiment, a rural catchment, and a full-scale urban test case including sewer coupling. More specifically, we simulate the Thiès plot experiment (Section 3.1), the HOAL catchment at Petzenkirchen, Austria (Section 3.3), and the city of Cologne, Germany (Section 3.4). Furthermore, we validate the sewer-surface coupling approach on a laboratory experiment presented in Rubinato et al. (2017) in Section 3.2.
The numerical simulations were performed on a desktop PC equipped with 10 Intel i9-9820X cores at 3.3 GHz and 128 GB RAM. The GPU utilized for the test cases was an NVIDIA Titan RTX in Tesla Compute Cluster (TCC) mode. It features 4608 CUDA cores and has 24 GB memory. In the following, the term runtime describes the cumulative execution time of the simulation measured via wall clock timing. The runtime neither includes the initialization process, such as reading input data, nor postprocessing steps, such as writing results to the disk. However, the GPU runtime includes data transfer between the GPU and the CPU during simulation.

Thiès plot experiment
We validate the model with measurements performed at in Thiès, Senegal, by Tatard et al. (2008). The experiment was carried out on a 10 × 4 m 2 plot. The plot has an average slope of 1%, and the resolution of the digital terrain model (DTM) is 0.1 m. Rainfall was simulated with a constant rate of 70 mm∕h for a duration of 1 h on the sandy soil. In the reference data set of Mügler et al. (2011), measurements of mean flow velocities are available at 62 locations across the plot (Fig. 2a). Following Simons et al. (2013), Manning's roughness coefficient was set to a constant value of 0.014 m 1∕3 ∕s throughout the entire plot. We compare the results from the first-order CN and second-order BH scheme for the steady state after 1 h. The simulated water depths show a slightly clearer depiction of the flow paths in the second-order scheme, compare Fig. 2b-c. The simulated velocities are shown as arrows in Fig. 2b-c for the CN and BH scheme.
In Fig. 3a, we compare the simulated velocities with the measured velocities. Second-order schemes are computationally more involved than first-order schemes, but are supposed to yield superior results due to the improved accuracy. The root mean square error (RMSE) of the velocities is defined by where is the total number of all observed velocities . The RMSEs of the velocities are consistently lower for the second-order scheme for all resolutions from 0.05 to 0.25 m, as is shown in Fig. 3b, and the achieved RMSE of 0.026 m∕s is in line to results in the literature (Tatard et al., 2008;Mügler et al., 2011;Simons et al., 2013;Caviedes-Voullième et al., 2020). However, we emphasize that a proper discretization of the source term is important for the simulation of runoff processes, even more so in the case of first-order accurate schemes. The superiority of the first-order CN scheme above the popular HR scheme (Audusse et al., 2004) is noticeable from the velocity errors in Fig. 3a and b. The simulated velocities of the HR scheme are consistently lower than for the CN scheme, as the HR scheme is not able to fully account for the bed slope in the case of shallow flow (Delestre et al., 2012). Switching to second-order accuracy in the HR scheme fixes this issue, albeit at the cost of a higher computational workload. The velocities of the second-order HR scheme are only slightly deviating from the BH scheme, therefore they were excluded from the plots in Fig. 3. In line with numerical theory, the RMSE decreases with grid refinement in general as shown in Fig. 3b. However, the RMSEs of the second-order scheme do not exhibit such a clear trend as the RMSEs of the first-order scheme. For the second-order scheme, convergence with regard to mesh refinement is limited for cell sizes below 0.1 m due to the resolution of the underlying DTM (0.1 m). Typically, for smooth A. Buttinger-Kreuzhuber et al. Fig. 3. Thiès. (a) Simulated velocities over measured velocities at steady-state conditions for a resolution of 0.1 m. The second-order scheme develops higher velocities than the first-order scheme. (b) Root mean square errors (RMSEs) over cell sizes. For the first-order scheme, the error decreases with resolution. For the second-order scheme the trend is not as pronounced. (c) RMSEs over GPU runtimes. For the same amount of computational time spent, the first-order CN scheme produces better results than the BH scheme. solution surfaces, e.g. in river floods, the second-order accurate scheme is expected to be more efficient when considering the accuracy versus runtime tradeoff (Horváth et al., 2020).
When the corresponding runtimes of the CN and the BH schemes are compared in Fig. 3c, the first-order CN scheme produces better results for the same amount of computational time spent. In terms of the tradeoff between computational workload and accuracy, this suggests the use of finer grids together with first-order schemes for the  surface flow in rainfall-runoff simulations. Summarizing, we conclude that fast first-order schemes, which properly resolve the source term, are sufficiently accurate for rainfall-runoff simulations.
Both the first-order CN and the second-order BH scheme are robust, as no unphysical high numerical speeds develop over time (Fig. 4). The time step is inversely proportional to the maximum numerical speed by the CFL condition, see (6). Thus, if the maximum numerical speeds are small, large time steps are possible. There is a small spike at the beginning of the wetting, but its magnitude is reasonable. In Fig. 5, we compare the effect of using single-precision against double-precision floating-point numbers for the first-order CN scheme on the CPU and the GPU. In terms of the velocity RMSEs, the differences are noticeable for a cell size of 20 cm, but they are still very small (<0.1 mm∕s). The mass balance error for the 10 cm runoff model is below 0.0001%. The rain volume is exactly 2.8 m 3 , 1.074432 m 3 are infiltrated, 1.697895 m 3 are flowing out at the open southern boundary during the simulation, and 0.027672 m 3 remain at the surface at the end of the simulation. In   terms of the runtimes, the parallel single-precision GPU implementation is more than 30 times faster than the sequential single-precision CPU implementation for a resolution of 5 cm (Fig. 5b). In this case, due to the low number of cells in the domain, that is 16000 cells, the GPU cannot fully exploit its parallel capabilities. The modest speedup in this smallscale experiment is therefore not representative of large real-world test cases.

Surface-sewer coupling
In this section, we present simulation results to verify the surfacesewer coupling against experimental data provided by Rubinato et al. (2017). The setup consists of a flume and a single manhole connected to a pipe. The surface bed is 4 m wide and 8 m long with a slope of 1 m per 1 km. At the upper end, a hydrograph with a constant discharge of 11 L∕s is specified. At the outlet, critical flow conditions are imposed. The manhole is located 2.5 m downstream of the inlet and has a diameter of 0.24 m. The invert level of the pipe is 0.478 m below the flume bed. Manning's roughness coefficient is set to 0.009 s∕m 1∕3 for both the pipe and the flume as both are PVC. In the experiment, the pipe pressure and the surface level was measured 0.34 m and 0.35 m away from the manhole (Fig. 6).
First, we tested steady state inflow from the surface into the sewer system for various prescribed surface discharges ranging from 5 L∕s to 11 L∕s. The simulation reaches a steady state after 300 s. The simulation is able to accurately reproduce the measured water depths and exchange flows (Fig. 7). With a mean absolute error (MAE) of 0.19 mm and relative differences ranging from 0.1% to 4.3%, the accuracy is excellent.
In the second test case, we simulated overflow from the sewer onto the wet flume. The surface inflow is fixed at 11 L∕s and the pipe inflow ranges from 2.2 L∕s to 7.6 L∕s. Again, the simulated water depths agree well with the measured water depths with a MAE of 0.72 mm. The relative differences range from 3% to 4.8% in this case. Unfortunately, SWMM cannot extract the pressure head at an arbitrary location along the pipe, so instead we extracted the pressure head directly at the manhole. The pressure head at the node is supposed to be lower than Fig. 8. Sewer to surface coupling, i.e. water is flowing from the sewer onto the surface. Simulated (red) and measured (violet) water depths (a) and pressure heads (b) over the exchange discharges. The water depths show good agreement. The simulated pressure head is extracted directly at the node and not 0.35 m away from the node as in the experiment thus resulting in lower simulated pipe pressure heads.
in the pipe, as energy is dissipated in the transition from the pipe into the manhole. This discrepancy is visible in Fig. 8b.
Overall, the coupled simulation correctly exchanges flows from the surface to the sewer system and vice versa as simulated and observed values agree. The small differences are comparable to results in the literature (Rubinato et al., 2017;Fernández-Pato and García-Navarro, 2018).

HOAL Petzenkirchen
This scenario analyzes a rainfall event in June 2013 in the Hydrological Open Air Laboratory (HOAL) catchment in Petzenkirchen, Lower Austria. The HOAL catchment is used to test hydrological hypotheses under natural conditions. The catchment is 0.66 km 2 in size and is mainly covered by arable land (87%) and grassland (10%) . A high resolution (0.5 m) DTM of 2012 was used. The topographic elevation of the catchment ranges from 255 to 325 m.a.s.l. Manning's roughness coefficient is set to 0.1 s∕m 1∕3 for the riparian forest, to 0.05 s∕m 1∕3 for grassland and arable land, and to 0.03 s∕m 1∕3 everywhere else.
The saturated conductivities range from 1 to 32 mm∕h and are set according to literature values (Rawls et al., 1983;Carsel and Parrish, 1988;Smith et al., 2002) and measured values (Picciafuoco et al., 2019). Streets and the river bed are assumed to be impermeable. We specify an interception storage capacity of 5 mm for the riparian forests around the outflow (colored in light green in Fig. 9a) and of 2 mm for the arable land and grassland.
The investigated heavy rain event starts on June 23 at 21:00 and is simulated for 1.5 days. There are two distinct blocks of intense rainfall. The first occurs after 1 h, and the second after 28 h. The main flow paths of the surface runoff for the second rain block are clearly visible (Fig. 9a).
A. Buttinger-Kreuzhuber et al. For the 1 m simulation, 0.75 million cells are wet at the peak of the second rain block, which corresponds to the simulated region of 0.75 km 2 (including a small buffer zone around the catchment). Regarding the runtimes of the sequential CPU and the parallel GPU implementation, speedups of three orders of magnitude are achieved ( Table 1). The speedup increases with higher resolutions and higher workloads as the GPU is not fully utilized for a low resolution. For a fully occupied GPU, doubling the resolution causes a theoretical increase of the amount of work by eight times as the number of cells quadruples and due to the CFL condition twice the number of time steps are required. For the 1 m simulation, 90% of the computation time on the GPU is spent in the following routines: the reconstruction and flux computation (32%), the time integration and time step reduction (30%), and the computation and integration of the runoff (28%). In the latter routines, it is not the amount of floating-point operations but the memory transfers that prevent the GPU from achieving faster runtimes. On the CPU, the distribution is slightly different with 60%, 11%, and 9%, respectively, due to faster memory access rates.
One drawback of the GPU implementation is the comparably longer development time. Parallel CPU implementations are possible (Neal et al., 2010;Noh et al., 2018;Morales-Hernández et al., 2021), but even if the implementation achieves full parallelization speedups, over thousand CPU cores would be needed to match the computational advantage of the GPU. From an economical and ecological perspective the GPU simulation still performs better with regards to power consumption than a parallel CPU simulation running on a supercomputer. The fast GPU simulation opens up new possibilities for this small catchment, such as calibration tasks within reasonable time spans.

Urban flooding in Cologne
We study two urban scenarios in the city of Cologne, Germany. First, we present a dual-drainage model at the central part of the city, at the eastern bank of the Rhine river, to which we refer as Cologne Center-East. Second, we present a city-scale simulation encompassing the entire city of Cologne with an area of 23.73 × 27.14 km 2 without sewer coupling.
In the first scenario, the region simulated with the coupled model lies at the eastern bank of the Rhine river and covers an area of 5.41 × 9.86 km 2 . The terrain model is obtained from light detection and ranging (LIDAR) data where solid urban features such as buildings or bridges were removed in a pre-processing step, so that the DTM represents a so-called bare earth DTM. The resolution of the DTM is 1 m with a typical vertical accuracy around a decimeter (Kraus, 2011;Dottori et al., 2013). The terrain is relatively flat, mostly ranging between 40 and 60 m.a.s.l. (Fig. 10a). The simulation domain exhibits modest average slopes of 0.3 m∕km along the Rhine river from south to north, and of around 1 m∕km from east to west.
Buildings and land use data are extracted from the official ALKIS data set of 2021 (Caffier et al., 2017). Buildings cover 13% of the area, they are impermeable for the surface flow and water from roofs is routed to sewer nodes in the coupled model, thus building cells remain dry during simulation. Roughness coefficients are mapped from the land use, a detailed overview of the spatial distribution is shown in Fig. 10b. The interception parameters are assumed to correlate with land use. Woods and gardens are assigned a storage capacity of 5 mm, for public recreational areas and residential areas it is set to 2 mm. Rivers, streets and parking lots are assumed to not retain rain, thus their storage capacity is set to zero. The infiltration parameters are derived from a soil map, compare the saturated hydraulic conductivity in Fig. 10c. Streets and squares as well as rivers and lakes are considered impermeable.
We preprocessed sewer data for SWMM for the eastern bank of the Rhine river, therefore, cells west of the river are excluded from the simulation. The active simulation region is thus restricted to 39 km 2 . The sewer network consists of 6392 junction nodes and 7206 conduits linking them with a total length of 245 km. Moreover there are 16 pumps, 16 outfalls and 18 weirs in the simulation domain, which are included in the model. The sewer network and the invalidated region are shown in Fig. 11a in yellow and pink, respectively. Exchange between the sewer network and the surface is assumed to occur at the nodes with the parameters specified in Section 2.7. Each node has a maximum inflow capacity of 0.1 m 3 ∕s. Rain that falls on buildings is directly routed to an assigned sewer junction node, if the roof water discharge exceeds the capacity, it spills over at the nodes. We simulate a hypothetical uniform one-hour rainfall of 53 mm∕h corresponding to approximately a 100-year event according to the KOSTRA 2010R data set (Junghänel et al., 2017). With a cell size of 1 × 1 m 2 , the grid has nearly 40 million cells valid for simulation. The water depths are aggregated in time during the first-order 1 m simulation resulting in a maximum water depth for each cell at the end of the simulation (Fig. 11b).
To illustrate the effects of the sewers, the resolution, and the order of accuracy of the surface flow scheme, we focus on the region marked with a red frame in Fig. 11b. The water depths of the coupled simulation for the specified region are aggregated in time resulting in maximum water depths (Fig. 12a). This heightfield serves as the reference for the difference heightfields, where positive values indicate higher maximum water depths in the reference simulation than in the corresponding alternative simulations. The differences between the simulation with sewer coupling and the one without are spatially restricted to the vicinity of the sewer network (Fig. 12b). For the   simulation without sewer coupling, we assume that water falling onto roofs can be drained by the sewer network. In regions with positive values (red), the maximum water depths of the coupled simulation  Fig. 12a) are higher than the maximum water depths of the simulation without sewer coupling. In terms of the MAE, the differences amount to 7.37 mm. The mean signed error (MSE), where the results of the runoff simulation without sewers are subtracted from those of the coupled simulation, amounts to −6.60 mm. This indicates that surface water levels do not rise as high in the coupled simulation due to sewer drainage. In fact, more water is drained from the streets than what is spilling onto the streets as excess roof water, which exceeds node inflow capacities. The sewer simulation also induces a water redistribution and causes minor floodings at a few streets due to sewer overflows. Overall, the sewer simulation drains around 250000 m 3 of surface water. In Fig. 12c, the difference field resulting from the subtraction of the maximum water depths computed by a 4 m simulation from the 1 m simulation is displayed. In regions with negative values (blue), the maximum water depths of the 4 m simulation are higher than the corresponding maximum water depths of the 1 m simulation. The difference between the maximum water depths of the 4 m and the 1 m grids are spatially concentrated at certain locations and appear mostly where the DTM shows strong variations at the scale of the employed cell size. For example, major differences occur in the vicinity of underpasses and garage entrances, or at the edge of elevated plateaus. At the edges of buildings, differences appear as boundary cells are rasterized as wall cells in one grid but not in the other. Overall, the MAE between the 4 m and the 1 m simulations is 8.12 mm. For the MAE computation the results of the 1 m grid are downsampled onto the 4 m grid.
In Fig. 12d, we compare the first-order accurate CN and the secondorder accurate BH scheme. In general, the differences in the maximum water depths between the two surface flow discretizations are marginal. Noticeable deviations are concentrated on a few spots and occur where relatively high velocities up to 1 m∕s develop. This happens for example at sloped entrances to inner courtyards. In the entire simulation domain, the MAE between the two schemes is 1.2 mm and the MSE amounts to −0.1 mm. As we consider differences between the maximum water depths occurring during the event, a non-nil MSE does not indicate a volume error but rather indicates that the surface water travels a greater distance. In fact, the negative MSE reveals that maximum water depths are slightly higher in the second-order BH scheme. In this comparison, the errors are considerably smaller than in the previous comparisons. The relatively small differences have to be considered in light of the mild terrain slopes.
To verify the validity of the proposed parallelized coupling approach, we perform a comparison of the proposed coupling method with a tightly coupled simulation as in Leandro and Martins (2016). In the tightly coupled simulation, the individual solvers execute one time step after each other, so that each solvers advances exactly one time step with a fixed time step size of 0.1 s. The continuity errors in the tightly coupled simulation amount to 1047.9 m 3 due to flow routing errors (942.3 m 3 ) in SWMM. The overall error can be considered acceptable when compared to the total sewer inflow volume (426802.8 m 3 ). Furthermore, we compare the maximum water depths for a cell size of 4 m. Across the entire simulation domain the MSE is 0.32 mm and the MAE is 0.38 mm. Differences between the maximum water depths of the proposed model and the tightly coupled model are predominantly located around overflowing sewer nodes, see Fig. 13b,   which shows the region marked with a red frame in Fig. 11b. Still, the MAE between the two coupling approaches is small. It is lower than the MAE between first-and second-order scheme, and considerably lower than the difference between the simulations on the 1 m and the 4 m grid.
We investigate the effect of the coupling time step size on the accuracy of the simulation results by considering continuity errors in the water volumes as well as differences in the maximum water depths with respect to the tightly coupled simulation, see Table 2. The overall inlet volume amounts to 1935514.8 m 3 . The water volume on the surface and in the sewers ranges from 698052.3 m 3 to 762353.6 m 3 and the outlet volumes from 1238814.4 m 3 to 1243885.8 m 3 for the different coupling time step sizes. The mass balance errors are dominated by the flow routing error of SWMM ( Table 2). Overall, the total mass balance errors is below 1 per mille for coupling time step sizes under 2 s, and around 1 per mille for 4 s and 8 s. The runoff simulation's time step varies between 0.18 s and 0.32 s for the 4 m grid. During the coupling time step (1 s) around 4 steps of the surface runoff simulation and around 10 sewer routing steps are performed in average. A coupling time step size of 1 s seems to offer a good compromise between computational performance and accuracy. To thoroughly assess the quality of the predictive capabilities of the coupled model, further validation is required. The validation of coupled models on large-scale scenarios is challenging as in most cases the collected data is sparse. Possible strategies to tackle this problem are the collection of crowd-sourced data (Yu et al., 2016;Wang et al., 2018;Xing et al., 2018), imagery from unmanned aerial vehicle sensing (Perks et al., 2016) or insurance claims (Zischg et al., 2018).
A comparison of runtimes of the coupled simulator for resolutions of 1 m, 2 m, and 4 m, in Table 3 shows that the GPU-accelerated 2D surface flow simulation is faster than the sequential 1D CPU sewer simulation for low resolutions. This emphasizes once more the massive gain in computing power for the surface flow simulation due to the GPU acceleration. Usually, solving the continuity and momentum equations for the sewer flow only requires around 0.1% of the total coupled simulation runtime for a sequential implementation (Noh et al., 2018). In the proposed implementation, as the simulations advance in parallel, the runtime per coupling time step is determined by the slower coupling component, which is either the GPU runoff simulation or the CPU sewer simulation. Thus, in order to improve the model performance further, effective parallelization strategies (Burger et al., 2014) of the sewer module are necessary.
For the derivation of pluvial flood hazard maps, we perform benchmark tests regarding large simulation domains with a size of 23.73 × 27.14 km 2 spanning the entire city of Cologne. We simulate uniform rainfall of 53 mm∕h that lasts for 1 h. In order to account for surface flow routing after the rainfall ends, the total simulated duration is extended to 2 h. In this city-scale scenario shown in Fig. 14a, infiltration and interception are considered and are set up analogously to the previously studied smaller domain. The 1.5 m simulation grid has a total number of 220 million cells valid for simulation, of which 17 million cells are rasterized buildings. In this scenario, we do not explicitly simulate the sewer network of the entire city, but we assume that water falling onto roofs can be drained by the sewer network, therefore building cells are excluded from simulation. All other input parameters remain unchanged. In Fig. 14b, we display the maximum water depths that occur during the simulated event. The computational speed is faster than physical time, with a total runtime of 1.62 h for a total simulated duration of 2 h. The simulation uses up to 23.4 GB of memory on the NVIDIA Titan RTX GPU, close to its limit of 24 GB.
In previous studies, urban regions of 40 km 2 were modeled with an efficient hybrid parallelization strategy on an adaptive grid with a minimum cell size of 1 m (Noh et al., 2018). Their model uses runoff coefficients depending on land use instead of a dynamic infiltration model as in this work, but dynamic sewer coupling is also integrated. In Xing et al. (2018), the city of Fuzhou, China, is modeled with a resolution of 2 m and a constant drainage loss at streets neglecting bidirectional sewer interaction and water routing in sewers. The simulation involves 66 million cells and runs almost in realtime on a rack of 8 NVIDIA Tesla K80 GPUs. The proposed computational model is efficient in terms of memory consumption and performance, it is able to simulate over 200 million cells on a single Titan RTX GPU with 24 GB memory, processing almost 10 million cells per GB of GPU memory. It is useful to extend the capabilities of single-GPU implementations to large regions as they typically run faster than comparable multi-GPU implementations requiring inter-GPU communication, which introduces an additional bottleneck (Morales-Hernández et al., 2021). In order to support larger simulation regions and higher spatial resolutions, an indispensable extension to multiple GPUs is possible.

Conclusions and perspectives
In this study, we present an integrated modeling framework for the simulation of rainfall-runoff processes and urban flash floods. The introduced modeling framework accounts for all major processes needed for an accurate description of flash floods while still accomplishing very fast runtimes. Infiltration is modeled with the Green-Ampt equations in a fully dynamic and spatially distributed way. An interception module accounts for initial rain abstractions due to vegetation. Instead of applying simple surface flow approximations, we discretize the full 2D shallow water equations (SWEs). In the context of rainfall-runoff simulations, the first-order accurate CN scheme (Chen and Noelle, 2017) was shown to be able to reproduce velocities and discharges accurately. In particular, using higher resolutions in the first-order scheme proved to be more beneficial than using the second-order accurate BH scheme (Buttinger-Kreuzhuber et al., 2019) when considering the tradeoff between accuracy and required computational work. We remark that an appropriate bed source term discretization of the surface flow is essential for providing correct velocity estimates.
For urban flash floods, the rainfall-runoff simulation is coupled with the sewer network simulation from the Storm Water Management Model (SWMM). An effective approach for the bidirectional coupling of the sewer simulation to the surface runoff simulation is developed, where the two simulators advance in parallel in each coupling timestep. The runoff simulation is validated in the Thiès irrigation experiment and the surface-sewer coupling is validated in a laboratory experiment, both showing good agreement between simulations and observations. Regarding the validation of the integrated model, we point out that more detailed spatial observations are needed in order to assess the model's predictive performance in a more exhaustive way.
We demonstrate the benefits of using graphics processing units (GPUs) as computational devices to speed up the rainfall-runoff simulations. The GPU-accelerated model enables high-resolution simulations for entire cities with simulation domains involving up to 225 million cells on a single GPU with 24 GB of memory. In other words, we enable simulation of regions up to 220 km 2 with a resolution of 1 m in realtime. This removes the need for multiple localized, small-scale simulations. Speed-ups of up to three orders of magnitude are achieved for simulated regions with around 10 million cells, if compared against a serial CPU implementation. The speedup increases with the number of cells, thus large simulation regions profit even more from GPU acceleration.
The efficiency of the approach opens up new possibilities regarding ensemble simulations and high-resolution environmental modeling. The coupling of the surface flow with the spatially distributed infiltration and interception component allows the direct inclusion of green infrastructure by varying parameters accordingly. Detailed results help raise public awareness for flash floods by providing straightforward impact analysis at the scale of individual buildings and enable the analysis of efforts to mitigate the effects of climate change in rural and urban settings.

Software availability
The presented modeling framework Visdom (Waser et al., 2011) is developed in C++. The GPU implementation is written in the CUDA language. The Visdom framework is closed source. Licensing information as well as information on cooperation in a joint research project with Visdom is available on request from VRVis (www.vrvis.at).

Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability
Data will be made available on request.