Simultaneous flood risk analysis and its future change among all the 109 class-A river basins in Japan using a large ensemble climate simulation database d4PDF

This study investigated simultaneous flood risk among all the 109 class-A river basins over Japan using the big data of (over 1000 years) annual maximum hourly flow simulated from a large ensemble climate simulation database for policy decision making for future climate change, and proposed a novel approach in its geospatial analysis by applying two informatics techniques: the association rule analysis and graph theory. Frequency analysis of the number of rivers with the annual maximum flow over the flow capacity in the same year (defined as simultaneous flooding here) indicated that simultaneous flood risk will increase in the future climate under 4-degree rise scenarios in Japan, whose increment is larger than the variation of sea surface temperature projections. As the result, the return period of simultaneous flooding in eight river basins (the same number as in a severe storm in western Japan, 2018, causing the second worst economic damage since 1962) is estimated at 400 years in the historical experiment, 25 years in the 4-degree rise experiment. The association rule and graph theory analyses for the big data of annual maximum flows in the future climate scenarios indicated that simultaneous flood occurrence is dominated by spatial distance at a national scale as well as by the spatial relation between mountainous ridges and typhoon courses at a regional scale. Large ensemble climate simulation data combined with the informatics technology is a powerful approach to simultaneous flood risk analysis.


Introduction
Flood is one of the most hazardous natural disasters around the world and its economic impact keeps increasing in the last decades (Swiss Re Institute 2018). In particular, Japan has intensively experienced severe floods almost every year over this decade: the worst and second worst flood damage were caused by Typhoon Hagibis in 2019 (1.86 trillion Japanese yen) and a heavy rainfall event over western Japan in 2018 (1.35 trillion Japanese yen), respectively (MLIT 2019a(MLIT , 2019b. These extreme flood damages were caused by river overflow and dike breach in multiple river basins (six and eight river basins among 109 class-A rivers, respectively). Such simultaneous flood occurrence results in devastating disaster through the supply chain and thus massive insurance payment as well as disaster recovery expenditure. In fact, the accumulation risk (Bull-Kamanga et al 2003) is of great attention in the insurance industry. Climate change impact on flood disaster risk has been intensively investigated in the literature at national to regional scales (e.g. Hirabayashi et al 2013, Arnell andGosling 2016); nevertheless, the impact on the coincidence of flood in multiple river basins has rarely been discussed in the literature. Rather, more attention is paid to compound risk of river flooding with other types of disasters such as storm surge (Wahl et al 2015, Saleh et al 2017 and/or wildfires/heatwaves (Zscheischler et al 2018, Moftakhari andAghaKouchak 2019).
Spatial dependence of flood risk has been evaluated in the context of hydrological frequency analysis (Keef et al 2009, Chen et al 2012, Neel et al 2013, Geertsema et al 2018. Keef et al (2009) analyzed spatial dependence of flood peaks and precipitation totals over the whole UK. Neal et al (2013) estimated the joint probability of flood peaks among the River Eden and its two tributaries. They successfully estimated joint probabilities of multi-site flood peaks and the resulting inundation depths at various return periods with the help of copula functions. Geertsema et al (2018) presented another approach, dynamic time warping as an efficient tool in time series analysis that detects analogy in two different flood waves, and clarified that rapid drainage in a lowland tributary reduces the likelihood of coincident discharge peak with the main stream. These statistical techniques are essential to discuss simultaneous flooding because the observation period is far shorter than the return periods of such events. However, it has been a challenge to apply these statistical approaches to inter-catchment and/or national-scale simultaneous flood analysis due to high dimension of their joint probability distribution.
Recently, some studies expanded flood spatial dependence to continental scales by accumulating all the available observation data over the countries. Quinn et al (2019) generated many synthetic flood events by constructing spatial flood dependence structures with conditional multivariate extreme value analysis. They were then converted into a flood risk curve through depth-damage functions. Diederen et al (2019) constructed a statistical model of spatially coherent discharge peaks from block maxima data and produced pan-European flood events. Using these state-of-the-art statistical models for increased number of gauges is a major way to do a large-scale flood risk assessment (Brunner et al 2020, Winter et al 2020. There will be, however, still challenges for extending this analysis to assess climate change impact on large-scale flooding, which would be addressed in the near future.
As an alternative approach, this study explores the use of large ensemble climate simulations with global climate models. In the last few years, to detect the climate change signals in extreme weather events such as heavy rainfall and/or storm surge, a large ensemble climate simulation database has been developed by the meteorological research community such as Half A degree additional warming, Prognosis and Projected Impacts (Mitchell et al 2017), database for(4) policy decision making for future climate change (d4PDF) , and the ClimEx project (Leduc et al 2019). They realized a tremendous number of climate simulations and as a result, produced thousands of climate simulation data for the present and future climate conditions. They are considered long enough to estimate low-frequency phenomena including river overflow at the return period of several hundred years. Recent studies have widely applied these data to impact assessments on river overflow, storm surge, and/or wind disasters (Doll et al 2018, Faye et al 2018, Lavender et al 2018, Yang et al 2018, Mori et al 2019. These large ensemble climate simulation datasets also have potential of giving coincident flood frequencies over a wide range in a more straightforward manner. This research aims to estimate future changes of simultaneous flood occurrence probability and its geospatial characteristics among all the 109 class-A river basins in Japan (see figure 1) using the annual maxima of hourly river discharge simulated with a rainfall-runoff model from d4PDF. The climate change signal of simultaneous flood risk is detected by comparing the probability distribution of the number of flooded river basins per year between the past and 4-degree rise experiments of d4PDF.
Furthermore, we also analyze spatial dependence in flood occurrence among all the basins. This can be challenging because d4PDF provides 3000-year and 5400-year annual maximum flow data in the 109 river basins, which makes a tremendous number of combinations to examine. We overcome this big data issue by employing association rule analysis (ARA) and a graph theory that efficiently identify a group of river basins which are likely to be flooded in the same year. Finally, two examples of regional community extraction are demonstrated in the Kyushu and Hokkaido regions to examine the geospatial characteristics of simultaneous flood occurrence at a regional scale. Note that the community extraction approach is presented only for the 4-degree rise experiment because enough simultaneous flood events did not occur in the past experiment.

Rainfall-runoff model 1K-DHM
We employ a rainfall-runoff model 1K-DHM (Tanaka and Tachikawa 2015), a distributed rainfall runoff model based on a kinematic wave flow approximation, constructed at all the 109 class-A rivers basins in Japan in Kobayashi et al (2020). 1K-DHM calculates the slope runoff and river routing at each cell at around 1 km (30 s) resolution over a two-dimensional domain. Slope runoff is simulated with the continuity and momentum equations of the kinematic wave theory, considering saturated and unsaturated subsurface flows and surface flow with the following momentum equation and discharge-storage relation (Tanaka and Tachikawa 2015) ∂h ∂t where, q is the slope runoff discharge per unit width; h is the water depth; d a and d c are the water depth corresponding to the maximum water content of saturated and unsaturated soil layers, respectively; v a and v c are the water velocity of saturated and unsaturated soil layers, respectively; α = n/ √ sinθ; n is the roughness coefficient; and θ is the slope gradient. River flow is simulated with a kinematic wave model with the following momentum equation and dischargestorage relation where, A is cross-sectional area of a river, Q is discharge. q L is lateral inflow per unit width and given by calculated slope runoff discharge per unit. α c is parameter and defined α c = (1/B c ) 2/3 √ i c /n c where i c river gradient, n c is roughness coefficient of river, B c is river width. Figure 1 shows the concept of 1k-DHM. A kinematic wave flow approximation with thin subsurface flow components well represents flash flood characteristics over steep mountainous terrain in a temperate climate, which are widely seen in Japan river catchments (Shakti 2017).
Dams whose catchment area is larger than 5% basin area are modeled at each basin. Dam operation is assumed to be constant release operation, by which outflow of dam is calculated with the following equation where, I (t), S (t), Q (t) is inflow, storage, outflow of a dam at time t. Initial storage Q (0) is assumed to be capacity for water utilization in flooding period, S F .
The model parameters are identified in each basin to maximize the Nash-Sutcliffe coefficient using the SCE-UA algorithm (Duan et al 1994) as an optimization method. This study examines simultaneous occurrence probability of extreme flood peaks at the standard gauging station of each river system (located near by the major city within its catchment); thus, the parameters are calibrated to the hydrograph of the station during the largest flood event between 1981 and 2017 (when enough observation data was available in most river systems). If the observed rainfall/discharge data has problems (most data are missing, runoff ratio exceeds 1.0, …etc), the second largest one was targeted. Calibrated parameters are summarized in supplementary table S1 (available online at stacks.iop.org/ERL/16/074059/mmedia). As the result, for the calibrated event, the model showed the Nash-Sutcliffe efficiency (NSE) coefficient ranging from 0.730 to 0.997 (0.923 on average) and peak discharge error within ±20% among all the basins. The calibrated model was validated for the second largest event (the third largest one if the second largest event was used for calibration) in each river basin. As the result, the NSE values ranged from 0.523 to 0.988 (0.816 on average), and peak error ratios were within ±30% among all the basins. Hydrographs of the calibration and validation events at one river system from each region, i.e. Hokkaido, Tohoku, Kanto, Koshinetsu, Shikoku and Kyushu (see figure 1), were shown in supplementary figures S2 and S3, respectively.

Large ensemble climate simulation data d4PDF
d4PDF is a huge ensemble climate simulation database produced using MRI-AGCM 3.2 s at 60 km resolution, which was downscaled to 20 km with the non-hydrostatic regional climate model (NHRCM) (Sasaki 2008) over Japan. This study used this 20 km regional experiment. The downscaled d4PDF consists of 50 ensembles of climate simulations under the observed boundary conditions with observable small perturbations in sea surface temperature (SST) and sea ice from 1951 to 2010 (historical experiment, hereinafter HPB), and six ensembles of different future SST projections corresponding to around 4-degree rise in global mean temperature under 15 boundary condition ensembles (900 ensembles in total), in which the NHRCM was run for 60 years (4-degree rise experiment, hereinafter HFB) . Therefore, HPB and HFB have 3000-and 5400-year data, respectively. This dynamically downscaled dataset resolved typhoons, which is the major driver of flooding in Japan, and well reproduced their genesis frequency with the spatial distribution (Yoshida et al 2017) and the central pressure to some degree (Nakajo et al 2020) while it still has bias from observation with similar to other climate models. The model bias in d4PDF was corrected by Kobayashi et al (2020) in terms of rainfall intensity (direct input to the rainfallrunoff model 1K-DHM) so that annual maximum basin rainfall reproduces observation in all the class-A river basins (the target rivers in this study) as follows. Kobayashi et al (2020) corrected the bias of rainfall intensity during a storm event causing the annual maximum basin rainfall in each year (hereinafter, the annual maximum storm) in all the class-A rivers using the radar AMeDAS rainfall (RAR) as reference. Bias correction ratio was calculated as the difference ratio of D-hour rainfall between HPB of d4PDF (D is the design rainfall duration of each river ranging between 6 and 72 h) and RAR at the same quantile (Piani et al 2010) and applied to rainfall intensity of the original 20 km grid rainfall during D-hour. The bias-corrected rainfall was converted into river discharge using the aforementioned rainfall-runoff model . This study utilizes the simulated discharge data for the following simultaneous flood risk analysis.

Definition of occurrences of flooding in each river basins and simultaneous flooding
Performing 3000-and 5400-year continuous simulation for both the past (HPB) and 4-degree rise (HFB) experiments consumes huge time and computer resources for its completion. To avoid this, we focus on annual maximum storms in each river basin, i.e. simulate 3000 and 5400 flood events for all the target river basins in total. Then, we define, in a river basin of interest, flooding as a phenomenon that the flood peak discharge of an annual maximum storm exceeds the one corresponding to the design return period (100, 150 and 200 years depending on each river basin, hereinafter, flow capacity). This procedure implies that we assume the flood occurrence at most once per year per one river basin. Simultaneous flooding is, accordingly, defined as a phenomenon that annual maximum river discharge exceeds the flow capacity in multiple river basins in the same year.
Regulating simultaneous flooding in the same year (between January and December) may miss simultaneous floods crossing the year. Although it is well known that Japanese river basins under the Asian Monsoon climate receive extreme precipitation from typhoons and frontal rain in the middle of the year, the frequency of annual maximum storms (years) for each month in d4PDF was examined. Among all the target river basins, around 60%-68% to 98% of events (years) occurred between June and October in HPB and HFB, respectively (see supplementary figure S4). It became 88%-100% when it comes to floods larger than the 100-year event (minimum level of the design return period) (see supplementary figure S5), justifying the treatment of every year as individual isolated time units both in the present and future climates.

Annual flood occurrence probability at a single river basin
The estimated annual probability of flood occurrence in HFB at all the 109 class-A river basins is shown in figure 2. The probability in HFB is much higher than the design level of ranging from 0.005 to 0.01 (corresponding to 100-to 200-year return periods) in the target river basins. Increase of the occurrence probability is higher in the northern (Hokkaido and Tohoku) regions where design flood is smaller corresponding to smaller flood discharge in historical records and  the southern (Kyushu and Shikoku) regions where severe typhoons approach frequently (see figure 1 for locations). The one-sided binominal distribution test showed that the change is statistically significant at 1% significance level.

Probability of simultaneous flood occurrence
The histograms of the number of inundated river basins in the same year in HPB and HFB are shown in figure 3. HFB shows much higher frequencies of simultaneous flood occurrence among more than three river basins. Although there are some variations in frequencies among the SST ensembles of HFB (between the two black dots), its length is far shorter than the future increase, indicating that future changes of simultaneous flood occurrence probability is far larger than the uncertainty of future SST projections. The tail of the distributions become long; consequently, there are some extreme events with around 30 basins inundated in the same year. In the top five years with the largest number of inundated basins for both the experiment, heavy rainfall is caused by a typhoon.
The exceedance probability of the number of flooded river basins among the 41 basins in western Japan is shown in figure 4. The black line in the  The exceedance probability of the number of flooded river basins among the Ara, Yodo, and Shonai River basins. As the probability is much smaller than all Japan and the western Japan cases, it was estimated using all the SST ensemble members (5400 years in total). figure indicates the number in the rainfall event in 2018 (eight rivers were inundated), whose exceedance probability in HPB is around 0.002 corresponding to about 500-year return period, whereas one in HFB is around 0.035 corresponding to 28-year return period. For the Tohoku, Kanto and Koshinetsu Regions where six river basins were inundated by Typhoon Hagibis in 2019, the similar analysis (without the SST ensemble range) was performed by Tanaka et al (2020) showing that the exceedance probability in HPB is around 0.0027 (370-year return period) while that in HFB is around 0.05 (20-year return period). Both results show that a large-scale flooding at hundreds of return period will occur around once in a few decades in the 4-degree rise climate.
The same plot for the Ara, Yodo, and Shonai River basins having the mega cities: Tokyo, Osaka, and Nagoya, respectively, is plotted in figure 5. Note that this probability was estimated using all the SST ensemble members (5400-year data) because simultaneous flood occurrence among the three river basins rarely occurs; thus, uncertainty range is not shown in this case. In the historical experiment HPB, simultaneous flood did not occur among them. HFB shows that two river basins are inundated with probability of 0.0011 (around 910-year return period) and all the basins are inundated in only one year, in which there are two major flood events: one caused flood in the Yodo and Shonai River basins; the other did in the Ara River basin. This is simply because the Yodo and Shonai Rivers are closer to each other than the Ara River. In Japan, simultaneous flooding in multiple mega cities is found to rarely occur in the present climate while it might happen more often than once in 1000 years under the 4-degree rise climate. Note, however, that this estimation is under high uncertainty due to the absence of SST ensemble range.
Return periods of simultaneous flood events have been rarely discussed due to the lack of observation data or climate simulation ensembles. The presented analysis demonstrated the usefulness of large ensemble climate simulation data for the frequency analysis of simultaneous flood occurrence, i.e. the accumulation risk of floods, which is different information from large-scale flood risk map in a literature (Alfieri et al 2014, Dottori et al 2016 4. Geospatial analysis of simultaneous flood occurrence 4.1. ARA As seen in a literature, simultaneous flood risk has been analyzed by multivariate analysis of flood peak data among multiple gauges (Keef et al 2009, Diederen et al 2019. As Quinn et al (2019) pointed out, statistical flood generators need to consider that dependence structure is different between regular and extreme floods. To focus on flood exceeding the design level, this study deals with the flood occurrence data, i.e. whether annual maximum peaks exceed the flow corresponding to the design return period or not. As such data is not a continuous but discrete variable (flooding occurs or not), typical correlation analysis cannot be applied (Changhai and Shenping 2019). Furthermore, it is unrealistic to examine dependence between all the combinations from 109 river basins (only a pair of two basins makes 5886 patterns) in each of which 5400 flood occurrence data are included in the 4-degree rise experiment. Dependence structure in such discrete, big data has been analyzed in the field of market analysis using the ARA. This study first applies the ARA to geospatial analysis of simultaneous flood occurrence.
The ARA is a method to identify interesting relations between two sets of variables from big data and first proposed and applied in the field of market basket analysis to find items bought together frequently by Agrawl et al (1993). In this analysis, problems are formulated as follows: let I = {i 1 , i 2 , . . . , i m } be a set of items and D be a set of transactions where each transaction T is a set of items (T ∈ I). Then, the combination of two transactions X and Y such that ∅ ̸ = X, Y ⊂ I and X ∩ Y ̸ = ∅ is defined as an association rule X ⇒ Y, where X is the antecedent; Y is the consequent. An association rule is featured as a higher value of indices: 'support' , 'confidence' , and 'lift value' . Support sup , which means the frequency of transaction(s) T = {X, Y} in all the transactions |D|. The confidence conf (X ⇒ Y) is defined as: conf (X ⇒ Y) = P (X ∩ Y) /P (X) = P (Y|X) , indicating the conditional probability of Y on X.
In this study, a set of items I consists of all the river basins {RB 1 , RB 2 , . . . , RB 109 } and {0} (no occurrence of flooding); a transaction T is a set of river basins flooded (T = {0} when no river is flooded) in a certain year. In this analysis, the 4-degree rise experiment data (5400-year simulations) was used as it showed larger number of flood occurrences; thus, 5400 transactions were obtained. This study deals with flood occurrences between two river basins in each transaction. For example, between river basins RB 2 and RB 5 (denote that X = {RB 2 } and Y = {RB 5 }, let RB 2 and RB 5 be flooded in 108 and 270 events (year) in 5400 years and both are flooded in 54 years. Then, sup (X ⇒ Y) = P (X ∩ Y) = 54/5400 = 1/100 and conf (X ⇒ Y) = P (X ∩ Y) /P (X) = 54/108 = 1/2. Conditional probability is used to avoid from extracting rules with week relations but high support due to the high frequency of transaction X. On the other hand, only the confidence may extract trivial rules such that the transaction X rarely occurs. In the above example, sup (X ⇒ Y) is 1/100, meaning that both RB 2 and RB 5 are flooded once in 100 years, and conf (X ⇒ Y) = 1/2, indicating that RB 5 is flooded once in two times when RB 2 is flooded. If both sup (X ⇒ Y) and conf (X ⇒ Y) are high, flooding in RB 2 and RB 5 is strongly related and not rare. ARA aims to extract interesting but non-trivial rules by selecting those with both sup (X ⇒ Y) and conf (X ⇒ Y) over the threshold (see the values in this study in section 4.3). This selection is efficiently implemented with the Apriori algorithm (Agrawal et al 1994) in R package 'arules' . Finally, an additional criterion called 'lift value' is used to cut out rules extracted due to frequent occurrence of Y. After all, interesting and meaningful rules are identified as the high support and confidence above the thresholds and the high lift value.

Community extraction by graph theory
A rule (transaction) extracted as support and confidence above threshold is expressed as a link and the connected two river basins are expressed as nodes, making undirected graph (see figure 6(a) as an example). A community of simultaneous flood occurrence, composed of elements strongly connected with each other (i.e. high support, confidence, and the lift value), is then extracted from the constructed graph based on the intermediation centrality C v defined as: where e is a link targeted; σ st is the total number of the shortest paths between nodes s and t; σ St (e) is the paths with the link e. High intermediation centrality indicates that a link between nodes s and t works as a hub connecting two individual communities. Such communities are identified by cutting a link with the largest C v (see figure 6(b) as an example). This study applies the GN algorithm (Girvan and Newman 2002) which repeats the above process until the network is split into strongly connected communities using the modularity as criterion. Modularity is a measure of connectivity in communities and high when links connecting nodes within a community are dense but those between communities are sparse, i.e. nodes (river basins in this study) are divided into strongly related groups (see Newman and Girvan 2004). In figure 7(b), the modularity is the highest when dividing into two communities {1, 2, 3, 4} and {5, 6, 7, 8}. All these algorithms are implemented in R package 'igraph' .

Geospatial characteristics of simultaneous flood occurrence
We applied the ARA to simultaneous flood occurrence among all the 109 river basins and extracted interesting rules by the Apriori algorithm. The threshold values depend on the problem. This study Figure 6. An example of community extraction: (a) the numbers indicate the intermediation centrality defined as equation (7); (b) two communities are extracted by cutting the whole network at a link with the largest intermediation centrality of 16. set the minimum threshold of the support and confidence as 0.001 and 0.1, respectively and by selecting the top 25 ones so that geospatial characteristics can be seen. To see the characteristics of association rules, the top ten rules fixing the precedent to the Tone River basin having the largest catchment area in Japan are shown in table 1. The location of each basin is displayed in figure 7. Obviously, the combinations of nearer river basins have higher lift value. Interestingly, the consequent of the most rules is located in the southeastern of the Tone River basin (as shown in red in figure 7) whereas its northern basins No. 34,No. 20,No. 27 and No. 26 are ranked at 8th, 10th and out of the top tenth, respectively (as shown in blue in figure 7). As typhoons approach from the south Pacific Ocean in Japan, when the Tone River is flooded, its southeastern basins are more likely to be flooded together than the northern basins. The ARA  clarified that such geographical feature still applies in extreme storm events that cause large-scale precipitation. This finding will be informative for large-scale flood risk management and/or insurance planning. Based on the extracted rules, the 109 river basins are classified into ten groups as shown in figure 8(a). River basins with the same color belong to the same community. With similar to the ARA, spatially closer basins are classified into the same group; however, its boundary does not always the same as mountain ridges ( figure 8(b)). This indicates that at a national scale, spatial distance is more dominant than topographic features in extreme storms. On the other hand, the Group 6 (light-blue region in figure 7) expands to the north of the Group 5, whose boundary exactly correspond to the ridges of the Japan Alps probably because of their notably high altitudes. More regional investigation is demonstrated in Hokkaido and Kyushu regions (see figure 1) by applying the community extraction within the region as shown in figures 9 and 10, respectively, which clearly shows that the communities are consistent with mountain ridges. In summary, the grouping of simultaneous flood occurrence at a national scale is affected mainly by spatial distance and partly by mountain ridges with high altitudes while at a regional scale, the latter dominates it.

Summary, concluding remarks and future work
This study estimated the present and future simultaneous flood occurrence probability and its geospatial characteristics using big data of simulated flood peak discharge of annual maximum rainfall events in a large ensemble climate simulation database d4PDF and informatics techniques: ARA and graph theory. Even in Japanese river basins suffering from severe typhoons every year, simultaneous flood occurrence is a low-frequency phenomenon at least in the last decades. Frequencies of such events are first quantified in this study owing to a large number of ensembles in d4PDF and a rainfall-runoff model calibrated in all the class-A river basins in Japan.
The probability of simultaneous flood occurrence in the 4-degree rise experiment HFB is significantly larger than that in the historical experiment HPB in all the three cases examined: all Japan, western Japan that experienced large-scale inundation in 2018, and mega cities. The significant increase is largely attributed to the increased magnitude and frequencies of extreme discharge in each river basin. The future change of their correlation might have some impact, which should be analyzed in future research. Statistical techniques for multivariate  probability distribution such as the copula approach might be helpful (Bevacqua et al 2017). The uncertainty analysis of the probability estimation is kept as the future work because it focused on first proposing a big data-based approach to multi-basin flood risk analysis. The Bayesian approach and/or the MCMC approach could support to quantify the uncertainty range of simultaneous flood occurrence probability (Zhou et al 2017).
The geospatial characteristics of simultaneous flood risk were first analyzed using the ARA and graph theory, which provided the degree of correlation with surrounding areas while being unable to detect various combinations of river basins at high simultaneous flood risk. This study, using the graph theory, identified the groups of river basins to which insurance design and/or urban planning should pay much attention.

Data availability statement
The data that support the findings of this study are available upon reasonable request from the authors.