1 Introduction

Flood is a natural hazard that can cause significant damage to agriculture and infrastructure (Longman et al. 2019). Design flood is a flood discharge, which is associated with an annual exceedance probability (AEP), and widely used in flood risk assessment. At-site flood frequency analysis (FFA) is generally used to estimate design floods when recorded flood data of sufficient length and quality is available at the site of interest (Kuczera and Franks 2019). However, there are numerous ungauged catchments where FFA is not directly applicable. Regional flood frequency analysis (RFFA) is used for these ungauged catchments, which attempts to transfer flood characteristics from gauged to ungauged catchments on the basis of regional homogeneity (Şen 1980; Cunnane 1988; Shu and Ouarda 2008). RFFA techniques have been evolved over the years from simple rational method to more complex data-driven models with the increase in computing power (Potter 1987; Kirby and Moss 1987; NRC 1988; Bobee et al. 1993; Jingyi and Hall 2004; Dawson et al. 2006; Archfield et al. 2013; Chebana et al. 2014; Msilini et al. 2020; Zalnezhad et al. 2022a, b; Esmaeili-Gisavandani et al. 2023).

Numerous linear RFFA techniques have been proposed over the years, such as the probabilistic rational method (Pilgrim and Cordery 1993; Rahman et al. 2011; Gilmore et al. 2014), index flood method (Hosking and Wallis 1993; Bates et al. 1998; Rahman et al. 1998; Smith et al. 2015; Zalnezhad et al. 2023) and ordinary least square and generalized least square based quantile regression techniques (QRT) (Stedinger and Tasker 1985; Rahman 2005; Ouarda et al. 2008; Haddad and Rahman 2012; Zalnezhad et al. 2022a, b). According to Sivakumar and Singh (2012), hydrologic process is often non-linear because many of the processes involved in the movement and distribution of runoff are non-linear. With the advent of computer technology, non-linear techniques such as artificial intelligence (AI) based models are increasingly being adopted to develop AI-based RFFA models (Dawson et al. 2006; Aziz et al. 2013, 2014, 2017; Zorn and Shamseldin 2015; Ghaderi et al. 2019; Vafakhah and Khosrobeigi Bozchaloei, 2020; Filipova et al. 2022; Zalnezhad et al. 2022a). In most of these studies, AI-based models have outperformed the linear RFFA ones.

Recently, AI based technique such as deep learning (DL) methods have received attention as they have a higher capability of identifying patterns and features in a large dataset with a greater accuracy. For example, Jiang et al. (2022) applied DL method to predict relative humidity and compared DL method with support vector regression (SVR), decision tree (DT) regression, and deep residual (DR) regression and found that DL outperformed the other methods. CNN is a type of DL method, which has demonstrated state-of-the-art performance in many computer vision tasks and have become a standard tool for image processing in many fields such as medical imaging, autonomous driving, and surveillance (Aurna et al. 2022; Yuan et al. 2023; Lee and Liu, 2023; Patel and Elgazzar 2023). CNN has also been used for text classification, sentiment analysis, and question-answering (Zhou 2022; Habbat et al. 2022; Manmadhan and Kovoor, 2023). CNN has also been successfully applied to forecast floods from satellite images (Chen et al. 2021) and predict depth of urban flooding (Chen et al. 2023). In addition, CNN has shown good performance for flood susceptibility (Wang et al. 2020), flood forecasting (Kimura et al. 2019), and fluvial flood prediction (Kabir et al. 2020). However, application of CNN in RFFA is limited.

DT is a non-parametric supervised learning algorithm, which has been widely adopted in different fields such as aquarium control systems, tourist behaviour, and the optimal location of solar power plants (González-Sánchez et al. 2022; Abdurohman et al. 2022; Shorabeh et al. 2022). It has shown good performance in hydrology such as flood susceptibility assessment (Khosravi et al. 2018; Chen et al. 2020; Ghosh et al. 2022). According to Tehrany et al. (2013), its robust predictive capabilities make it well-suited for generating susceptibility maps, capitalizing on its accurate prediction capacity; however, its utilization in RFFA remains relatively limited.

Support Vector Machine (SVM) is another popular algorithm based on statistical learning theory which was proposed by Vapnik and Chervonenkis (1974) and Vapnik (1995). Since the successful inception of SVM in other fields of hydrology (Wu et al. 2008; Pijush 2011), researchers were stimulated to apply it in RFFA. For example, Ghaderi et al. (2019) compared three data driven RFFA models, adaptive neuro-fuzzy inference system (ANFIS), SVM and genetic expression programming (GEP) where SVM outperformed the other methods. A similar result was also found by Sharifi Garmdareha et al. (2018). Haddad and Rahman (2020) applied multidimensional scaling (MDS) in RFFA, which is capable of developing a visual representation of similar catchments in either catchment characteristics or geographical data space. They found that the MDS-based SVR model with radial basis function (RBF) performed more consistently in RFFA. Vafakhah and Khosrobeigi Bozchaloei (2020) compared SVR, artificial neural network (ANN), and non-linear regression (NLR) in RFFA and found that SVR outperformed the other methods. Similar results were observed by Allahbakhshian-Farsani et al. (2020) who compared several AI-based RFFA models, multivariate regression spline (MARS), boosted regression trees (BRT) and projection pursuit regression (PPR) with NLR and found that SVR model with RBF functions performed better than others.

Today, RFFA remains an active area of research, with ongoing efforts to refine and improve the methodology to apply more accurate flood risk assessment in ungauged catchments. While AI-based methods have shown superior performance compared to traditional approaches and other fields within hydrology are exploring novel AI-based techniques, there has been limited investigation into the application of these new AI-based methods in RFFA. To fill this knowledge gap and build upon the successful application of CNN in other domains, this study introduces a CNN-based RFFA methodology. The CNN-based approach is compared with well-established techniques such as DT and SVM. Additionally, given the importance of interpretability in practical applications (Warner and Misra 1996), multiple linear regression (MLR) is also included in the study. It is expected that the outcomes of this study will assist in recommending AI-based RFFA models for practical applications in Australia and other countries.

2 Study area and data

This study selects south-east Australia since this part of Australia has the best quality streamflow data. South-east Australia consists of the Victoria and New South Wales states. Victoria is dominated by winter rainfall. The Great Dividing Range (GDR) divides coastal part from the inland regions of south-east Australia. The GDR stars from the Queensland state and ends at the eastern edge of the Victoria state, measures approximately 3500 km long. This study considered both side of the GDR (inland and coastal) as a single region based on previous studies by Ali and Rahman (2022) and Zalnezhad et al. (2022a, b).

For this study, 201 gauged catchments from southeast Australia are selected with annual maximum flood (AMF) data series length ranging from 25 to 89 years. Figure 1 shows the locations of the selected catchments. The selected catchments are not affected by major land use change, which provides an opportunity to study the natural hydrological processes of these catchments. To calculate at-site flood quantiles log-Pearson Type 3 (LP3) distribution with Bayesian parameter estimation technique was adopted using FLIKE software (Kuczera and Franks 2019). Six flood quantiles are used, which are AEPs of 1 in 2 (Q2), 1 in 5 (Q5), 1 in 10 (Q10), 1 in 20 (Q20), 1 in 50 (Q50) and 1 in 100 (Q100). It should be noted that other flood frequency distributions could have been adopted but LP3 distribution generally performs better with Australian AMF data (Rahman et al. 2013) and hence it is adopted here.

Fig. 1
figure 1

Location of the selected 201 catchments in South-East Australia

Previous studies demonstrated that acceptable homogeneous regions cannot be established in Australia. For example, Ahmed et al. (2024) reported heterogeneity (H1) statistics in the range of 5.11–26.27 for south-east Australia (H1 values of 1.00 or smaller are needed for an acceptable region).

In this study, eight catchment characteristics are selected (Table 1) since these were found to be important in previous Australian RFFA studies (Haddad et al., 2012; Rahman et al. 2020; Zalnezhad et al. 2022a, b, 2022a). These catchment characteristics are catchment area (AREA), rainfall intensity with 6 h duration and 1 in 2 AEP (I62), mean annual rainfall (MAR), shape factor (SF), mean annual evapotranspiration (MAE), stream density (SDEN), slope of central 75% of mainstream (S1085) and forest (FOREST). A summary of the descriptive statistics of the selected catchment characteristics for the 201 study catchments is presented in Table 1. The boxplots of the selected catchment characteristics are presented in Fig. 2.

Table 1 Descriptive statistics of the selected catchment characteristics
Fig. 2
figure 2

Boxplot of selected catchment characteristics

In Fig. 2, Y axis presents represents measurement unit. It should be noted that in Fig. 2, box width of AREA is 128 km2 to 487km2. In Table 1, it can be seen that smallest catchment is 3 km2 and largest one is 1010 km2 with median value of 261 km2.. AREA is generally considered as the main scaling factor in RFFA as it directly influences flood volume from a given storm event and is directly related to the mean annual flood (Rahman 1997).

I62 is another useful climatic characteristic in RFFA. According to the rational method, rainfall intensity with a duration equal to the time of concentration (tc) is very logical. For the selected catchments, the mean tc is 6.45 h. Since use of rainfall intensity of a fixed duration is preferable in RFFA studies, selection of six hours duration is logical as it is close to mean tc. In Fig. 2, box width of I62 is comparatively narrower varying from 32.15 to 43.1 mm/h; however, there are few high values plotted as outliers in the boxplot.

MAR does not directly impact the flood generation process, but it is an indication of a catchment wetness. MAR is included in this study as a candidate predictor variable. It has been seen in Fig. 2, box width of MAR range from 725.67 to 1125.7 mm with few outliers. MAE is selected in this study as it indicates catchment dryness. MAR, MAE and I62 data are obtained from Australian Bureau of Meteorology website. In Fig. 2, MAE shows narrow range of box width from 1024.5 to 1166.1 mm with few outliers.

Catchment shape has a direct impact on flood generation. The shortest distance between the catchment centroid and outlet is divided by the square root of the catchment area to get the shape factor (SF) (Rahman et al. 2015). The higher the SF the smaller the flood peak In Table 1 it can be seen that minimum SF value is 0.258 and maximum is 1.63 with median value of 0.78 and box width of SF in Fig. 2 ranges 0.6227 to 0.9246 with few outliers. Stream density (SDEN) affects flood generation process (a higher SDEN enhances drainage efficiency of a catchment). In Fig. 2, it can be seen that the SDEN box width ranges 1.38–2.67 km−1 with a median value of 1.69 km−1.

Slope is one of the key catchment characteristics affecting flood generation process. A higher slope reduces travel time of runoff through increasing velocity. Benson (1959) mentioned that S1085 gave the best prediction of the mean annual flood. Hence S1085 is adopted in this study. To express S1085, if L is the mainstream length of the catchment, E1 is the elevation at the 0.1L position and E2 is the elevation at the 0.85L position along the mainstream from the catchment outlet, and E is the difference of elevation E1 and E2, then S1085 is the ratio of E and L. In Table 1, it can be seen that S1085 has a range of 0.8–69.9 m/km, which represents a wider variation (indicating some of the catchments are really flat and some are highly steep). In Fig. 2 for S1085, box width varies from 5.48 to 16.48 km−1, however, there are few higher values.

The flood generation process can be delayed due to an increase in forest area (FOREST) as it helps infiltration and reduces flow velocity. Here, FOREST indicates fraction of catchment areawhich is forested. Box width of FOREST in Fig. 2 can be seen varying from 0.22 to 0.89 and in Table 1, minimum value is 0.0001 and maximum value is 1 with median value of 0.59. It can be observed from Fig. 2 and Table 1 that among selected catchments, some of them are highly forested, while some have little forest area SF, SDEN, S1085 and FOREST data are obtained from 1:100,000 topographic maps of the selected catchments. Based on previous study (Rahman et al. 2009; Rahman and Rahman 2020) 1:100,000 scale for topographic map is selected to obtain these values.

3 Methodology

Figure 3 presents adopted overall methodology in this study. At the beginning, from the literature review, background knowledge has been gained about AI-based RFFA techniques. The next step is selection of study area and collation of streamflow and catchment characteristics data set. Thereafter, CNN, DT, SVM and MLR models have been developed for six AEP’s (Q2, Q5, Q10, Q20, Q50 and Q100) and the models have been tested by implementing split-sample validation. In split-sample validation technique, the dataset have been parted randomly for training and testing. Then the results of all the models are compared by performing nine statistical measures which are presented in Sect. 3.6.

Fig. 3
figure 3

Flowchart of the adopted overall methodology in this study

Adam (adaptive moment estimation) optimizer was used for CNN models to train the data sets. It is a stochastic gradient descent method (Okewu et al. 2020) that adapts the learning rate of each parameter based on its historical gradients and momentum and it is capable to adjust the parameters of a neural network in real-time to improve its accuracy and speed. For other regression models including SVM, MLR and DT models, Bayesian optimization was used. It uses the Bayesian formula to obtain the posterior information of the function distribution and combines the prior information of the unknown function with the sample information (Wu et al. 2019).

3.1 Split sample validation

According to Muraina (2022), splitting dataset is crucial as from training dataset model will learn effective mapping of inputs and produce effective outputs which means training dataset should have enough information for the model to learn the pattern and also testing dataset should have best possible representation of data thus model can show its performance. In another study by Gholamy et al. (2018) did an empirical studies to avoid overfitting and suggested 70–80% data for training and 20–30% for testing shows best results.

In this study a split sample validation (80–20%) technique is adopted to compare the performance of the CNN model with MLR, SVM and DT models. As stated by Gholamy et al. (2018), 70–80% data for training and 20–30% data for testing are appropriate. Out of the 201 selected catchments, 161 (80%) catchments are selected randomly for training, and the remaining 40 (20%) catchments are used for model testing.

3.2 Convolutional Neural Network (CNN)

A CNN is a form of deep learning model, which processes data in a grid pattern. Usually, CNN architecture consists of different building blocks. For this study, CNN regression mode is utilized, which contains convolutional layers with the rectified linear unit (ReLU) and dropout layers followed by one fully connected layer and regression layer. A convolutional layer is a fundamental layer in a CNN architecture, as it performs convolution operation by extracting features, which is a linear operation and then ReLU as an activation function does a nonlinear operation to make all negative values to zero.

In this study, all the numbers representing each of the variables have been considered as image input and converted to an array of numbers and digital images pixel values are stored in a two-dimension (2D) grid, therefore 2D-CNN has been used in this study. A small grid which is the neural networks filter known as the kernel (an optimizable feature extractor) slides across the input image by a Stride to optimize the feature extractor. The stride represents how many steps the kernel is moving across for each step (Yamashita et al. 2018). To process the image more precisely, padding has been added to the frame of the image to permit more space for the kernel to cover the image. This process continues until the kernel moves across the whole image. Then the output is considered input for the next layer.

The outputs of a convolution layer are then passed through the ReLU activation function. After the CNN layer and ReLU layer, the dropout layer has been added. Dropout is a technique where nodes drop randomly along with their connections in the network during training which prevents the network from overfitting and shows significant improvements (Lim 2021).

After few times repeating the process of convolution-ReLU-Dropout, the outputs need to be transformed into a one-dimensional (1D) array of numbers (or vector), therefore fully connected layer is added with a learnable weight. The weights that have been updated during training are denoted as the step of size or the “learning rate”. During the training of the convolution network, particular kernels and weights are calculated through forward propagation and updated through backpropagation. Before training starts, all the hyperparameters such as the size of the kernel, number of kernel, padding and stride have been set. After training the model, test data set has been given to the model to predict output then the predicted outputs and original outputs of the test dataset have been compared to evaluate the performance of the model. Figure 4 shows an overview of the developed CNN architecture and the training process adopted in this study.

Fig. 4
figure 4

CNN architecture and the training process

3.3 Support vector machine (SVM)

SVM is a machine learning algorithm used for both classification and regression tasks. The basic idea behind SVM is to find the optimal hyperplane that separates the data points into different classes to predict the response variable with maximum margin. The margin is the distance between the hyperplane and the closest data points, and the SVM algorithm aims to maximize this distance while minimizing the prediction error. In SVM regression, the kernel function plays an important role in transforming the input variables into a higher-dimensional space where the relationship between the predictors and the response variable may be more linear. One common type of kernel function used in SVM regression is the radial basis function (RBF) kernel, which is a Gaussian function that maps the input variables to an infinite-dimensional space. In addition to the RBF kernel, other types of kernel functions can be used in SVM regression, including the polynomial kernel. The polynomial kernel is a type of kernel function that can be used to model non-linear relationships between the predictor variables and the response variable. When the degree of the polynomial kernel is set to 3, it is also referred to as the cubic SVM. The cubic SVM is useful in situations where the relationship between the predictor variables and the response variable is highly non-linear and cannot be captured by a linear or RBF kernel. Bagasta et al. (2019) did a comparison study between cubic SVM and Gaussian SVM to detect ischemic stroke and found that cubic SVM performed best for infraction classification. Hence, for this study cubic SVM is used.

3.4 Decision tree (DT) regression

In DT regression, the leaf nodes of the tree represent the predicted values for the output variable and the path from the root node to a leaf node represents the decision process that led to the prediction (Freund and Mason 1999). The tree is built by recursively partitioning the input data into smaller subsets based on the values of the input variables and finding the split that minimizes the mean squared error (MSE) of the prediction at each node. A Fine Tree (TF) with many small leaves is usually highly accurate in the training data. However, a very leafy tree tends to overfit, and its validation accuracy is often far lower than its training accuracy and it shows a highly flexible response function. In contrast, a Coarse Tree (TC) can be more robust and shows a coarse response function. In between, Medium Tree (TM) shows a less flexible response function by having at least 12 leaves compared to the Fine Tree's minimum leaf size of 4 and the coarse tree's minimum leaf size of 36.

Several studies in other fields compared different DT models performance. For example, Yaman et al. (2020) compared TF, TM and TC to estimate energy consumption and found TF performed best. In another study to estimate wind speed, AKINCI and NOĞAY (2019) found TC performed best among three DT models (TF, TM, TC). To select the best-performing DT model in this study, three DT models (TF, TM and TC) have been tested and based on lowest RMSE value the TM model has been chosen for this study.

3.5 Multiple linear regression (MLR)

MLR can be used to develop prediction equation in RFFA. In this study, ordinary least square (OLS) method is used to estimate the coefficients of the regression equation. The OLS is the maximum likelihood estimate of the parameters as it gives unbiased and minimum variance estimates of parameters where the errors are independent, identically and normally distributed (Draper and Smith 1998; Pandey and Nguyen 1999; Haddad and Rahman 2012). The adopted form of MLR is expressed by Eq. 1:

$$ \begin{aligned} Q_{T} = & {\text{b}}_{0} + {\text{b}}_{{1}} \left( {{\text{AREA}}} \right) + {\text{b}}_{{2}} \left( {{\text{I}}_{{{62}}} } \right) + {\text{b}}_{{3}} \left( {{\text{MAR}}} \right) + {\text{b}}_{{4}} \left( {{\text{SF}}} \right) \\ & + {\text{b}}_{{5}} \left( {{\text{MAE}}} \right) + {\text{b}}_{{6}} \left( {{\text{SDEN}}} \right) + {\text{b}}_{{7}} \left( {{\text{S1}}0{85}} \right) + {\text{b}}_{{8}} \left( {{\text{FOREST}}} \right) \\ \end{aligned} $$
(1)

where QT is flood quantile with AEP of 1 in T, bo is intercept of the regression equation and b1, b2, … are regression coefficients.

3.6 Statistical indices

The following nine statistical indices (Eqs. 210) are adopted to compare the performances of the developed RFFA models:

Qpred/Qobs ratio (Qr):

$$Qr= \frac{{Q}_{pred}}{{Q}_{obs}}$$
(2)

Relative error (RE):

$$RE=\frac{{Q}_{pred-{Q}_{obs}}}{{Q}_{obs}}\times 100$$
(3)

Median absolute relative error (REr):

$$REr=median[abs(RE)]$$
(4)

Mean square error (MSE):

$$ MSE = mean[(Q_{red} - Q_{{{\text{obs}}}} )]^{2} $$
(5)

Root mean square error (RMSE):

$$RMSE = \sqrt{MSE}$$
(6)

Bias:

$$Bias=mean(Q_{\text{pred}}-Q_{\text{obs}})$$
(7)

Relative bias (RBias):

$$RBias = \left[mean\left(\frac{{Q}_{pred}-{Q}_{obs}}{{Q}_{obs}}\right)\right]\times 100$$
(8)

Relative root mean square error (RRMSE):

$$RRMSE = \frac{\sqrt{mean \left[{\left({Q}_{pred}-{Q}_{obs}\right)}^{2}\right]}}{mean\left({Q}_{obs}\right)}$$
(9)

Root mean square normalised error (RMSNE):

$$RMSNE= \sqrt{mean\left[{\left(\frac{{Q}_{pred}-{Q}_{obs}}{{Q}_{obs}}\right)}^{2}\right]}$$
(10)

where Qobs is the observed flood quantile from at-site flood frequency analysis by LP3 distribution at a given test catchment, and Qpred is the predicted flood quantile obtained from the developed RFFA models for the test catchment.

4 Results

Each of the four developed RFFA models (CNN, SVM, TM and MLR) are tested on the test data set consisting of 40 catchments. Several statistical measures (Eqs. 210) and plots are used in this evaluation as presented below.

Table 2 provides seven statistical measures (based on the test data set consisting of 40 stations) for four different models (CNN, SVM, TM, and MLR) for six different flood quantiles. Table 2 reveals that the CNN model consistently outperforms the other models in terms of several statistical measures across the quantiles. Specifically, the CNN model exhibits the lowest REr values for five quantiles out of six, with the exception of Q2. It also demonstrates the four lowest values of mean squared error (MSE) across the four quantiles except for Q2 and Q5. Furthermore, the CNN model achieves the lowest bias values for Q20 and Q100, as well as the lowest RBias value for Q10. It also attains the four lowest RMSE values, except for Q2 and Q5, and the four lowest RMSNE values, except for Q5 and Q20. It should be noted that error values in Table 2 are generally higher, which is mainly due to highly variable hydrology in Australia. The currently recommended RFFA technique (ARR-RFFE Model) in the Australian Rainfall and Runoff (ARR) shows a similar/higher error statistics.

Table 2 Statistical evaluations of the four different models and six flood quantiles

It should be noted that the CNN model does not consistently achieve the lowest value across all the statistical measures. In some cases, it performs as the second-best model, with exceptions such as Q5-RMSNE, Q5-RBias, Q20-RBias, Q50-RBias, and Q100-RBias. Consequently, out of the 42 statistical measures (7 statistics *6 quantiles in Table 2) examined, the CNN model performs the best in 24 measures, followed by SVM (6 measures), TM (4 measures), and MLR (8 measures). Moreover, the CNN model ranks second best in 13 measures and third best in 5 measures. While the TM model demonstrates the highest performance for most measures in Q2, and the MLR model performs the best for most measures in Q5, the CNN model surpasses the other models in terms of most statistical measures for the remaining quantiles (Q10, Q20, Q50, and Q100).

Based on the findings presented in Table 2, it can be concluded that the CNN model exhibits superior performance overall compared to the other three models. The CNN model consistently achieves the lowest values for the majority of statistical measures, indicating its strong predictive capability. To provide further insights into the performance of these models, Fig. 5 presents RE box plots, which offers a more detailed assessment of the CNN, SVM, TM, and MLR models across the six flood quantiles.

Fig. 5
figure 5

Boxplots of relative error (RE) values of four methods for six different quantiles (Y-axis presents RE values and X-axis presents six flood quantiles (Q2, Q5, Q10, Q20, Q50 and Q100)

From Fig. 5, it is found that for Q2, the smallest box width is associated with SVM, followed by CNN, TM and MLR (where CNN, SVM and TM have similar box width). For Q5, the smallest box width is exhibited by SVM, followed by MLR, CNN and TM (SVM and MLR have similar box width, and CNN and TM have similar box width). For Q10, the smallest box width is seen for CNN, followed by SVM, MLR and TM (CNN and SVM have similar box width). For Q20, CNN shows the smallest box width, followed by SVM, MLR and TM (CNN and SVM have similar box width and TM and MLR have similar box width). For Q50, SVM has the smallest box width, followed by CNN, TM and MLR (SVM, CNN and TM have similar box width, and box width for MLR is remarkably higher than the three other models). For Q100, the smallest box width is provided by CNN, followed by SVM, MLR and TM (box width of TM is about double of the box width of CNN and SVM). Considering all the six quantiles, in terms of box width, CNN and SVM have similar results, which is much smaller than TM and MLR, in particular for higher return periods.

In Fig. 5, the median line of each model is represented by a thick line within the box. When the median line of a model is located below the 0:0 reference line (in Fig. 5), the model overall underestimates the observed flood quantiles and if median line is located above the 0:0 line, it indicates an overall overestimation by the model and if the median line coincides with the 0:0 line, it indicates the best model in terms of bias. In terms of bias, the best result is found for CNN (Q2 and Q100), followed by MLR (Q5, Q10 and Q100). Overall, in terms of bias (as seen in Fig. 3), CNN outperforms the other three models, and TM shows notable overestimations for all the six flood quantiles.

The presence of outliers (indication of gross overestimation and underestimation by a model) is of great importance as it influences the performance of a model. An increased number of outliers contributes to greater variability in the model performance, thereby diminishing the statistical power of the model. Table 3 shows outlier number produced by each model as per Fig. 5. Overall, CNN has the smallest number of outliers followed by MLR, and SVM has the highest number of outliers.

Table 3 Number of outliers for six quantiles and four models

Figure 6 displays boxplots representing the performance of four selected models across six quantiles using the Qr (Qpred/Qobs) metric. The median line within each box, indicated by a thick line, serves as an indicator of the overall model performance, with a median line closer to 1 suggesting a better performance. Regarding the Qr box plots for the six flood quantiles in Fig. 4, it is evident that the CNN model exhibits the narrowest Qr box with fewer outliers compared to the other three models. On the other hand, the Q5-SVM model demonstrates the narrowest box for Qr, but it also has five outliers. Similarly, the Q5-MLR model shows a narrower box compared to Q5-CNN, with both models having only one outlier. Consequently, Q5-MLR performs better than Q5-CNN in terms of Qr performance. Considering all the Qr box plots produced by the four models for the six quantiles in Fig. 6, it can be concluded that overall, the CNN model exhibits the best performance compared to the other three models.

Fig. 6
figure 6

Boxplot of Qr (Qpred/Qobs) values of four methods for six different quantiles where Y-axis presents Qr (Qpred/Qobs) values and X-axis presents six quantiles (Q2, Q5, Q10, Q20, Q50 and Q100)

Table 4 presents the median values of Qr for the four models. It is observed from Table 4 that the CNN model achieves median Qr values ranging from 0.82 to 1.14 across the six different flood quantiles. In contrast, the other models display a larger range of median Qr values compared to the CNN model. Based on the analysis of the boxplots and median values in Fig. 6 and Table 4, the CNN models demonstrate overall better performance compared to the three other models.

Table 4 Median value of Qr for six quantiles by four models

In summary, comparing the performance of the four models for six quantiles using the selected statistical indices, RE box plots and Qr box plots, outlier number produced by each model, and median values of Qr, it is evident that, overall, the CNN model outperforms the three other models. However, few catchments performed poorly by producing high Qr in the CNN model and they influence the median value of REr and other statistical measures. In Figs. 7 and 8, these poorly performing catchments are denoted by their station name. Figure 7 shows Qr values for different flood quantiles of these five outlier catchments. The catchment characteristics of these five catchments are illustrated in Fig. 8 and the thick line shows the median value of each catchment characteristic based on the data of the 201 selected catchments. Typically, a Qr value closer to 1 indicates better model performance. Examining the results for each catchment (Figs. 7 and 8), Murrindindi River at Murrindindi above Colwells catchment consistently exhibits high Qr values ranging from 4.1 to 6.8 across different flood quantiles. Despite its smaller AREA, Murrindindi River at Murrindindi above Colwells catchment has very small value of MAE as compared to the majority of the catchments. Pranjip Creek at Moorilim catchment has higher AREA but too small MAR, S1085 and FOREST. Big River d/s of Frenchman Creek Junction catchment has very small SF and higher FOREST. Grampians Rd Br catchment is characterized by very small AREA and very high S1085 and Avon River at Wimmera Highway catchment has very small MAR, S1085 and FOREST. These unusual characteristics might have contributed to poor performance of these catchments by the CNN model.

Fig. 7
figure 7

Qr values of five poorly performing catchments for the CNN model

Fig. 8
figure 8

Poorly performing catchments and their characteristics based on Qr value (CNN model)

To investigate CNN model performance in depth, few well performed catchments based on Qr value close to 1, have been chosen to analyse. Figure 9 shows Qr values for different flood quantiles of these five well performed catchments. The catchment characteristics of these five catchments has shown in Fig. 10 where thick black line is representing median value of all selected catchments of this study. Despite of have same area, Murrindindi River at Murrindindi above Colwells catchment (Fig. 7) performed poor and Wanalta catchment (Fig. 9) performed well. In Figs. 8 and 10, it can be seen that Murrindindi River at Murrindindi above Colwells catchment has almost double MAR value, S1085 is very high, SF, SDEN and I62 is slightly higher than Wanalta catchment which means Murrindindi River at Murrindindi above Colwells catchment is steeper and having more rainfall with higher intensity than Wanalta catchment but for slightly higher SDEN and FORST value, Murrindindi River at Murrindindi above Colwells catchment is showing drainage efficiency by producing less flood quantile in real, CNN models unable to learn this behavioural pattern by Murrindindi River at Murrindindi above Colwells catchment in this study.

Fig. 9
figure 9

Qr values of five well performing catchments for the CNN model

Fig. 10
figure 10

Well performing catchments and their characteristics based on Qr value (CNN model)

Big River d/s of Frenchman Creek Junction catchment (Fig. 8) and Devlins Br catchment (Fig. 10) has also very close value for AREA, yet Devlins Br catchment (Fig. 7) is showing better Qr value than Big River d/s of Frenchman Creek Junction catchment (Fig. 9). Most of the predictor variables between Big River d/s of Frenchman Creek Junction catchment and Devlins Br catchment are almost same except MAR and SF. Having higher MAR but less SF value than Devlins Br catchment, Big River d/s of Frenchman Creek Junction catchment is producing less quantile value in real observation than Devlins Br catchment (Table 5). CNN models could not capture Big River d/s of Frenchman Creek Junction catchment pattern but predicted almost same quantiles value for both catchments.

Table 5 Observed six quantiles (Q2, Q5, Q10, Q20, Q50 and Q100) of 5 poorly performed and 5 well performed catchments during testing of CNN models

Pranjip Creek at Moorilim catchment (Fig. 8) and Redesdale catchment (Fig. 10) both are large catchments having area 787 km2 and 629 km2 respectively. However, having higher SDEN and FOREST but steeper slope (Fig. 10), Redesdale catchment is producing higher quantiles value than Pranjip Creek at Moorilim catchment (Table 5). CNN model failed to capture hydrological pattern of Pranjip Creek at Moorilim catchment but predicted almost same quantile values for both catchments. Gerrang Br catchment and Flowerdale catchment showing close value for mostly all predictor variables in Fig. 10, except slightly different value for AREA and SF. However, both catchments are showing better Qr value in Fig. 9.

To understand CNN models learning phase, this study also investigate 4 catchments of training data set. Among them 2 catchments performed poor and another 2 catchments performed well. Table 5 is showing these 4 catchments where Cudgee catchment and Eungella catchment are poorly performed catchments and Glencairn catchment and Jacobs Ladder catchment are well performed catchment.

From Table 6, it can be seen that Cudgee catchment and Glencairn catchment both has same AREA. Being a flatter catchment and having less value for all other predictor, Cudgee catchment is producing significant less quintile value than Glencairn catchment (Table 7) but CNN models predicted close value of quantiles for both catchments and produced high Qr value for Cudgee catchment. Jacobs Ladder catchment is highly steeper than Eungella catchment but SF is almost same for both. Having lower I62, MAR, MAE, SDEN predictor variables value, Jacobs Ladder catchment producing less quantile than Eungella catchment, which CNN models could captured perfectly during training and Qr value for Jacobs Ladder catchment is close to 1 for all quantiles (Table 6). Eungella catchment is producing high quantiles value for all quantiles despite of having flatter slope but higher I62, MAR, MAE, SDEN. In case of Eungella catchment, CNN model failed to predict. But predicted value by CNN models for both catchments (Jacobs Ladder and Eungella) are close.

Table 6 Poor and well performed catchments characteristics and Qr value during training
Table 7 Poor and Well performed observed six quantiles (Q2, Q5, Q10, Q20, Q50 and Q100) value during Training

5 Discussion

Likeany other neural networks, CNN relies on a bigger training data set to learn the pattern in the data. CNN also tends to overfit, but in this study a dropout layer was used to reduce overfitting. The dropout layers prevent the networks from overfitting by removing the neurons which force the network to overfit. In the CNN method, every convolution layer should be followed by an activation layer. In this study, ReLU (Rectified Linear Unit) operation was used as activation layer so that the network can account for non-linearity. Although the CNN has been known to be a good pattern recognition model, in this study, the CNN model had limited learning opportunity due to relatively smaller data set. In this regard, a Monte Carlo cross validation technique can be adopted (Haddad et al. 2013), which randomly splits the data for training and validation hundreds of time to evaluate prediction error.”

Also, in future studies it would be worth to create sub set of catchments based on homogeneity (Msilini et al. 2020). Selecting important feature would be beneficial as well in future studies. Sensitivity analysis of the inputs variable would be also helpful to extract best independent variables (Heidarpanah et al. 2023).

In this study four different RFFA models (CNN, SVM, TM and MLR) are evaluated using data from 201 catchments in south-east Australia. Comparing performances of these four models based on several statistical measures (Eqs. 210), it is found that the CNN regression model performs better than the other three models.

To gain further performance level of the CNN model developed in this study, few statistical measures (REr, RBias, RMSE and RRMSE) based on this study are compared with other RFFA studies. Our CNN models show REr values ranging 29% to 44%, which are comparable to Ali and Rahman (2022) who reported REr values in the range of 28% to 39% for a kriging based RFFA model for NSW and Victoria states of Australia. In another study, Noor et al. (2022) found REr values ranging 16% to 41% for Victoria by using a generalized additive model (GAM). Rahman and Rahman (2020) noted REr values between 22 and 37% for their index flood method (IFM) for NSW. Zalnezhad et al. (2022a) developed a quantile regression technique (QRT) for NSW and Victoria and found REr ranging from 36 to 48% and for their ANN model the REr values were in the range of 33–54%. A recent study by Zalnezhad et al. (2023) using IFM method found REr values ranging from 32 to 46%. Aziz et al. (2015) found REr values ranging from 37 to 72% for south-east Australia for their GAANN model. In another study, Aziz et al. (2017) found REr ranging from 36 to 46% based on their ANN model for south-east Australia. ARR RFFA model (Rahman et al. 2019) reported REr ranging from 49 to 59% for eastern Australia; however, it should be noted that ARR-RFFA model used 558 catchments and leave-one-out validation technique to evaluate model accuracy, which is more rigorous than split-sample validation technique adopted in this study.

In relation to RBias, the CNN model developed in this study shows values in the range of 14 to 43, which were found to be in the range of 32–57 by Zalnezhad et al. (2022a) for their QRT model in south-east Australia. In another study, Shu and Ouarda (2008) found RBias ranging from − 11 to − 8 using an ANFIS model in Quebec, Canada. In terms of lowest RMSE, this study found 14.78. Allahbakhshian-Farsani et al. (2020) used SVM model and found lowest RMSE of 50.7. Zalnezhad et al. (2022a) used ANN method and found the lowest RMSE value of 50.15. This study found RRMSE value ranging from 0.61 to 0.85 for the CNN method. The RRMSE values of this study closely align with the study by Ouarda and Shu (2009) where they used ANFIS model and found RRMSE ranging from 0.57 to 0.64. Zalnezhad et al. (2022a) found RRMSE values in the range of 0.79 to 1.02 for their ANN based RFFA model. Another study by Zalnezhad et al. (2022a, b) found RRMSE values in the range of 0.75 to 1.01 for QRT method. Zalnezhad et al. (2023) used IFM method and found RRMSE values in the range of 0.74 to 1.12. Overall, the CNN model developed in this study performs better than most of the previously reported similar studies.

6 Conclusion

This study focuses on RFFA in south-east Australia using data from 201 catchments. It compares CNN based RFFA model with MLR, SVM, and DT based RFFA models. The performances of these models are evaluated using a split-sample validation technique based on nine statistical measures for six different flood quantiles (Q2, Q5, Q10, Q20, Q50 and Q100). It is found that the CNN model performs best for AEPs in the range of 1 in 5 to 1 in 100, with median relative error in the range of 29% to 44%. The DT model shows better performance for 1 in 2 AEP, with a median relative error of 24%. The CNN model outperforms the currently recommended RFFA model in Australian Rainfall and Runoff guideline. The developed CNN based RFFA model performs better than similar previous studies.

However, the CNN models face challenges in accurately predicting flood quantiles for certain catchments with extreme characteristics. To enhance the performance of CNN models in future studies, it is recommended to create sub set of catchments based on homogeneity, conduct important feature selection and sensitivity analysis for input variables. By identifying important feature, the application of independent variables can be optimized to improve model performance. Future studies should apply Monte Carlo and leave-one-out cross validation techniques to evaluate CNN based RFFA model using data from other Australian states, which will assist to recommend more accurate RFFA techniques in Australian Rainfall and Runoff guideline.