Influence of global teleconnections on long-term variability in flood seasonality across peninsular India

Climate change and natural climate variability impact not only the frequency and magnitude of floods but also flood seasonality. However, limited to no study has investigated the seasonality in flood occurrence in peninsular Indian rivers. The Central Water Commission opening its long-term records of river stage and discharge gathered from many monitoring stations offers a unique opportunity to analyse flood seasonality. The primary aims of this study are to identify the time of the year when floods are most likely, investigate whether the occurrence of floods has changed over time due to the influence of climate change and natural climate variability, and determine the most significant large-scale and local climate drivers of flood seasonality. Stage and discharge data from 281 gauging stations across ten peninsular river basins are considered to identify the annual maximum gauge level for a gauging point while retaining the corresponding Julian day (also known as the date of occurrence or DO) for further analysis. Several attributes related to circular statistics are calculated from the DO series to find that 68% of the gauging stations experience floods during August, indicating the dominance of the monsoon system on DO. Preliminary analysis confirms that while most stations experience a non-stationarity in the DO series, a few stations exhibit a non-linear trend. Furthermore, our study develops a machine learning model with DO as predictand and 36 local and large-scale predictors to identify the dominant predictors of DO. The predictor importance metric shows that the Pacific Decadal Oscillation (PDO) and the El Niño and Southern Oscillation (ENSO) explain DO variability better than other drivers. Moreover, since ENSO and PDO are teleconnected with the onset of monsoon and annual maximum rainfall (Pradhan et al 2017 Sci. Rep. 7 14229; Choudhury et al 2021a Q. J. R. Meteorol. Soc. 147 3111–23), flood seasonality demonstrates a connection with both PDO and ENSO.


Introduction
The Indian subcontinent receives the predominant portion of its annual precipitation during the monsoon season, which spans from June to September, resulting in extreme floods.Apart from monsoonal rainfall, heavy rainfall with prevailing soil moisture (Wasko and Nathan 2019), deep depressions and cyclonic storms from the Bay of Bengal and the Arabian Sea, and mid-tropospheric cyclones (Ray et al 2019) primarily contribute to extreme floods in India.Although former studies have conducted detailed investigations into extreme flood magnitudes (Prosdocimi et al 2015, Sahany et al 2018), flood timing and its associated drivers have received comparatively less research attention.The significance of flood seasonality studies lies in identifying distinct flood generation mechanisms (Villarini 2016), assessing the relative importance of flood drivers (Berghuijs et al 2016), and understanding flood regimes (Beurton and Thieken 2009).Moreover, a comprehensive understanding of flood seasonality is crucial for flood risk assessment, climate change research, and the characterization of diverse flood triggers (Blöschl et al 2017, Ye et al 2017, Wasko et al 2020).Towards this, the current study analyses the temporal patterns of flood occurrences and their association with largescale oceanic-atmospheric circulations (called climate drivers) and local drivers.The study restricts its analysis to peninsular India, whose hydrometry datasets were recently made available in the public domain by the Central Water Commission (CWC), Government of India, thus offering a unique opportunity to study flood processes in Indian rivers.
While most of the floods in India occur during the southwest monsoon season, peninsular India experiences floods during the post-monsoon and winter months due to the northeast monsoon (Dhar and Nandargi 2003).In recent years, major floods have occurred in peninsular India (for example, the 2015 Chennai and 2018 Kerala floods) in concurrence with common flooding conditions in the eastern peninsular rivers.Ray et al (2019) noted that the response to floods in Indian rivers varies greatly, mainly based on the local geomorphology and population density in the flood plains.The frequency and magnitude of floods in Indian rivers will likely increase in response to anthropogenic climate change (Rani and Sreekesh 2019, Sinha et al 2020).Apart from climate change, natural climate variability (i.e.natural variation in land-atmospheric processes independent of anthropogenic influences) and antecedent hydrometeorological conditions may also influence Indian floods (Garg and Mishra 2019, Swain et al 2020, Rahman et al 2021, Nanditha and Mishra 2022, Sharma and Mujumdar 2024).Along with El Niño and Southern Oscillation (ENSO), climate variability modes such as the Pacific Decadal Oscillation (PDO), North Atlantic Oscillation (NAO), and Indian Ocean Dipole (IOD) are strongly associated with the Indian monsoon, meaning that they too can modulate floods (Gadgil et al 2004, Mondal and Mujumdar 2015, Shekhar et al 2022).However, to the best of our knowledge, limited to no study has investigated the non-stationarity in flood seasonality across Indian rivers resulting from either human-induced climate change or natural climate variability.Considering these, the current study addresses the following objectives.1.To investigate the spatio-temporal characteristics of flood seasonality pertaining to peninsular Indian rivers and find whether flood timings have changed.2. To identify the local and global drivers of flood seasonality.
Several former studies considered circular statistics to characterize the seasonality of flooding and study the underlying flood-generating mechanisms (Villarini 2016, Ye et al 2017, Berghuijs et al 2019, Basu et al 2023).However, to the best of our knowledge, a comprehensive analysis of the drivers of flood seasonality is yet to be conducted.The current study estimates the annual maximum gauge level (AMGL) for 281 monitoring points across peninsular India.Multiple statistical attributes related to circular statistics are estimated to address the first objective.The augmented Dickey-Fuller test (ADF), Mann-Kendall (MK) test, and Theil-Sen slope estimator are employed to investigate the non-stationarity in flood timing, i.e. the date of occurrence (DO) of annual maximum flow.Finally, a machine learning (ML) model is developed, with the DO as predictand and the local and global climate drivers as predictors.Subsequently, our study estimates the predictor importance metric to identify dominant drivers of DO.

Data
The current study obtains the daily river stage and discharge information from 384 monitoring stations maintained by the CWC, Government of India, across ten peninsular rivers.However, following a quality check to exclude erroneous and short time series, 281 stations were finally selected (details related to the quality check are discussed in appendix A.1 and SI figures 1 and 2).Importantly, stations with a minimum of 10 years of gauge data are considered for this study, and the length of the daily time series might vary from station to station (see SI figure 3).The location and spatial extent of the basins of these rivers are presented in figure 1(a).The number of stations considered for each river basin, along with the percentage of the total number of stations, are presented in figure 1(b), which shows that Godavari, Krishna and Mahanadi are the top three contributors.Since CWC daily gauge data is considered more accurate than the daily measured discharge, and several technical reasons can influence direct discharge monitoring, we estimate the AMGL as a proxy for the annual maximum flood (AMF).AMGL is estimated using the block maxima approach.Further, we obtain the daily gridded precipitation and maximum and minimum temperature data from the India Meteorological Department (Srivastava et al 2009, Pai et al 2014).Furthermore, we obtained the monthly climate variability modes -Oceanic Nino Index (ONI) representing ENSO, Dipole Mode Index (DMI) representing IOD, NAO and PDO indices from the National Oceanic and Atmospheric Administration (www.ncei.noaa.gov/access/monitoring/products/).The shapefiles of the basins are downloaded from HydroBASINS, while HydroBASINS has been extracted from the gridded HydroSHEDS core layers at 15 arc-second resolution (Lehner and Grill 2013).

Methods: seasonality analysis
Traditional linear statistical techniques must be improved when analyzing cyclic or periodic data.Hence, circular statistics, also known as directional statistics, have been used to study the seasonality of hydrological extreme events (Mardia and Jupp 2000, Jammaladak 2001).Our study constructs the DO of AMGL for all 281 stations across the ten river basins.Circular statistics are then applied to the DO values, obtained by converting the Julian days into angular measurements.The DO is then revised as the angle made by a point on a circle, typically ranging from 0 to 360 degrees (or 0 to 2π radians).In this context, the mean direction (θ) can be defined as the resultant direction of a set of 'n' AMGL events at a station, which determines flood seasonality.Furthermore, the mean resultant length (ρ) is calculated to assess the variability in flood occurrences, with a ρ value close to 1 indicating a dominant seasonal pattern and a value approaching 0 suggesting uniformity, particularly in circumstances where flood occurrences are evenly distributed around the circle.In other words, the mean resultant length is similar to a seasonality index (SI), as considered in several former studies (for example, Basu et al 2023).The circular standard deviation (CSD), which measures the variability in mean direction, is also calculated.Additional details related to computations of the circular statistics are provided in appendix A.2. Non-stationarity in the DO time series (in Julian days) is tested using an ADF test.Meanwhile, non-linear trends in the DO series for a station are investigated using the MK test and Sen's slope, popular techniques for estimating trends in hydroclimate data (Sen 1968, Kendall 1975).

Driver identification based on random forest regression
A random forest regression (RF) is deployed for each basin to identify the significant drivers of DO.RF is a supervised machine-learning technique that follows the logic of multiple trees creating a forest.RF is built based on a bagging technique that allows multiple decision trees to run simultaneously without learning from each other.The current study develops an RF model considering the DO as predictand and 36 predictor variables, including station location, teleconnections, and extreme meteorological indices as predictors (list of predictors is presented in SI table 2).We developed an Expert Team on Climate Change and Detection Indices (ETCCDI) based on basin-average daily precipitation and temperature time series (Karl et al 1999, Peterson et al 2001).We note that a few predictor variables, like cold and warm nights, may not directly correlate with extreme flows.Nevertheless, the ML model will only retain important predictors through the predictor importance analysis.We do not consider the impact of urbanization and local flow regulations as predictors of DO since Villarini (2016) found that while flow regulations have limited influence on the DO, urbanization rarely impacts the DO of AMGL.
Since ML models require a large set of observations for model training, the predictand and predictor sets of all stations in a basin are pooled to increase the sample space.The RF performance is evaluated based on the root mean square error (RMSE), the coefficient of determination (R 2 ), and the Nash-Sutcliffe coefficient (NSE).Furthermore, a three-fold crossvalidation is conducted to ensure no data overfitting.A diagnostic metric, the predictor importance, is estimated to identify the relative importance of the predictors (equation ( 1)).Predictor importance (PI) provides a comprehensive measure of the influence of each predictor in a machine-learning model (Breiman 2001), where ntree refers to the number of trees in the random forest, B(t) represents the out-of-bag sample for tree t, P Before X j (i ) is the predicted class for observation i before permuting the value of the variable X j and P After X j (i ) is the predicted class for observation i after permuting the value of the variable X j .

Results and discussion
The primary statistical attributes related to seasonality in the AMGL are presented in figure 2. The study considers the mean direction of the AMGL (figure 2  discussed the association between soil moisture and flood seasonality from a theoretical perspective.In summary, strong seasonality is evident in the peninsular rivers, with 40% (54%) of the stations exhibiting an SI above 0.9 (between 0.75 and 0.9).Moreover, SI values exceeding 0.9 are observed for the eastern part of Godavari and the western parts of Krishna, Narmada, Mahi, and Mahanadi.
As evident from figure 2(b), the eastern upstream gauge stations primarily show strong seasonality for Krishna, with 19 out of the 60 stations exhibiting an SI greater than 0.9.In contrast, for Godavari and Mahanadi, nearly half of the gauge stations, especially those located in the mid-reach and downstream areas of the basin, exhibit an SI greater than 0.9 (details presented in appendix A.2). Given that over 54% of the stations record an SI between 0.75 and 0.9, it is probable that the inter-annual variability in the AMGL may limit the strength of the observed seasonality.Furthermore, our study estimates the CSD to understand the variations in SI (figure 1(c)).A lower (higher) value of CSD represents a high (low) concentration of AMGL dates along with low (high) dispersion.In this study, most stations exhibit a CSD between 0.3 and 0.6, indicating moderate interannual variation in AMGL occurrences.Importantly, stations with lower CSD values, specifically between 0.2 and 0.3, are found to be typically associated with SI values greater than 0.9.Therefore, it is concluded that the CSD is a relative measure that should not be directly compared to the SI.Additionally, the presence of strong seasonality in peninsular Indian rivers is confirmed.
Following the seasonality analysis, we investigate the temporal trends in the DO of AMGL.In this context, it should be noted that this study considers the DO of AMGL at a specific gauge station in terms of Julian days for each year, resulting in the creation of a distinct time series, referred to as the DO time series, which is separate from the AMGL series.We found that autocorrelation is not statistically significant in the DO series (see SI figure 5).Results related to the ADF test confirm that most of the peninsular gauge stations experience non-stationarity in the DO of AMGL, indicating changes in DO over time (figure 3(a)).ADF test does not identify the source of non-stationarity (i.e.trend/jump/periodicity) but searches for a unit root in a time series (Haan 1995).Furthermore, it is noted that two gauge stations in the same basin located nearby may experience stationary and non-stationary DO, respectively, possibly due to variations in the anthropogenic influences (such as local flow regulations; we lack basin-level flow regulation information) on DO across stations.Results related to the MK test and the Sen's slope are presented in figures 3(b) and (c).Figure 3(b) shows the gauge stations exhibiting increasing or decreasing DO with a 95% confidence level.An increase (decrease) in DO indicates delayed (early) floods.We found that only 8% of the 281 stations experience a statistically significant trend in DO.Furthermore, only 5% of the stations experience delayed floods compared to 3% of the stations that encounter early floods.Results of Sen's slope showed that only 8% of the stations exhibit a non-linear trend that is statistically significant.Figure 3(c) presents the DO trend in terms of Julian days per year, indicating that floods may be delayed by seven days per year, depending on the length of the dataset.In summary, the DO of AMGL for the peninsular rivers experience substantial non-stationarity, primarily due to natural variability in the monsoon system rather than anthropogenic land use or the impact of climate change.
In the final analysis, our study adopts an ML model to identify the most significant drivers of DO.The model calibration and validation results are presented in table 1.For this analysis, 80% of the data are used for model calibration, while 20% are used for validation under a three-fold validation framework.However, despite our efforts, the calibration results indicate that the model overfits the data.We found that the model exhibits low RMSE with high R 2 and NSE during calibration, indicating that it could accurately yield variations in the predictand.Additionally, model performance remains similar across all ten basins.However, overfitting deteriorates performance during the validation period, exhibiting R 2 between 0.3 and 0.7.While Brahmani-Baitarani, Cauvery, Krishna, and Mahanadi record an NSE of around 0.7, Mahi exhibits a poorer performance than the other basins, registering an NSE and R 2 of around 0.36.We employ three additional ML models (Extreme Gradient Boosting-XGBoost, Multi Linear Regression, and Support Vector Regression) to search for a superior ML model (details related to these models are discussed in the appendix).Results indicate that the Random Forest model performs better than the other models, with lower RMSE and higher R 2 values (SI figure 6).
Furthermore, this study estimates the predictor importance following a leave-zero model fitting using the same sets of predictand and predictors (figure 4).Since predictor importance is primarily a relative measure of the importance of predictors, the results are illustrated as circular plots.Each circular plot in figure 4 indicates the eight most important predictors explaining the variations in the DO series for a basin.The relative length of the bars represents the magnitude of importance for a particular predictor.The circular plots show that PDO and ONI are DO's two most important drivers.In a few basins (such as Cauvery, Krishna, and Pennar), the DMI also emerges as one of the three most important predictors of DO.It should be noted that local extreme meteorological factors rarely appear to be important predictors explaining DO variability.Additionally, among the various statistical attributes of the climate variability modes, either the climate index for the month of DO (referred to as GPDO, GONI, etc) or the maximum climate index value for a year (referred to as MPDO, MONI, etc) present the highest predictor importance.In addition to AMGL, we conduct driver-identification analyses of seasonal maximum gauge level (SMGL) for pre-monsoon, monsoon, post-monsoon, and winter following the framework described earlier.Results indicate that climate variability modes play a consistent role as the most influential drivers for SMLG for four seasons.Overall, it is observed that climate variability modes have a higher impact on the DO of AMGL than local extreme meteorological conditions.

Summary and discussion
The major findings of our study are as follows.
1. Peninsular rivers experience strong seasonality, with floods primarily occurring during August.However, floods occur primarily between October and November in three specific southern rivers.This indicates that the southeast and northwest monsoons play critical roles in determining the seasonality of floods concerning the climate regime of a river.2. Although most monitoring stations considered in this study experience non-stationarity in the DO of floods, a few stations yielded a non-linear temporal trend, highlighting that the influence of natural climate variability is more significant than anthropogenic influence on the DO series.3. PDO and ENSO are the two most prominent drivers of DO variation.In contrast, local extreme meteorological conditions have little to no impact governing early or late floods.
Since most floods in peninsular India occur during August (during the southwest monsoon), the soil is expected to be fully saturated (Varikoden and Revadekar 2018), pointing to the limited role of antecedent soil moisture in generating floods.However, Godavari, Krishna, and Cauvery often experience floods from October to November-the second leg of the monsoon-possibly caused by the cyclones and deep depressions frequently observed in the postmonsoon months.It is worth noting that floods usually synchronize with several types of sectoral decision-making, such as reservoir and irrigation planning and integrated reservoir management.In this context, the results of our study confirm that peninsular Indian reservoirs should prepare themselves for floods before August since an almost full reservoir at the beginning of this month may restrict flood modulation, eventually leading to disasters similar to the 2018 Kerala floods (Central Water Commission 2018).Although the experimental set-up in this study does not investigate whether early or late floods can be expected during active PDO and ENSO phases, it confirms that water managers may expect a deviation from traditional DO seasonality during the active phases of global teleconnections.
We understand that the DO of AMGL is typically determined by the DO of Annual Maximum Rainfall since the time of concentration of floods across pen-insular Indian basins is less than a day (Mehta et al 2022).The magnitude of AMR follows a complex relation with climate variability modes with the warm phase of PDO and El Niño (the cold phases of PDO and La Niña) resulting in higher extreme rainfall across India (Krishnan andSugi 2003, Vinod andMahesha 2024).Former studies found that El Niño suppresses the convections in the Indian subcontinent while La Niña enhances the convection (Gadgil et al 2004).Further, several former studies noted that the onset of monsoon is influenced by climate variability modes, particularly ENSO, as a suppressed convection results in the late arrival of monsoon winds in the southern coasts of the country (Pradhan et al 2017, Choudhury et al 2021b).Furthermore, climate variability modes can modulate the wet spells and dry spells of monsoon, which is the sub-seasonal variability of monsoon, leading to increasing hydrometeorological controls on the day of occurrence of AMGL (Dwivedi et al 2015).We note that either a delay or early arrival of monsoon subsequently impacts the DO of AMR, which further determines the DO of AMF.Interestingly, ongoing research from the current group found that an AMR-AMF direct dynamics is not always evident, as AMR might not translate into AMF, mainly when regional hydrometeorological conditions and local flow regulations are active.
One major criticism of the current study may be that it does not account for the bimodality of floods, which is why multiple floods (typically, two floods) may occur within a year.Although bimodality in the southern basins is not uncommon, there is limited knowledge available on the changes observed in this bimodality over the years.Furthermore, the current experimental setup can be extended to investigate the second-highest peak flow in a year and its associated driver.However, modeling the bimodality in floods is a research topic beyond the scope of the current study.An additional concern related to the development of the proposed ML model is that the DO series from all available monitoring stations within a basin were pooled in this study, which could lead to overfitting during calibration.However, pooling the data from all stations within a basin involves spatio-temporal variability in the DO series, further modeled by random forest regression.In this context, Bayesian hierarchical models may also identify climate drivers as an alternative to ML models.
In conclusion, strong seasonality in peninsular India warrants additional investigations into the connection between seasonality and spatio-temporal rainfall patterns and its causal relation with global teleconnections.The current study confirms that the occurrence of floods is closely associated with the frequency and magnitude of hydrometeorological events and large-scale oceanic and atmospheric circulations.
(a) Multi-linear regression (MLR): MLR is a linear method that models the relationship between dependent and multiple independent variables (Uyanık and Güler 2013).It formulates this relationship using a linear equation, estimating coefficients for each independent variable during training.The primary goal of MLR is to minimize the sum of squared differences between actual and predicted values.MLR parameters include coefficients for each independent variable and an intercept term (Uyanık and Güler 2013).In this study, ridge regression, a regularization technique, is employed to mitigate overfitting by penalizing large coefficients.
(b) Support vector machine (SVM): SVM is a regression analysis method that discovers a hyperplane in a high-dimensional space that best fits the data points.Unlike traditional regression, SVM focuses on defining a function that maintains a specified margin of error ε around the actual values (Awad and Khanna 2015).The model utilizes support vectors, the data points closest to the hyperplane, to delineate this margin.We explore the SVM parameters such as the radial basis function kernel, a regularization parameter set to C = 1, and epsilon (ε) = 0.1, which governs the margin's width and acceptable error.
(c) Random forest regression (RF): RF regression is an ensemble learning technique that aggregates predictions from numerous decision trees to enhance accuracy and counteract overfitting (Tyralis et al 2019).Each decision tree within the forest is trained on a random subset of the training data, making predictions accordingly.In the current study, we adaptively tuned RF parameters to optimize performance, resulting in 500 trees in the forest, a minimum node split of 2 samples, and a minimum leaf sample requirement of 1.
(d) Extreme gradient boosting (XGB): XGB is an advanced gradient-boosting algorithm tailored for speed and performance (Niazkar et al 2024).It constructs an ensemble of weak prediction models, typically decision trees, sequentially, with each subsequent model rectifying errors made by its predecessors (Niazkar et al 2024).XGB employs gradient descent to minimize a pre-defined loss function while incorporating new models.Parameters in XGB encompass the learning rate, which governs the contribution of each new tree, maximum tree depth, ensemble size, and regularization parameters like gamma and lambda to curb overfitting.Furthermore, parameters pertain to the boosting method and data subsampling for each boosting round.In this study, XGB parameters were adaptively tuned, yielding optimal values: Learning Rate (eta): 0.1, Maximum Depth: 5, Number of Trees: 1000, Gamma: 0.1, and Lambda: 1.0.

Figure 1 .
Figure 1.(a) Map of peninsular India depicting the ten river basins.Locations of the basins concerning the country map are shown in the inset, and (b) a bar plot of the number and percentage of stations (in parentheses) considered in this study from each river basin.
Former studies have linked flood characteristics to climate variability modes (Ward et al 2016, Bhatla et al 2020, Mohan et al 2023).Unlike previous research on floods focusing on yearly or seasonal averages of climate variability modes (Najibi and Devineni 2018, Gurrapu et al 2023), our study examines extreme values of climate variability modes.
(a)), the SI (figure 2(b)), and the CSD (figure 2(c)) as the three statistical attributes of AMGL seasonality.The mean direction refers to the

Figure 2 .
Figure 2. The maps presenting the (a) mean direction month, (b) seasonality index (SI), and (c) circular standard deviation (CSD) of the DO of AMGL across the river basins-Top panel.The tables show the corresponding summary statistics for each river-bottom panel.

Figure 3 .
Figure 3. Maps presenting the results of the (a) augmented Dickey-Fuller test, with the red and blue dots representing stationarity and non-stationarity in the DO, respectively; (b) Mann-Kendall trend test representing the direction of trend, with the red, blue, and grey dots denoting decreasing trend, increasing trend, and no significant trend in the DO, respectively; (c) Sen's slope test showing the magnitude of the trend of DO in Julian days.

Figure 4 .
Figure 4. Radial plots reflecting the importance of the predictors influencing the DO of annual maximum floods.The length of the bars represents the importance of the corresponding predictor, with the longest bar denoting the covariate with the highest importance.

Table 1 .
Model calibration and validation values of RMSE, R2, and NSE for each river basin.