The Choice of an Appropriate Information Dissimilarity Measure for Hierarchical Clustering of River Streamflow Time Series, Based on Calculated Lyapunov Exponent and Kolmogorov Measures

The purpose of this paper was to choose an appropriate information dissimilarity measure for hierarchical clustering of daily streamflow discharge data, from twelve gauging stations on the Brazos River in Texas (USA), for the period 1989–2016. For that purpose, we selected and compared the average-linkage clustering hierarchical algorithm based on the compression-based dissimilarity measure (NCD), permutation distribution dissimilarity measure (PDDM), and Kolmogorov distance (KD). The algorithm was also compared with K-means clustering based on Kolmogorov complexity (KC), the highest value of Kolmogorov complexity spectrum (KCM), and the largest Lyapunov exponent (LLE). Using a dissimilarity matrix based on NCD, PDDM, and KD for daily streamflow, the agglomerative average-linkage hierarchical algorithm was applied. The key findings of this study are that: (i) The KD clustering algorithm is the most suitable among others; (ii) ANOVA analysis shows that there exist highly significant differences between mean values of four clusters, confirming that the choice of the number of clusters was suitably done; and (iii) from the clustering we found that the predictability of streamflow data of the Brazos River given by the Lyapunov time (LT), corrected for randomness by Kolmogorov time (KT) in days, lies in the interval from two to five days.


Introduction
Cluster analysis (also called clustering) is employed to identify the set of objects with similar characteristics or identify groups, and has a broad range of applications in science (e.g., biology, computational biology and bioinformatics, medicine, hydrology, geosciences, business and marketing, computer science, social science, and others). The analysis hypothesizes that the objects in the same group are more similar to each other than to those in other groups. The question; however, arises:  Table 1. Basic descriptive statistics of the daily discharge data of the Brazos River for the period ; (the first number indicates the order of the station used in this study).  Daily discharge data were standardized (i.e., for each calendar day "i" means discharge x i and standard deviation SD i , over the year, "j", were computed and then the standardized discharge on day "i" in year "j" was calculated as y i,j = x i,j − x i /SD i [13]. This procedure removes any seasonal effects.

Basic Descriptive Statistics
Basic descriptive statistics of daily discharge data of the gauging stations are summarized in Table 1, where for each station the mean, median, minimum, maximum, interquartile range (IQR), and standard deviation (SD i ) are shown. It is seen from Table 1 that the differences between the maximum and the mean values are in the range of roughly 10 to 40 standard deviations, strongly positively skewed, indicating a power law behavior. Indeed, frequency counts for the USGS 08082500 Brazos River station at Seymour, Texas (USA), displayed in Figure 2 on a log-log scale, demonstrated a power law distribution, with similar behavior also observed at all other stations. Daily discharge data were standardized (i.e., for each calendar day "i" means discharge 〈 〉 and standard deviation SD , over the year, "j", were computed and then the standardized discharge on day "i" in year "j" was calculated as , = , − 〈 〉 /SD [13]. This procedure removes any seasonal effects.

Basic Descriptive Statistics
Basic descriptive statistics of daily discharge data of the gauging stations are summarized in Table 1, where for each station the mean, median, minimum, maximum, interquartile range (IQR), and standard deviation (SD ) are shown. It is seen from Table 1 that the differences between the maximum and the mean values are in the range of roughly 10 to 40 standard deviations, strongly positively skewed, indicating a power law behavior. Indeed, frequency counts for the USGS 08082500 Brazos River station at Seymour, Texas (USA), displayed in Figure 2 on a log-log scale, demonstrated a power law distribution, with similar behavior also observed at all other stations.

Method
In this section, we describe the selected dissimilarity measures (compression-based dissimilarity measure, permutation distribution dissimilarity measure, and Kolmogorov distance), used in the average-linkage clustering hierarchical algorithm, which was applied to streamflow data measured from 12 gauging locations on the Brazos River in Texas (USA). We also briefly consider three information measures (the largest Lyapunov exponent, Kolmogorov complexity, and the highest value of the Kolmogorov spectrum) used for K-means clustering based on these information measures.

Choice of Measures for Characterization of Streamflow for Clustering
Cluster analysis of gauged streamflow records into regions is an important tool for the characterization of hydrologic systems. To that end, the distance between two gauge stations, and , is frequently measured by the Euclidean distance (ED) [14][15][16], which is expressed as:

Method
In this section, we describe the selected dissimilarity measures (compression-based dissimilarity measure, permutation distribution dissimilarity measure, and Kolmogorov distance), used in the average-linkage clustering hierarchical algorithm, which was applied to streamflow data measured from 12 gauging locations on the Brazos River in Texas (USA). We also briefly consider three information measures (the largest Lyapunov exponent, Kolmogorov complexity, and the highest value of the Kolmogorov spectrum) used for K-means clustering based on these information measures.

Choice of Measures for Characterization of Streamflow for Clustering
Cluster analysis of gauged streamflow records into regions is an important tool for the characterization of hydrologic systems. To that end, the distance between two gauge stations, i and j, is frequently measured by the Euclidean distance d ij (ED) [14][15][16], which is expressed as: where {x ik } and y jk k = 1, . . . , p are the streamflow values at the stations, while p is the period which can be daily, monthly, seasonal, or annual. Although there are many other distance metrics, this distance is frequently used as a dissimilarity measure in the clustering algorithms. Gong and Richman [15] showed that the majority of investigators (about 85%) applied this measure in their studies [5]. Despite the popularity of the ED measure in streamflow clustering, it has a drawback in that it assumes that the sample points are distributed about the sample mean in a spherical manner. If the distribution happens to be decisively non-spherical, for example ellipsoidal, then we would expect the probability of a "test point" belonging to the set to depend not only on the distance from the sample mean but also on the direction. The dynamic time warping (DTW) is a more general algorithm, based on ED, that enables the finding of the best alignment between time series that may have different lengths and/or local distortions [17,18]. Besides a shape-based measure, the dissimilarity of time series may be measured by comparing the features extracted from the original time series, such as autocorrelations, cross-correlations, spectral features, wavelet coefficients, and information measures, or by model approach [19]. Among many information measures we have tested in this study, we selected three measures for the characterization of streamflow: compression based dissimilarity measure (NCD), permutation distribution dissimilarity measure (PDDM), and Kolmogorov distance (KD). The normalized compression distance (NCD) provides a computable version of the normalized information distance (NID). It has been recommended for application in bioinformatics, music clustering, linguistics, plagiarism detection, image similarity, question answering, and many other fields [20]. This measure has a broad application in the clustering of heterogeneous data. Therefore, it can be used for clustering streamflow, for example, in optimizing streamflow monitoring networks on the basis of daily streamflow data.
Permutation distribution dissimilarity measure (PDDM) is a complexity-based approach to clustering time series. The dissimilarity of time series is formalized as the squared Hellinger distance between the permutation distributions of embedded time series [21]. This method has not been used in hydrology that often in the past. However, recently some authors used it for multiscale parameter regionalization implemented within a spatially-distributed mesoscale hydrologic model, clustering streamflow time series for regional classification, and establishing relationships between the regionalization and streamflow indices [22][23][24].
The Kolmogorov complexity distance (KD) has become an important tool in a wide variety of applications [25]. It has also been applied in hydrology in scaling problems, since the heterogeneity of catchments and the variability of hydrological processes make scaling (which is performed either in a deterministic or a stochastic framework) so difficult [26]. In this study, we used this measure for clustering streamflow. To our knowledge, this measure for this purpose has not been applied yet.

Normalized Compression Distance
A normalized information distance (NID) between two objects (time series, images, texts) x and y is given by: where the conditional Kolmogorov complexity K(x|y) of x, given y, is the length of the shortest program producing x when y is given as an auxiliary input in the program. The NID is theoretically appealing, but not practical, since it cannot be computed. In this subsection, we consider the normalized compression distance (NCD), an efficiently computable, and thus practically applicable, form of the normalized information distance. One approach for computing NID is approximating Kolmogorov complexity by the length of the compressed objects obtained from some data compressors (gzip, bzip2, xz). Using the approximation K(x|y) ≈ K(xy) − K(y), the normalized compression distance (NCD) is defined as: where C is a chosen data compressor, and C(xy) is the size in bytes of the series x and y concatenated. The function NCD from the R 3.5.1 package TSclust, applied in the calculation, selects the best compression algorithm separately for x,y and concatenated xy [27]. Sometimes, for simplicity, it is advisable to present calculations of the clustering matrix in the form of pseudocode, which is a detailed yet readable description of what a computer program or algorithm must do, expressed in a formally-styled natural language, rather than in a programming language. The pseudocode for calculating the NCD has the following steps:

2.
Set the number of time series compressed by the chosen compressor C to N.

3.
Set all elements of clustering matrix M C (N, N) to zero.

4.
Calculate Kolmogorov complexity by the length of the compressed time series obtained from some data compressors C(x), C(y).

5.
Calculate C(xy) which is the size in bytes of the time series x and y concatenated. 6.
Find the lower value of {C(x), C(y)}.
Set the calculated value into M C (i, j) i = 1, N − 1; j = i + 1, N.

Permutation Distribution Dissimilarity Measure
Permutation distribution dissimilarity measure (PDDM) is based on the distance of distributions of permutations [28]. On the basis of given time permutation ∏(X m ) obtained by sorting X m in the ascending order is recorded, and the distribution of permutations is denoted by P(x t ). The dissimilarity between two time series is measured by the dissimilarity of their permutation distributions. One approach is based on Kullback-Leibler (KL) divergence (relative Shannon entropy). Taylor approximation of KL divergence is the squared Helling distance of discrete probability distributions P = (p 1 , p 2 , . . . , p n ) and Q = (q, q 2 , . . . , q n ): , where 2 is the Euclidean norm. The pseudocode for calculating the PDDM has the steps:

1.
Set all elements of clustering matrix M C (N, N) to zero.
For given time series, the m-dimensional embedding with time delay t is X m = . Sort x m in the ascending order to get permutation ∏(X m ) for each x m . 5.
Obtain the distribution of permutations P(x t ). 6.

Kolmogorov Complexity Distance (KD)
Kolmogorov complexity distance is defined using the conditional complexity as: Since We get While KD, as given by (5), is a non-negative and symmetric quantity, it does not in general satisfy the triangle inequality. Therefore, after calculating KD distance matrix using (5), all pairs are checked: if for a given pair of objects x,y it turns that KD xy > min z KD xz + KD zy , then the distance is set to KD xy = min z KD xz + KD zy , and all the pairs are checked anew. The true distance is computed by iterating this procedure until for all x,y and z the triangle inequality is satisfied KD xy ≤ KD xy + KD zy .
When the matrix D (mxm) of distances of all pairs of p objects is obtained by any of the three selected information distances, hierarchical clustering is performed. The average linkage clustering defines the distance between any two clusters to be the average of distances of all pairs of objects from any member of one cluster from any member of the other cluster. The pseudocode for KD has the following steps:

1.
Set all elements of clustering matrix M C (N, N) to zero.

3.
Check for all pairs: If for a given pair of time series x,y it turns that KD xy > min z KD xy + KD zy then the distance is set to KD xy = min z KD xy + KD zy .

4.
The true distance is computed by iterating this procedure until for all x, y and z the triangle inequality is satisfied KD xy ≤ KD xy + KD zy .

5.
Set the calculated value of KD xy into M C (i, j) i = 1, N − 1; j = i + 1, N.

Calculation of Largest Lyapunov Exponent and Kolmogorov Measures
Because the rate of separation can be different for different orientations of the initial separation vector, there is a spectrum of Lyapunov exponents whose largest value is commonly assigned as LLE. A positive value of this exponent is usually taken as an indication that the system is chaotic. In this study, we obtained the LLE for the standardized daily discharge time series by applying the Rosenstein algorithm [29], which was implemented in MATLAB program [30]. This algorithm is fast, easy to apply, and robust to changes in the embedding dimension, reconstruction delay, length of time series, and noise level. The applied MATLAB program calculates the proper embedding dimension and reconstruction delay. The value of embedding dimension is selected by the FNN (false nearest neighbors) method or the symplectic geometry method in the case of high noisy data [31,32]. The Kolmogorov complexity and its derivates (the Kolmogorov spectrum and its highest value) are calculated using the Lempel Ziv algorithm, which is widely described in [33].

Selection of Information Measures for K-Means Clustering of Daily Streamflow
The question arises: How to select the information measures for K-means clustering? We selected three measures (i.e., the largest Lyapunov exponent (LLE), Kolmogorov complexity (KC), and the highest value of the Kolmogorov complexity (KCM). This choice was made for the following reasons.  Table 1.
(ii) There are many factors, both natural (runoff from rainfall and snowmelt; evaporation from soil and surface-water bodies; transpiration by vegetation; ground-water discharge from aquifers; ground-water recharge from surface-water bodies; sedimentation of lakes and wetlands, etc.) and human-induced (surface-water withdrawals and transbasin diversions; river-flow regulation for hydropower and navigation; dams; construction, removal, and sedimentation of reservoirs and storm water detention ponds; stream channelization and levee construction; drainage or restoration of wetlands; land-use changes, such as urbanization, that alter rates of erosion, infiltration, overland flow, and evapotranspiration; wastewater outfalls; and irrigation wastewater return flow), that cause continuous changes in streamflow time series and; therefore, in its nonlinearity and complexity of the Brazos River and its drainage basin. For example, a huge human intervention was the Morris Sheppard Hydroelectric Power Plant at Morris Sheppard Dam (Possum Kingdom Reservoir) on the Brazos River in Palo Pinto County, built in the period 1938-1941 (11 miles southwest of Graford and 18 miles northeast from Mineral Wells). Currently, "USGS station 3_08088610 (Brazos River near Graford, Texas) is located approximately 1.25 miles downstream of Possum Kingdom Reservoir. As such, this site is largely influenced by regulation. This gage was established to monitor outflow from Possum Kingdom Reservoir. The gage was initially located farther upstream, closer to the outflow from the reservoir. In 1995, the gage was moved downstream to the current location" [34]. Another regulation on the Brazos River was the Aquilla Lake, which is an artificial lake in Hill County. The dam for this regulation was constructed by the U.S. Army Corps of Engineers. This dam is part of the overall flood control project in the Brazos River basin (station 7_08093100).
(iii) Because streamflow processes are unavoidably influenced by measurement at gauging stations (including uncertainties in the single determination of river discharge [35]) and dynamical noises that increase Lyapunov exponents under the influence of noise, then these factors were taken into considerations.

Largest Lyapunov Exponent (LLE)
The perpetual debate over whether hydrological systems are deterministic or stochastic has been taken to a new level by controversial applications of nonlinear dynamics tools. Lyapunov exponents, perhaps the most informative invariants of a complex dynamical process, are also among the most difficult to determine from experimental data, although when using embedding theory to build chaotic attractors in a reconstruction space, extra "spurious" Lyapunov exponents arise that are not Lyapunov exponents of the original system [36,37]. Some hydrologists have discussed the difficulties and uncertainties in discerning between low-dimensional chaotic and stochastic systems using Lyapunov exponents and correlation dimension measures [38][39][40]. Thus, for the analysis of weak chaos, generating two phenomena from the normal functioning of the same system, the LLE has to be utilized carefully. In real physical systems, the structure of chaos is more complex than in truly random processes [41]. Systems with chaotic dynamics usually contain islands of stability. Accordingly, if the larger is the covering factor of the islands of stability, the weaker is the chaos. Intermittency is also one of the manifestations of the weak chaos [42]. Intermittent behavior is frequently observed in fluid flows that are turbulent or near the transition to turbulence. There are also numerous examples of weak chaos in hydrology. A further quantitative measure of weak chaos is the low dimension (close to 2) of the strange attractors characterizing their dynamics [43]. Wu et al. [44] offered another quantification of weak chaos when LLE is less than 0.1. They noted "If emergence is unapparent, the emergent time may be misjudged, which may lead to erroneous calculation of LLE. However, the LLE at a longer time is still positive, which manifests that chaos exists". Table 2 shows the LLE of standardized daily discharge data, indicating that station 3_08088610 (Graford) and1_08082500 (Seymour) had the highest value of LLE (0.394 and 0.158, respectively), while all other stations had values in the interval (0.018, 0.061) (i.e., in the region of weak chaos). Using LLE as an indicator, [45] established the presence of low chaos in the daily streamflow of the Kizilirmak River (Iran), with a positive value of LLE (0.0061). Similarly, in forecasting of daily streamflow of Xijiang River (China), [46] reported that LLE was 0.1604. The streamflow of station 1_08082500 (Seymour) had a value of 0.158 for LLE, which is approximately 2.6-8.8 times larger than for other stations. Following the criterion by [44], the streamflow, measured at this station, exhibited high chaotic behavior. In our opinion it comes from several reasons: (i) Uncertainties as a result of errors in the field determination of discharge [35,47]. In a project report, Ward [48] gave a judgment of the quality of field in the measurement of discharge for eleven selected Texas stream gauges (for the period 1987-2011), presumably based on the conditions of field work. He reported that the uncertainties, as relative standard errors (RSE), for discharge measurements for all stations, in general, were considerably larger than recommended by [47]. Surprisingly, 1_08082500 (Seymour) had the highest values of RSE (relative standard error)-188%-pointing to a high level of variability and a potential source of high nonlinearity (higher values of LLE) and randomness. (ii) The Seymour gauging station posted an extremely high sediment yield, while the next downstream gage (South Bend) showed a considerable decrease. In fact, sediment yields at Seymour (1220 t km −2 yr −1 ) were the highest among all the gauging stations on the Brazos River, whose average annual suspended-sediment yield is generally considered the highest of all rivers in the state of Texas [49]. Having in mind that a nonlinear relationship is inherent in the streamflow-suspended sediment relationship [50], the higher value of LLE could be attributed to this phenomenon. The highest value of LLE (0.394) for 3_08088610 (Graford) station is a result of changed river streamflow dynamics because of the Sheppard Hydroelectric Power Plant at Morris Sheppard Dam (Possum Kingdom Reservoir) that is built on the Brazos River in Palo Pinto County. More details about the change in the nonlinearity and randomness of streamflow for this gauge station can be found in [51]. The Kolmogorov complexity measures applied in this paper sheds additional light on the complex behavior of streamflow. The values of KC and KCM of standardized daily discharge data are shown in Table 2, which shows that the KC values for all daily streamflow time series were relatively small, ranging in the interval from 0.200 to 0.474. Similar behavior was observed for KCM, having values in the intervals from 0.252 to 0.682, which is expected for lowland rivers in contrast to mountain rivers, whose KC values can be up to 0.98 [51,52]. From Table 2 it is seen that there were three peaks for KC: 0.474, 0.352, and 0.316 for stations 3_08088610 (Graford), 7_08093100 (Aquilla), and 9_08098290 (Highbank), respectively. The highest value of KC (3_08088610) was a result of the human activity (i.e., the building of a hydroelectric power plant changing the streamflow dynamics; see the previous subsection). It would be interesting to clarify the appearance of peaks in KC values for 7_08093100 (Aquilla) and 9_08098290 (Highbank) stations. Both stations had low values of LLE (0.055 and 0.061), which belong to the domain of weak chaos (i.e., they are very close to zero). We now had a situation of the occurrence of stochastic behavior (high randomness), although LLE indicated a stable state. Vilela Mendes [53] explained this situation in the following way. The idea is that the dynamics is simple to describe in law, but not that it has simple orbits. In short, a dynamical law with small sophistication but capable of generating orbits of high Kolmogorov complexity. According to [54], "sophistication" is defined as the size of the projectable part of the string's minimal description and formalizes the amount of planning which went into the construction of the string. Note that an additional source of the occurrence of peak for 7_08093100 (Aquilla) station was because of human intervention (i.e., the presence of the Aquilla dam).

Hierarchical Clustering of Daily Streamflow
Starting with a dissimilarity matrix based on the compression-based dissimilarity measure (NCD), permutation distribution dissimilarity measure (PDDM), and Kolmogorov distance (KD) for daily streamflow discharge data from twelve gauging stations on the Brazos River in Texas (USA), for the period 1989-2016, the agglomerative average-linkage hierarchical algorithm was applied. This algorithm consists of a series of successive fusions of the objects into groups culminating in the stage where all objects are in one group. At any stage in the procedure, two objects or groups of objects which are the closest are fused together. The average-linkage clustering defines a distance between any two groups of objects (clusters) to be the average of distances of all pairs of objects, from any member of one cluster to any member of the other cluster. The tree diagram (dendogram) gives the stages in the aggregation of gauging stations in clusters (Figure 3). The vertical axis is used to indicate the distances at which the joining occurs.

Hierarchical Clustering of Daily Streamflow
Starting with a dissimilarity matrix based on the compression-based dissimilarity measure (NCD), permutation distribution dissimilarity measure (PDDM), and Kolmogorov distance (KD) for daily streamflow discharge data from twelve gauging stations on the Brazos River in Texas (USA), for the period 1989-2016, the agglomerative average-linkage hierarchical algorithm was applied. This algorithm consists of a series of successive fusions of the objects into groups culminating in the stage where all objects are in one group. At any stage in the procedure, two objects or groups of objects which are the closest are fused together. The average-linkage clustering defines a distance between any two groups of objects (clusters) to be the average of distances of all pairs of objects, from any member of one cluster to any member of the other cluster. The tree diagram (dendogram) gives the stages in the aggregation of gauging stations in clusters (Figure 3). The vertical axis is used to indicate the distances at which the joining occurs. The dendrogram gives the indication that the stations may be grouped either in three or four clusters. For comparison with the results of the latter analysis, we chose four clusters. The results of grouping were visualized on maps of the geographical locations of gauging stations on the Brazos River used in this study (Figure 4). If the compression-based dissimilarity measure was applied, the stations were distributed as: (i) Cluster 1 (1_08082500, 2_08088000, 5_08090800, and 6_08091000); Cluster 2 (3_08088610); Cluster 3 (4_08089000, 7_08093100, 8_08096500, and 9_08098290); and Cluster 4 (10_08111500, 11_08114000, and 12_08116650). The hierarchical clustering based on permutation distribution dissimilarity measure gave: (i) Cluster 1 (1_08082500); Cluster 2 (2_08088000, 5_08090800, 6_08091000, and 9_08098290); Cluster 3 (3_08088610, 4_08089000, 7_08093100; and 8_08096500), and Cluster 4 (10_08111500, 11_08114000, and 12_08116650). In the case when the dissimilarity measure was Kolmogorov complexity distance, the obtained grouping was: (i) Cluster 1 (1_08082500, 2_08088000, 5_08090800, 6_08091000, 8_08096500, and 9_08098290), Cluster 2 (3_08088610); Cluster 3 (4_08089000 and 7_08093100); and Cluster 4 (10_08111500, 11_08114000, and 12_08116650). It may be noted that, in all cases, stations 10_08111500, 11_08114000, and 12_08116650 belonged to the same cluster. The dendrogram gives the indication that the stations may be grouped either in three or four clusters. For comparison with the results of the latter analysis, we chose four clusters. The results of grouping were visualized on maps of the geographical locations of gauging stations on the Brazos River used in this study (Figure 4). If the compression-based dissimilarity measure was applied, the stations were distributed as: (i) Cluster 1 (1_08082500, 2_08088000, 5_08090800, and 6_08091000); Cluster 2 (3_08088610); Cluster 3 (4_08089000, 7_08093100, 8_08096500, and 9_08098290); and Cluster 4 (10_08111500, 11_08114000, and 12_08116650). The hierarchical clustering based on permutation distribution dissimilarity measure gave: (i) Cluster 1 (1_08082500); Cluster 2 (2_08088000, 5_08090800, 6_08091000, and 9_08098290); Cluster 3 (3_08088610, 4_08089000, 7_08093100; and 8_08096500), and Cluster 4 (10_08111500, 11_08114000, and 12_08116650). In the case when the dissimilarity measure was Kolmogorov complexity distance, the obtained grouping was: (i) Cluster 1 (1_08082500, 2_08088000, 5_08090800, 6_08091000, 8_08096500, and 9_08098290), Cluster 2 (3_08088610); Cluster 3 (4_08089000 and 7_08093100); and Cluster 4 (10_08111500, 11_08114000, and 12_08116650). It may be noted that, in all cases, stations 10_08111500, 11_08114000, and 12_08116650 belonged to the same cluster.   In computer science, the computational complexity of an algorithm is the amount of resources required for running it. The computational complexity of a problem is the minimum of the complexities of all possible algorithms for this problem (including the unknown algorithms). Here we shortly present a comparative analysis of the computational complexity of all the three used algorithms. We have chosen for criterion the computational cost, which depends not only on the size of the dataset, but also on the complexity of many other aspects. For comparison we used three times. The "user time" (UT) is the CPU time charged for the execution of user instructions of the calling process. The "system time" (ST) is the CPU time charged for execution by the system on behalf of the calling process. The first two entries are the total user and system CPU times of the current R (language) process and any child processes on which it has waited, and the third entry is the "real elapsed time" (RET) since the process was started. The focus of this paper is to suggest a suitable information dissimilarity measure for hierarchical clustering of river streamflow time series, but without going down into detailed aspects in comparison of the clustering algorithms we have used. However, in discussion, we cannot avoid the aspect of data size in time series clustering. Shortly, we will do it in the following way. Time series clustering is a very effective approach in discovering valuable information in various systems. However, focusing on the efficiency and scalability of these algorithms to deal with time series data has come at the expense of losing the usability and effectiveness of clustering. Aghabozorgi and Teh [55] proposed a method, which was compared with different algorithms and various datasets of dissimilar length, showing that this method outperformed other conventional clustering algorithms. They emphasized that the user does not require very low-resolution time series for clustering of large datasets; instead, the clustering can be applied on smaller sets of high dimension time series by the prototyping process. That is, the cost of using representatives is much less than the dimension reduction in terms of accuracy. In computer science, the computational complexity of an algorithm is the amount of resources required for running it. The computational complexity of a problem is the minimum of the complexities of all possible algorithms for this problem (including the unknown algorithms). Here we shortly present a comparative analysis of the computational complexity of all the three used algorithms. We have chosen for criterion the computational cost, which depends not only on the size of the dataset, but also on the complexity of many other aspects. For comparison we used three times. The "user time" (UT) is the CPU time charged for the execution of user instructions of the calling process. The "system time" (ST) is the CPU time charged for execution by the system on behalf of the calling process. The first two entries are the total user and system CPU times of the current R (language) process and any child processes on which it has waited, and the third entry is the "real elapsed time" (RET) since the process was started. The focus of this paper is to suggest a suitable information dissimilarity measure for hierarchical clustering of river streamflow time series, but without going down into detailed aspects in comparison of the clustering algorithms we have used. However, in discussion, we cannot avoid the aspect of data size in time series clustering. Shortly, we will do it in the following way. Time series clustering is a very effective approach in discovering valuable information in various systems. However, focusing on the efficiency and scalability of these algorithms to deal with time series data has come at the expense of losing the usability and effectiveness of clustering. Aghabozorgi and Teh [55] proposed a method, which was compared with different algorithms and various datasets of dissimilar length, showing that this method outperformed other conventional clustering algorithms. They emphasized that the user does not require very low-resolution time series for clustering of large datasets; instead, the clustering can be applied on smaller sets of high dimension time series by the prototyping process. That is, the cost of using representatives is much less than the dimension reduction in terms of accuracy.

K-Means Clustering of Daily Streamflow
The first step in the K-means clustering is the determination of the number of clusters K. The 3D scatter plot of points (KC i , KCM i , LLE i ) (i = 1, . . . , 12), which were calculated on the basis of daily discharge data recorded during the period 1989-2016 at twelve gauging stations on the Brazos River in Texas, suggested that K = 4, as shown by the 3D scatter plot in Figure 5. From this figure it can be seen that clustering closely followed the aforementioned discussion about the choice of information measures for K-means clustering. The K-means algorithm consists of repeating three steps until convergence: (i) Determining the centroid coordinate; (ii) determining the distance of each object to the centroids; and (iii) grouping the objects based on minimum distance to their closest cluster center, according to the Euclidean distance function. Any random object may be taken as the initial centroid. We applied the program STATISTICA 13.2 for K-means clustering. The stations were distributed in the following ways: (i) Cluster 1 (1_08082500, 2_08088000, 5_08090800, 6_08091000, 8_08096500, and 9_08098290); Cluster 2 (3_08088610); Cluster 3 (4_08089000 and 7_08093100); and Cluster 4 (10_08111500, 11_08114000, and 12_08116650). Table 3 and Figure 6 show the centroids of the clusters.

K-Means Clustering of Daily Streamflow
The first step in the K-means clustering is the determination of the number of clusters K. The 3D scatter plot of points (KCi, KCMi, LLEi) (i = 1, …, 12), which were calculated on the basis of daily discharge data recorded during the period 1989-2016 at twelve gauging stations on the Brazos River in Texas, suggested that K = 4, as shown by the 3D scatter plot in Figure 5. From this figure it can be seen that clustering closely followed the aforementioned discussion about the choice of information measures for K-means clustering. The K-means algorithm consists of repeating three steps until convergence: (i) Determining the centroid coordinate; (ii) determining the distance of each object to the centroids; and (iii) grouping the objects based on minimum distance to their closest cluster center, according to the Euclidean distance function. Any random object may be taken as the initial centroid. We applied the program STATISTICA 13.2 for K-means clustering. The stations were distributed in the following ways: (i) Cluster 1 (1_08082500, 2_08088000, 5_08090800, 6_08091000, 8_08096500, and 9_08098290); Cluster 2 (3_08088610); Cluster 3 (4_08089000 and 7_08093100); and Cluster 4 (10_08111500, 11_08114000, and 12_08116650). Table 3 and Figure 6 show the centroids of the clusters.      On the basis of the analysis of variance (ANOVA) results (Table 4), it could be concluded that there existed highly significant differences between mean values of four clusters, which confirms that the choice of the number of clusters was correctly done. For the end of discussion, let us consider the question about predictability of streamflow, seen through the light of the aforementioned consideration of clustering the streamflow time series. The Lyapunov exponent relates to the predictability of measured time series, which includes deterministic chaos as an inherent component. Model predictability is here understood as the degree to which a correct prediction of a system's state can be made, either qualitatively or quantitatively. In stochastic analysis, a random process is considered predictable if it is possible to infer the next state from previous observations. In many models; however, randomness is a phenomenon which "spoils" predictability [51]. Deterministic chaos does not mechanically denote total predictability, but means that at least it improves the prognostic power. In contrast, stochastic trajectories cannot be projected into future. If LLE > 1 then streamflow is not chaotic, but is rather stochastic, and predictions cannot be based on chaos theory. However, if 0 < LLE < 1 it indicates the existence of chaos in streamflow. In that case, one can compute the approximate time (often called Lyapunov time (LT)) limit for which accurate prediction for a chaotic system is a function of LLE. It designates a period when a certain process (physical, mechanical, hydrological, quantum, or even biological) moves beyond the bounds of precise (or probabilistic) predictability and enters a chaotic mode. According to [56], that time can be calculated as Δ = 1/LLE. If LLE → 0, implying that Δ → ∞, then long-term accurate predictions are possible. However, many streamflow time series are highly complex. Therefore, ∆ can be corrected for randomness in the following way. Similar to ∆ , we can introduce a randomness time ∆ = 1/KC (in time units, second, hour, or day).
Henceforth, we shall denote this quantity Kolmogorov time (KT), as it quantifies the time span beyond which randomness significantly influences predictability. Then, the Lyapunov time corrected On the basis of the analysis of variance (ANOVA) results (Table 4), it could be concluded that there existed highly significant differences between mean values of four clusters, which confirms that the choice of the number of clusters was correctly done. For the end of discussion, let us consider the question about predictability of streamflow, seen through the light of the aforementioned consideration of clustering the streamflow time series. The Lyapunov exponent relates to the predictability of measured time series, which includes deterministic chaos as an inherent component. Model predictability is here understood as the degree to which a correct prediction of a system's state can be made, either qualitatively or quantitatively. In stochastic analysis, a random process is considered predictable if it is possible to infer the next state from previous observations. In many models; however, randomness is a phenomenon which "spoils" predictability [51]. Deterministic chaos does not mechanically denote total predictability, but means that at least it improves the prognostic power. In contrast, stochastic trajectories cannot be projected into future. If LLE > 1 then streamflow is not chaotic, but is rather stochastic, and predictions cannot be based on chaos theory. However, if 0 < LLE < 1 it indicates the existence of chaos in streamflow. In that case, one can compute the approximate time (often called Lyapunov time (LT)) limit for which accurate prediction for a chaotic system is a function of LLE. It designates a period when a certain process (physical, mechanical, hydrological, quantum, or even biological) moves beyond the bounds of precise (or probabilistic) predictability and enters a chaotic mode. According to [56], that time can be calculated as ∆t lyap = 1/LLE. If LLE → 0 , implying that ∆t lyap → ∞ , then long-term accurate predictions are possible. However, many streamflow time series are highly complex. Therefore, ∆t lyap can be corrected for randomness in the following way. Similar to ∆t lyap , we can introduce a randomness time ∆t rand = 1/KC (in time units, second, hour, or day). Henceforth, we shall denote this quantity Kolmogorov time (KT), as it quantifies the time span beyond which randomness significantly influences predictability. Then, the Lyapunov time corrected for randomness is defined as 0, ∆t lyap ∩ [0, ∆t rand ].
It can be stated that the KT designates the size of the time window within time series where complexity remains nearly unchanged. Figure 7 shows the predictability of the standardized daily discharge data of the Brazos River, given by the Lyapunov time (LT) corrected for randomness (in days). From this figure it is seen that LT corrected for randomness increases from two to five days. Such distribution corresponds to the order of clusters in the 3D scatter plot ( Figure 5) along the diagonal, from the upper right corner to the lower left one.   Figure 7 shows the predictability of the standardized daily discharge data of the Brazos River, given by the Lyapunov time (LT) corrected for randomness (in days). From this figure it is seen that LT corrected for randomness increases from two to five days. Such distribution corresponds to the order of clusters in the 3D scatter plot ( Figure 5) along the diagonal, from the upper right corner to the lower left one.

Conclusion
We compared the average-linkage clustering hierarchical algorithm with three clustering algorithms based on the compression-based dissimilarity measure (NCD), permutation distribution dissimilarity measure (PDDM), and Kolmogorov distance (KD) for daily streamflow discharge data from twelve gauging stations on the Brazos River in Texas (USA), for the period 1989-2016. The algorithm was also compared with K-means clustering based on Kolmogorov complexity (KC), the highest value of Kolmogorov complexity spectrum (KCM), and the largest Lyapunov exponent (LLE). The following conclusions are drawn from this study: 1. We considered the way of selecting suitable information measures for K-means clustering.
Accordingly, we selected three measures (i.e., the LLE, KC, and KCM). 2. This choice was made for the following reasons. There are many factors, both natural and humaninduced, that cause continuous changes in streamflow time series and; therefore, in its nonlinearity and complexity, of the Brazos River, and its drainage basin. Additionally, because streamflow processes are unavoidably influenced by measurement at gauging stations (including uncertainties in the single determination of river discharge) and dynamical noise that increases LLE under the influence of noise. 3. Using a dissimilarity matrix based on NCD, PDDM, and KD for daily streamflow discharge data from twelve gauging stations, the agglomerative average-linkage hierarchical algorithm was applied. We selected the KD clustering algorithm as the most suitable among others. 4. The dendrogram gave the indication that the gauging stations may be grouped either in three or four clusters. For statistical analysis (3D scatter plot specified by the vectors KC, KCM, and LLE,

Conclusions
We compared the average-linkage clustering hierarchical algorithm with three clustering algorithms based on the compression-based dissimilarity measure (NCD), permutation distribution dissimilarity measure (PDDM), and Kolmogorov distance (KD) for daily streamflow discharge data from twelve gauging stations on the Brazos River in Texas (USA), for the period 1989-2016. The algorithm was also compared with K-means clustering based on Kolmogorov complexity (KC), the highest value of Kolmogorov complexity spectrum (KCM), and the largest Lyapunov exponent (LLE). The following conclusions are drawn from this study: