Water quality warnings based on cluster analysis in Colombian river basins

Fresh water is considered one of the most important renewable natural resources in the world. Among all the countries, Colombia is one of the places with the highest water supply, and has five watersheds: the Caribbean, Orinoco, Amazon, Pacific and Catatumbo. It is therefore vital to study and evaluate the water quality of the rivers and/ or lotic systems. In recent studies, some scientists made use of biological indices to calculate water quality, while others detected water quality through machine learning techniques. However, these studies do not allow users to easily interpret the results. These investigations motivated us to propose a dataset for generating water quality alerts in Piedras river basin based on the analysis of the K-Means clustering algorithm and C.4.5 classification technique.


I. Introduction
Fresh water is considered one of the most important renewable natural resources. Among all the countries, Colombia has the largest water supply in the world (Viceministerio de Ambiente, 2010) with five watersheds: the Caribbean, Orinoco, Amazon, Pacific and Catatumbo. It is therefore vital to study and evaluate the water quality of rivers and/ or lotic systems. However, determining the environmental status of these systems becomes a particularly complex task when the streams' reference conditions are unknown and also when they have been exposed for a long period of time to anthropogenic perturbations (Arango, Álvarez, Arango, Torres, & Monsalve, 2008).
In this type of ecosystem the macro-invertebrate community is highly diverse. Due to their tolerance levels to different environmental changes (Alba-Tercedor, 1996), they have potential to be used in lotic system monitoring (Pino et al., 2003), supplemented by analysis of physicochemical variables (pH, dissolved oxygen, etc.) and environmental variables (temperature, humidity, rainfall, sunlight, etc.). Scientists currently make use of biological indices for calculating water quality (Rico, Paredes, & Fernandez, 2009;Park, Chon, Kwak, & Lek, 2004). In Colombia, commonly used bio-indicators in river systems are the Biological Monitoring Working Party (BMWP) and Average Score per Taxon (ASPT), which are adapted for each region of the country, due to the diversity of climates and reliefs (Pérez, 2003).
Some studies detect water quality through Support Vector Machines (SVM) (Singh & Gupta, 2012;Bae & Park, 2014;Liu et al., 2012) and Artificial Neural Networks (ANN) (Bucak & Karlik, 2011), with the objective of effectively monitoring and controlling water quality. However, these studies do not define classes that allow users to easily interpret the results obtained by the classifiers. We therefore propose a data set to generate alerts for water quality in Piedras river basin based on the analysis of clustering algorithms. This paper is organized as follows: Section II presents the process of understanding the data and the techniques used to build it. Section III describes the cluster validation methods and experimental results. And Section IV summarizes the study and provides conclusions.

II. Materials and methods
This section describes the process of data collection, and techniques used to build it, through clustering and classification algorithms.
Este artículo está organizado de la siguiente manera: La sección II presenta el proceso de comprensión de datos y técnicas usadas para desarrollarlo; la sección III describe los grupos de métodos de validación y resultados experimentales; y la sección IV resume el estudio y ofrece las conclusiones.

Physicochemical indicators
The physicochemical variables are described by the authors (Pérez, 2003), and are as follows: • Temperature (T): acts on oxygen absorption processes, biological activity, precipitation of compounds, deposit formation and modification of the solubility of substances. Measuring unit: degrees Celsius (° C).
• Conductivity (C): used as the solute concentration ratio (amount of solids) dissolved in water. Measuring unit: μs/cm.
• Total Dissolved Solids (TDS): measuring organic and inorganic substances, in molecular form, or micro-granular ionized water. Measuring unit: mg/L.
• Dissolved oxygen (DO): amount of oxygen dissolved in the water. It is an indicator that measures water pollution. A high level of dissolved oxygen indicates better water quality. Measuring unit: mg/L.
• PH: measures the concentration of hydrogen ions in the water. Natural waters (uncontaminated) exhibit a pH range 5-9.
• Ammonia (Am): formed during the biodegradation of organic nitrogen compounds. A high level causes damage to rivers or ponds. Measuring unit: mg/L.
tion of dead plants and animals and excrement of live animal nitrates is discharged in aquatic ecosystems. Measuring unit: mg/L.
• Nitrite (Nitri): naturally transformed from nitrates, and its presence in water indicates fecal contamination. Measuring unit: mg/L.
• Phosphates (F): are essential nutrients for aquatic organisms in both natural waters and sewage; they are necessary for reproduction and synthesis of new cell tissue. Measuring unit: mg/L.
• Turbidity (Tu): lack of transparency in the water due to insoluble materials in suspension or colloids (clay, silt, dirt, etc.); the more material is in the water (high turbidity), the lower the concentration of oxygen in the same. Measuring unit: NTU.

Biological indicators
Biological samples were collected at the sampling points and they were classified by taxonomic keys, including: class, order, family, taxon and number of individuals (Pérez, 2003). A brief description of the collected samples is presented below: • Acari: living in clean and highly oxygenated inland (freshwater), lotic (flowing) and lentic (stagnant) waters.
• Pelecypoda: belong to highly oxygenated marine and inland aquatic ecosystems, are highly sensitive to pollution and are thus considered excellent biomarkers to determine water quality.
• Plecoptera: living in clean, rough, cold, highly oxygenated lotic inland waters; used as biomarkers to determine water quality.
• Lepidoptera: living in both lentic and lotic waters, on stony bottoms, and highly oxygenated submerged vegetation. The species are intolerant to eutrophication (chemical pollution of water).
• Coleoptera: inhabit clean, shallow lotic and lentic inland water with high concentrations of oxygen, average temperatures and low speed.
• Diptera: living in terrestrial niches in deep and shallow lotic and lentic inland waters. This order includes parasites, predators and degraders, and has by virtue of this become of major health significance (poor water quality). Species are tolerant to different degrees of contamination.
ters with well oxygenated low organic content of waste; used as bio-indicators of water quality.
• Hemiptera: living in lotic and lentic low speed inland waters. Some species (Neuston) withstand some degree of salinity and high temperatures; used as a biomarker in surface waters.
• Odonata: living in shallow, low speed lotic and lentic inland waters, surrounded by abundant submerged or emergent aquatic vegetation. Some species can withstand a certain degree of contamination.
• Trichoptera: most lotic species live in inland waters (under rocks, logs and plant material) and a few live in lentic, clean and oxygenated water.
• Tricladida: live in both lentic and lotic shallow water. Most live in well oxygenated waters, but some species can withstand high levels of organic contamination.
• Isopoda: common in marine habitats, but some are freshwater species and many are land-based. Large numbers of species of this order indicate organic enrichment.
• Glossiphoniformes: ectoparasites of fish in inland waters; this order has a high tolerance to water contamination.
• Haplotaxida: most living organisms of this order live in eutrophic lentic and lotic waters, with a muddy bottom and plenty of waste. The species are highly tolerant of organic contamination.

Precipitation period
This describes the precipitation period in which the samples were taken. They are presented by: year, month and sampling point code. These indicators are described in Table 1.

B. Background
Clustering algorithms are among the unsupervised learning methods, which divide a dataset into a number of groups, so that the elements (observations) of a same group are homoge-  Pérez, & Perona, 2013). Several authors (Lin & Chen, 2005;Gan, Ma & Wu, 2007) have classified clustering algorithms into three types, which are presented below.

Partitional algorithms
Partitional algorithms assume in advance the number of non-overlapping groups that the dataset should be divided into, determining a split into groups or classes that ensures that an observation belongs to exactly one group (Gan et al., 2007;Arbelaitz et al., 2013). The most commonly used search algorithms are: K-Means and K-Medoids. K-Means defines k centroids (one for each group) and then takes each element of the data set in order to locate the nearest centroid. It recalculates the centroid of each group and redistributes each element of the data set using the same criteria (Velmurugan & Santhanam, 2010). The K-Means algorithm takes the average value of the objects in a group as a reference point (seed), indicating that the centroids do not need to match the values of the objects in the cluster to which they belong. The K-Medoids algorithm is a variant of the K-Means algorithm. In contrast to the K-means algorithm, K-Medoids choose points belonging to the dataset as centers or reference points (seed).

Hierarchical algorithms
These algorithms establish a hierarchical structure as a result of grouping (Theodoridis, Pikrakis, Koutroumbas, & Cavouras, 2010). Such techniques decompose the dataset into levels or stages, such that on each level (agglomerative) or bind divide (partition) the previous level groups (Pang-Ning, Steinbach, & Kumar, 2006;Theodoridis et al., 2010). The agglomerative analysis method starts with as many groups as there are observations in the dataset. From here, new groups are formed until the end of the process, when all the cases treated are contained in a single group, while the dividing technique is the reverse process, obtaining at the end of the process as many groups as cases have been treated.
It should be noted that the hierarchical algorithms used in the environmental sector are actually of the agglomerative type (Moreno, 2000) and they are much more understandable (easy to interpret) and effective (less complex) than divisive algorithms, since in the first group, merging the groups corresponds to a high degree of similarity, thus making them more comprehensible,
Se debe notar que los algoritmos jerárquicos usados en el sector ambiental son de tipo aglomerativo (Moreno, 2000) y mucho más entendibles (de fácil interpretación) y efectivos (menos complejos) que los algoritmos divisivos; desde el primer grupo, se fusionan los grupos similares, de forma tal que se tornan más comprensivos, mientras que la última división de grupo apunta a minimizar, la variación general del grupo, lo que conlleva a un entendimiento mucho más complejo de los grupos (Sasirekha & Baby, 2013;de Mantaras, & Saitia, 2004). Por lo tanto, el resultado de este tipo de algoritmos debería ser examinado exhaustivamente para asegurar que tiene sentido; por esta razón este trabajo se enfoca en los análisis de los métodos aglomerativos. Las estrategias de búsqueda más repre-while in the latter group the division aims to minimize the overall variance of the group, leading to a much more complex understanding of the groups (Sasirekha & Baby, 2013;de Mantaras, & Saitia, 2004). Therefore the results of this type of algorithm should be examined thoroughly to ensure that they make sense, and for this reason, this work focuses on the analysis of the agglomerative methods. The most representative search strategies are: the minimum distance (single linkage), the maximum distance (complete linkage) and the average distance (average linkage) (Madhulatha, 2012;Pang-Ning et al., 2006). A key step in the process of hierarchical clustering is to select the measurement that describes the distance between observations; in Madhulatha (2012), the author argues that the most common and most often used measurement, and one which gives reliable results in most cases, is the Euclidean distance measure. In this research, we use the Euclidean distance to measure the similarity between clusters.

Density-based algorithms
This class of algorithms divides the dataset into groups, taking into account the density distribution of the elements, so that the groups are formed having a high density of points inside. Among the algorithms used in this area is DBSCAN (Density-Based Spatial Clustering of Applications with Noise), which groups the observations of a dataset into regions of high and low density. Initially, the algorithm defines a set of central objects (objects that have in their neighborhood an amount greater than or equal to a specific threshold point), edge or boundary objects (objects that have in their neighborhood a number of points less than a specified threshold, but which are in the vicinity of a central object) and atypical objects or noise (objects that do not fall into the above categories). Once the objects are defined, DBSCAN selects a p arbitrary element; if p is a central object, a group in which are located all observations density-reachable p is built. If p is not a central object, another element of the dataset is visited. The process continues until all elements have been assigned to an object. Those outside the groups that have formed objects are called outliers or noise points, while items that are not atypical or core values are called edge points (Gan et al., 2007;Pang-Ning et al., 2006). It is noteworthy that the DBSCAN technique can handle groups of different shapes and sizes, but has limitations when the dataset has tied dimensions, when groups are overlapping, and in the presence sentativas son: la mínima distancia (Enlace Simple), la máxima distancia (Enlace Completo) y la distancia promedio (Enlace Promedio) (Madhulatha, 2012;Pang-Ning et al., 2006). Un paso clave en el proceso de agrupamiento jerárquico es seleccionar la medida que describe la distancia entre observaciones; Madhulatha (2012) argumenta que la más común, de mayor uso y con resultados confiables en la mayoría de los casos es la medida de distancia euclidiana, la misma que se usa en la presente investigación para medir la similitud entre los grupos.

III. Results and discussion
This section presents the clustering assessments algorithms discussed above, applied to the dataset described in section II.

A. Experimental results
An experimental evaluation was conducted to determine the performance of the methods of hierarchical and partitional clustering applied on different data sets (Table 2), using only the Indexes Validation Group (IVG) due to the absence of the correct partition. The IVG is mostly based on the concept of cohesion of the assembly and the spacing between them (Pang-Ning et al., 2006). Cohesion measures determine the degree of relationship or homogeneity of objects in a group (membership), while the segregation determines the degree of separation between the groups (non-membership).
For the experiments, four datasets derived from the Piedras river database were used, which contain information on the physicochemical and biological samples collected. It should be mentioned that such referrals in the dataset were made taking into account the reduction technique attributes of Principal Component Analysis (PCA), which aims to select the most appropriate subset of features of the original dataset, while discarding redundant attributes (correlated), and any that are inconsistent, irrelevant, etc. (useless attributes). These attributes are described in Table 2.
In this way, the datasets in question were used with the techniques provided above to give the results displayed further on.

Performance analysis of partitional algorithms
To evaluate the performance of the partitional algorithms K-Means and K-Medoids, we must first find an optimal set of groups that fit in the best possible way the natural group of input data for the range tested K=3 to K=20, and the result of each iteration with the IVG des-  el grado de separación entre los grupos (no pertenecientes). Para los experimentos, se usaron cuatro conjuntos de datos derivados del conjunto de datos del río Piedras, los cuales contenían información en las muestras fisicoquímicas y biológicas que fueron recolectadas. Estas referencias en el conjunto de datos fueron hechas tomando en cuenta los atributos de la técnica de reducción Principal Component Analysis (PCA), la cual permite seleccionar el subconjunto más apropiado de características provenientes del conjunto de datos original, descartando atributos redundantes (correlacionadas), inconsistentes, irrelevantes, etc. (atributos inútiles). Estos atributos se describen en la Tabla 2.
De esta manera, los conjuntos de datos en cuestión fueron usados con las técnicas proporcionadas anteriormente para reproducir los resultados Análisis del rendimiento de algoritmos particionales Para evaluar el rendimiento de los algoritmos particionales K-Means y K-Medoids, en primer lugar se encontró un conjunto óptimo de grupos que encajaran de la mejor manera posible en el grupo natural de datos de entrada para ese rango evaluado de K=3 a K=20, y el resultado de cada iteración con el IVC descrito en la sección previa. De la misma manera, este procedimiento se llevó a cabo para 4 conjuntos de datos descritos en la Tabla 2. La Figura 2 permite describir el comportamiento de los resultados del algoritmo K-Means (Figura 2a) y K-Medoids (Figura 2b) respectivamente en el conjunto de datos DS0, en donde el color azul simboliza el índice DB (basado en la similitud media entre dos grupos que denotan el valor mínimo y la mejor partición), y referencias rojas para el índice Silueta (combina tanto cohesión como separación, entre más grande sea la Silueta, es más compacta y de grupos separados, es decir, su valor máximo denota la mejor partición). Como se menciono, un valor máximo de la figura del índice Silueta y un valor mínimo del índice DB representan la mejor partición, y por lo tanto, el valor óptimo de K. Del mismo modo, en la Figura 2 el número óptimo de grupos para el método K-Means se encuentra entre 5 y 6 grupos, mientras que para K-Medoids corresponde a 7 grupos. En este caso, las medidas para definir el número apropiado de grupos son mejores para K-Medoids en comparación con K-Means, desde que los valores del índice DB coinciden con el índice Silueta y K=7. cribed in the previous section is evaluated. In the same way, note that this procedure was performed for the four datasets described in Table 2. Figure 2 describes the behavior of the results of the K-Means algorithm (Figure 2a) and K-Medoids (Figure 2b) respectively on the dataset DS0, where the blue color symbolizes the DB index (based on the similarity measure between two groups denoting the minimum value and the best partition) and red refers to the Silhouette index (combining both cohesion and separation, how greater the Silhouette, more compact and separate the groups, i.e., its maximum value denotes the best partition).
As mentioned above, a maximum value of the Silhouette index figure and a minimum value of the DB index represent the best partition and thus the optimum value of K. Therefore from Figure 2 the optimal number of groups for the K-Means method is between 5
La Figura 5 muestra los resultados del conjunto de datos DS3, donde se obtienen resultados similares a DS0. A partir de aquí se puede asumir que existe información redundante entre los conjuntos de datos DS0-DS3 y DS1-DS2. Por tanto, se puede decir que DS3 representa la misma información que DS0, a pesar de que el primero posee menos atributos. Lo mis-and 6 groups, while for the K-Medoids it is 7 groups. In this case, measures to define the appropriate number of groups are better for K-Medoids than for K-Means, since the values of DB agree with Silhouette and K = 7.
Similarly, the algorithms' behavior in question applied to the dataset of DS1 (Figure 3) is analyzed. In Figure  2 (a) and (b), note that the results are reversed compared to the previous case; that is, K-Means acquires accurately the appropriate number of groups (K=3), while K-Medoids delivers two possibilities for this value (K=3 or K=4). Now, Figure 4 shows the results obtained when the dataset DS2 is evaluated on K-Means and K-Medoids.

K-Medoids behavior in DS3 with respect to Silhouette and DB. (a) K-Means (b) K-Medoids / Figura 5. El comportamiento K-Means vs. K-Medoids en DS1 respecto a los índices Silueta y DB. (a) K-Means (b) K-Medoids
timal clustering (K=3) versus the K-Medoids algorithm. These results indicate that the DS1 and DS2 datasets are very similar. However, Figure 5 shows the results of the DS3 dataset, where similar results are obtained to those for DS0. From this it is possible to assume that there is redundant information between datasets DS0-DS3 and DS1-DS2. In other words, we can say that DS3 represents the same information as DS0, even though the former has fewer attributes. The same goes for DS1 vs. DS2. This behavior indicates that the datasets DS0 and DS2 can be represented by DS1 and DS3 respectively.
Below are summarized the best values of the Silhouette and DB indices (minimum and maximum respectively) obtained by the K-Means algorithm and K-Medoids on each of the datasets. The best results are obtained when the K-Means algorithm and K-Medoids process data set DS1, because they obtain the minimum values of DB (0.458 and 1.101 respectively) and the highest values of Silhouette (0.773 and 0.526 respectively). From these results, K-Means reaches the best quality for the DS1 clustering process with K=3, while K-Medoids for DS1 reaches K=3 or K=4. Given these criteria, it can be concluded that the optimal number of groups is K=3 for both cases. Significantly, K-Means, unlike K-Medoids, presents the best values for validation of the DB and Silhouette indexes, the minimum and maximum respectively.

Performance analysis of hierarchical algorithms
In this item are evaluated single, complete and average linkage algorithms. The performance analysis was performed for each dataset defined in Table 2, based on the CCC and Silhouette indices which measure the degree of distortion, and the respective optimal number of clusters. Figure 6 show the behavior of these algorithms for different numbers of clusters (3-20). As clearly noted, all the algorithms have the best values in Silhouette when grouping the data set K=3, obtaining the same optimum number of clusters as the partitional methods.
Single Linkage presents the best results for Silhouette compared with Complete and Average Linkage (0.8, 0.88, 0.88, 0.8 respectively) on the four datasets analyzed. If these results are analyzed in detail, it is noted that the behavior of the datasets is the same as is obtained from para cada conjunto de datos definidos en la Tabla 2, basados en los índices CCC y Silueta con medidas del grado de distorsión, y el número óptimo de agrupaciones de manera respectiva.
El árbol de decisión C.4.5 fue utilizado para interpretar la composición de los grupos generados por K-means, teniendo en cuenta que este algoritmo es uno de los más utilizados the analysis of partitional methods, thus corroborating the occurrence of data redundancy.
Moreover, taking into account that the hierarchical algorithms establish a tiered structure as a result of the group (dendrogram), the best way to get an accurate assessment is by using the CCC index. Table 3 lists the degree of distortion of relations between observations. It is observed that the Average Linkage algorithm obtained the best values for the CCC index (0.631, 0.681, 0.69 and 0.713 respectively) for the four data sets analyzed. Thus, Average Linkage is the strategy that generated the least distortion in relationships, or in other words, more appropriately related observations.

B. Understanding the data
In the previous section we analyzed partitional and hierarchical clustering algorithms, where the best results for the K-Means (K=3) and Average Linkage were obtained respectively, just as DS1 was the dataset that showed the best results. This section provides an analysis of the groups obtained from the K-Means algorithm performance, but Average Linkage analysis is omitted because when the dataset is high-dimensional, hierarchical clustering algorithms decompose (with some descriptive results and unreliable delivery) due to their nonlinear time complexity and high cost, so the literature shows that this kind of technique is quite effective for low-dimensional datasets (Quiroz, Pla, Badia, & Chover, 2007;Saraçli, Doğan, & Doğan, 2013).
However, to interpret the composition of groups generated by K-Means, a C.4.5 decision tree was used, considering that this algorithm is one of the most used for such tasks (Corrales, Corrales, & Figueroa-Casas, 2015;Pérez, 2003). The resulting decision tree is shown in Figure 7.
The DS1 dataset consists of a total of 5590 individuals of aquatic macro-invertebrates, belonging to 63 taxa, plus attributes: year, month, station code and number of individuals, as indicated in Table 2. Now, by dividing the dataset into three groups by the K-Means clustering technique, each is interpreted through the C.4.5 decision tree, as indicated above. Table 4 shows the la Tabla 4 muestra la distribución de casos obtenidos por el árbol de decisión C.4.5 para individuos macro-invertebrados acuáticos que se encuentran en la cuenca del río Piedras. Asi mismo, se clasifica el porcentaje de individuos en cada grupo, teniendo en cuenta la metodología (Pérez, 2003), donde el autor ha etiquetado cada macro-invertebrado con un número que indica el grado de sensibilidad a los contaminantes. Estos números varían de forma gradual y sucesiva en un rango de 1 a 10, donde el número 1 indica la menor sensibilidad (contaminantes aceptados) y el número 10 la mas alta (acepta cualquier tipo de contaminantes). Además, considera la calidad del agua de la abundancia taxonómica mediante el cálculo de los índices biológicos BMWP y ASPT, y también el color clasificado de acuerdo a su calidad.
distribution of instances obtained by the C.4.5 decision tree for individual aquatic macro-invertebrates found in the Piedras river basin. In turn, the percentage of individuals in each group was categorized, taking into account the methodology (Pérez, 2003), where the author has labeled each macro-invertebrate with a number indicating the degree of sensitivity to pollutants. These numbers range from 1 to 10, where 1 indicates the least sensitive (accepts contaminants), and so on, gradually, until the number 10 (accepts no kind of contaminants) pointing to the most sensitivity. Additionally, water quality is considered from the taxonomic abundance by calculation of the biological indices BMWP and ASPT, and also the color is categorized according to their quality.
Thus, to interpret the clusters, this methodology is taken as a starting point, where each group has three alert levels for water quality (represented by individual taxa found), discriminated as blue, green and yellow according to their sensitivity to contaminants. Table 4, clusters 1 and 3 have the greatest diversity of macro-invertebrates, reaching a representation of 40.56% and 44.4% of the collected taxa respectively. The taxa percentages indicating high and good water quality are much higher than the taxa indicators of dubious water quality, indicating good condition alerts on water quality in the three clusters.

As shown in
The results obtained by the classifier are discussed in more detail, and are organized so that these can visualize the behavior of the taxonomic community at different times and sampling points (Puente Alto, Puente Carretera and El Diviso intake). Figure 8 explains this behavior.
In general, it is clearly seen that indicators of high and good water quality taxa can be compared with taxa representing water of questionable quality in the three sampling points, thereby expressing quality alerts for relatively good water in the three sites.
Similarly, analyzing this figure more thoroughly, it displays in the first instance for the period 2011 that cluster number one (C1) is a good representative of water quality alerts due to the dominance of individuals belonging to this category (blue), followed by two (C2) and three (C3), where the warnings of good water quality (green) in this study are not considered alerts of wa-agua dudosa, lo que indica las buenas condiciones de alerta en la calidad del agua en las tres agrupaciones.
En general, los indicadores de alta y de buena calidad de agua de taxones resaltan claramente comparados con taxones que representan agua de calidad cuestionable en los tres puntos de muestreo, mostrando de este modo alertas de calidad de agua relativamente buena en los tres sitios.
Del mismo modo, la Figura 8 es estudiada, la cual, muestra en primera instancia al Cluster o agrupación numero 1 (C1) como un fiel representante de alertas de calidad de agua debido al dominio sobre los individuos que pertenecen a su categoría (Azul), seguido del Cluster numero dos (C2) y el numero tres (C3) que representan alertas de buena calidad de agua (Verde). Otras categorías no hacen parte de la presente investigación debido a la falta de diversidad de taxones de este tipo en la base de datos y a la calidad dudosa del agua.
De la misma manera, se detectó que la abundancia taxonómica para 2013 sigue disminuyendo, la cual se obtiene a partir del numero de especies indicadoras de agua de alta calidad, así mismo como la calidad del agua en general de las áreas de muestreo con la excepción del punto de Puente Alto, el cual parece seguir un constante comportamiento desde que la co- munidad pendiente de las especies indicadoras de alta y buena calidad reportan la misma proporción durante los tres años estudiados. Los grupos siguen el mismo patrón para el mismo período, sin embargo, los grupos y alertas C1, C2 y C3 solo representan buena calidad de agua. Con base en los resultados obtenidos es posible asumir que el punto de muestreo de Puente Alto es el sistema que ha sufrido un menor grado de alteraciones por las actividades humanas en comparación con las áreas de muestreo de Puente Carretera y de la toma de agua de Diviso, teniendo en cuenta que posee mayor riqueza de individuos que representan alta calidad en todos los períodos de muestreo.
El objetivo de este estudio se basó en la generación de alertas mediante análisis grupales de calidad de agua en la cuenca del rio Piedras. Diferentes tipos de métodos de validación fueron revisados y abordados con el presente propósito. Después de analizar los resultados de todos los experimentos, se ha llegado a la conclusión de que el agua de las cuencas es de buena calidad en los tres puntos de muestreo analizados, a pesar de que se reduce con el paso del tiempo. La pérdida de la calidad del agua se debe al aumento de aproximadamente 14% en el número de organismos resistentes, el cual, se debe a los diferentes grados de contaminación y la reducción de 16% en el número de individuos indicadores de calidad de agua.
El punto de muestreo de Puente Alto es el sistema que ha sido alterado en un menor grado por las actividades humanas en comparación con Puente Carretera y la toma de agua de Diviso. Este punto tiene la mayor cantidad de individuos que representan alta calidad en todos los períodos de muestreo.
Por otro lado, el uso de la metodología con el algoritmo K-means y el árbol de decisión C.4.5 pueden generar alertas de calidad del agua de fácil interpretación para todos los usuarios. Hay un gran interés en el grupo para poder llevar a cabo este tipo de análisis con otras cuencas donde la abundancia taxonómica de organismos indicadores de baja calidad es significativa.
ter of dubious quality groups, due to the lack of diverse taxa of this type in the database.
Moreover, for 2012 the growth of indicator species of good water and dubious quality can be observed, with a small reduction of the representatives species of water quality, generating alerts for increasing pollutants (compared to 2011) in the three sampling points, especially in the Diviso intake and Puente Carretera, where this phenomenon is most clearly noted. The behavior of groups is the same as in the previous year, where C1 represents alerts of high quality water, while the C2 and C3 alerts still represent good water quality.
In the same way, from the taxonomic abundance for 2013 it is obtained that the number of indicator species of high quality continues to decline and with this the water quality of the sampling points, with the exception of the sampling point analyzed in Puente Alto, which follows an approximately constant behavior, since at this point the community of indicator species of high quality and good quality shows the same proportions during the three years analyzed. As for the groups, these follow the same pattern as in previous years, with the difference that the C1, C2 and C3 groups and alerts are representing only good water quality.
Based on the results found it can be assumed that the sampling point at Puente Alto is the system that has been altered the least by human activities compared with the sampling points at Puente Carretera and the Diviso intake, since this point has the greatest wealth of individuals representing high quality at all sampling periods.
The diversity of aquatic macro-invertebrate indicators of water quality in the three sampling points for 2011 was 43%. However, in 2012 and 2013 this community of macro-invertebrates was reduced to 30% and 27% respectively, by contrast with the population of contamination-resistant taxa, which increased with the passing of time. That is, the community rose from 53% of the total population in 2011 to 62% and 67% in 2012 and 2013 respectively, thus indicating that the waters are declining in quality over time.

IV. Conclusions
Manual analysis of water quality through traditional methods is very cumbersome, expensive and time-consuming when the data set is too large. For this reason, this process requires specialized tools that are appropriate for the accurate and effective analysis of informahttp://www.icesi.edu.co/revistas/index.php/sistemas_telematica Castillo, E., Gonzales, W., López, I., Figueroa, A., Corrales, D., Hoyos, M. & Corrales, J. (2015).
Como recomendación general, se aconseja el control y seguimiento constante de las actividades en torno al Río Piedras. Aunque el agua es de buena calidad en la cuenca, la reducción de organismos indicadores de calidad de agua es una alarma para promover un monitoreo continuo, todo con el objetivo de preservar las fuentes de abastecimiento de agua para la ciudad de Popayán, en el Estado del Cauca.

Agradecimientos
Los autores de este artículo quisieran dar las gracias a la Universidad del Cauca, el Grupo de Estudios Ambientales (GEA), el Grupo de Ingeniería Telemática (GIT), Beca Colciencias Doctorado (que fue otorgado al MSc. David Camilo Corrales) y el programa AgroCloud del proyecto RICCLISA, por el apoyo técnico-científico. tion, as well as machine-learning techniques that utilize existing knowledge to arrive at the same conclusions, by using different and less complicated methods.
The objective of this study is based on the generation of alerts by cluster analysis for water quality in the Piedras river basin. For this purpose, different types of validation methods were reviewed and addressed. After analyzing the results of all the experiments, we concluded that the basins' water is of good quality at the three analyzed sampling points. However, these basins are declining in water quality with the passage of time. The reduction in water quality is due to the increase by approximately 14% of the number of resistant organisms. This increase is caused by the different degrees of pollution and the reduction by 16% of individuals that are indicators of water quality.
The sampling point at Puente Alto is the system that has been least altered by human activities compared to Puente Carretera and the Diviso intake. This point has the greatest wealth of individuals representing high quality in all sampling periods.
On the other hand, the use of the methodology with the K-Means algorithm and C.4.5 decision tree can generate water quality alerts that are easy for all users to interpret. There is a strong interest in the group to be able to perform this type of analysis in other basins where the taxonomic abundance of poor quality indicator organisms is significant.
As a general recommendation, we suggest constant control and monitoring of activities around Piedras River. Although the water is of good quality at the basin, the reduction of water quality indicator organisms is an alarm, warning of the need to promote continuous monitoring, with the overall objective of preserving the sources of water supply for Popayan city in the State of Cauca.