A Multicriteria Decision Making Approach for Estimating the Number of Clusters in a Data Set

Yi Peng; Yong Zhang; Gang Kou; Yong Shi

doi:10.1371/journal.pone.0041713

Abstract

Determining the number of clusters in a data set is an essential yet difficult step in cluster analysis. Since this task involves more than one criterion, it can be modeled as a multiple criteria decision making (MCDM) problem. This paper proposes a multiple criteria decision making (MCDM)-based approach to estimate the number of clusters for a given data set. In this approach, MCDM methods consider different numbers of clusters as alternatives and the outputs of any clustering algorithm on validity measures as criteria. The proposed method is examined by an experimental study using three MCDM methods, the well-known clustering algorithm–k-means, ten relative measures, and fifteen public-domain UCI machine learning data sets. The results show that MCDM methods work fairly well in estimating the number of clusters in the data and outperform the ten relative measures considered in the study.

Citation: Peng Y, Zhang Y, Kou G, Shi Y (2012) A Multicriteria Decision Making Approach for Estimating the Number of Clusters in a Data Set. PLoS ONE 7(7): e41713. https://doi.org/10.1371/journal.pone.0041713

Editor: Frank Emmert-Streib, Queen’s University Belfast, United Kingdom

Received: April 1, 2012; Accepted: June 27, 2012; Published: July 27, 2012

Copyright: © Peng et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This research has been partially supported by grants from the National Natural Science Foundation of China (#70901011 and #71173028 for YP, #70901015 for GK, and #70921061 for YS), and Program for New Century Excellent Talents in University (NCET-10-0293). No additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Cluster analysis, the most widely adopted unsupervised learning process, organizes data objects into groups that have high intra-group similarities and inter-group dissimilarities without a priori information. Unlike the evaluation of supervised classifiers, which can be conducted using well-accepted objective measures and procedures, assessment of clustering algorithms’ outputs, often called cluster validation, is challenging because of the lack of objective validation criteria and application-dependent nature of clustering. Nevertheless, cluster validation is necessary to ensure that the resulting clustering structures are not occurred by chance [1].

As an essential step in cluster analysis, cluster validation has been an active research area. Two fundamental issues that need to be addressed in cluster validation are: to estimate the number of clusters in a data set; and to evaluate clustering algorithms [2]. This paper focuses on the first problem. Researchers from several disciplines, such as statistics, pattern recognition, and information retrieval, have studied this issue for years. Marriott (1971) used a heuristic argument to determine the number of clusters in a data set [3]. Hartigan (1975) suggested the statistic H(k) to estimate the number of clusters [4]. Milligan and Cooper (1985) evaluated thirty procedures for determining the number of clusters using artificial data sets with distinct non-overlapping clusters [5]. The procedures, also called stopping rules, were clustering-algorithm independent and selected from the clustering literature to represent a wide variety of techniques and approaches. Krzanowski and Lai (1988) derived a criterion for determining the number of groups in a data set using sum-of-squares clustering and illustrated that the new criterion has better performance than the Marriott’s criterion [6]. Kaufman and Rousseeuw (1990) used the silhouette statistic to estimate the optimal number of clusters in a data set [7]. Tibshirani et al. (2001) proposed the gap statistic for estimating the number of clusters in a data set and compared the gap method with four other methods in a simulation study [8]. Dudoit and Fridlyand (2002) estimated the number of clusters using a prediction-based resampling method, Clest, and compared the performance of the Clest method with some existing methods using simulated data and gene-expression data [9]. Sugar and James (2003) developed an information theoretic approach for choosing the number of clusters; conducted a simulation study to compare the performance of the proposal with five other methods; and provided a theoretical justification for the proposed procedure [10]. Salvador and Chan (2004) designed the L method to determine the number of clusters for hierarchical clustering algorithms [11].

Different from previously developed approaches, this study examines the problem from a new perspective. Since the determination of the number of clusters in a data set normally involves more than one criterion, it can be modeled as a multiple criteria decision making (MCDM) problem [12], [13]. The objective of this paper is to develop a MCDM-based approach to choose the appropriate number of clusters for a data set. MCDM methods treat different numbers of clusters for a data set as available alternatives and performances of clustering algorithms on validity measures with different numbers of clusters as criteria. Alternatives are then ranked according to the evaluation of multiple criteria. An experimental study is designed to examine the proposed approach using three MCDM methods (i.e., PROMETHEE II, WSM, and TOPSIS), the well-known clustering algorithm–k-means, ten relative measures, and fifteen public-domain UCI machine learning data sets. Furthermore, the experimental study applies the ten existing relative measures for estimating the number of clusters and compares their performances with the proposed three MCDM methods.

The rest of the paper is organized as follows. The next section describes the proposed method, the selected MCDM methods, the clustering algorithm, and the validity measures. Results and discussion section presents details of the experimental study and analyzes the results. The last section concludes the paper with summaries, limitations, and future research directions.

Methods

Proposed Approach

Estimating the number of clusters for a given data set is closely related to the validity measures and the data set structures. Many validity measures have been proposed and can be classified into three categories: external, internal, and relative [1]. External measures use predefined class labels to examine the clustering results. Because external validation uses the true class labels in the comparison, it is an objective indicator of the true error rate of a clustering algorithm. Internal measures evaluate clustering algorithms by measuring intra- and inter-cluster similarity. An algorithm is regarded as good if the resulting clusters have high intra-class similarities and low inter-class similarities. Relative measures try to find the best clustering structure generated by a clustering algorithm using different parameter values. Extensive reviews of cluster validation techniques can be found in [1] and [14], [15].

Although external measures perform well in predicting the clustering error in previous studies, they require a priori structure of a data set and can only be applied to data sets with class labels. Since this study concentrates on data sets without class labels, it utilizes relative validity measures. The proposed approach can be applied to a wide variety of clustering algorithms. For simplicity, this study chooses the well-known k-means clustering algorithm. Figure 1 describes the MCDM-based approach for determining the number of clusters in a data set. For a given data set, different numbers of clusters are considered as alternatives and the performances of k-means clustering algorithm on the relative measures with different numbers of clusters represent criteria by MCDM methods. The output is a ranking of numbers of clusters, which evaluates the appropriateness of different numbers of clusters for a given data set based on their overall performances for multiple criteria (i.e., selected relative measures).

Download:

Figure 1. A MCDM-based approach for determining the number of clusters in a dataset.

https://doi.org/10.1371/journal.pone.0041713.g001

MCDM Methods

This study chooses three MCDM methods for estimating the number of clusters for a data set. This section introduces the selected MCDM methods (i.e., WSM, PROMETHEE, and TOPSIS) and explains how they are used to estimate the optimal number of clusters for a given data set.

MCDM Method 1: Weighted Sum Method (WSM)

The weighted sum method (WSM) was introduced by Zadeh [16]. It is the most straightforward and widely-used MCDM method for evaluating alternatives. When an MCDM problem involves both benefit and cost criteria, two approaches can be used to deal with conflicting criteria. One is the benefit to cost ration and the other is the benefit minus cost [17]. For the estimation of optimal number of clusters for a data set, the relative indices Dunn, silhouette, and PBM are benefit criteria and have to be maximized, while Hubert, normalized Hubert, Davies-Bouldin index, SD, S_Dbw, CS, and C-index are cost criteria and have to be minimized. This study chooses the benefit minus cost approach and applies the following formulations to rank different numbers of clusters.

Suppose there are m alternatives, k benefit criteria, and n cost criteria. The total benefit of alternative is defined as follows:

where represents the performance measure of the jth criterion for alternative . Similarly, the total cost of alternative is defined as follows:

where Then the importance of alternative is defined as follows:

The best alternative is the one has the largest WSM score [18].

MCDM Method 2: Preference Ranking Organisation Method for Enrichment of Evaluations (PROMETHEE)

Brans proposed the PROMETHEE I and PROMETHEE II, which use pairwise comparisons and outranking relationships to choose the best alternative [19]. The final selection is based on the positive and negative preference flows of each alternative. The positive preference flow indicates how an alternative is outranking all the other alternatives and the negative preference flow indicates how an alternative is outranked by all the other alternatives [20]. While PROMETHEE I obtains partial ranking because it does not compare conflicting actions [21], PROMETHEE II ranks alternatives according to the net flow which equals to the balance of the positive and the negative preference flows. An alternative with a higher net flow is better [20]. Since the goal of this study is to provide a complete ranking of different numbers of clusters, PROMETHEE II is utilized. The following procedure presented by Brans and Mareschal [20] is used in the experimental study:

Step 1. define aggregated preference indices.Let a, b∈A, and let :

where A is a finite set of possible alternatives {a₁, a₂,…, a_n}, k represents the number of evaluation criteria, and w_j is the weight of each criterion. For estimating the number of clusters for a given data set, the alternatives are different numbers of clusters and the criteria are relative indices. Arbitrary numbers for the weights can be assigned by decision-makers. The weights are then normalized to ensure that indicates how a is preferred to b over all the criteria and indicates how b is preferred to a over all the criteria. and are the preference functions for alternatives a and b. The relative indices Dunn, silhouette, and PBM have to be maximized, and Hubert, normalized Hubert, DB, SD, S_Dbw, CS, and C-index have to be minimized.

Step 2. calculate and for each pair of alternatives of A. There are six types of preference functions and the decision-maker needs to choose one type of the preference functions for each criterion and the values of the corresponding parameters [22]. The usual preference function, which requires no input parameter, is used for all criteria in the experiment.

Step 3. define the positive and the negative outranking flow as follows:The positive outranking flow :

The negative outranking flow :

Step 4. compute the net outranking flow for each alternative as follows:

When a is more outranking all the alternatives on all the evaluation criteria. When a is more outranked.

MCDM Method 3: Technique for Order Preference by Similarity to Ideal Solution (TOPSIS)

The Technique for order preference by similarity to ideal solution (TOPSIS) method was proposed by Hwang and Yoon [23] to rank alternatives over multiple criteria. It finds the best alternatives by minimizing the distance to the ideal solution and maximizing the distance to the nadir or negative-ideal solution [24]. This paper uses the following TOPSIS procedure, which was adopted from [25] and [24], in the empirical study:

Step 1. calculate the normalized decision matrix. The normalized value r_ij is calculated as:

where J and n denote the number of alternatives and the number of criteria, respectively. For alternative A_j, the performance measure of the ith criterion C_i is represented by x_ij.

Step 2. develop a set of weights w_i for each criterion and calculate the weighted normalized decision matrix. The weighted normalized value v_ij is calculated as:

weight of the ith criterion, and

Step 3. find the ideal alternative solution S⁺, which is calculated as:

where is associated with benefit criteria and is associated with cost criteria. In this study, benefit and cost criteria of TOPSIS are defined the same as the benefit and cost criteria in WSM.

Step 4. find the negative-ideal alternative solution S⁻, which is calculated as:

Step 5. Calculate the separation measures, using the n-dimensional Euclidean distance. The separation of each alternative from the ideal solution is calculated as:

The separation of each alternative from the negative-ideal solution is calculated as:

Step 6. Calculate a ratio that measures the relative closeness to the ideal solution and is calculated as:

Step 7. Rank alternatives by maximizing the ratio .

Clustering Algorithm

The k-means algorithm, the most well-known partitioning method, is an iterative distance-based technique [26]. The input parameter k predefines the number of clusters. First, k objects are randomly chosen to be the centers of these clusters. All objects are then partitioned into k clusters based on the minimum squared-error criterion, which measures the distance between an object and the cluster center. The new mean of each cluster is calculated and the whole process iterates until the cluster centers remain the same [27], [28]. Let , be the n objects to be clustered, is the set of clusters. Let be the mean of cluster . The squared-error between and the objects in cluster is defined as.

Then the aim of k-means algorithm is to minimize the sum of the squared error over all k clusters, that is

where WCSS denotes the sum of the squared error in the inner-cluster.

Two critical steps of k-means algorithm have impact on the sum of squared error. First, generate a new partition by assigning each observed point to its closest cluster center, the formula is as follows:

where denotes the mean of the cluster in times clustering, while represents all sets contained in the cluster in times clustering. Second, compute new cluster mean centers using the following formula.

where denotes the mean of the cluster in times clustering while represents all sets contained in the cluster in times clustering. The algorithm is implemented using WEKA (Waikato Environment for Knowledge Analysis), a free machine learning software [29].

Clustering Validity Measures

Ten relative measures are selected for the experiment, namely, the Hubert Γ statistic, the normalized Hubert Γ, the Dunn’s index, the Davies-Bouldin index, the CS measure, the SD index, the S_Dbw index, the silhouette index, PBM, and the C-index. Relative measures can also be used to identify the optimal number of clusters in a data set and some of them, such as the C-index and silhouette, have exhibited good performance in previous studies [5], [8]. The following paragraphs define these relative measures.

Hubert Γ statistic [30]:

where n is the number of objects in a data set, , P is the proximity matrix of the data set, and Q is an matrix whose (i, j) element is equal to the distance between the representative points () of the clusters where the objects and belong [15]. Γ indicates the agreement between P and Q.

Normalized Hubert Γ:

Where represent the respective means and variances of P and Q matrices [14].

Dunn’s index [31] evaluates the quality of clusters by measuring inter cluster distance and intra cluster diameter.

where K is the number of clusters, is the i^th cluster, is the distance between cluster and , and is the diameter of the lth cluster. Larger values of D suggest good clusters, and a D larger than 1 indicates compact separated clusters.

Davies-Bouldin index is defined as [32]:

where K is the number of clusters, and represent the respective dispersion of clusters i and j, measures the dissimilarity between two clusters, and measures the similarity between two clusters [15]. It is the average similarity between each cluster and its most similar one [30].

The CS measure is proposed to evaluate clusters with different densities and/or sizes [33]. It is computed as:

Where is the number of objects in cluster i and d is a distance function. The smallest CS measure indicates a valid optimal clustering.

SD index combines the measurements of average scattering for clusters and total separation between clusters [15]:

where is the maximum number of input clusters, , and , is the maximum distance between cluster centers and the is the minimum distance between cluster centers.

S_Dbw index is similar to SD index and is defined as [15]:

where N_ij is the number of objects that belong to the cluster C_i and C_j, and function f(x,u) is defined as:

Silhouette is an internal graphic display for clustering methods evaluation. It represents each cluster by a silhouette, which shows how well objects lie within their clusters. It is defined as [34]:

where i represents any object in the data set, a(i) is the average dissimilarity of i to all other objects in the same cluster A, and b(i) is the average dissimilarity of i to all objects in the neighboring cluster B, which is defined as the cluster that has the smallest average dissimilarity of i to all objects in it. Note that and the dissimilarity is computed using distance measures. Since a(i) measures how dissimilar i is to its own cluster and b(i) measures how dissimilar i is to its neighboring cluster, an s(i) close to one indicates a good clustering method. The average s(i) of the whole data set measures the quality of clusters.

PBM is developed by Pakhira, Bandyopadhyay, and Maulik [35] and it is based on the intra-cluster and inter-cluster distances:

The C-index [36] is based on intra-cluster distances and their maximum and minimum possible values [37]:

,

where q_i,j = 1 if the i^th and j^th objects are in the same cluster and q_i,j = 0 otherwise. Small C-index indicates good partitions.

Results and Discussion

The experiment is designed to examine the proposed MCDM-based approach for estimating the number of clusters in a data set. The data sets, the experimental design, and the results are discussed in sequence.

Data Sets

Fifteen data sets are used in the experiment. They are provided by UCI machine learning repository (http://archive.ics.uci.edu/ml/) [38]. Table 1 summarizes the characteristics of the data sets.

Download:

Table 1. Data set structures.

https://doi.org/10.1371/journal.pone.0041713.t001

Download:

Table 2. Rankings of numbers of clusters for the yeast data set.

https://doi.org/10.1371/journal.pone.0041713.t002

The breast cancer data set was provided by Dr. William H. Wolberg from the University of Wisconsin Hospitals [39]. Each record has ten attributes to describe cytological characteristics of breast and belongs to either benign or malignant class. The breast tissue data set contains impedance measurements of freshly excised tissue samples from the breast [40]. The acute inflammations data set includes examples of diagnosing of the acute inflammations of urinary bladder and acute nephritises [41]. The ecoli data set contains protein localization sites [42]. The glass data set describes six types of glass in terms of their oxide content [43]. The Haberman’s survival data set includes samples from a study that was conducted between 1958 and 1970 on the survival of patients who had undergone surgery for breast cancer [44]. The Ionosphere data set describes radar data return from the ionosphere [45]. The iris data uses length and width of sepal and petal to describe three types of iris plant [46]. The Parkinson’s data set consists of a range of biomedical voice measurements from people who are either healthy or with Parkinson’s disease [47]. The Pima Indians diabetes data set uses several aspects to separate females from Pima Indian heritage who are either healthy or with diabetes [48]. The sonar data set collects data obtained by bouncing sonar signals off a metal cylinder and rocks at various angles and under various conditions [49]. The transfusion data set has four aspects of blood donors, i.e., months since last donation, total number of donation, total blood donated, and months since first donation [50]. The wine data uses constituents found in wines to distinguish three types of wine [51]. The wine quality (red) data set contains inputs from physicochemical tests to describe red variant of the Portuguese “Vihno Verde” wine [52]. The yeast data set collects the amino acid sequence information to predict the cellular localization sites of proteins [53].

Experimental Design

The experiment is designed for two purposes: (1) examine the effectiveness of the proposed approach and (2) compare the proposed approach with existing methods. The effectiveness of the proposed approach is examined by applying three MCDM methods to estimate the number of clusters for fifteen public-domain UCI machine learning data sets. The performances of the three MCDM methods are then compared to the ten relative measures presented in the previous section using the same sets of UCI data [54].

The experiment is carried out according to the following process:

Input. fifteen UCI machine learning data sets.

Output. Rankings of different numbers of clusters for each data set by the MCDM methods and the relative measures.

Step 1. Prepare the data sets: remove class labels from the data sets and upload the data sets to Weka 3.6.

Step 2. Get clustering solutions using the k-means algorithm for all data sets.

Step 3. For each data set, the k-means algorithm is used to compute the ten selected relative measures nine times, each time with a different number of clusters (i.e., from 2 to 10).

Step 4. For each data set, generate the optimal number of clusters determined by each relative measure.

Step 5. Twelve domain experts were asked to assign weights to relative measures for each data set based on their experiences. The score ranges from 0 to 10 with increasing importance, and the averaged and normalized scores are weights of relative measures.

Step 6. Generate three rankings of different numbers of clusters using PROMETHEE II, WSM, and TOPSIS for the data sets. For each data set, different numbers of clusters are alternatives and the performances of k-means algorithm on the relative measures are criteria. PROMETHEE II was implemented by the MCDM software D-Sight, and WSM and TOPSIS were implemented using MATLAB 7.0 [54]. If the top-three ranked numbers of clusters have very close ranking values (i.e., the difference between their values is less than 0.01), both the ranking order and ranking values should be provided to the decision-maker.

Download:

Table 3. Estimations of number of clusters by the relative measures.

https://doi.org/10.1371/journal.pone.0041713.t003

Download:

Table 4. Estimations of number of clusters by the MCDM methods.

https://doi.org/10.1371/journal.pone.0041713.t004

Download:

Table 5. Results summary.

https://doi.org/10.1371/journal.pone.0041713.t005

END

For each data set, nine different numbers of clusters (i.e., from 2 to 10) are used as alternatives in the MCDM methods due to the structures of these data sets (refer to Table 1). When the structure of a data set is unknown, reasonable numbers of clusters can be used as alternatives.

The 0–10 scale used by domain experts indicates increasing importance of criteria. Number 0 indicates that the domain expert is not interested in that criterion and number 10 indicates that the domain expert considers the criterion extremely important. Number 5, the midpoint of the scale, shows the moderate importance of a criterion. Domain experts can use numbers 1, 2, 3, and 4 to represent the importance between none and moderate, with increasing strength. Similarly, numbers 6, 7, 8, and 9 are used to represent the importance between moderate and extreme, with increasing intensity. Since the weights of criteria have important impact on the final evaluation of alternatives, some MCDM softwares provide tools to facilitate sensitivity and robustness analyses. For instance, the D-Sight software allows the decision-maker to find out the stability intervals of the weights of criteria and observe the impact of a change of weight on the final ranking.

Experimental Results and Discussion

To illustrate the values and rankings generated by the MCDM methods for different numbers of clusters [55], Table 2 presents the yeast data set as an example. The number of classes provided by UCI machine learning repository for yeast is ten. As can be seen from Table 2, PROMETHEE II finds the right number of clusters for this data set. Both TOPSIS and WSM rank K = 9 as the best alternative and K = 10 as the second best.

Table 3 and Table 4 summarize the best ranked numbers of clusters for all data sets produced by the ten relative measures and the three MCDM methods, respectively. Both tables have the same structure. The leftmost column lists the data sets and the rightmost column gives the number of classes provided by UCI machine learning repository for each data set. The entries in the middle of Table 3 and 4 show the optimal number of clusters for each data set determined by the relative measures and the MCDM methods, respectively. The correctly estimated numbers of clusters are highlighted in boldface and italic. Table 5 summarizes the number of correct determinations for the three MCDM methods and the ten relative measures.

A number of observations can be made based on the experimental study. First, the proposed approach is effective at estimating the optimal number of clusters in data. WSM, TOPSIS, and PROMETHEE II can estimate the optimal numbers of clusters for eight, nine, and eleven datasets, respectively. Second, the three MCDM methods outperform the ten existing relative measures considered in this study. The best performance of the relative measures (i.e., Silhouette and Hubert) is equal to the worst performance of the three MCDM methods (i.e., WSM). Furthermore, as can be seen from Table 3 and 4, the data sets that were missed by the MCDM methods were also missed by the relative measures, except the Parkinson’s data set. Third, the estimation of numbers of clusters for a given data set generated by different MCDM methods may vary. Fourth, there are situations that the top-ranked numbers of clusters by MCDM methods have very close ranking values. For instance, 9 and 10 were ranked by WSM as the best and the second best choices for the yeast data set, respectively (Table 2). But the difference between their WSM scores is only 0.0023. In such a case, both 9 and 10 and their corresponding ranking values should be provided to the decision-maker.

Conclusions

Determining the number of clusters in a data set is intrinsically difficult because this is often a subjective process. This paper has proposed a MCDM-based approach for estimating the optimal number of clusters in a data set, which treats different numbers of clusters as alternatives and clustering validity measures as criteria. Different numbers of clusters are ranked according to the corresponding performances of clustering algorithms on validity measures. The top ranked number of clusters is the one with the best overall performances for all the selected validity measures.

The experiment is designed to examine the effectiveness of the proposed method and compare the new approach with existing methods using three MCDM methods (WSM, TOPSIS, and PROMETHEE II), the k-means clustering algorithm, ten relative measures, and fifteen public-domain UCI machine learning data sets. The results prove the effectiveness of the proposed approach in estimating the number of clusters. Specifically, WSM, TOPSIS, and PROMETHEE II can estimate the optimal numbers of clusters for eight, nine, and eleven datasets, respectively. The comparative study shows that the three MCDM methods outperform the ten existing relative measures considered in the present study. The best performance of the relative measures (i.e., Silhouette and Hubert) is equal to the worst performance of the three MCDM methods (i.e., WSM).

MCDM methods normally require decision makers or domain experts to provide weights for the criteria involved in the decision problem. In this study, the proposed approach needs domain experts to assign weights for the relative measures. When automatic decision process is required or inputs of criteria weights from domain experts are unavailable, it is necessary to find a way to obtain the weights automatically and this is a future research direction. In addition, different MCDM methods may generate different rankings of the numbers of clusters. How to reconcile these differences is another future research avenue. This study only considers validity indices for crisp clustering. However, many real-life data sets have overlapping clusters, whose boundaries are hard to define. Therefore a potential direction of future work is to introduce validity indices that are suitable for fuzzy clustering to MCDM methods.

Acknowledgments

We thank Jun Li, Chen Lu, and Guoxun Wang for helpful discussions. We are grateful to the anonymous referees and Dr. Frank Emmert-Streib for their valuable and constructive comments.

Author Contributions

Conceived and designed the experiments: YP GK YZ YS. Performed the experiments: YZ YP. Analyzed the data: YP YZ. Wrote the paper: YP GK.

References

1. Jain AK, Murty MN, Flynn PJ (1999) Data Clustering: a Review. ACM Computing Surveys 31: 264–323.
- View Article
- Google Scholar
2. Tan P, Steinbach M, Kumar V (2005) Introduction to Data Mining. Addison-Wesley.
3. Marriott FHC (1971) Practical problems in a method of cluster analysis. Biometrics 27: 501–514.
- View Article
- Google Scholar
4. Hartigan JA (1975) Clustering Algorithms. Wiley.
5. Milligan GW, Cooper C (1985) An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika 50: 159–179.
- View Article
- Google Scholar
6. Krzanowski WJ, Lai YT (1988) A criterion for determining the number of groups in a data set using sum of squares clustering. Biometrics 44: 23–34.
- View Article
- Google Scholar
7. Kaufman L, Rousseeuw P (1990) Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley.
8. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B 63: 411–423.
- View Article
- Google Scholar
9. Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3: research0036.1–0036.21.
- View Article
- Google Scholar
10. Sugar CA, James GM (2003) Finding the number of clusters in a dataset. Journal of the American Statistical Association 98: 750–763.
- View Article
- Google Scholar
11. Salvador S, Chan P (2004) Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms. ICTAI. 576–584.
- View Article
- Google Scholar
12. Rokach L (2010) Ensemble-based classifiers. Artificial Intelligence Review 33: 1–39.
- View Article
- Google Scholar
13. Peng Y, Kou G, Wang G, Shi Y (2011) FAMCDM: A Fusion Approach of MCDM Methods to Rank Multiclass Classification Algorithms. Omega 39: 677–689 DOI:https://doi.org/10.1016/j.omega.2011.01.009.
- View Article
- Google Scholar
14. Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: part I. ACM SIGMOD Record 31.
15. Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: part II. ACM SIGMOD Record 31.
16. Zadeh L (1963) Optimality and non-scalar-valued performance criteria, IEEE Transactions on Automatic Control. 1: 59–60.
- View Article
- Google Scholar
17. Triantaphyllou E, K Baig (2005) The impact of aggregating benefit and cost criteria in four MCDA methods, IEEE Transactions on Engineering Management. 52: 213–226.
- View Article
- Google Scholar
18. Triantaphyllou E (2000) Multi-Criteria Decision Making: A Comparative Study. Dordrecht, The Netherlands: Kluwer Academic Publishers. 320.
- View Article
- Google Scholar
19. Brans JP (1982) L’ingénièrie de la décision; Elaboration d’instruments d’aide à la décision. La méthode PROMETHEE. In Nadeau R and Landry M, editors, L’aide à la décision: Nature, Instruments et Perspectives d’Avenir, Québec, Canada, Presses de l’Université Laval. 183–213.
- View Article
- Google Scholar
20. Brans JP, Mareschal B (2005) PROMETHEE methods. In Multiple Criteria Decision Analysis: State of the Art Surveys, Figueira J, Mousseau V and Roy B (eds.), Springer, New York. 163–195.
- View Article
- Google Scholar
21. Brans JP, Mareschal B (1994) How to decide with PROMETHEE. available at. http://www.visualdecision.com/Pdf/How%20to%20use%20PROMETHEE.pdf.
22. Brans JP, Vincke P (1985) A Preference Ranking Organisation Method: (The PROMETHEE Method for Multiple Criteria Decision-Making). Management Science 31: 647–656.
- View Article
- Google Scholar
23. Hwang CL, Yoon K (1981) Multiple Attribute Decision Making Methods and Applications, Springer, Berlin Heidelberg.
24. Olson DL (2004) Comparison of weights in TOPSIS models, Mathematical and Computer Modelling. 40: 721–727.
- View Article
- Google Scholar
25. Opricovic S, Tzeng GH (2004) Compromise solution by MCDM methods: A comparative analysis of VIKOR and TOPSIS, European Journal of Operational Research. 156: 445–455.
- View Article
- Google Scholar
26. MacQueen JB (1967) Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press. 281–297.
- View Article
- Google Scholar
27. Witten IH, Frank E (2005) Data Mining: Practical machine learning tools and techniques, 2nd Edition. Morgan Kaufmann, San Francisco.
28. Han J, Kamber M (2006) Data Mining: Concepts and Techniques. 2nd edition. Morgan Kaufmann.
29. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, et al. (2009) The WEKA Data Mining Software: An Update; SIGKDD Explorations. 11: 10–18.
- View Article
- Google Scholar
30. Theodoridis S, Koutroubas K (2008) Pattern recognition, Fourth edition. Academic Press.
31. Dunn JC (1973) A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics 3: 32–57.
- View Article
- Google Scholar
32. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1: 224–227.
- View Article
- Google Scholar
33. Chou CH, Su MC, Lai E (2004) A new cluster validity measure and its application to image compression. Pattern Analysis Applications 7: 205–220.
- View Article
- Google Scholar
34. Rousseeuw PJ (1987) Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics 20: 53–65.
- View Article
- Google Scholar
35. Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recognit 37: 487–501.
- View Article
- Google Scholar
36. Hubert LJ, Levin JR (1976) A general statistical framework for assessing categorical clustering in free recall. Psychol Bull 10: 1072–1080.
- View Article
- Google Scholar
37. Vendramin L, Campello R, Hruschka E (2010) Relative Clustering Validity Criteria: A Comparative Overview. Statistical Analysis and Data Mining 3: 209–235.
- View Article
- Google Scholar
38. Frank A, Asuncion A (2010) UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.
39. Mangasarian OL, Wolberg WH (1990) Cancer diagnosis via linear programming. SIAM News23: 1 & 18.
- View Article
- Google Scholar
40. Jossinet J (1996) Variability of impedivity in normal and pathological breast tissue. Med. & Biol. Eng. & Comput 34: 346–350.
- View Article
- Google Scholar
41. Czerniak J, Zarzycki H (2003) Application of rough sets in the presumptive diagnosis of urinary system diseases. Artifical Inteligence and Security in Computing Systems, ACS’2002 9th International Conference Proceedings, Kluwer Academic Publishers. 41–51.
- View Article
- Google Scholar
42. Horton P, Nakai K (1996) A Probablistic Classification System for Predicting the Cellular Localization Sites of Proteins. Intelligent Systems in Molecular Biology. 109–115.
- View Article
- Google Scholar
43. Evett IW, Spiehler EJ (1987) Rule Induction in Forensic Science. Technical report, Central Research Establishment, Home Office Forensic Science Service.
44. Haberman SJ (1976) Generalized Residuals for Log-Linear Models. Proceedings of the 9th International Biometrics Conference, Boston. 104–122.
- View Article
- Google Scholar
45. Sigillito VG, Wing SP, Hutton LV, Baker KB (1989) Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest 10: 262–266.
- View Article
- Google Scholar
46. Fisher RA (1936) The use of multiple measurements in taxonomic problems Annual Eugenics 7 Part. II: 179–188.
- View Article
- Google Scholar
47. Little MA, McSharry PE, Roberts SJ, Costello DAE, IM M (2007) Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection. BioMedical Engineering OnLine 6: 23.
- View Article
- Google Scholar
48. Smith JW, Everhart JE, Dickson WC, Knowler WC, Johannes RS (1988) Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care. 261–265.
- View Article
- Google Scholar
49. Gorman RP, Sejnowski TJ (1988) Analysis of Hidden Units in a Layered Network Trained to Classify Sonar Targets in Neural Networks. 1: 75–89.
- View Article
- Google Scholar
50. Yeh I, Yang K, Ting T (2008) Knowledge discovery on RFM model using Bernoulli sequence. Expert Systems with Applications.
51. Aeberhard S, Coomans D, De VO (1992) Comparison of Classifiers in High Dimensional Settings, Tech. Rep. no. 92–02, Dept. of Computer Science and Dept. of Mathematics and Statistics. James Cook University of North Queensland.
52. Cortez P, Cerdeira A, Almeida F, Matos T, Reis J (2009) Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, Elsevier 47: 547–553.
- View Article
- Google Scholar
53. Nakai K, Kanehisa M (1992) A Knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells. Genomics 14: 897–911.
- View Article
- Google Scholar
54. MATLAB (2005) The MathWorks, Inc., Natick, MA 01760. http://www.mathworks.com/products/matlab/.
55. Peng Y, Kou G, Wang G, Wu W, Shi Y (2011) Ensemble of software defect predictors: an AHP-based evaluation method, International Journal of Information Technology & Decision Making. 10: 187–206.
- View Article
- Google Scholar
56. Peng Y, Kou G, Shi Y, Chen Z (2008) A Descriptive Framework for the Field of Data Mining and Knowledge Discovery, International Journal of Information Technology & Decision Making. 7: 639–682.
- View Article
- Google Scholar

[ref1] 1. Jain AK, Murty MN, Flynn PJ (1999) Data Clustering: a Review. ACM Computing Surveys 31: 264–323.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Tan P, Steinbach M, Kumar V (2005) Introduction to Data Mining. Addison-Wesley.

[ref3] 3. Marriott FHC (1971) Practical problems in a method of cluster analysis. Biometrics 27: 501–514.
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref4] 4. Hartigan JA (1975) Clustering Algorithms. Wiley.

[ref5] 5. Milligan GW, Cooper C (1985) An Examination of Procedures for Determining the Number of Clusters in a Data Set. Psychometrika 50: 159–179.
View Article
Google Scholar

[10] View Article

[11] Google Scholar

[ref6] 6. Krzanowski WJ, Lai YT (1988) A criterion for determining the number of groups in a data set using sum of squares clustering. Biometrics 44: 23–34.
View Article
Google Scholar

[13] View Article

[14] Google Scholar

[ref7] 7. Kaufman L, Rousseeuw P (1990) Finding Groups in Data: An Introduction to Cluster Analysis. New York: Wiley.

[ref8] 8. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B 63: 411–423.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref9] 9. Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology 3: research0036.1–0036.21.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref10] 10. Sugar CA, James GM (2003) Finding the number of clusters in a dataset. Journal of the American Statistical Association 98: 750–763.
View Article
Google Scholar

[23] View Article

[24] Google Scholar

[ref11] 11. Salvador S, Chan P (2004) Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms. ICTAI. 576–584.
View Article
Google Scholar

[26] View Article

[27] Google Scholar

[ref12] 12. Rokach L (2010) Ensemble-based classifiers. Artificial Intelligence Review 33: 1–39.
View Article
Google Scholar

[29] View Article

[30] Google Scholar

[ref13] 13. Peng Y, Kou G, Wang G, Shi Y (2011) FAMCDM: A Fusion Approach of MCDM Methods to Rank Multiclass Classification Algorithms. Omega 39: 677–689 DOI:https://doi.org/10.1016/j.omega.2011.01.009.
View Article
Google Scholar

[32] View Article

[33] Google Scholar

[ref14] 14. Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: part I. ACM SIGMOD Record 31.

[ref15] 15. Halkidi M, Batistakis Y, Vazirgiannis M (2002) Cluster validity methods: part II. ACM SIGMOD Record 31.

[ref16] 16. Zadeh L (1963) Optimality and non-scalar-valued performance criteria, IEEE Transactions on Automatic Control. 1: 59–60.
View Article
Google Scholar

[37] View Article

[38] Google Scholar

[ref17] 17. Triantaphyllou E, K Baig (2005) The impact of aggregating benefit and cost criteria in four MCDA methods, IEEE Transactions on Engineering Management. 52: 213–226.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref18] 18. Triantaphyllou E (2000) Multi-Criteria Decision Making: A Comparative Study. Dordrecht, The Netherlands: Kluwer Academic Publishers. 320.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref19] 19. Brans JP (1982) L’ingénièrie de la décision; Elaboration d’instruments d’aide à la décision. La méthode PROMETHEE. In Nadeau R and Landry M, editors, L’aide à la décision: Nature, Instruments et Perspectives d’Avenir, Québec, Canada, Presses de l’Université Laval. 183–213.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref20] 20. Brans JP, Mareschal B (2005) PROMETHEE methods. In Multiple Criteria Decision Analysis: State of the Art Surveys, Figueira J, Mousseau V and Roy B (eds.), Springer, New York. 163–195.
View Article
Google Scholar

[49] View Article

[50] Google Scholar

[ref21] 21. Brans JP, Mareschal B (1994) How to decide with PROMETHEE. available at. http://www.visualdecision.com/Pdf/How%20to%20use%20PROMETHEE.pdf.

[ref22] 22. Brans JP, Vincke P (1985) A Preference Ranking Organisation Method: (The PROMETHEE Method for Multiple Criteria Decision-Making). Management Science 31: 647–656.
View Article
Google Scholar

[53] View Article

[54] Google Scholar

[ref23] 23. Hwang CL, Yoon K (1981) Multiple Attribute Decision Making Methods and Applications, Springer, Berlin Heidelberg.

[ref24] 24. Olson DL (2004) Comparison of weights in TOPSIS models, Mathematical and Computer Modelling. 40: 721–727.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref25] 25. Opricovic S, Tzeng GH (2004) Compromise solution by MCDM methods: A comparative analysis of VIKOR and TOPSIS, European Journal of Operational Research. 156: 445–455.
View Article
Google Scholar

[60] View Article

[61] Google Scholar

[ref26] 26. MacQueen JB (1967) Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press. 281–297.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref27] 27. Witten IH, Frank E (2005) Data Mining: Practical machine learning tools and techniques, 2nd Edition. Morgan Kaufmann, San Francisco.

[ref28] 28. Han J, Kamber M (2006) Data Mining: Concepts and Techniques. 2nd edition. Morgan Kaufmann.

[ref29] 29. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, et al. (2009) The WEKA Data Mining Software: An Update; SIGKDD Explorations. 11: 10–18.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref30] 30. Theodoridis S, Koutroubas K (2008) Pattern recognition, Fourth edition. Academic Press.

[ref31] 31. Dunn JC (1973) A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics 3: 32–57.
View Article
Google Scholar

[72] View Article

[73] Google Scholar

[ref32] 32. Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1: 224–227.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref33] 33. Chou CH, Su MC, Lai E (2004) A new cluster validity measure and its application to image compression. Pattern Analysis Applications 7: 205–220.
View Article
Google Scholar

[78] View Article

[79] Google Scholar

[ref34] 34. Rousseeuw PJ (1987) Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics 20: 53–65.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref35] 35. Pakhira MK, Bandyopadhyay S, Maulik U (2004) Validity index for crisp and fuzzy clusters. Pattern Recognit 37: 487–501.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref36] 36. Hubert LJ, Levin JR (1976) A general statistical framework for assessing categorical clustering in free recall. Psychol Bull 10: 1072–1080.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref37] 37. Vendramin L, Campello R, Hruschka E (2010) Relative Clustering Validity Criteria: A Comparative Overview. Statistical Analysis and Data Mining 3: 209–235.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref38] 38. Frank A, Asuncion A (2010) UCI Machine Learning Repository (http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

[ref39] 39. Mangasarian OL, Wolberg WH (1990) Cancer diagnosis via linear programming. SIAM News23: 1 & 18.
View Article
Google Scholar

[94] View Article

[95] Google Scholar

[ref40] 40. Jossinet J (1996) Variability of impedivity in normal and pathological breast tissue. Med. & Biol. Eng. & Comput 34: 346–350.
View Article
Google Scholar

[97] View Article

[98] Google Scholar

[ref41] 41. Czerniak J, Zarzycki H (2003) Application of rough sets in the presumptive diagnosis of urinary system diseases. Artifical Inteligence and Security in Computing Systems, ACS’2002 9th International Conference Proceedings, Kluwer Academic Publishers. 41–51.
View Article
Google Scholar

[100] View Article

[101] Google Scholar

[ref42] 42. Horton P, Nakai K (1996) A Probablistic Classification System for Predicting the Cellular Localization Sites of Proteins. Intelligent Systems in Molecular Biology. 109–115.
View Article
Google Scholar

[103] View Article

[104] Google Scholar

[ref43] 43. Evett IW, Spiehler EJ (1987) Rule Induction in Forensic Science. Technical report, Central Research Establishment, Home Office Forensic Science Service.

[ref44] 44. Haberman SJ (1976) Generalized Residuals for Log-Linear Models. Proceedings of the 9th International Biometrics Conference, Boston. 104–122.
View Article
Google Scholar

[107] View Article

[108] Google Scholar

[ref45] 45. Sigillito VG, Wing SP, Hutton LV, Baker KB (1989) Classification of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest 10: 262–266.
View Article
Google Scholar

[110] View Article

[111] Google Scholar

[ref46] 46. Fisher RA (1936) The use of multiple measurements in taxonomic problems Annual Eugenics 7 Part. II: 179–188.
View Article
Google Scholar

[113] View Article

[114] Google Scholar

[ref47] 47. Little MA, McSharry PE, Roberts SJ, Costello DAE, IM M (2007) Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection. BioMedical Engineering OnLine 6: 23.
View Article
Google Scholar

[116] View Article

[117] Google Scholar

[ref48] 48. Smith JW, Everhart JE, Dickson WC, Knowler WC, Johannes RS (1988) Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care. 261–265.
View Article
Google Scholar

[119] View Article

[120] Google Scholar

[ref49] 49. Gorman RP, Sejnowski TJ (1988) Analysis of Hidden Units in a Layered Network Trained to Classify Sonar Targets in Neural Networks. 1: 75–89.
View Article
Google Scholar

[122] View Article

[123] Google Scholar

[ref50] 50. Yeh I, Yang K, Ting T (2008) Knowledge discovery on RFM model using Bernoulli sequence. Expert Systems with Applications.

[ref51] 51. Aeberhard S, Coomans D, De VO (1992) Comparison of Classifiers in High Dimensional Settings, Tech. Rep. no. 92–02, Dept. of Computer Science and Dept. of Mathematics and Statistics. James Cook University of North Queensland.

[ref52] 52. Cortez P, Cerdeira A, Almeida F, Matos T, Reis J (2009) Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, Elsevier 47: 547–553.
View Article
Google Scholar

[127] View Article

[128] Google Scholar

[ref53] 53. Nakai K, Kanehisa M (1992) A Knowledge Base for Predicting Protein Localization Sites in Eukaryotic Cells. Genomics 14: 897–911.
View Article
Google Scholar

[130] View Article

[131] Google Scholar

[ref54] 54. MATLAB (2005) The MathWorks, Inc., Natick, MA 01760. http://www.mathworks.com/products/matlab/.

[ref55] 55. Peng Y, Kou G, Wang G, Wu W, Shi Y (2011) Ensemble of software defect predictors: an AHP-based evaluation method, International Journal of Information Technology & Decision Making. 10: 187–206.
View Article
Google Scholar

[134] View Article

[135] Google Scholar

[ref56] 56. Peng Y, Kou G, Shi Y, Chen Z (2008) A Descriptive Framework for the Field of Data Mining and Knowledge Discovery, International Journal of Information Technology & Decision Making. 7: 639–682.
View Article
Google Scholar

[137] View Article

[138] Google Scholar

Figures

Abstract

Introduction

Methods

Proposed Approach

MCDM Methods

MCDM Method 1: Weighted Sum Method (WSM)

MCDM Method 2: Preference Ranking Organisation Method for Enrichment of Evaluations (PROMETHEE)

MCDM Method 3: Technique for Order Preference by Similarity to Ideal Solution (TOPSIS)

Clustering Algorithm

Clustering Validity Measures

Results and Discussion

Data Sets

Experimental Design

Experimental Results and Discussion

Conclusions

Acknowledgments

Author Contributions

References