Abstract
Accurate estimation of class membership probability is needed for many applications in data mining and decision-making, to which multiclass classification is often applied. Since existing methods for estimation of class membership probability are designed for binary classification, in which only a single score outputted from a classifier can be used, an approach for multiclass classification requires both a decomposition of a multiclass classifier into binary classifiers and a combination of estimates obtained from each binary classifier to a target estimate. We propose a simple and general method for directly estimating class membership probability for any class in multiclass classification without decomposition and combination, using multiple scores not only for a predicted class but also for other proper classes. To make it possible to use multiple scores, we propose to modify or extend representative existing methods. As a non-parametric method, which refers to the idea of a binning method as proposed by Zadrozny et al., we create an “accuracy table” by a different method. Moreover we smooth accuracies on the table with methods such as the moving average to yield reliable probabilities (accuracies). As a parametric method, we extend Platt’s method to apply a multiple logistic regression. On two different datasets (open-ended data from Japanese social surveys and the 20 Newsgroups) both with Support Vector Machines and naive Bayes classifiers, we empirically show that the use of multiple scores is effective in the estimation of class membership probabilities in multiclass classification in terms of cross entropy, the reliability diagram, the ROC curve and AUC (area under the ROC curve), and that the proposed smoothing method for the accuracy table works quite well. Finally, we show empirically that in terms of MSE (mean squared error), our best proposed method is superior to an expansion for multiclass classification of a PAV method proposed by Zadrozny et al., in both the 20 Newsgroups dataset and the Pendigits dataset, but is slightly worse than the state-of-the-art method, which is an expansion for multiclass classification of a combination of boosting and a PAV method, on the Pendigits dataset.
Similar content being viewed by others
References
Agui T, Nakajima M (1991) Graphical information processing. Morikita Press, Tokyo
Asuncion A, Newman DJ (2007) UCI Machine Learning Repository.
Bennett PN (2000) Assessing the calibration of naive Bayes’s posterior estimates. In: Technical Report CMU-CS-00-155, School of Computer Science, Carnegie Mellon University, pp 1–8
Caruana R, Niculescu-Mizil A (2005) Predicting good probabilities with supervised learning. In: Proceedings of the American methodology conference (AMS2005), San Diego
Chan YS, Ng HT (2006) Estimating class priors in domain adaptation for word sense disambiguation. In: Proceedings of the 21st international conference on computational linguistic and the 44th annual meeting of the ACL (ICCL’06 and ACL’06), pp 89–96
Cheeseman P, Stutz J et al (1995) Bayesian classification (AutoClass): theory and results. In: Fayyad UM (eds) Advances in knowledge discovery and data mining. AAAI Press, Menlo Park, pp 61–83
Devarakota PR, Mirbach B, Ottersten B (2007) Confidence estimation in classification decision: a method for detecting unseen patterns. In: Proceedings of the sixth international conference on advance topics in pattern recognition (ICAPR), Kolkata, India
Fragoudis D, Meretakis D, Likothanassis S (2007) Best terms: an efficient feature-selection algorithm for text categorization. Knowl Inf Syst (KAIS) 8(1): 16–33
Groves RM, Fowler FJ Jr, Couper MP, Lepkowski JM, Singer E, Tourangeau R (2004) Survey methodology. Wiley, Hoboken
Iwai N, Sato H (eds) (2002) Japanese values and behavioral patterns in JGSS. Yuhikaku Publishing, Tokyo
Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the tenth European conference on machine learning (ECML’98), pp 137–142
Jones R, Rey B, Madani O, Griner W (2006) Generating query substitutions. In: Proceedings of the 15th international world wide web conference (WWW’06), pp 387–396
Kita K (1999) Language and computing: volume 4 probabilistic language model. University of Tokyo Press, Tokyo
Kogure A, Sagae M (2005) Estimating probability density from percentiles. Stat Math 53(2): 375–389
Kressel U et al (1999) Pairwise classification and support vector machines. In: Schölkopf B (eds) Advances in kernel methods support vector learning. MIT Press, Cambridge, pp 255–268
Langford J, Zadrozny B (2005) Estimating class membership probabilities using classifier learners. In: Cowell RG, Ghahramani Z (eds) Proceedings of AISTATS05, Society for Artificial Intelligence and Statistics, pp 198–205
Margineantu DD (2002) Class probability estimation and cost-sensitive classification decisions. In: Proceedings of the 13th European conference on machine learning (ECML’02), pp 270–281
Miwa S, Kobayashi D (eds) (2008) 2005SSM survey series: no. 1 basic analysis of 2005 SSM survey in Japan, 2005SSM Survey Research Group
Niculescu-Mizil A, Caruana R (2005a) Predicting good probabilities with supervised learning. In: Proceedings of the 22nd international conference on machine learning (ICML’05), pp 625–632
Niculescu-Mizil A, Caruana R (2005b) Obtaining calibrated probabilities from boosting. In: Proceedings of the 21st international conference on uncertainty in artificial intelligence (UAI’05), pp 413–420
Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2/3): 103–134
Ohkura T, Kiyota K, Nakagawa H (2006) Browsing system for weblog articles based on automated folksonomy. In: Proceedings of the third annual workshop on the weblogging ecosystem (WWE2006), Edinburgh
Perlich C, Provost FJ, Simonoff JS (2003) Tree induction vs. logistic regression: a Learning-curve analysis. J Mach Learn Res 4: 211–255
Platt JC et al (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola AJ (eds) Advances in large margin classifiers. MIT Press, Cambridge, pp 1–11
Provost FJ, Domingos P (2000) Well-trained PETs: improving probability estimation trees. CeDER Working Paper, #IS-00-04, Stern School of Business, New York University
Provost FJ, Domingos P (2003) Tree induction for probability-based ranking. Mach Learn 52(3): 199–215
Rennie J, Rifkin R (2001) Improving multiclass text classification with the support vector machine. Technical Report AIM-2001-026, MIT Press, Cambridge
Saar-Tsechansky M, Provost FJ (2004) Active sampling for class probability estimation and ranking. Mach Learn 54: 153–178
Sakamoto Y, Ishiguro M, Kitagawa G (1983) Akaike information criterion statistics. Kyoritsu Press, Tokyo
Schohn G, Cohn D(2000) Less is more: active learning with support vector machines. In: Proceedings of the 17th international conference on machine learning (ICML’00), pp 839–846
Sebastiani F (2002) Machine learning automated text categorization. ACM comput Surv 34(1): 1–47
Takahashi K, Takamura H, Okumura M (2005a) Automatic occupation coding with combination of machine learning and hand-crafted rules. In: Proceedings of the ninth Pacific-Asia conference on knowledge discovery and data mining (PAKDD’05), pp 269–279
Takahashi K, Suyama A, Murayama N, Takamura H, Okumura M (2005b) Applying occupation coding supporting system for coders (NANACO) in JGSS-2003. In: Japanese value and behavioral pattern seen in JGSS in 2003, the IRS at Osaka University of Commerce, pp 225–242
Takahashi K (2008) Automated coding in social surveys. In: Tanioka I, Nitta M, Iwai N (eds) Values and behavioral patterns in Japan. University of Tokyo Press, Tokyo, pp 459–471
Tanioka I, Nitta M, Iwai N (eds) (2008) Values and behavioral patterns in Japan. University of Tokyo Press, Tokyo
Tsuruoka Y, Tsujii J (2003) Training a naive Bayes classifier via EM algorithm with a class distribution constraint. In: Proceedings of the seventh conference on natural language learning (CoNLL), pp 127–134
Vapnik V (1998) Statistical learning theory. Wiley, New York
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, Mclachlan GJ, Ng A, Liu B, Yu PS, Zhou Z-H, Steinbach M, Hand DJ, Steinberg D (2008) Top 10 algorithms in data mining. Knowl Inf Syst (KAIS) 14(1): 1–37
Zadrozny B, Elkan C (2001a) Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Proceedings of the 18th international conference on machine learning (ICML’01), pp 609–616
Zadrozny B, Elkan C (2001b) Learning and making decisions when costs and probabilities are both unknown. In: Proceedings of the seventh international conference on knowledge discovery and data mining (KDD’01), pp 204–213
Zadrozny B (2002) Reducing multiclass to binary by coupling probability estimates. In: Advances in neural information processing systems (NIPS’01)
Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: Proceedings of the 8th international conference on knowledge discovery and data mining (KDD’02), pp 694–699
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Takahashi, K., Takamura, H. & Okumura, M. Direct estimation of class membership probabilities for multiclass classification using multiple scores. Knowl Inf Syst 19, 185–210 (2009). https://doi.org/10.1007/s10115-008-0165-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-008-0165-z