Abstract
Evaluating the performance of a learning algorithm is one of the basic tasks in machine learning and data science. In this chapter, we review commonly used performance measures and discuss their properties. We show that different measures focus on different aspects of the algorithm. Therefore, a learning algorithm is typically evaluated with respect to several criteria. We introduce conceptual tools and provide important guidelines for quality assessment of fully trained algorithms. We focus our attention to classification problems, as we draw connections to basic concepts in statistics, engineering, and other disciplines. In addition, we discuss regression problems, as we study popular residual-based measures. Finally, we suggest that evaluation criteria shall also be considered during the design of the algorithm. In this sense, the desired criteria determine the objective function, prior to the training of the algorithm. These design considerations are discussed, and several approaches are introduced to the problem.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Jinbo Bi and Kristin P Bennett, Regression error characteristic curves, Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003, pp. 43–50.
Andrew P Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition 30 (1997), no. 7, 1145–1159.
Rich Caruana and Alexandru Niculescu-Mizil, Data mining in metric space: an empirical analysis of supervised learning performance criteria, Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2004, pp. 69–78.
Michelle Ciraco, Michael Rogalewski, and Gary Weiss, Improving classifier utility by altering the misclassification cost ratio, Proceedings of the 1st International Workshop on Utility-Based Data Mining, ACM, 2005, pp. 46–52.
Corinna Cortes and Mehryar Mohri, AUC optimization vs. error rate minimization, Advances in Neural Information Processing Systems, 2004, pp. 313–320.
Thomas M Cover and Joy A Thomas, Elements of information theory, John Wiley & Sons, 2012.
Jesse Davis and Mark Goadrich, The relationship between precision-recall and ROC curves, Proceedings of the 23rd International Conference on Machine Learning, ACM, 2006, pp. 233–240.
Jonathan J Deeks and Douglas G Altman, Diagnostic tests 4: likelihood ratios, BMJ 329 (2004), no. 7458, 168–169.
Gianluca Demartini and Stefano Mizzaro, A classification of IR effectiveness metrics, European Conference on Information Retrieval, Springer, 2006, pp. 488–491.
Tom Fawcett, ROC graphs: Notes and practical considerations for researchers, Machine Learning 31 (2004), no. 1, 1–38.
——, An introduction to ROC analysis, Pattern Recognition Letters 27 (2006), no. 8, 861–874.
César Ferri, José Hernández-Orallo, and R Modroiu, An experimental comparison of performance measures for classification, Pattern Recognition Letters 30 (2009), no. 1, 27–38.
Peter A Flach, The geometry of ROC space: understanding machine learning metrics through ROC isometrics, Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003, pp. 194–201.
Jerome Friedman, Trevor Hastie, and Robert Tibshirani, The elements of statistical learning, vol. 1, Springer Series in Statistics New York, 2001.
Tilmann Gneiting and Adrian E Raftery, Strictly proper scoring rules, prediction, and estimation, Journal of American Statistical Association 102 (2007), no. 477, 359–378.
David J Hand and Robert J Till, A simple generalisation of the area under the ROC curve for multiple class classification problems, Machine Learning 45 (2001), no. 2, 171–186.
James A Hanley and Barbara J McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve., Radiology 143 (1982), no. 1, 29–36.
Nathalie Japkowicz and Mohak Shah, Evaluating learning algorithms: a classification perspective, Cambridge University Press, 2011.
Victor Richmond Jose, A characterization for the spherical scoring rule, Theory and Decision 66 (2009), no. 3, 263–281.
Nicolas Lachiche and Peter A Flach, Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves, Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003, pp. 416–423.
Nada Lavrač, Peter Flach, and Blaz Zupan, Rule evaluation measures: A unifying view, International Conference on Inductive Logic Programming, Springer, 1999, pp. 174–185.
Charles X Ling, Jin Huang, Harry Zhang, et al., AUC: a statistically consistent and more discriminating measure than accuracy, IJCAI, vol. 3, 2003, pp. 519–524.
Simon J Mason and Nicholas E Graham, Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation, Quarterly Journal of the Royal Meteorological Society: A Journal of the Atmospheric Sciences, Applied Meteorology and Physical Oceanography 128 (2002), no. 584, 2145–2166.
Neri Merhav and Meir Feder, Universal schemes for sequential decision from individual data sequences, IEEE Transactions on Information Theory 39 (1993), no. 4, 1280–1292.
Amichai Painsky and Gregory Wornell, On the universality of the logistic loss function, 2018 IEEE International Symposium on Information Theory (ISIT), IEEE, 2018, pp. 936–940.
Amichai Painsky and Gregory W Wornell, Bregman divergence bounds and universality properties of the logarithmic loss, IEEE Transactions on Information Theory 66 (2019), no. 3, 1658–1673.
H Vincent Poor, An introduction to signal detection and estimation, Springer Science & Business Media, 2013.
Mark D Reid and Robert C Williamson, Composite binary losses, Journal of Machine Learning Research 11 (2010), no. Sep, 2387–2422.
Thornton B Roby, Belief states and the uses of evidence, Behavioral Science 10 (1965), no. 3, 255–270.
Saharon Rosset, Claudia Perlich, and Bianca Zadrozny, Ranking-based evaluation of regression models, Fifth IEEE International Conference on Data Mining (ICDM’05), IEEE, 2005, pp. 8–pp.
Leonard J Savage, Elicitation of personal probabilities and expectations, Journal of American Statistical Association 66 (1971), no. 336, 783–801.
George AF Seber and Alan J Lee, Linear regression analysis, vol. 329, John Wiley & Sons, 2012.
Emir H Shuford, Arthur Albert, and H Edward Massengill, Admissible probability measurement procedures, Psychometrika 31 (1966), no. 2, 125–145.
Henri Theil, Economic forecasts and policy, North-Holland Pub. Co., 1961.
Leslie G Valiant, A theory of the learnable, Communications of the ACM 27 (1984), no. 11, 1134–1142.
Tong Zhang, Statistical behavior and consistency of classification methods based on convex risk minimization, The Annals of Statistics (2004), 56–85.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Painsky, A. (2023). Quality Assessment and Evaluation Criteria in Supervised Learning. In: Rokach, L., Maimon, O., Shmueli, E. (eds) Machine Learning for Data Science Handbook. Springer, Cham. https://doi.org/10.1007/978-3-031-24628-9_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-24628-9_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24627-2
Online ISBN: 978-3-031-24628-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)