Abstract
Many KDD systems incorporate an implicit or explicit preference for simpler models, but this use of “Occam's razor” has been strongly criticized by several authors (e.g., Schaffer, 1993; Webb, 1996). This controversy arises partly because Occam's razor has been interpreted in two quite different ways. The first interpretation (simplicity is a goal in itself) is essentially correct, but is at heart a preference for more comprehensible models. The second interpretation (simplicity leads to greater accuracy) is much more problematic. A critical review of the theoretical arguments for and against it shows that it is unfounded as a universal principle, and demonstrably false. A review of empirical evidence shows that it also fails as a practical heuristic. This article argues that its continued use in KDD risks causing significant opportunities to be missed, and should therefore be restricted to the comparatively few applications where it is appropriate. The article proposes and reviews the use of domain constraints as an alternative for avoiding overfitting, and examines possible methods for handling the accuracy–comprehensibility trade-off.
Similar content being viewed by others
References
Abu-Mostafa, Y.S. 1989. Learning from hints in neural networks. Journal of Complexity, 6:192-198.
Akaike, H. 1978. A Bayesian analysis of the minimum AIC procedure. Annals of the Institute of Statistical Mathematics, 30A:9-14.
Andrews, R. and Diederich, J. (Eds.). 1996. Proceedings of the NIPS-96 Workshop on Rule Extraction from Trained Artificial Neural Networks, Snowmass, CO: NIPS Foundation.
Bernardo, J.M. and Smith, A.F.M. 1994. Bayesian Theory. New York, NY: Wiley.
Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Oxford, UK: Oxford University Press.
Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M.K. 1987. Occam's razor. Information Processing Letters, 24:377-380.
Breiman, L. 1996. Bagging predictors. Machine Learning, 24:123-140.
Breiman, L. and Shang, N. 1997. Born again trees. Technical Report, Berkeley, CA: Statistics Department, University of California at Berkeley.
Brunk, C., Kelly, J., and Kohavi, R. 1997. MineSet: An integrated system for data mining. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 135-138.
Cestnik, B. and Bratko, I. 1988. Learning redundant rules in noisy domains. Proceedings of the Eighth European Conference on Artificial Intelligence, Munich, Germany: Pitman, pp. 348-356.
Cheeseman, P. 1990. On finding the most probable model. In Computational Models of Scientific Discovery and Theory Formation, J. Shrager and P. Langley (Eds.). San Mateo, CA: Morgan Kaufmann, pp. 73-95.
Chickering, D.M. and Heckerman, D. 1997. Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables. Machine Learning, 29:181-212.
Clark, P. and Matwin, S. 1993. Using qualitative models to guide inductive learning. Proceedings of the Tenth International Conference on Machine Learning, Amherst, MA: Morgan Kaufmann, pp. 49-56.
Clearwater, S. and Provost, F. 1990. RL4: A tool for knowledge-based induction. Proceedings of the Second IEEE International Conference on Tools for Artificial Intelligence, San Jose, CA: IEEE Computer Society Press, pp. 24-30.
Cohen, W.W. 1994. Grammatically biased learning: Learning logic programs using an explicit antecedent description language. Artificial Intelligence, 68:303-366.
Cohen, W.W. 1995. Fast effective rule induction. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 115-123.
Cooper, G.F. 1997. A simple constraint-based algorithm for efficiently mining observational databases for causal relationships. Data Mining and Knowledge Discovery, 1:203-224.
Cover, T.M. and Thomas, J.A. 1991. Elements of Information Theory. New York, NY: Wiley.
Craven, M.W. 1996. Extracting comprehensible models from trained neural networks. Unpublished doctoral dissertation, Department of Computer Sciences, University of Wisconsin—Madison, Madison, WI.
Datta, P. and Kibler, D. 1995. Learning prototypical concept descriptions. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 158-166.
Djoko, S., Cook, D.J., and Holder, L.B. 1995. Analyzing the benefits of domain knowledge in substructure discovery. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montréal, Canada: AAAI Press, pp. 75-80.
Domingos, P. 1996a. Two-way induction. International Journal on Artificial Intelligence Tools, 5:113-125.
Domingos, P. 1996b. Unifying instance-based and rule-based induction. Machine Learning, 24:141-168.
Domingos, P. 1997a. Knowledge acquisition from examples via multiple models. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 98-106.
Domingos, P. 1997b. Why does bagging work? A Bayesian account and its implications. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 155-158.
Domingos, P. 1998a. A process-oriented heuristic for model selection. Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI: Morgan Kaufmann, pp. 127-135.
Domingos, P. 1998b. When (and how) to combine predictive and causal learning. Proceedings of the NIPS-98 Workshop on Integrating Supervised and Unsupervised Learning, Breckenridge, CO: NIPS Foundation.
Domingos, P. 1999. Process-oriented estimation of generalization error. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, Stockholm, Sweden: Morgan Kaufmann.
Domingos, P. and Pazzani, M. 1997. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103-130.
Donoho, S. and Rendell, L. 1996. Constructive induction using fragmentary knowledge. Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy: Morgan Kaufmann, pp. 113-121.
Drucker, H., Cortes, C., Jackel, L.D., LeCun, Y., and Vapnik, V. 1994. Boosting and other machine learning algorithms. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 53-61.
Edgington, E.S. 1980. Randomization Tests. New York, NY: Marcel Dekker.
Elomaa, T. 1994. In defense of C4.5: Notes on learning one-level decision trees. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 62-69.
Fisher, D.H. and Schlimmer, J.C. 1988. Concept simplification and prediction accuracy. Proceedings of the Fifth International Conference on Machine Learning, Ann Arbor, MI: Morgan Kaufmann, pp. 22-28.
Freund, Y. and Schapire, R.E. 1996. Experiments with a new boosting algorithm. Proceedings of the Thirteenth International Conference on Machine Learning, Bari, Italy: Morgan Kaufmann, pp. 148-156.
Friedman, J.H. 1997. On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1:55-77.
Gams, M. 1989. New measurements highlight the importance of redundant knowledge. Proceedings of the Fourth European Working Session on Learning, Montpellier, France: Pitman, pp. 71-79.
Geman, S., Bienenstock, E., and Doursat, R. 1992. Neural networks and the bias/variance dilemma. Neural Computation, 4:1-58.
Grove, A.J. and Schuurmans, D. 1998. Boosting in the limit: Maximizing the margin of learned ensembles. Proceedings of the Fifteenth National Conference on Artificial Intelligence, Madison, WI: AAAI Press, pp. 692-699.
Han, J., Fu, Y., Wang, W., Chiang, J., Gong, W., Koperski, K., Li, D., Lu, Y., Rajan, A., Stefanovic, N., Xia, B., and Zaiane, O. 1996. DB Miner: a system for mining knowledge in large relational databases. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 250-255.
Hasling, D.W., Clancey, W.J., and Rennels, G. 1984. Strategic explanations for a diagnostic consultation system. Developments in Expert Systems, M.J. Coombs (Ed.), London, UK: Academic Press, pp. 117-133.
Haussler, D. 1988. Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence, 36:177-221.
Heckerman, D., Geiger, D., and Chickering, D.M. 1995. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20:197-243.
Holte, R.C. 1993. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11:63-91.
Imielinski, T., Virmani, A., and Abdulghani, A. 1996. DataMine: application programming interface and query language for database mining. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 256-262.
Jensen, D. 1992. Induction with Randomization Testing: Decision-Oriented Analysis of Large Data Sets. Unpublished doctoral dissertation, Washington University, Saint Louis, MO.
Jensen, D. and Cohen, P.R. 1999. Multiple comparisons in induction algorithms. Machine Learning, to appear.
Jensen, D. and Schmill, M. (1997). Adjusting for multiple comparisons in decision tree pruning. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 195-198.
Joachims, T. 1998. Text categorization with support vector machines: Learning with many relevant features. Proceedings of the Tenth European Conference on Machine Learning, Chemnitz, Germany: Springer-Verlag.
Kamber, M., Han, J., and Chiang, J.Y. 1997. Metarule-guided mining of multi-dimensional association rules using data cubes. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 207-210.
Kohavi, R. and Kunz, C. 1997. Option decision trees with majority votes. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 161-169.
Kohavi, R. and Sommerfield, D. 1998. Targeting business users with decision table classifiers. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 249-253.
Kong, E.B. and Dietterich, T.G. 1995. Error-correcting output coding corrects bias and variance. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 313-321.
Kononenko, I. 1990. Comparison of inductive and naive Bayesian learning approaches to automatic knowledge acquisition. In Current Trends in Knowledge Acquisition, B. Wielinga (Ed.). Amsterdam, The Netherlands: IOS Press.
Langley, P. 1996. Induction of condensed determinations. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR: AAAI Press, pp. 327-330.
Lawrence, S., Giles, C.L., and Tsoi, A.C. 1997. Lessons in neural network training: Overfitting may be harder than expected. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press, pp. 540-545.
Lee, Y., Buchanan, B.G., and Aronis, J.M. 1998. Knowledge-based learning in exploratory science: Learning rules to predict rodent carcinogenicity. Machine Learning, 30:217-240.
Liu, B., Hsu, W., and Chen, S. 1997. Using general impressions to analyze discovered classification rules. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 31-36.
MacKay, D. 1992. Bayesian interpolation. Neural Computation, 4:415-447.
Maclin, R. and Opitz, D. 1997. An empirical evaluation of bagging and boosting. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press.
Maclin, R. and Shavlik, J. 1996. Creating advice-taking reinforcement learners. Machine Learning, 22:251-281.
Meo, R., Psaila, G., and Ceri, S. 1996. A new SQL-like operator for mining association rules. Proceedings of the Twenty-Second International Conference on Very Large Databases, Bombay, India: Morgan Kaufmann, pp. 122-133.
Miller, Jr., R.G. 1981. Simultaneous Statistical Inference, 2nd ed. New York, NY: Springer-Verlag.
Mingers, J. 1989. An empirical comparison of pruning methods for decision tree induction. Machine Learning, 4:227-243.
Mitchell, T.M. 1980. The need for biases in learning generalizations, Technical report, New Brunswick, NJ: Computer Science Department, Rutgers University.
Murphy, P. and Pazzani, M. 1994. Exploring the decision forest: An empirical investigation of Occam's razor in decision tree induction. Journal of Artificial Intelligence Research, 1:257-275.
Murthy, S. and Salzberg, S. 1995. Lookahead and pathology in decision tree induction. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montréal, Canada: Morgan Kaufmann, pp. 1025-1031.
Nédellec, C., Rouveirol, C., Adé, H., Bergadano, F., and Tausend, B. 1996. Declarative bias in ILP. In Advances in Inductive Logic Programming, L. de Raedt (Ed.). Amsterdam, the Netherlands: IOS Press, pp. 82-103.
Oates, T. and Jensen, D. 1998. Large datasets lead to overly complex models: An explanation and a solution. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 294-298.
Ourston, D. and Mooney, R.J. 1994. Theory refinement combining analytical and empirical methods. Artificial Intelligence, 66:273-309.
Padmanabhan, B. and Tuzhilin, A. 1998. A belief-driven method for discovering unexpected patterns. Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York, NY: AAAI Press, pp. 94-100.
Pazzani, M., Mani, S., and Shankle, W.R. 1997. Beyond concise and colorful: Learning intelligible rules. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 235-238.
Pazzani, M.J. 1991. Influence of prior knowledge on concept acquisition: Experimental and computational results. Journal of Experimental Psychology: Learning, Memory, and Cognition, 17:416-432.
Pearl, J. 1978. On the connection between the complexity and credibility of inferred models. International Journal of General Systems, 4:255-264.
Piatetsky-Shapiro, G. 1996. Editorial comments. KDD Nuggets, 96:28.
Provost, F. and Jensen, D. 1998. KDD-98 Tutorial on Evaluating Knowledge Discovery and Data Mining. New York, NY: AAAI Press.
Quinlan, J.R. 1996. Bagging, boosting, and C4.5. Proceedings of the Thirteenth National Conference on Artificial Intelligence, Portland, OR: AAAI Press, pp. 725-730.
Quinlan, J.R. and Cameron-Jones, R.M. 1995. Oversearching and layered search in empirical learning. Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montréal, Canada: Morgan Kaufmann, pp. 1019-1024.
Quinlan, J.R. and Rivest, R.L. 1989. Inferring decision trees using the minimum description length principle. Information and Computation, 80:227-248.
Rao, J.S. and Potts, W.J.E. 1997. Visualizing bagged decision trees. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 243-246.
Rao, R.B., Gordon, D., and Spears, W. 1995. For every action, is there really an equal and opposite reaction? Analysis of the conservation law for generalization performance. Proceedings of the Twelfth International Conference on Machine Learning, Tahoe City, CA: Morgan Kaufmann, pp. 471-479.
Rissanen, J. 1978. Modeling by shortest data description. Automatica, 14:465-471.
Russell, S.J. 1986. Preliminary steps towards the automation of induction. Proceedings of the Fifth National Conference on Artificial Intelligence, Philadelphia, PA: AAAI Press, pp. 477-484.
Schaffer, C. 1993. Overfitting avoidance as bias. Machine Learning, 10:153-178.
Schaffer, C. 1994. A conservation law for generalization performance. Proceedings of the Eleventh International Conference on Machine Learning, New Brunswick, NJ: Morgan Kaufmann, pp. 259-265.
Schapire, R.E., Freund, Y., Bartlett, P., and Lee, W.S. 1997. Boosting the margin: A new explanation for the effectiveness of voting methods. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann.
Schölkopf, B., Burges, C., and Smola, A. 1998. Advances in Kernel Methods: Support Vector Machines. Cambridge, MA: MIT Press.
Schölkopf, B., Burges, C., and Vapnik, V. 1995. Extracting support data for a given task. Proceedings of the First International Conference on Knowledge Discovery and Data Mining, Montréal, Canada: AAAI Press, pp. 252-257.
Schuurmans, D. 1997. A new metric-based approach to model selection. Proceedings of the Fourteenth National Conference on Artificial Intelligence, Providence, RI: AAAI Press, pp. 552-558.
Schuurmans, D., Ungar, L.H., and Foster, D.P. 1997. Characterizing the generalization performance of model selection strategies. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 340-348.
Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6:461-464.
Shawe-Taylor, J., Bartlett, P.L., Williamson, R.C., and Anthony, M. 1996. Structural risk minimization over data-dependent hierarchies, Technical report No. NC-TR-96-053, Egham, UK: Department of Computer Science, Royal Holloway, University of London.
Shen, W.-M., Ong, K., Mitbander, B., and Zaniolo, C. 1996. Metaqueries for data mining. In Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.). Menlo Park, CA: AAAI Press, pp. 375-398.
Smola, A., Bartlett, P., Schölkopf, B., and Schuurmans, D. (Eds.). 1998. Proceedings of the NIPS-98 Workshop on Large Margin Classifiers, Breckenridge, CO: NIPS Foundation.
Srikant, R., Vu, Q., and Agrawal, R. 1997. Mining association rules with item constraints. Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA: AAAI Press, pp. 67-73.
Todorovski, L. and Džeroski, S. 1997. Declarative bias in equation discovery. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN: Morgan Kaufmann, pp. 376-384.
Tornay, S.C. 1938. Ockham: Studies and Selections. La Salle, IL: Open Court.
Vapnik, V.N. 1995. The Nature of Statistical Learning Theory. New York, NY: Springer-Verlag.
Wallace, C.S. and Boulton, D.M. 1968. An information measure for classification. Computer Journal, 11:185-194.
Webb, G.I. 1996. Further experimental evidence against the utility of Occam's razor. Journal of Artificial Intelligence Research, 4:397-417.
Webb, G.I. 1997. Decision tree grafting. Proceeding of the Fifteenth International Joint Conference on Artificial Intelligence, Nagoya, Japan: Morgan Kaufmann, pp. 846-851.
Wolpert, D. 1992. Stacked generalization. Neural Networks, 5:241-259.
Wolpert, D. 1996. The lack of a priori distinctions between learning algorithms. Neural Computation, 8:1341-1390.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Domingos, P. The Role of Occam's Razor in Knowledge Discovery. Data Mining and Knowledge Discovery 3, 409–425 (1999). https://doi.org/10.1023/A:1009868929893
Issue Date:
DOI: https://doi.org/10.1023/A:1009868929893