Skip to main content

CONSTRUCTION OF DECISION TREES USING DATA CUBE

  • Conference paper
Enterprise Information Systems VII

Abstract

Data classification is an important problem in data mining. The traditional classification algorithms based on decision trees have been widely used due to their fast model construction and good model understandability. However, the existing decision tree algorithms need to recursively partition dataset into subsets according to some splitting criteria i.e. they still have to repeatedly compute the records belonging to a node (called F-sets) and then compute the splits for the node. For large data sets, this requires multiple passes of original dataset and therefore is often infeasible in many applications. In this paper we present a new approach to constructing decision trees using pre-computed data cube. We use statistics trees to compute the data cube and then build a decision tree on top of it. Mining on aggregated data stored in data cube will be much more efficient than directly mining on flat data files or relational databases. Since data cube server is usually a required component in an analytical system for answering OLAP queries, we essentially provide “free” classification by eliminating the dominant I/O overhead of scanning the massive original data set. Our new algorithm generates trees of the same prediction accuracy as existing decision tree algorithms such as SPRINT and RainForest but improves performance significantly. In this paper we also give a system architecture that integrates DBMS, OLAP, and data mining seamlessly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Agarwal, S., R. Agrawal, et al. (1996). On The Computation of Multidimensional ggregates. Proceedings of the International Conference on Very Large Databases, Mumbai (Bomabi), India: 506–521.

    Google Scholar 

  • Beyer, K. and R. Ramakrishnan (1999). Bottom-Up Computation of Sparse and Iceberg CUBEs. Proceedings of the 1999 ACM SIGMOD International Conference on anagement of Data (SIGMOD '99). C. Faloutsos. Philadelphia, PA: 359–370.

    Chapter  Google Scholar 

  • Chan, C. Y. and Y. E. Ioannidis (1998). Bitmap Index Design and Evaluation. roceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD '98), Seattle, WA: 355–366.

    Google Scholar 

  • Chaudhuri, S. and U. Dayal (1997). “An Overview of Data Warehousing and OLAP echnology.” SIGMOD Record 26(1): 65–74.

    Article  Google Scholar 

  • Chaudhuri, S., U. Fayyad, et al. (1999). Scalable Classification over SQL Databases. 15th International Conference on Data Engineering, March 23 - 26, 1999, Sydney, Australia: 470.

    Google Scholar 

  • Cheeseman, P. and J. Stutz (1996). Bayesian Classification (AutoClass): Theory and Results. Advances in Knowledge Discovery and Data Mining. R. Uthurusamy, AAAI/MIT Press: 153–180.

    Google Scholar 

  • Comer, D. (1979). “The Ubiquitous Btree.” ACM Computing Surveys 11(2): 121–137.

    Google Scholar 

  • Duda, R. and P. Hart (1973). Pattern Classification and Scene Analysis. New York, John Wiley & Sons.

    MATH  Google Scholar 

  • Fu, L. (2003). Classification for Free. International Conference on Internet Computing 2003 (IC'03) June 23 - 26, 2003, Monte Carlo Resort, Las Vegas, Nevada, USA.

    Google Scholar 

  • Fu, L. and J. Hammer (2000). CUBIST: A New Algorithm For Improving the Performance of Ad-hoc OLAP Queries. ACM Third International Workshop on Data Warehousing and OLAP, Washington, D.C, USA, November: 72–79.

    Google Scholar 

  • Gehrke, J., V. Ganti, et al. (1999). BOAT — Optimistic Decision Tree Construction. Proc. 1999 Int. Conf. Management of Data (SIGMOD '99), Philadephia, PA, June 1999.: 169–180.

    Google Scholar 

  • Gehrke, J., R. Ramakrishnan, et al. (1998). RainForest — A Framework for Fast Decision Tree Construction of Large Datasets. Proceedings of the 24th VLDB Conference (VLDB '98), New York, USA, 1998: 416–427.

    Google Scholar 

  • Hammer, J. and L. Fu (2001). Improving the Performance of OLAP Queries Using Families of Statistics Trees. 3rd International Conference on Data Warehousing and Knowledge Discovery DaWaK 01, September, 2001, Munich, Germany: 274–283.

    Google Scholar 

  • Han, J. and M. Kamber (2001). Data Mining: Concepts and Techniques, Morgan Kaufman Publishers.

    Google Scholar 

  • Harinarayan, V., A. Rajaraman, et al. (1996). “Implementing data cubes efficiently.” SIGMOD Record 25 (2): 205–216.

    Article  Google Scholar 

  • Inmon, W. H. (1996). Building the Data Warehouse. New York, John Wiley & Sons.

    Google Scholar 

  • Johnson, T. and D. Shasha (1997). “Some Approaches to Index Design for Cube Forests.” Bulletin of the Technical Committee on Data Engineering, IEEE Computer Society 20 (1): 27–35.

    Google Scholar 

  • Lakshmanan, L. V. S., J. Pei, et al. (2003). QC-Trees: An Efficient Summary Structure for Semantic OLAP. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9-12, 2003. A. Doan, ACM: 64–75.

    Google Scholar 

  • Lent, B., A. Swami, et al. (1997). Clustering Association Rules. Proceedings of the Thirteenth International Conference on Database Engineering (ICDE '97), Birmingham, U.K.: 220–231.

    Google Scholar 

  • Lu, H., R. Setiono, et al. (1995). NeuroRule: A Connectionist Approach to Data Mining. VLDB'95, Proceedings of 21th International Conference on Very Large Data Bases, September 11–15, 1995, Zurich, Switzerland. S. Nishio, Morgan Kaufmann: 478–489.

    Google Scholar 

  • Mehta, M., R. Agrawal, et al. (1996). SLIQ: A Fast Scalable Classifier for Data Mining. Advances in Database Technology — EDBT'96, 5th International Conference on Extending Database Technology, Avignon, France, March 25–29, 1996, Proceedings. G. Gardarin, Springer. 1057: 18–32.

    Google Scholar 

  • O'Neil, P. (1987). Model 204 Architecture and Performance. Proc. of the 2nd International Workshop on High Performance Transaction Systems, Asilomar, CA: 40–59.

    Google Scholar 

  • O'Neil, P. and D. Quass (1997). “Improved Query Performance with Variant Indexes.” SIGMOD Record (ACM Special Interest Group on Management of Data) 26(2): 38–49.

    Google Scholar 

  • Quilan, J. R. (1986). Introduction of Decision Trees. Machine Learning. 1: 8x1–106.

    Google Scholar 

  • Quilan, J. R. (1993). C4.5: Programs for Machine Learning, Morgan Kaufmann.

    Google Scholar 

  • Shafer, J., R. Agrawal, et al. (1996). SPRINT: A Scalable Parallel Classifier for Data Mining. VLDB'96, Proceedings of 22th International Conference on Very Large Data Bases, September 3–6, 1996, Mumbai (Bombay), India. N. L. Sarda, Morgan Kaufmann: 544–555.

    Google Scholar 

  • Sismanis, Y., A. Deligiannakis, et al. (2002). Dwarf: shrinking the PetaCube. Proceedings of the 2002 ACM SIGMOD international conference on Management of data (SIGMOD '02), Madison, Wisconsin: 464–475.

    Google Scholar 

  • Zhao, Y., P. M. Deshpande, et al. (1997). “An Array- Based Algorithm for Simultaneous Multidimensional Aggregates.” SIGMOD Record 26(2): 159–170.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer

About this paper

Cite this paper

Fu, L. (2007). CONSTRUCTION OF DECISION TREES USING DATA CUBE. In: Chen, CS., Filipe, J., Seruca, I., Cordeiro, J. (eds) Enterprise Information Systems VII. Springer, Dordrecht. https://doi.org/10.1007/978-1-4020-5347-4_10

Download citation

  • DOI: https://doi.org/10.1007/978-1-4020-5347-4_10

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-1-4020-5323-8

  • Online ISBN: 978-1-4020-5347-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics