Abstract
Model trees combine the interpretability of decision trees with the efficiency of multiple linear regressions making them useful in dynamically attaining predictive analysis on data streams. However, missing values within the data streams is an issue during the training phase of a model tree. In this article, we compare different approaches to deal with incomplete streams in order to measure their impact on the resulting model tree in terms of accuracy. Moreover, we propose an online method to estimate and adjust the missing values during the stream processing. To show the results, a prototype has been developed and tested on several benchmarks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bache, K., Lichman, M.: UCI Machine Learning Repository (2013)
Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: MOA: Massive Online Analysis. J. Mach. Learn. Res. 11, 1601ā1604 (2010)
Breiman, L., et al.: Classification and Regression Trees. Chapman & Hall, New York (1984)
Breslow, L.A., Aha, D.W.: Simplifying decision trees: a survey. Knowl. Eng. Rev. 12(1), 1ā40 (1997)
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst. 47(4), 547ā553 (2009). Smart Business Networks: Concepts and Empirical Evidence
Didry, Y., Parisot, O., Tamisier, T.: Engineering data intensive applications with cadral. In: Luo, Y. (ed.) CDVE 2015. LNCS, vol. 9320, pp. 28ā35. Springer, Heidelberg (2015). doi:10.1007/978-3-319-24132-6_4
Domingos, P., Hulten, G.: Mining high-speed data streams. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 71ā80. ACM (2000)
Enders, C.K.: Applied Missing Data Analysis. Guilford Publications, New York (2010)
Farhangfar, A., Kurgan, L., Dy, J.: Impact of imputation of missing values on classification error for discrete data. Pattern Recogn. 41(12), 3692ā3705 (2008)
FĆ©raud, R., ClĆ©rot, F.: A methodology to explain neural network classification. Neural Networks 15(2), 237ā246 (2002)
Fong, S., Yang, H.: The six technical gaps between intelligent applications, real-time data mining: a critical review. J. Emerg. Technol. Web Intell. 3(2), 63ā73 (2011)
Frank, E., Mayo, M., Kramer, S.: Alternating model trees. In: 30th Annual ACM Symposium on Applied Computing, SAC 2015, pp. 871ā878. ACM, NY (2015)
Gilbert, D.: The jfreechart class library: Developer Guide. Object Refinery 7 (2002)
Hang, Y., Fong, S.: An experimental comparison of decision trees in traditional data mining and data stream mining. In: 6th International Conference on Advanced Information Management and Service (IMS), pp. 442ā447. IEEE (2010)
Hoeffding, W.: Probability inequalities for sums of bounded random variables. J. Am. Stat. Assoc. 58(301), 13ā30 (1963)
Ikonomovska, E., Gama, J.: Learning model trees from data streams. In: Boulicaut, J.-F., Berthold, M.R., HorvĆ”th, T. (eds.) DS 2008. LNCS (LNAI), vol. 5255, pp. 52ā63. Springer, Heidelberg (2008)
Ikonomovska, E., Gama, J., Džeroski, S.: Learning model trees from evolving data streams. Data Min. Knowl. Discov. 23(1), 128ā168 (2011)
Ikonomovska, E., Gama, J., SebastiĆ£o, R., Gjorgjevik, D.: Regression trees from data streams with drift detection. In: Gama, J., Costa, V.S., Jorge, A.M., Brazdil, P.B. (eds.) DS 2009. LNCS, vol. 5808, pp. 121ā135. Springer, Heidelberg (2009)
Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M.: Methods for imputation of missing values in air quality data sets. Atmos. Environ. 38(18), 2895ā2907 (2004)
Kotsiantis, S.B.: Decision trees: a recent overview. Artif. Intell. Rev. 39(4), 261ā283 (2013)
Marwala, T., IGI Global: Computational intelligence for missing data imputation, estimation and management: knowledge optimization techniques. Information Science Reference, Herhsey (2009)
MuƱoz, J., FelicĆsimo, Ć.M.: Comparison of statistical methods commonly used in predictive modelling. J. Veg. Sci. 15(2), 285ā292 (2004)
Murthy, S.K.: Automatic construction of decision trees from data: a multi-disciplinary survey. Data Min. Knowl. Discov. 2(4), 345ā389 (1998)
Mwale, F.D., Adeloye, A.J., Rustum, R.: Infilling of missing rainfall and streamflow data in the Shire River basin, Malawi-a SOM approach. Phys. Chem. Earth 50, 34ā43 (2012)
OāMadadhain, J., Fisher, D., White, S., Boey, Y.: The JUNG (Java Universal Network/Graph) framework. Technical report, UCI-ICS (2003)
Patel, K., Mehta, R.G., Raghuvanshi, M.M., Vadnere, N.N.: Incremental missing value replacement techniques for stream data. Int. J. Comput. Appl. 122(17), 9ā13 (2015)
Pham, N.-K., Do, T.-N., Poulet, F., Morin, A.: Treeview, exploration interactive des arbres de decision. Revue dāIntelligence Artificielle 22(3ā4), 473ā487 (2008)
Quinlan, J.R.: Learning with continuous classes. In: 5th Australian joint Conference on Artificial Intelligence, vol. 92, pp. 343ā348, Singapore (1992)
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581ā592 (1976)
Saar-Tsechansky, M., Provost, F.: Handling missing values when applying classification models (2007)
Shmueli, G., Koppius, O.R.: Predictive analytics in information systems research. Mis Q. 35(3), 553ā572 (2011)
Siegel, E.V.: Competitively evolving decision trees against fixed training cases for natural language processing. Adv. Genet. Program. 19, 409ā423 (1994)
Smith, J.D., Borckardt, J.J., Nash, M.R.: Inferential precision in single-case time-series data streams: how well does the em procedure perform when missing observations occur in autocorrelated data? Behav. Ther 43(3), 679ā685 (2012)
Stiglic, G., Kocbek, S., Pernek, I., Kokol, P.: Comprehensive decision tree models in bioinformatics. PLoS ONE 7(3), e33812 (2012)
Tfwala, S.S., Wang, Y.-M., Lin, Y.-C.: Prediction of missing flow records using multilayer perceptron and coactive neurofuzzy inference system. Sci. World J. (2013)
Tran, T.T., Peng, L., Diao, Y., McGregor, A., Liu, A.: Claro: modeling and processing uncertain data streams. VLDB J. Int. J. Very Large Data Bases 21(5), 651ā676 (2012)
Buuren, S.V.: Flexible Imputation of Missing Data. CRC Press, Boca Raton (2012)
Hulse, J.V., Khoshgoftaar, T.M.: A comprehensive empirical evaluation of missing value imputation in noisy software measurement data. J. Syst. Softw. 81(5), 691ā708 (2008)
Walters, D.K.W., Linn, R.T., Kulas, M., Cuddihy, E., Chonghua, W., Granger, C.V.: Selecting modeling techniques for outcome prediction: Comparison of artificial neural networks, classification and regression trees, and linear regression analysis for predicting medical rehabilitation outcomes. J. Am. Med. Inform. Assoc. Suppl. S, vol. 1187 (1999)
Wang, Y., Witten, I.H.: Induction of model trees for predicting continuous classes (1996)
Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, San Francisco (2011)
Zhang, P., Zhu, X., Shi, Y., Guo, L., Xindong, W.: Robust ensemble learning for mining noisy data streams. Decis. Support Syst. 50(2), 469ā479 (2011)
Zhu, X., Xindong, W.: Class noise vs. attribute noise: a quantitative study. Artif. Intell. Rev. 22(3), 177ā210 (2004)
Zhu, X., Zhang, P., Wu, X., He, D., Zhang, C., Shi, Y.: Cleansing noisy data streams. In: ICDM 2008, pp. 1139ā1144. IEEE (2008)
ŽliobaitÄ, I., HollmĆ©n, J.: Optimizing regression models for data streams with missing values. Mach. Learn. 99(1), 47ā73 (2015)
Acknowledgements
The project is supported by a grant from the Ministry of Economy and External Trade, Grand-Duchy of Luxembourg, under the RDI Law. Moreover, this work has been realized in partnership with the infinAIt Solutions S.A. company (http://infinait.eu), so we would like to thank Gero Vierke and Helmut Rieder for their help.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Parisot, O., Didry, Y., Tamisier, T., Otjacques, B. (2016). Training Model Trees on Data Streams with Missing Values. In: Helfert, M., Holzinger, A., Belo, O., Francalanci, C. (eds) Data Management Technologies and Applications. DATA 2015. Communications in Computer and Information Science, vol 584. Springer, Cham. https://doi.org/10.1007/978-3-319-30162-4_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-30162-4_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30161-7
Online ISBN: 978-3-319-30162-4
eBook Packages: Computer ScienceComputer Science (R0)