Abstract
Advances in hardware technologies allow to capture and process data in real-time and the resulting high throughput data streams require novel data mining approaches. The research area of Data Stream Mining (DSM) is developing data mining algorithms that allow us to analyse these continuous streams of data in real-time. The creation and real-time adaption of classification models from data streams is one of the most challenging DSM tasks. Current classifiers for streaming data address this problem by using incremental learning algorithms. However, even so these algorithms are fast, they are challenged by high velocity data streams, where data instances are incoming at a fast rate. This is problematic if the applications desire that there is no or only a very little delay between changes in the patterns of the stream and absorption of these patterns by the classifier. Problems of scalability to Big Data of traditional data mining algorithms for static (non streaming) datasets have been addressed through the development of parallel classifiers. However, there is very little work on the parallelisation of data stream classification techniques. In this paper we investigate K-Nearest Neighbours (KNN) as the basis for a real-time adaptive and parallel methodology for scalable data stream classification tasks.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Gaber, M., Zaslavsky, A., Krishnaswamy, S.: A survey of classification methods in data streams. In: Data Streams, pp. 39–59 (2007)
Domingos, P., Hulten, G.: A general framework for mining massive data streams. J. Comput. Graph. Stat. 12, 945–949 (2003)
Gama, J.: Knowledge Discovery from Data Streams. Chapman and Hall/CRC, Boca Raton (2010)
Bujlow, T., Riaz, T., Pedersen, J.M.: A method for classification of network traffic based on c5.0 machine learning algorithm. In: 2012 International Conference on Computing, Networking and Communications (ICNC), pp. 237–241 (2012)
Jadhav, A., Jadhav, A., Jadhav, P., Kulkarni, P.: A novel approach for the design of network intrusion detection system(nids). In: 2013 International Conference on Sensor Network Security Technology and Privacy Communication System (SNS PCS), pp. 22–27 (2013)
Behdad, M., French, T.: Online learning classifiers in dynamic environments with incomplete feedback. In: IEEE Congress on Evolutionary Computation, Cancùn, Mexico (2013)
Salazar, A., Safont, G., Soriano, A., Vergara, L.: Automatic credit card fraud detection based on non-linear signal processing. In: 2012 IEEE International Carnahan Conference on Security Technology (ICCST), pp. 207–212 (2012)
Joshi, M.V., Karypis, G., Kumar, V.: Scalparc: a new scalable and efficient parallel classification algorithm for mining large datasets. In: Parallel Processing Symposium, pp. 573–579 (1998)
Shafer, J., Agrawal, R., Mehta, M.: Sprint: a scalable parallel classifier for data mining. In: Proceedings of the 22nd VLDB Conference (1996)
Stahl, F., Bramer, M.: Computationally efficient induction of classification rules with the PMCRI and J-PMCRI frameworks. Knowl. Based Syst. 35, 49–63 (2012)
Domingos, P., Hulten, G.: Mining high-speed data streams. KDD, pp. 71–80 (2000)
Zhang, P., Gao, B.J., Zhu, X., Guo, L.: Enabling fast lazy learning for data streams. In: 2011 IEEE 11th International Conference on Data Mining (ICDM), pp. 932–941 (2011)
Zhang, C., Li, F., Jestes, J.: Efficient parallel kNN joins for large data in MapReduce. In: Proceedings of the 15th International Conference on Extending Database Technology, EDBT 12, pp. 38–49. ACM, New York (2012)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation (2004)
Liang, S., Wang, C., Liu, Y., Jian, L.:CUKNN: A parallel implementation of K-nearest neighbor on CUDA-enabled GPU. In: 2009 IEEE Youth Conference on Information, Computing and Telecommunication, YC-ICT ’09, pp. 415–418 (2009)
Dilectin, H.D., Mercy, R.B.V.: Classification and dynamic class detection of real time data for tsunami warning system. In: 2012 International Conference on Recent Advances in Computing and Software Systems (RACSS), pp. 124–129 (2012)
Gantz, J., Reinsel, D.: Extracting value from chaos. IDC iview, pp. 1–12 (2011)
Massive online analysis (http://moa.cms.waikato.ac.nz) (2014)
Dawid, A.: Stastical theory the prequential approach. Royal Stat. Soc. 147, 278–292 (1984)
Schlimmer, J.C., Granger, R.: Beyond incremental processing: tracking concept drift. In: Proceedings of the Fifth National Conference on Artificial Intelligence, vol. 1, pp. 502–507 (1986)
Street, W.N., Kim, Y.S.: A streaming ensemble algorithm (sea) for large-scale classication. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 377–382 (2001)
Hadoop, http://hadoop.apache.org/ (2014)
Spark: Lightning fast cluster computing (http://spark.apache.org) (2014)
Aggarwal, C., Han, J., Wang, J.,Yu P.: A framework for clustering evolving data streams. In: Proceedings of the 29th VLDB Conference. Berlin, Germany (2003)
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Pettinger, D., Di Fatta, G.: Space partitioning for scalable k-means. In: The Ninth IEEE International Conference on Machine Learning and Applications (ICMLA 2010), pp. 319–324, Washington DC, USA (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Tennant, M., Stahl, F., Di Fatta, G., Gomes, J.B. (2014). Towards a Parallel Computationally Efficient Approach to Scaling Up Data Stream Classification. In: Bramer, M., Petridis, M. (eds) Research and Development in Intelligent Systems XXXI. SGAI 2014. Springer, Cham. https://doi.org/10.1007/978-3-319-12069-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-12069-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12068-3
Online ISBN: 978-3-319-12069-0
eBook Packages: Computer ScienceComputer Science (R0)