Fuzzy clustering-based semi-supervised approach for outlier detection in big text data

Lazhar, Farek

doi:10.1007/s13748-018-0165-5

Fuzzy clustering-based semi-supervised approach for outlier detection in big text data

Regular Paper
Published: 14 September 2018

Volume 8, pages 123–132, (2019)
Cite this article

Progress in Artificial Intelligence Aims and scope Submit manuscript

Farek Lazhar ORCID: orcid.org/0000-0002-6958-736X¹

501 Accesses
6 Citations
Explore all metrics

Abstract

Text data is often polluted by outlier documents which can significantly influence the performance of classification techniques. In this paper, we propose an approach based on fuzzy clustering to detect outlier documents. The principle of our approach is based on the assumption that documents assigned to different clusters with very close degrees are considered as candidate outliers. Firstly, a semantic data model is built using Doc2Vec framework. Secondly, a fuzzy clustering is performed. Thirdly, candidate outlier documents are detected based on the different degrees of membership. Finally, for each candidate outlier, the objective function is recomputed, and a candidate document is considered as outlier when it conducts to considerably increase the objective function score. To show the effectiveness of our approach, two classification tests, one with original datasets and the second without outlier, are applied. Experimental results show that discarding outlier from datasets conducts to improve the performance of classifiers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Utilizing Local Outlier Factor for Open-Set Classification in High-Dimensional Data - Case Study Applied for Text Documents

Distance Metrics in Open-Set Classification of Text Documents by Local Outlier Factor and Doc2Vec

NOCOL - Nonnegative Orthogonal Constraint Outlier Learning

References

Han, J., Kamber, M.: Data Mining: Concepts and Techniques, vol. 743. Morgan Kaufmann, San Francisco (2006)
MATH Google Scholar
Tamboli, J., Shukla, M.: A survey of outlier detection algorithms for data streams. In: 3rd International Conference on Computing for Sustainable Global Development (INDIACom), pp 3535–3540 (2016)
Sreevidya, S.S.: A survey on outlier detection methods. Int. J. Comput. Sci. Inf. Technol. 5(6), 8153–8156 (2014)
Google Scholar
Sharma, S., Jain, R.: Outlier detection in agriculture domain: application and techniques. In: Aggarwal, V., Bhatnagar, V., Mishra, D. (eds.) Big Data Analytics. Advances in Intelligent Systems and Computing, vol. 654. Springer, Singapor (2018)
Google Scholar
Assent, I.: Efficient density-based subspace clustering in high dimensions. In: Masulli, F., Petrosino, A., Rovetta, S. (eds.) Clustering High-Dimensional Data. Lecture Notes in Computer Science, vol. 7627, pp. 34–49. Springer, Berlin (2015)
Chapter Google Scholar
Merrell, R., Diaz, D.: Comparison of data mining methods on different applications: clustering and classification methods. Inf Sci Lett Lect Notes Comput Sci 4(2), 61–66 (2015)
Google Scholar
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceeding COLT’ 98 Proceedings of the Eleventh Annual Conference on Computational Learning Theory, pp. 92–100 (1998)
Aggarwal, C.C., Hinneburg, A., Keim, D.A.: On the surprising behavior of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) Database Theory ICDT 2001. Lecture Notes in Computer Science, vol. 1973. Springer, Berlin (2001)
Google Scholar
Rousseeuw, P., Leroy, A.: Robust Regression and Outlier Detection, 3rd edn. Wiley, New York (1996)
MATH Google Scholar
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: SIGMOD Conference, pp. 427–438 (2000)
Jagadeeswaran, V.S., Uma, P.: Detection of noise by efficient hierarchical BIRCH algorithm for large data sets. Int. J. Adv. Res. Comput. Commun. Eng. 2(2), 1306–1309 (2013)
Google Scholar
Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE Trans. Knowl. Data Eng. 17(2), 203–215 (2005)
Article MATH Google Scholar
Jain, A.K., Murty, M.N., Flyn, P.J.: Data clustering: a review. ACM Comput Surv 31(3), 264–323 (1999)
Article Google Scholar
Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York (1981)
Book MATH Google Scholar
Kumar, V., Kumar, S., Singh, A.K.: Outlier detection: a clustering-based approach. Int. J. Sci. Mod. Eng. 1(7), 16–19 (2013)
Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: ICML’14 Proceedings of the 31st International Conference on International Conference on Machine Learning, Beijing, China, vol. 32, pp. II-1188–II-1196 (2014)
Singh, G., Kumar, V.: An efficient clustering and distance based approach for outlier detection. Int. J. Comput. Trends Technol. 4(7), 2067–2072 (2013)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: an efficient clustering algorithm for large databases. In: ACM SIGMOD Conference, vol. 27(2) (1998)
Karypis, G., Han, E.H., Kumar, V.: Chameleon: hierarchical clustering using dynamic modeling. IEEE Comput. 32(8), 68–75 (1999)
Article Google Scholar
Breunig, M.M., Kriegel, H.P. ,Ng, R.T., Lof, S.J.: Identifying density-based local outliers. In: SIGMOD Conference, pp. 93–104 (2000)
Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: Proceeding VLDB Algorithms for Mining Distance-Based Outliers in Large Datasets, pp. 392–403 (1998)
Çelik, M., Dadaşer-Çelik, F., Dokuz, A.Ş.: Anomaly detection in temperature data using DBSCAN algorithm. In: International Symposium on Innovations in Intelligent Systems and Applications (INISTA), Istanbul, Turkey, pp. 91–95 (2011)
Mirkin, B.G.: Clustering for Data Mining: A Data Recovery Approach, vol. 3. CRC Press, Boca Raton (2005)
Book MATH Google Scholar
Wang, W., Yang, J., Muntz, R.: STING: a statistical information grid approach to spatial data mining. In: Proceedings of the 23rd International Conference on Very Large Data Bases, pp. 186–195. Morgan Kaufmann Publishers Inc., Burlington (1997)
Niu, K., Huang, C., Zhang, S., Chen, J.: ODDC: outlier detection using distance distribution clustering. In: Washio, T. (ed.) PAKDD 2007 Workshops. Lecture Notes in Artificial Intelligence (LNAI), vol. 4819, pp. 332–343. Springer, Berlin (2007)
Google Scholar
Breunig, M.M., Kriegel, H., Ng, R.T., et al.: LOF: identifying density-based local outliers. In: Proceedings of ACM SIGMOD International Conference on Management of Data, Dalles, TX, pp. 93–104 (2000)
Gath, I., Geva, A.: Fuzzy clustering for the estimation of the parameters of the components of mixtures of normal distribution. Pattern Recognit. Lett. 9, 77–86 (1989)
Article MATH Google Scholar
Cutsem, B., Gath, I.: Detection of outliers and robust estimation using fuzzy clustering. Comput. Stat. Data Anal. 15, 47–61 (1993)
Article MATH Google Scholar
Klawonn, K., Höppner, F., Shim, K., Jayaram, B.: Efficient algorithms for mining outliers from large data sets. In: Proceeding Revised Selected Papers of the First International Workshop on Clustering High-Dimensional Data, vol. 7627, pp. 14–33 (2013)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of Workshop at the International Conference on Learning Representations, Scottsdale, USA (2013)
Campr, M., Ježek, K.: Comparing semantic models for evaluating automatic document summarization. In: Král, P., Matoušek, V. (eds.) Text, Speech, and Dialogue. TSD 2015. Lecture Notes in Computer Science, vol. 9302. Springer, Cham (2015)
Google Scholar
Lau, J.H., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation. In: Proceedings of the 1st Workshop on Representation Learning for NLP, Berlin, Germany, pp. 78–86 (2015)
Ertöz, L., Steinbach, M., Kumar, V.: Finding topics in collections of documents: a shared nearest neighbor approach. In: Ertöz, L., Steinbach, M., Kumar, V. (eds.) Clustering and Information Retrieval. Network Theory and Applications, vol. 11. Springer, Boston (2004)
Google Scholar
Bayley, M.J., Gillet, V.J., Willett, P., Bradshaw, J., Green, D.V.S.: Computational analysis of molecular diversity for drug discovery. In: Proceeding of the 3rd Annual Conference on Research in Computational Molecular Biology, pp 321–330. ACM Press, New York (1999)
Zadeh, L.A.: Fuzzy sets. Inf. Control 8, 338–353 (1965)
Article MATH Google Scholar
Sami, Ä., Tommi K.: Introduction to partitioning-based clustering methods with a robust example. Reports of the Department of Mathematical Information Technology, University of Jyväskylä, Finland (2006)
Bora, D.J.: Computational analysis of molecular diversity for drug discovery. Int. J. Comput. Sci. Inf. Technol. 5(2), 2501–2506 (2014)
MathSciNet Google Scholar
Bora, D.J., Gupta, A.K.: Effect of different distance measures on the performance of K-means algorithm: an experimental study in Matlab. Int. J. Comput. Sci. Inf. Technol. 5(2), 2501–2506 (2014)
Google Scholar
Kull, M., Flach, P.A.: Reliability maps: a tool to enhance probability estimates and improve classification accuracy. In: Calders, T., Esposito, F., Hüllermeier, E., Meo, R. (eds.) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2014. Lecture Notes in Computer Science, vol. 8725. Springer, Berlin (2014)
Google Scholar
Wang, F., Sun, J.: Survey on distance metric learning and dimensionality reduction in data mining. Data Min. Knowl. Discov. 29(2), 534–564 (2015)
Article MathSciNet MATH Google Scholar
Wu, W.: Clustering and information retrieval. In: Feature Selection for High-Dimensional Data. Artificial Intelligence: Foundations, Theory, and Algorithms. Springer, Cham (2015)
Wen, J.R., Zhang, H.J.: Query clustering in the web context. In: Wu, W., Xiong, H., Shekhar, S. (eds.) Clustering and Information Retrieval. Network Theory and Applications, vol. 11. Springer, Boston (2004)
Google Scholar
López-Monroy, A.P., Montes-y-Gómez, M., Villaseñor-Pineda, L., Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F.: A new document author representation for authorship attribution. In: Carrasco-Ochoa, J.A., Martínez-Trinidad, J.F., Olvera López, J.A., Boyer, K.L. (eds.) Pattern Recognition. MCPR 2012. Lecture Notes in Computer Science, vol. 7329. Springer, Berlin (2012)
Google Scholar
Forsyth, D.: Learning to classify. In: Probability and Statistics for Computer Science. Springer, Cham (2018)

Download references

Author information

Authors and Affiliations

University of Guelma, BP 411, Guelma, Algeria
Farek Lazhar

Authors

Farek Lazhar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Farek Lazhar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lazhar, F. Fuzzy clustering-based semi-supervised approach for outlier detection in big text data. Prog Artif Intell 8, 123–132 (2019). https://doi.org/10.1007/s13748-018-0165-5

Download citation

Received: 13 April 2018
Accepted: 02 September 2018
Published: 14 September 2018
Issue Date: 01 April 2019
DOI: https://doi.org/10.1007/s13748-018-0165-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fuzzy clustering-based semi-supervised approach for outlier detection in big text data

Abstract

Access this article

Similar content being viewed by others

Utilizing Local Outlier Factor for Open-Set Classification in High-Dimensional Data - Case Study Applied for Text Documents

Distance Metrics in Open-Set Classification of Text Documents by Local Outlier Factor and Doc2Vec

NOCOL - Nonnegative Orthogonal Constraint Outlier Learning

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fuzzy clustering-based semi-supervised approach for outlier detection in big text data

Abstract

Access this article

Similar content being viewed by others

Utilizing Local Outlier Factor for Open-Set Classification in High-Dimensional Data - Case Study Applied for Text Documents

Distance Metrics in Open-Set Classification of Text Documents by Local Outlier Factor and Doc2Vec

NOCOL - Nonnegative Orthogonal Constraint Outlier Learning

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation