Frequent Subgraph-Based Approach for Classifying Vietnamese Text Documents

Nguyen, Tu Anh Hoang; Hoang, Kiem

doi:10.1007/978-3-642-01347-8_25

Tu Anh Hoang Nguyen⁷ &
Kiem Hoang⁸

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 24))

Included in the following conference series:

International Conference on Enterprise Information Systems

1533 Accesses

Abstract

In this paper we present a simple approach for Vietnamese text classification without word segmentation, based on frequent subgraph mining techniques. A graph-based instead of traditional vector-based model is used for document representation. The classification model employs structural patterns (subgraphs) and Dice measure of similarity to identify a class of documents. This method is evaluated on Vietnamese data set for measuring classification accuracy. Results show that it can outperform k-NN algorithm (based on vector, hybrid document representation) in terms of accuracy and classification time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Apte, C., Damerau, F., Weiss, S.: Text mining with decision rules and decision trees. In: Proceedings of the Conference on Automated learning and discovery, Workshop 6: Learning from Text and the Web (1998)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of SIGIR, pp. 96–103 (1998)
Google Scholar
Dien, D., Kiem, H., Toan, N.V.: Vietnamese Word Segmentation. In: Proceedings of 6th Natural Language Processing Pacific Rim Symposium, pp. 749–756 (2001)
Google Scholar
Dominik, A., Walczak, Z., Wojceichowski, J.: Classification of web document using a graph-based model and structural patterns. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 67–78. Springer, Heidelberg (2007)
Chapter Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Gudes, E., Shimony, S.E., Vanetik, N.: Discovering Frequent Graph Patterns using Disjoint Paths. IEEE Transaction on Knowledge and Data Engineering 18(11), 1441–1456 (2006)
Article Google Scholar
Hung, N., Ha, N., Thuc, V., Nghia, T., Kiem, H.: Internet and Genetics Algorithm-based Text Categorization for Documents in Vietnamese. In: Proceedings of 3rd International Conference Research, Innovation and Vision of the Future, pp. 168–172 (2005)
Google Scholar
Markov, A., Last, M.: A Simple, Structure-Sensitive Approach for Web Document Classification. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds.) AWIC 2005. LNCS (LNAI), vol. 3528, pp. 293–298. Springer, Heidelberg (2005)
Google Scholar
Markov, A., Last, M., Kandel, A.: Model-based classification of web documents represented by graphs. In: Proceedings of Workshop on Knowledge Discovery on the Web at KDD, pp. 31–38 (2006)
Google Scholar
Masand, B., Linoff, G., Waltz, D.: Classifying news stories using memory based reasoning. In: Proceedings of SIGIR (1992)
Google Scholar
Phuc, D.: Document classification using graph model, frequent sub-graphs and Galois lattice. In: Poster Proceedings of 4th International Conference on Computer Science - Research, Innovation and Vision of the Future, pp. 33–38 (2006)
Google Scholar
Phuc, D., Phung, N.T.K.: Using Naïve Bayes Model and Natural Language Processing for Classifying Messages on Online Forum. In: Proceedings of IEEE International Conference on Computer Science - Research, Innovation and Vision for the Future, pp. 247–252 (2007)
Google Scholar
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification Of Web Documents Using Graph Matching. International Journal of Pattern Recognition and Artificial Intelligence, Special Issue on Graph Matching in Computer Vision and Pattern Recognition 18(3), 475–479 (2004)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communication of ACM 18(11), 613–620 (1992)
Article Google Scholar
Thanh, V.N., Hoang, K.T., Thanh, T.T.N., Hung, N.: Word Segmentation for Vietnamese Text Categorization: An online corpus approach. In: Poster Proceedings of 4th International Conference on Computer Science - Research, Innovation and Vision for the Future, pp. 113–118 (2006)
Google Scholar
Tomita, J., Nakawatase, H., Ishii, M.: Graph-based Text Database for Knowledge Discovery. In: Proceedings of 13th international World Wide Web conference on Alternate track papers & posters, pp. 454–455 (2004)
Google Scholar
Yan, X., Han, J.: gSpan: Graph-Based Substructure Pattern Mining. In: Proceedings of 2002 IEEE International Conference on Data Mining, pp. 721–724 (2002)
Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of ACM SIGIR, pp. 42–49 (1999)
Google Scholar
Vu, C.D.H., Dien, D., Nguyen, L.N., Hung, Q.N.: A Comparative Study on Vietnamese Text Classification Methods. In: Proceedings of IEEE International Conference on Computer Science - Research, Innovation and Vision for the Future, pp. 267–273 (2007)
Google Scholar
Washito, T., Motoda, H.: State of the art of Graph-Based Data Mining. SIGKDD Exploration 5(1), 59–68 (2003)
Article Google Scholar
Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proceedings of 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 317–332 (1995)
Google Scholar
Worlein, M., Meinl, T., Fisher, I., Philippsen, M.: A quantative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 392–403. Springer, Heidelberg (2005)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Information Technology, University of Science,VNU, Ho Chi Minh City, Vietnam
Tu Anh Hoang Nguyen
Faculty of Computer Science, University of Information Technology VNU, Ho Chi Minh City, Vietnam
Kiem Hoang

Authors

Tu Anh Hoang Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Kiem Hoang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Systems and informatics, Institute for Systems and Technologies of Information, Control and Communication (INSTICC) and Instituto Politécnico de Setúbal (IPS), Rua do Vale de Chaves, Estefanilha, 2910-761, Setúbal, Portugal
Joaquim Filipe & José Cordeiro &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nguyen, T.A.H., Hoang, K. (2009). Frequent Subgraph-Based Approach for Classifying Vietnamese Text Documents. In: Filipe, J., Cordeiro, J. (eds) Enterprise Information Systems. ICEIS 2009. Lecture Notes in Business Information Processing, vol 24. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01347-8_25

Download citation

DOI: https://doi.org/10.1007/978-3-642-01347-8_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01346-1
Online ISBN: 978-3-642-01347-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics