Skip to main content

Frequent Subgraph-Based Approach for Classifying Vietnamese Text Documents

  • Conference paper
Enterprise Information Systems (ICEIS 2009)

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 24))

Included in the following conference series:

  • 1533 Accesses

Abstract

In this paper we present a simple approach for Vietnamese text classification without word segmentation, based on frequent subgraph mining techniques. A graph-based instead of traditional vector-based model is used for document representation. The classification model employs structural patterns (subgraphs) and Dice measure of similarity to identify a class of documents. This method is evaluated on Vietnamese data set for measuring classification accuracy. Results show that it can outperform k-NN algorithm (based on vector, hybrid document representation) in terms of accuracy and classification time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Apte, C., Damerau, F., Weiss, S.: Text mining with decision rules and decision trees. In: Proceedings of the Conference on Automated learning and discovery, Workshop 6: Learning from Text and the Web (1998)

    Google Scholar 

  2. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  3. Baker, L.D., McCallum, A.K.: Distributional clustering of words for text classification. In: Proceedings of SIGIR, pp. 96–103 (1998)

    Google Scholar 

  4. Dien, D., Kiem, H., Toan, N.V.: Vietnamese Word Segmentation. In: Proceedings of 6th Natural Language Processing Pacific Rim Symposium, pp. 749–756 (2001)

    Google Scholar 

  5. Dominik, A., Walczak, Z., Wojceichowski, J.: Classification of web document using a graph-based model and structural patterns. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 67–78. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  6. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  7. Gudes, E., Shimony, S.E., Vanetik, N.: Discovering Frequent Graph Patterns using Disjoint Paths. IEEE Transaction on Knowledge and Data Engineering 18(11), 1441–1456 (2006)

    Article  Google Scholar 

  8. Hung, N., Ha, N., Thuc, V., Nghia, T., Kiem, H.: Internet and Genetics Algorithm-based Text Categorization for Documents in Vietnamese. In: Proceedings of 3rd International Conference Research, Innovation and Vision of the Future, pp. 168–172 (2005)

    Google Scholar 

  9. Markov, A., Last, M.: A Simple, Structure-Sensitive Approach for Web Document Classification. In: Szczepaniak, P.S., Kacprzyk, J., Niewiadomski, A. (eds.) AWIC 2005. LNCS (LNAI), vol. 3528, pp. 293–298. Springer, Heidelberg (2005)

    Google Scholar 

  10. Markov, A., Last, M., Kandel, A.: Model-based classification of web documents represented by graphs. In: Proceedings of Workshop on Knowledge Discovery on the Web at KDD, pp. 31–38 (2006)

    Google Scholar 

  11. Masand, B., Linoff, G., Waltz, D.: Classifying news stories using memory based reasoning. In: Proceedings of SIGIR (1992)

    Google Scholar 

  12. Phuc, D.: Document classification using graph model, frequent sub-graphs and Galois lattice. In: Poster Proceedings of 4th International Conference on Computer Science - Research, Innovation and Vision of the Future, pp. 33–38 (2006)

    Google Scholar 

  13. Phuc, D., Phung, N.T.K.: Using Naïve Bayes Model and Natural Language Processing for Classifying Messages on Online Forum. In: Proceedings of IEEE International Conference on Computer Science - Research, Innovation and Vision for the Future, pp. 247–252 (2007)

    Google Scholar 

  14. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  15. Schenker, A., Last, M., Bunke, H., Kandel, A.: Classification Of Web Documents Using Graph Matching. International Journal of Pattern Recognition and Artificial Intelligence, Special Issue on Graph Matching in Computer Vision and Pattern Recognition 18(3), 475–479 (2004)

    Google Scholar 

  16. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communication of ACM 18(11), 613–620 (1992)

    Article  Google Scholar 

  17. Thanh, V.N., Hoang, K.T., Thanh, T.T.N., Hung, N.: Word Segmentation for Vietnamese Text Categorization: An online corpus approach. In: Poster Proceedings of 4th International Conference on Computer Science - Research, Innovation and Vision for the Future, pp. 113–118 (2006)

    Google Scholar 

  18. Tomita, J., Nakawatase, H., Ishii, M.: Graph-based Text Database for Knowledge Discovery. In: Proceedings of 13th international World Wide Web conference on Alternate track papers & posters, pp. 454–455 (2004)

    Google Scholar 

  19. Yan, X., Han, J.: gSpan: Graph-Based Substructure Pattern Mining. In: Proceedings of 2002 IEEE International Conference on Data Mining, pp. 721–724 (2002)

    Google Scholar 

  20. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of ACM SIGIR, pp. 42–49 (1999)

    Google Scholar 

  21. Vu, C.D.H., Dien, D., Nguyen, L.N., Hung, Q.N.: A Comparative Study on Vietnamese Text Classification Methods. In: Proceedings of IEEE International Conference on Computer Science - Research, Innovation and Vision for the Future, pp. 267–273 (2007)

    Google Scholar 

  22. Washito, T., Motoda, H.: State of the art of Graph-Based Data Mining. SIGKDD Exploration 5(1), 59–68 (2003)

    Article  Google Scholar 

  23. Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proceedings of 4th Annual Symposium on Document Analysis and Information Retrieval, pp. 317–332 (1995)

    Google Scholar 

  24. Worlein, M., Meinl, T., Fisher, I., Philippsen, M.: A quantative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston. In: Jorge, A.M., Torgo, L., Brazdil, P.B., Camacho, R., Gama, J. (eds.) PKDD 2005. LNCS (LNAI), vol. 3721, pp. 392–403. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Nguyen, T.A.H., Hoang, K. (2009). Frequent Subgraph-Based Approach for Classifying Vietnamese Text Documents. In: Filipe, J., Cordeiro, J. (eds) Enterprise Information Systems. ICEIS 2009. Lecture Notes in Business Information Processing, vol 24. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01347-8_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-01347-8_25

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-01346-1

  • Online ISBN: 978-3-642-01347-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics