skip to main content
10.3115/1220355.1220530dlproceedingsArticle/Chapter ViewAbstractPublication PagescolingConference Proceedingsconference-collections
Article
Free Access

Combining prediction by partial matching and logistic regression for Thai word segmentation

Published:23 August 2004Publication History

ABSTRACT

Word segmentation is an important part of many applications, including information retrieval, information filtering, document analysis, and text summarization. In Thai language, the process is complicated since words are written continuously, and their structures are not well-defined. A recognized effective approach to word segmentation is Longest Matching, a method based on dictionary. Nevertheless, this method suffers from character-level and syllable-level ambiguities in determining word boundaries. This paper proposes a technique to Thai word segmentation using a two-step approach. First, text is segmented, using an application of Prediction by Partial Matching, into syllables whose structures are more well-defined. This reduces the earlier type of ambiguity. Then, the syllables are combined into words by an application of a syllable-level longest matching method together with a logistic regression model which takes into account contextual information. The experimental results show the syllable segmentation accuracy of more than 96.65% and the overall word segmentation accuracy of 97%.

References

  1. W. Aroonmanakun 2002. Collocation and Thai Word Segmentation. Proceedings of SNLP-Oriental COCOSDA.Google ScholarGoogle Scholar
  2. T. C. Bell, J. G. Cleary, and I. H. Witten 1990. Text Compression. Prentice Hall, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. G. Cleary and I. H. Witten 1984. Data Compression Using Adaptive Coding and Partial String Matching. IEEE Transactions on Communications, 32(4):396--402.Google ScholarGoogle ScholarCross RefCross Ref
  4. A. Kawtrakul and T. Chalathip 1995. A Statistical Ap-proach to Thai Morphological Analyzer. Natural Lan-guage Processing and Intelligent Information System Technology Research Laboratory.Google ScholarGoogle Scholar
  5. S. Meknavin, P. Charenpornsawat, and B. Kijsirikul 1997. Feature-based Thai Words Segmentation. NLPRS, Incorporating SNLP-97.Google ScholarGoogle Scholar
  6. Y. Poowarawan 1986. Dictionary-based Thai Syllable Separation. Proceedings of the Ninth Electronics Engineering Conference.Google ScholarGoogle Scholar
  7. A. Pornprasertkul 1994. Thai Syntactic Analysis. Ph.D. thesis, Asian Institute of Technoloty.Google ScholarGoogle Scholar
  8. V. Sornlertlamvanich 1993. Word Segmentation for Thai in a Machine Translation System. Journal of NECTEC.Google ScholarGoogle Scholar
  9. W. J. Teahan, Yingying Wen, R. McNab, and I. H. Witten 2000. A Compression-Based Algorithm for Chinese Word Segmentation. Computational Linguistics, 26(3), 375--393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. I. H. Witten and T. C. Bell 1991. The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression. IEEE Transactions on Information Theory, 37(4):1085--1094.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Theeramunkong and V. Sornlertlamvanich 2000. Character Cluster Based Thai Information Retrieval. Proceedings of the 5th International Workshop in Information Retrieval with Asian Languages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Combining prediction by partial matching and logistic regression for Thai word segmentation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image DL Hosted proceedings
      COLING '04: Proceedings of the 20th international conference on Computational Linguistics
      August 2004
      1411 pages

      Publisher

      Association for Computational Linguistics

      United States

      Publication History

      • Published: 23 August 2004

      Qualifiers

      • Article

      Acceptance Rates

      COLING '04 Paper Acceptance Rate1,411of1,411submissions,100%Overall Acceptance Rate1,537of1,537submissions,100%
    • Article Metrics

      • Downloads (Last 12 months)25
      • Downloads (Last 6 weeks)6

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader