ABSTRACT
Word segmentation is an important part of many applications, including information retrieval, information filtering, document analysis, and text summarization. In Thai language, the process is complicated since words are written continuously, and their structures are not well-defined. A recognized effective approach to word segmentation is Longest Matching, a method based on dictionary. Nevertheless, this method suffers from character-level and syllable-level ambiguities in determining word boundaries. This paper proposes a technique to Thai word segmentation using a two-step approach. First, text is segmented, using an application of Prediction by Partial Matching, into syllables whose structures are more well-defined. This reduces the earlier type of ambiguity. Then, the syllables are combined into words by an application of a syllable-level longest matching method together with a logistic regression model which takes into account contextual information. The experimental results show the syllable segmentation accuracy of more than 96.65% and the overall word segmentation accuracy of 97%.
- W. Aroonmanakun 2002. Collocation and Thai Word Segmentation. Proceedings of SNLP-Oriental COCOSDA.Google Scholar
- T. C. Bell, J. G. Cleary, and I. H. Witten 1990. Text Compression. Prentice Hall, NJ. Google ScholarDigital Library
- J. G. Cleary and I. H. Witten 1984. Data Compression Using Adaptive Coding and Partial String Matching. IEEE Transactions on Communications, 32(4):396--402.Google ScholarCross Ref
- A. Kawtrakul and T. Chalathip 1995. A Statistical Ap-proach to Thai Morphological Analyzer. Natural Lan-guage Processing and Intelligent Information System Technology Research Laboratory.Google Scholar
- S. Meknavin, P. Charenpornsawat, and B. Kijsirikul 1997. Feature-based Thai Words Segmentation. NLPRS, Incorporating SNLP-97.Google Scholar
- Y. Poowarawan 1986. Dictionary-based Thai Syllable Separation. Proceedings of the Ninth Electronics Engineering Conference.Google Scholar
- A. Pornprasertkul 1994. Thai Syntactic Analysis. Ph.D. thesis, Asian Institute of Technoloty.Google Scholar
- V. Sornlertlamvanich 1993. Word Segmentation for Thai in a Machine Translation System. Journal of NECTEC.Google Scholar
- W. J. Teahan, Yingying Wen, R. McNab, and I. H. Witten 2000. A Compression-Based Algorithm for Chinese Word Segmentation. Computational Linguistics, 26(3), 375--393. Google ScholarDigital Library
- I. H. Witten and T. C. Bell 1991. The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression. IEEE Transactions on Information Theory, 37(4):1085--1094.Google ScholarDigital Library
- T. Theeramunkong and V. Sornlertlamvanich 2000. Character Cluster Based Thai Information Retrieval. Proceedings of the 5th International Workshop in Information Retrieval with Asian Languages. Google ScholarDigital Library
- Combining prediction by partial matching and logistic regression for Thai word segmentation
Recommendations
Language model based arabic word segmentation
ACL '03: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics - Volume 1We approximate Arabic's rich morphology by a model that a word consists of a sequence of morphemes in the pattern prefix*-stem-suffix* (* denotes zero or more occurrences of a morpheme). Our method is seeded by a small manually segmented Arabic corpus ...
Combining segmenter and chunker for Chinese word segmentation
SIGHAN '03: Proceedings of the second SIGHAN workshop on Chinese language processing - Volume 17Our proposed method is to use a Hidden Markov Model-based word segmenter and a Support Vector Machine-based chunker for Chinese word segmentation. Firstly, input sentences are analyzed by the Hidden Markov Model-based word segmenter. The word segmenter ...
Non-dictionary-based Thai word segmentation using decision trees
HLT '01: Proceedings of the first international conference on Human language technology researchFor languages without word boundary delimiters, dictionaries are needed for segmenting running texts. This figure makes segmentation accuracy depend significantly on the quality of the dictionary used for analysis. If the dictionary is not sufficiently ...
Comments