ABSTRACT
Just like other NLP applications, a serious problem with Chinese word segmentation lies in the ambiguities involved. Disambiguation methods fall into different categories, e.g., rule-based, statistical-based and example-based approaches, each of which may involve a variety of machine learning techniques. In this paper we report our current progress within the example-based approach, including its framework, example representation and collection, example matching and application. Experimental results show that this effective approach resolves more than 90% of ambiguities found. Hence, if it is integrated effectively with a segmentation method of the precision P > 95%, the resulting segmentation accuracy can reach, theoretically, beyond 99.5%.
- M. R. Brent. 1999. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning, 34:71--106. Google ScholarDigital Library
- Jing-Shin Chang and Keh-Yih Su. 1997. An unsupervised iterative method for Chinese new lexicon extraction. International Journal of Computational Linguistics & Chinese Language Processing, 1(1):101--157.Google Scholar
- Jyun-Sheng Chang, C.-D. Chen, and S.-D. Chen. 1991. Chinese word segmentation through constraint satisfaction and statistical optimization. In ROCLING-IV, pages 147--165, National Chiao-Thung University, Hsinchu, Taiwan.Google Scholar
- Jing-Shin Chang, Yi-Chung Lin, and Keh-Yih Su. 1995. Automatic construction of a Chinese electronic dictionary. In David Yarovsky and Kenneth Church, editors, WVLC-3, pages 107--120, Somerset, New Jersey, June.Google Scholar
- Keh-Jiann Chen and Shing-Huan Liu. 1992. Word identification for mandarin Chinese sentences. In COLING'92, volume I, pages 101--107, Nantes, France, July 23--28. Google ScholarDigital Library
- Tung-Hui Chiang, Ming-Yu Lin, and Keh-Yih Su. 1992. Statistical models for word segmentation and unknown word resolution. In ROCLING-V, pages 121--146, Taiwan.Google Scholar
- W. Daelemans, S. Buchholz, and J. Veenstra. 1999. Memory-based shallow parsing. In CoNLL-99, pages 53--60, Bergen, Norway.Google Scholar
- Walter Daelemans, Jakub Zavrel, Ko van der Sloot, and Antal van den Bosch. 2001. Timbl: Tilburg memory based learner, version 4.0, reference guide. Technical Report ILK Technical Report 01--04, Induction of Linguistic Knowledge, Tilburg University, The Netherlands.Google Scholar
- C. de Marcken. 1996. Unsupervised Language Acquisition. Ph.D. thesis, MIT, Cambridge, Mass.y. Google ScholarDigital Library
- A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 34:1--38.Google Scholar
- Charng-Kang Fan and Wen-Hsiang Tsai. 1988. Automatic word identification in Chinese sentences by the relaxation technique.Computer Processing of Chinese and Oriental Languages, 4(1):33--56.Google Scholar
- Kok-Wee Gan, Martha Palmer, and Kim-Teng Lua. 1997. A statistically emergent approach for language processing: Application to modeling effects in ambiguous Chinese word boundary perception. Computational Linguistics, 22(4):531--553. Google ScholarDigital Library
- Xianping Ge, Wanda Pratt, and Padhraic Smyth. 1999. Discovering Chinese words from unsegmented text (poster abstract). In SIGIR'99, pages 271--272, Berkeley, August. Google ScholarDigital Library
- Gregory Grefenstette and P. Tapanainen. 1994. What is a word, what is a sentence? Problems of tokenization. In 3rd Conference on Computational Lexicography and Text Research, COMPLEX'94, Budapest, July 7--10.Google Scholar
- Gregory Grefenstette, Anne Schiller, and Salah Aït-Mokhtar. 2000. Recognizing lexical patterns in text. In F. van Eynde, D. Gibbon, and I. Schuurman, editors, Lexicon Development for Speech and Language Processing, pages 141--168. Kluwer, Dordrecht.Google ScholarCross Ref
- Gregory Grefenstette. 1999. Tokenization. In Hans van Halteren, editor, Syntactic Wordclass Tagging, pages 117--133. Kluwer, Dordrecht.Google ScholarCross Ref
- Yingchun Guan and Bei Qin. 1986. The design and implementation of a Chinese word statistical system. Journal of Chinese Information Processing, 1(1):26--32. (In Chinese).Google Scholar
- Jin Guo. 1997. Critical tokenization and its properties. Computational Linguistics, 23(4):569--596. Google ScholarDigital Library
- J. Hockenmaier and C. Brew. 1998. Error-driven learning of Chinese word segmentation. In PACLIC-12, pages 218--229, Singapore. Chinese and Oriental Languages Processing Society.Google Scholar
- F. Jelinek. 1997. Statistical Methods for Speech Processing. MIT Press, Cambridge, MA. Google ScholarDigital Library
- Wanying Jin. 1992. A case study: Chinese segmentation and its disambiguation. Technical Report MCCS-92-227, Computing Research Laboratory, New Mexico State University, Las Cruces.Google Scholar
- Wanying Jin. 1994. Chinese segmentation disambiguation. In COLING-94, pages 1245--1249. Google ScholarDigital Library
- Chunyu Kit and Yorick Wilks. 1999. Unsupervised learning of word boundary with description length gain. In M. Osborne and E. T. K. Sang, editors, CoNLL-99, pages 1--6, Bergen, June.Google Scholar
- Chunyu Kit, Yuan Liu, and Nanyuan Liang. 1989. On methods of Chinese automatic word segmentation. Journal of Chinese Information Processing, 3(l):1--32. (In Chinese).Google Scholar
- Chunyu Kit. 2000. Unsupervised Lexical Learning as Inductive Inference. Ph.D. thesis, University of Sheffield, UK.Google Scholar
- Tom B. Y. Lai, Sun C. Lin, Chaofen Sun, and Maosong Sun. 1991. A maximal matching automatic Chinese word segmentation algorithm using corpus tagging for ambiguity resolution. In ROCLING-IV, pages 17--23.Google Scholar
- Nanyuan Liang and Yuan Liu. 1985. The OM method of automatic word segmentation. Chinese Information, 1(2). (In Chinese).Google Scholar
- Nanyuan Liang. 1984. Automatic word segmentation for written Chinese and the segmentation system CDWS. Journal of Beijing University of Aeronautics and Astronautics, ?(4). (In Chinese).Google Scholar
- Nanyuan Liang. 1986. CDWS - an automatic word segmentation system for written Chinese. Journal of Chinese Information Processing, 1(2):44--52. (In Chinese).Google Scholar
- Nanyuan Liang. 1989. Knowledge for Chinese word segmentation. Journal of Chinese Information Processing, 4(2):29--33. (In Chinese).Google Scholar
- Yuan Liu and Nanyuan Liang. 1986. Basic engineering for Chinese processing - Modern Chinese word frequency counting. Journal of Chinese Information Processing, 1(1):17--23. (In Chinese).Google Scholar
- Yuan Liu, Qiang Tan, and Xukun Shen. 1994. Contemporary Chinese Word Segmentation Standard Used for Information Processing, and Automatic Word Segmentation Methods. Tsinghua University Press, Bejing. (In Chinese).Google Scholar
- Hong I Ng and Kim Teng Lua. (forthcoming). A word finding automation for Chinese sentence tokenization. Submitted to ACM Transaction of Asian Languages Processing.Google Scholar
- David Palmer and J. Burger. 1997. Chinese word segmentation and information retrieval. In AAAI Spring Symposium on Cross-Language Text and Speech Retrieval.Google Scholar
- David Palmer. 1997. A trainable rule-based algorithm for word segmentation. In ACL-97, pages 321--328, Madrid. Google ScholarDigital Library
- David D. Palmer. 2000. Tokenization and sentence segmentation. In R. Dale, H. Moisl, and H. Somers, editors, Handbook of Natural Language Processing, pages 11--35. Marcel Dekker, New York.Google Scholar
- Fuchun Peng and Dale Schuurmans. 2001. Self-supervised Chinese word segmentation. In 4th International Symposium of Intelligent Data Analysis, pages 238--247. Google ScholarDigital Library
- Jay M. Pont and W. Bruce Croft. 1996. USeg: A retargetable word segmentation procedure for information retrieval. In Symposium on Document Analysis and Information Retrieval'96 (SDAIR). UMass Technical Report TR96-2, Univ. of Mass., Amherst, MA. Google ScholarDigital Library
- R. Sproat, C. Shih, W. Gale, and N. Chang. 1996. A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics, 22(3):377--404. Google ScholarDigital Library
- Mark Stevenson and Yorick A. Wilks. 2001. The interaction of knowledge sources in word sense disambiguation. Computational Linguistics, 27(3):321--349. Google ScholarDigital Library
- Maosong Sun and Benjamin K. T'sou. 1995. Ambiguity resolution in Chinese word segmentation. In Benjamin K. T'sou and Tom B. Y. Lai, editors, PACLIC-10, Hong Kong, December 27--28.Google Scholar
- Maosong Sun and Zhengping Zhou. 1998. Word segmentation ambiguity in Chinese texts. In Benjiamin K. T'sou, Tom B. Y. Lai, Samuel W. K. Chan, and Williams S-Y. Wang, editors, Quantitative and Computational Studies on the Chinese Language, pages 323--338. Language Information Sciences Research Centre, City University of Hong Kong.Google Scholar
- W. J. Teahan, Yingying Wen, Rodger J. McNab, and Ian H. Witten. 2000. A compression-based algorithm for Chinese word segmentation. Computational Linguistics, 26(3):375--393. Google ScholarDigital Library
- J. Veenstra, A. Van den Bosch, S. Buchholz, W. Daelemans, and J. Zavrel. 2000. Memory-based word sense disambiguation. Computing and the Humanities, special issue on SENSEVAL, 34(1--2y).Google Scholar
- Anand Venkataraman. 2001. A statistical model for word discovery in transcribed speech. Computational Linguistics, 27(3):353--372. Google ScholarDigital Library
- Jonathan J. Webster and Chunyu Kit. 1992a. Tokenization as the initial phase in NLP. In COLING'92, pages 1106--1110, Nantes, France, July 23--28. Google ScholarDigital Library
- Jonathan J. Webster and Chunyu Kit. 1992b. Tokenization for machine translation: What can be learned from Chinese word identification. In Proc. of 3rd International Conference on Chinese Information Processing, Beijing.Google Scholar
- Zimin Wu and Gwyneth Tseng. 1993. Chinese text segmentation for text retrieval: achievements and problems. JASIS, 44(9):532--542. Google ScholarDigital Library
- Shiwen Yu. 1998. Knowledge Base of Grammatical Information for Contemporary Chinese. Tsinghua University Press, Bejing. (In Chinese).Google Scholar
- Jakub Zavrel, Walter Daelemans, and Jorn Veenstra. 1998. Resolving PP attachment ambiguities with memory-based learning. In T. Mark Ellison, editor, CoNLL97: Computational Natural Language Learning, pages 136--144, Somerset, New Jersey.Google Scholar
- Guodong Zhou and Kim Teng Lua. (forthcoming). A hybrid approach toward ambiguity resolution in segmenting Chinese sentences. Submitted to Computer Processing of Oriental Languages.Google Scholar
Recommendations
Subword-based tagging for confidence-dependent Chinese word segmentation
COLING-ACL '06: Proceedings of the COLING/ACL on Main conference poster sessionsWe proposed a subword-based tagging for Chinese word segmentation to improve the existing character-based tagging. The subword-based tagging was implemented using the maximum entropy (MaxEnt) and the conditional random fields (CRF) methods. We found ...
Chinese word segmentation as morpheme-based lexical chunking
Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as ...
A Chinese word segmentation based on language situation in processing ambiguous words
While the processing of natural language is beneficial to the text mining. Chinese word segmentation is an important step in the processing of Chinese natural language. In this paper, the convergence essence of the segmentation process is analyzed, and ...
Comments