skip to main content
10.3115/1118824.1118832dlproceedingsArticle/Chapter ViewAbstractPublication PagessighanConference Proceedingsconference-collections
Article
Free Access

Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study

Published:01 September 2002Publication History

ABSTRACT

Just like other NLP applications, a serious problem with Chinese word segmentation lies in the ambiguities involved. Disambiguation methods fall into different categories, e.g., rule-based, statistical-based and example-based approaches, each of which may involve a variety of machine learning techniques. In this paper we report our current progress within the example-based approach, including its framework, example representation and collection, example matching and application. Experimental results show that this effective approach resolves more than 90% of ambiguities found. Hence, if it is integrated effectively with a segmentation method of the precision P > 95%, the resulting segmentation accuracy can reach, theoretically, beyond 99.5%.

References

  1. M. R. Brent. 1999. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning, 34:71--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jing-Shin Chang and Keh-Yih Su. 1997. An unsupervised iterative method for Chinese new lexicon extraction. International Journal of Computational Linguistics & Chinese Language Processing, 1(1):101--157.Google ScholarGoogle Scholar
  3. Jyun-Sheng Chang, C.-D. Chen, and S.-D. Chen. 1991. Chinese word segmentation through constraint satisfaction and statistical optimization. In ROCLING-IV, pages 147--165, National Chiao-Thung University, Hsinchu, Taiwan.Google ScholarGoogle Scholar
  4. Jing-Shin Chang, Yi-Chung Lin, and Keh-Yih Su. 1995. Automatic construction of a Chinese electronic dictionary. In David Yarovsky and Kenneth Church, editors, WVLC-3, pages 107--120, Somerset, New Jersey, June.Google ScholarGoogle Scholar
  5. Keh-Jiann Chen and Shing-Huan Liu. 1992. Word identification for mandarin Chinese sentences. In COLING'92, volume I, pages 101--107, Nantes, France, July 23--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Tung-Hui Chiang, Ming-Yu Lin, and Keh-Yih Su. 1992. Statistical models for word segmentation and unknown word resolution. In ROCLING-V, pages 121--146, Taiwan.Google ScholarGoogle Scholar
  7. W. Daelemans, S. Buchholz, and J. Veenstra. 1999. Memory-based shallow parsing. In CoNLL-99, pages 53--60, Bergen, Norway.Google ScholarGoogle Scholar
  8. Walter Daelemans, Jakub Zavrel, Ko van der Sloot, and Antal van den Bosch. 2001. Timbl: Tilburg memory based learner, version 4.0, reference guide. Technical Report ILK Technical Report 01--04, Induction of Linguistic Knowledge, Tilburg University, The Netherlands.Google ScholarGoogle Scholar
  9. C. de Marcken. 1996. Unsupervised Language Acquisition. Ph.D. thesis, MIT, Cambridge, Mass.y. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 34:1--38.Google ScholarGoogle Scholar
  11. Charng-Kang Fan and Wen-Hsiang Tsai. 1988. Automatic word identification in Chinese sentences by the relaxation technique.Computer Processing of Chinese and Oriental Languages, 4(1):33--56.Google ScholarGoogle Scholar
  12. Kok-Wee Gan, Martha Palmer, and Kim-Teng Lua. 1997. A statistically emergent approach for language processing: Application to modeling effects in ambiguous Chinese word boundary perception. Computational Linguistics, 22(4):531--553. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Xianping Ge, Wanda Pratt, and Padhraic Smyth. 1999. Discovering Chinese words from unsegmented text (poster abstract). In SIGIR'99, pages 271--272, Berkeley, August. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Gregory Grefenstette and P. Tapanainen. 1994. What is a word, what is a sentence? Problems of tokenization. In 3rd Conference on Computational Lexicography and Text Research, COMPLEX'94, Budapest, July 7--10.Google ScholarGoogle Scholar
  15. Gregory Grefenstette, Anne Schiller, and Salah Aït-Mokhtar. 2000. Recognizing lexical patterns in text. In F. van Eynde, D. Gibbon, and I. Schuurman, editors, Lexicon Development for Speech and Language Processing, pages 141--168. Kluwer, Dordrecht.Google ScholarGoogle ScholarCross RefCross Ref
  16. Gregory Grefenstette. 1999. Tokenization. In Hans van Halteren, editor, Syntactic Wordclass Tagging, pages 117--133. Kluwer, Dordrecht.Google ScholarGoogle ScholarCross RefCross Ref
  17. Yingchun Guan and Bei Qin. 1986. The design and implementation of a Chinese word statistical system. Journal of Chinese Information Processing, 1(1):26--32. (In Chinese).Google ScholarGoogle Scholar
  18. Jin Guo. 1997. Critical tokenization and its properties. Computational Linguistics, 23(4):569--596. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Hockenmaier and C. Brew. 1998. Error-driven learning of Chinese word segmentation. In PACLIC-12, pages 218--229, Singapore. Chinese and Oriental Languages Processing Society.Google ScholarGoogle Scholar
  20. F. Jelinek. 1997. Statistical Methods for Speech Processing. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Wanying Jin. 1992. A case study: Chinese segmentation and its disambiguation. Technical Report MCCS-92-227, Computing Research Laboratory, New Mexico State University, Las Cruces.Google ScholarGoogle Scholar
  22. Wanying Jin. 1994. Chinese segmentation disambiguation. In COLING-94, pages 1245--1249. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Chunyu Kit and Yorick Wilks. 1999. Unsupervised learning of word boundary with description length gain. In M. Osborne and E. T. K. Sang, editors, CoNLL-99, pages 1--6, Bergen, June.Google ScholarGoogle Scholar
  24. Chunyu Kit, Yuan Liu, and Nanyuan Liang. 1989. On methods of Chinese automatic word segmentation. Journal of Chinese Information Processing, 3(l):1--32. (In Chinese).Google ScholarGoogle Scholar
  25. Chunyu Kit. 2000. Unsupervised Lexical Learning as Inductive Inference. Ph.D. thesis, University of Sheffield, UK.Google ScholarGoogle Scholar
  26. Tom B. Y. Lai, Sun C. Lin, Chaofen Sun, and Maosong Sun. 1991. A maximal matching automatic Chinese word segmentation algorithm using corpus tagging for ambiguity resolution. In ROCLING-IV, pages 17--23.Google ScholarGoogle Scholar
  27. Nanyuan Liang and Yuan Liu. 1985. The OM method of automatic word segmentation. Chinese Information, 1(2). (In Chinese).Google ScholarGoogle Scholar
  28. Nanyuan Liang. 1984. Automatic word segmentation for written Chinese and the segmentation system CDWS. Journal of Beijing University of Aeronautics and Astronautics, ?(4). (In Chinese).Google ScholarGoogle Scholar
  29. Nanyuan Liang. 1986. CDWS - an automatic word segmentation system for written Chinese. Journal of Chinese Information Processing, 1(2):44--52. (In Chinese).Google ScholarGoogle Scholar
  30. Nanyuan Liang. 1989. Knowledge for Chinese word segmentation. Journal of Chinese Information Processing, 4(2):29--33. (In Chinese).Google ScholarGoogle Scholar
  31. Yuan Liu and Nanyuan Liang. 1986. Basic engineering for Chinese processing - Modern Chinese word frequency counting. Journal of Chinese Information Processing, 1(1):17--23. (In Chinese).Google ScholarGoogle Scholar
  32. Yuan Liu, Qiang Tan, and Xukun Shen. 1994. Contemporary Chinese Word Segmentation Standard Used for Information Processing, and Automatic Word Segmentation Methods. Tsinghua University Press, Bejing. (In Chinese).Google ScholarGoogle Scholar
  33. Hong I Ng and Kim Teng Lua. (forthcoming). A word finding automation for Chinese sentence tokenization. Submitted to ACM Transaction of Asian Languages Processing.Google ScholarGoogle Scholar
  34. David Palmer and J. Burger. 1997. Chinese word segmentation and information retrieval. In AAAI Spring Symposium on Cross-Language Text and Speech Retrieval.Google ScholarGoogle Scholar
  35. David Palmer. 1997. A trainable rule-based algorithm for word segmentation. In ACL-97, pages 321--328, Madrid. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. David D. Palmer. 2000. Tokenization and sentence segmentation. In R. Dale, H. Moisl, and H. Somers, editors, Handbook of Natural Language Processing, pages 11--35. Marcel Dekker, New York.Google ScholarGoogle Scholar
  37. Fuchun Peng and Dale Schuurmans. 2001. Self-supervised Chinese word segmentation. In 4th International Symposium of Intelligent Data Analysis, pages 238--247. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jay M. Pont and W. Bruce Croft. 1996. USeg: A retargetable word segmentation procedure for information retrieval. In Symposium on Document Analysis and Information Retrieval'96 (SDAIR). UMass Technical Report TR96-2, Univ. of Mass., Amherst, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. Sproat, C. Shih, W. Gale, and N. Chang. 1996. A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics, 22(3):377--404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Mark Stevenson and Yorick A. Wilks. 2001. The interaction of knowledge sources in word sense disambiguation. Computational Linguistics, 27(3):321--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Maosong Sun and Benjamin K. T'sou. 1995. Ambiguity resolution in Chinese word segmentation. In Benjamin K. T'sou and Tom B. Y. Lai, editors, PACLIC-10, Hong Kong, December 27--28.Google ScholarGoogle Scholar
  42. Maosong Sun and Zhengping Zhou. 1998. Word segmentation ambiguity in Chinese texts. In Benjiamin K. T'sou, Tom B. Y. Lai, Samuel W. K. Chan, and Williams S-Y. Wang, editors, Quantitative and Computational Studies on the Chinese Language, pages 323--338. Language Information Sciences Research Centre, City University of Hong Kong.Google ScholarGoogle Scholar
  43. W. J. Teahan, Yingying Wen, Rodger J. McNab, and Ian H. Witten. 2000. A compression-based algorithm for Chinese word segmentation. Computational Linguistics, 26(3):375--393. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. Veenstra, A. Van den Bosch, S. Buchholz, W. Daelemans, and J. Zavrel. 2000. Memory-based word sense disambiguation. Computing and the Humanities, special issue on SENSEVAL, 34(1--2y).Google ScholarGoogle Scholar
  45. Anand Venkataraman. 2001. A statistical model for word discovery in transcribed speech. Computational Linguistics, 27(3):353--372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jonathan J. Webster and Chunyu Kit. 1992a. Tokenization as the initial phase in NLP. In COLING'92, pages 1106--1110, Nantes, France, July 23--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Jonathan J. Webster and Chunyu Kit. 1992b. Tokenization for machine translation: What can be learned from Chinese word identification. In Proc. of 3rd International Conference on Chinese Information Processing, Beijing.Google ScholarGoogle Scholar
  48. Zimin Wu and Gwyneth Tseng. 1993. Chinese text segmentation for text retrieval: achievements and problems. JASIS, 44(9):532--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Shiwen Yu. 1998. Knowledge Base of Grammatical Information for Contemporary Chinese. Tsinghua University Press, Bejing. (In Chinese).Google ScholarGoogle Scholar
  50. Jakub Zavrel, Walter Daelemans, and Jorn Veenstra. 1998. Resolving PP attachment ambiguities with memory-based learning. In T. Mark Ellison, editor, CoNLL97: Computational Natural Language Learning, pages 136--144, Somerset, New Jersey.Google ScholarGoogle Scholar
  51. Guodong Zhou and Kim Teng Lua. (forthcoming). A hybrid approach toward ambiguity resolution in segmenting Chinese sentences. Submitted to Computer Processing of Oriental Languages.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image DL Hosted proceedings
    SIGHAN '02: Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
    September 2002
    122 pages

    Publisher

    Association for Computational Linguistics

    United States

    Publication History

    • Published: 1 September 2002

    Qualifiers

    • Article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader