Article

Free Access

Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study

Authors:
Chunyu Kit

City University of Hong Kong

City University of Hong Kong
View Profile

,
Haihua Pan

City University of Hong Kong

City University of Hong Kong
View Profile

,
Hongbiao Chen

City University of Hong Kong

City University of Hong Kong
View Profile

SIGHAN '02: Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18September 2002Pages 1–7https://doi.org/10.3115/1118824.1118832

Published:01 September 2002Publication History

SIGHAN '02: Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18

Pages 1–7

ABSTRACT

Just like other NLP applications, a serious problem with Chinese word segmentation lies in the ambiguities involved. Disambiguation methods fall into different categories, e.g., rule-based, statistical-based and example-based approaches, each of which may involve a variety of machine learning techniques. In this paper we report our current progress within the example-based approach, including its framework, example representation and collection, example matching and application. Experimental results show that this effective approach resolves more than 90% of ambiguities found. Hence, if it is integrated effectively with a segmentation method of the precision P > 95%, the resulting segmentation accuracy can reach, theoretically, beyond 99.5%.

References

M. R. Brent. 1999. An efficient, probabilistically sound algorithm for segmentation and word discovery. Machine Learning, 34:71--106. Google ScholarDigital Library
Jing-Shin Chang and Keh-Yih Su. 1997. An unsupervised iterative method for Chinese new lexicon extraction. International Journal of Computational Linguistics & Chinese Language Processing, 1(1):101--157.Google Scholar
Jyun-Sheng Chang, C.-D. Chen, and S.-D. Chen. 1991. Chinese word segmentation through constraint satisfaction and statistical optimization. In ROCLING-IV, pages 147--165, National Chiao-Thung University, Hsinchu, Taiwan.Google Scholar
Jing-Shin Chang, Yi-Chung Lin, and Keh-Yih Su. 1995. Automatic construction of a Chinese electronic dictionary. In David Yarovsky and Kenneth Church, editors, WVLC-3, pages 107--120, Somerset, New Jersey, June.Google Scholar
Keh-Jiann Chen and Shing-Huan Liu. 1992. Word identification for mandarin Chinese sentences. In COLING'92, volume I, pages 101--107, Nantes, France, July 23--28. Google ScholarDigital Library
Tung-Hui Chiang, Ming-Yu Lin, and Keh-Yih Su. 1992. Statistical models for word segmentation and unknown word resolution. In ROCLING-V, pages 121--146, Taiwan.Google Scholar
W. Daelemans, S. Buchholz, and J. Veenstra. 1999. Memory-based shallow parsing. In CoNLL-99, pages 53--60, Bergen, Norway.Google Scholar
Walter Daelemans, Jakub Zavrel, Ko van der Sloot, and Antal van den Bosch. 2001. Timbl: Tilburg memory based learner, version 4.0, reference guide. Technical Report ILK Technical Report 01--04, Induction of Linguistic Knowledge, Tilburg University, The Netherlands.Google Scholar
C. de Marcken. 1996. Unsupervised Language Acquisition. Ph.D. thesis, MIT, Cambridge, Mass.y. Google ScholarDigital Library
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, Series B, 34:1--38.Google Scholar
Charng-Kang Fan and Wen-Hsiang Tsai. 1988. Automatic word identification in Chinese sentences by the relaxation technique.Computer Processing of Chinese and Oriental Languages, 4(1):33--56.Google Scholar
Kok-Wee Gan, Martha Palmer, and Kim-Teng Lua. 1997. A statistically emergent approach for language processing: Application to modeling effects in ambiguous Chinese word boundary perception. Computational Linguistics, 22(4):531--553. Google ScholarDigital Library
Xianping Ge, Wanda Pratt, and Padhraic Smyth. 1999. Discovering Chinese words from unsegmented text (poster abstract). In SIGIR'99, pages 271--272, Berkeley, August. Google ScholarDigital Library
Gregory Grefenstette and P. Tapanainen. 1994. What is a word, what is a sentence? Problems of tokenization. In 3rd Conference on Computational Lexicography and Text Research, COMPLEX'94, Budapest, July 7--10.Google Scholar
Gregory Grefenstette, Anne Schiller, and Salah Aït-Mokhtar. 2000. Recognizing lexical patterns in text. In F. van Eynde, D. Gibbon, and I. Schuurman, editors, Lexicon Development for Speech and Language Processing, pages 141--168. Kluwer, Dordrecht.Google ScholarCross Ref
Gregory Grefenstette. 1999. Tokenization. In Hans van Halteren, editor, Syntactic Wordclass Tagging, pages 117--133. Kluwer, Dordrecht.Google ScholarCross Ref
Yingchun Guan and Bei Qin. 1986. The design and implementation of a Chinese word statistical system. Journal of Chinese Information Processing, 1(1):26--32. (In Chinese).Google Scholar
Jin Guo. 1997. Critical tokenization and its properties. Computational Linguistics, 23(4):569--596. Google ScholarDigital Library
J. Hockenmaier and C. Brew. 1998. Error-driven learning of Chinese word segmentation. In PACLIC-12, pages 218--229, Singapore. Chinese and Oriental Languages Processing Society.Google Scholar
F. Jelinek. 1997. Statistical Methods for Speech Processing. MIT Press, Cambridge, MA. Google ScholarDigital Library
Wanying Jin. 1992. A case study: Chinese segmentation and its disambiguation. Technical Report MCCS-92-227, Computing Research Laboratory, New Mexico State University, Las Cruces.Google Scholar
Wanying Jin. 1994. Chinese segmentation disambiguation. In COLING-94, pages 1245--1249. Google ScholarDigital Library
Chunyu Kit and Yorick Wilks. 1999. Unsupervised learning of word boundary with description length gain. In M. Osborne and E. T. K. Sang, editors, CoNLL-99, pages 1--6, Bergen, June.Google Scholar
Chunyu Kit, Yuan Liu, and Nanyuan Liang. 1989. On methods of Chinese automatic word segmentation. Journal of Chinese Information Processing, 3(l):1--32. (In Chinese).Google Scholar
Chunyu Kit. 2000. Unsupervised Lexical Learning as Inductive Inference. Ph.D. thesis, University of Sheffield, UK.Google Scholar
Tom B. Y. Lai, Sun C. Lin, Chaofen Sun, and Maosong Sun. 1991. A maximal matching automatic Chinese word segmentation algorithm using corpus tagging for ambiguity resolution. In ROCLING-IV, pages 17--23.Google Scholar
Nanyuan Liang and Yuan Liu. 1985. The OM method of automatic word segmentation. Chinese Information, 1(2). (In Chinese).Google Scholar
Nanyuan Liang. 1984. Automatic word segmentation for written Chinese and the segmentation system CDWS. Journal of Beijing University of Aeronautics and Astronautics, ?(4). (In Chinese).Google Scholar
Nanyuan Liang. 1986. CDWS - an automatic word segmentation system for written Chinese. Journal of Chinese Information Processing, 1(2):44--52. (In Chinese).Google Scholar
Nanyuan Liang. 1989. Knowledge for Chinese word segmentation. Journal of Chinese Information Processing, 4(2):29--33. (In Chinese).Google Scholar
Yuan Liu and Nanyuan Liang. 1986. Basic engineering for Chinese processing - Modern Chinese word frequency counting. Journal of Chinese Information Processing, 1(1):17--23. (In Chinese).Google Scholar
Yuan Liu, Qiang Tan, and Xukun Shen. 1994. Contemporary Chinese Word Segmentation Standard Used for Information Processing, and Automatic Word Segmentation Methods. Tsinghua University Press, Bejing. (In Chinese).Google Scholar
Hong I Ng and Kim Teng Lua. (forthcoming). A word finding automation for Chinese sentence tokenization. Submitted to ACM Transaction of Asian Languages Processing.Google Scholar
David Palmer and J. Burger. 1997. Chinese word segmentation and information retrieval. In AAAI Spring Symposium on Cross-Language Text and Speech Retrieval.Google Scholar
David Palmer. 1997. A trainable rule-based algorithm for word segmentation. In ACL-97, pages 321--328, Madrid. Google ScholarDigital Library
David D. Palmer. 2000. Tokenization and sentence segmentation. In R. Dale, H. Moisl, and H. Somers, editors, Handbook of Natural Language Processing, pages 11--35. Marcel Dekker, New York.Google Scholar
Fuchun Peng and Dale Schuurmans. 2001. Self-supervised Chinese word segmentation. In 4th International Symposium of Intelligent Data Analysis, pages 238--247. Google ScholarDigital Library
Jay M. Pont and W. Bruce Croft. 1996. USeg: A retargetable word segmentation procedure for information retrieval. In Symposium on Document Analysis and Information Retrieval'96 (SDAIR). UMass Technical Report TR96-2, Univ. of Mass., Amherst, MA. Google ScholarDigital Library
R. Sproat, C. Shih, W. Gale, and N. Chang. 1996. A stochastic finite-state word-segmentation algorithm for Chinese. Computational Linguistics, 22(3):377--404. Google ScholarDigital Library
Mark Stevenson and Yorick A. Wilks. 2001. The interaction of knowledge sources in word sense disambiguation. Computational Linguistics, 27(3):321--349. Google ScholarDigital Library
Maosong Sun and Benjamin K. T'sou. 1995. Ambiguity resolution in Chinese word segmentation. In Benjamin K. T'sou and Tom B. Y. Lai, editors, PACLIC-10, Hong Kong, December 27--28.Google Scholar
Maosong Sun and Zhengping Zhou. 1998. Word segmentation ambiguity in Chinese texts. In Benjiamin K. T'sou, Tom B. Y. Lai, Samuel W. K. Chan, and Williams S-Y. Wang, editors, Quantitative and Computational Studies on the Chinese Language, pages 323--338. Language Information Sciences Research Centre, City University of Hong Kong.Google Scholar
W. J. Teahan, Yingying Wen, Rodger J. McNab, and Ian H. Witten. 2000. A compression-based algorithm for Chinese word segmentation. Computational Linguistics, 26(3):375--393. Google ScholarDigital Library
J. Veenstra, A. Van den Bosch, S. Buchholz, W. Daelemans, and J. Zavrel. 2000. Memory-based word sense disambiguation. Computing and the Humanities, special issue on SENSEVAL, 34(1--2y).Google Scholar
Anand Venkataraman. 2001. A statistical model for word discovery in transcribed speech. Computational Linguistics, 27(3):353--372. Google ScholarDigital Library
Jonathan J. Webster and Chunyu Kit. 1992a. Tokenization as the initial phase in NLP. In COLING'92, pages 1106--1110, Nantes, France, July 23--28. Google ScholarDigital Library
Jonathan J. Webster and Chunyu Kit. 1992b. Tokenization for machine translation: What can be learned from Chinese word identification. In Proc. of 3rd International Conference on Chinese Information Processing, Beijing.Google Scholar
Zimin Wu and Gwyneth Tseng. 1993. Chinese text segmentation for text retrieval: achievements and problems. JASIS, 44(9):532--542. Google ScholarDigital Library
Shiwen Yu. 1998. Knowledge Base of Grammatical Information for Contemporary Chinese. Tsinghua University Press, Bejing. (In Chinese).Google Scholar
Jakub Zavrel, Walter Daelemans, and Jorn Veenstra. 1998. Resolving PP attachment ambiguities with memory-based learning. In T. Mark Ellison, editor, CoNLL97: Computational Natural Language Learning, pages 136--144, Somerset, New Jersey.Google Scholar
Guodong Zhou and Kim Teng Lua. (forthcoming). A hybrid approach toward ambiguity resolution in segmenting Chinese sentences. Submitted to Computer Processing of Oriental Languages.Google Scholar

Recommendations

Subword-based tagging for confidence-dependent Chinese word segmentation
COLING-ACL '06: Proceedings of the COLING/ACL on Main conference poster sessions

We proposed a subword-based tagging for Chinese word segmentation to improve the existing character-based tagging. The subword-based tagging was implemented using the maximum entropy (MaxEnt) and the conditional random fields (CRF) methods. We found ...
Read More
Chinese word segmentation as morpheme-based lexical chunking

Chinese word segmentation plays an important role in many Chinese language processing tasks such as information retrieval and text mining. Recent research in Chinese word segmentation focuses on tagging approaches with either characters or words as ...
Read More
A Chinese word segmentation based on language situation in processing ambiguous words

While the processing of natural language is beneficial to the text mining. Chinese word segmentation is an important step in the processing of Chinese natural language. In this paper, the convergence essence of the segmentation process is analyzed, and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

SIGHAN '02: Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18
September 2002
122 pages
Sponsors
In-Cooperation
Publisher
Association for Computational Linguistics
United States
Publication History
- Published: 1 September 2002
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 334
  Total Downloads
- Downloads (Last 12 months)32
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study

SIGHAN '02: Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18

ABSTRACT

References

Cited By

Recommendations

Subword-based tagging for confidence-dependent Chinese word segmentation

Chinese word segmentation as morpheme-based lexical chunking

A Chinese word segmentation based on language situation in processing ambiguous words

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Learning case-based knowledge for disambiguating Chinese word segmentation: a preliminary study

SIGHAN '02: Proceedings of the first SIGHAN workshop on Chinese language processing - Volume 18

ABSTRACT

References

Cited By

Recommendations

Subword-based tagging for confidence-dependent Chinese word segmentation

Chinese word segmentation as morpheme-based lexical chunking

A Chinese word segmentation based on language situation in processing ambiguous words

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media