Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems

https://doi.org/10.1016/j.eswa.2008.02.013Get rights and content

Abstract

Chinese word segmentation is an essential step in a processing of Chinese natural language because it is beneficial to the Chinese text mining and information retrieval. Currently, the lexicon-based Chinese word segmentation scheme is widely adopted, which can correctly identify Chinese sentences as distinct words from Chinese language texts in real-word applications. However, the word identification ability of the lexicon-based scheme is highly dependent with a well prepared lexicon with sufficient amount of lexical entries which covers all of the Chinese words. In particular, this scheme cannot perform Chinese word segmentation process well for highly changeable texts with time, such as newspaper articles and web documents. This is because highly changeable documents often contain many new words that cannot be identified by a lexicon-based Chinese word segmentation system with a constant lexicon. Moreover, to maintain a lexicon by manpower is an inefficient and time-consuming job. Therefore, this study proposes a novel statistics-based scheme for extraction of new words based on the categorized corpora of Google News retrieved automatically from the Google News site to promote the word identification ability for lexicon-based Chinese word segmentation systems. Since corpora of news almost contain all words used in daily life, to extract news words from corpora of news and to incrementally add them into lexicon for lexicon-based Chinese word segmentation systems provide benefits in terms of automatically constructing a professional lexicon and enhancing word identification capability. Compared to another proposed scheme of new word extraction, the experimental results indicated that the proposed extraction scheme of new words not only more correctly retrieves new words from the categorized corpora of Google News, but also obtains larger amount of new words. Moreover, the proposed scheme of new word extraction has been applied to automatically expand the lexicon of the Chinese word segmentation system ECScanner (A Chinese Lexicon Scanner with Lexicon Extension). Currently, the ECScanner has been published on the Web to provide Chinese word segmentation service based on Web service. Experimental results also confirmed that ECScanner is superior to CKIP (Chinese knowledge information processing) in identifying meaningful Chinese words.

Introduction

Identifying English or the other western languages texts into distinct words is natural and trivial task. By contrast, it is a very challenge and difficult task for Chinese texts, since Chinese texts consist of a string of ideographic characters without any blanks to mark word boundaries between words except for punctuation signs at the end of each sentence, and occasional commas within sentences (Chen and Liu, 1992, Foo and Li, 2004, Yeh and Lee, 1991, Zhang et al., 2004). However, the word segmentation is a necessary step in processing Chinese texts, such as machine translation, Chinese text mining and information retrieval. To survey the past studies (Chen and Liu, 1992, Foo and Li, 2004, Yeh and Lee, 1991, Zhang et al., 2004), Chinese word segmentation can be categorized as three approaches including the word identification (i.e. lexicon-based identification scheme), statistical word identification, and hybrid word identification schemes. The basic technique for identifying distinct words from Chinese texts is based on the lexicon-based identification scheme (Chen & Liu, 1992), which performs word segmentation process using string matching algorithms supported by a well prepared lexicon with sufficient amount of lexical entries which covers all of the Chinese words as possible. However, such a large lexicon is difficult to be constructed or maintained by manpower since the set of words is open-ended. Therefore, many used words in Chinese texts for word segmentation are often out-of-lexicon words due to insufficient amount of lexical entries so that the accuracy of Chinese word segmentation is degraded. The extraction of new words becomes a key technology for the lexicon-based Chinese word segmentation systems (Chen and Bai, 1998, Chen and Ma, 2002, Lin and Yu, 2001, Ma and Chen, 2003, Shan and Jie, 2001, Wai et al., 2004).

Moreover, the poor word segmentation results usually occur while using a general lexicon in a lexicon-based Chinese word segmentation system for specific domain texts. The best solution is to detect new words from the corpora of domain-specific and to add them into the original lexicon. With the rapid growth of Internet information, extracting new words automatically based on a large amount of collecting corpora from the Internet has become a likely task, specially from online daily Web news (Lin et al., 1998, Lu et al., 2002). Currently, Google News aggregator has gathered nearly 10,000 news sources from the World Wide Web by an automatic crawler and these news sources are presented as news stories/categories in a searchable format on the Google News site. Google News uses an automatic process to pull together related headlines into a news story/category, which enables people to see many different viewpoints on the same story/category.

Therefore, this paper proposes a novel statistics-based scheme of new word extraction based on the categorized corpora of Google News automatically retrieved from the Google News site to detect new words that appear in daily Google News titles. In parallel, the proposed scheme of new word extraction is also applied to expand the lexicon of the proposed Chinese word segmentation system ECScanner (A Chinese Lexicon Scanner with Lexicon Extension, 2006) in order to improve the word identification capability. Additionally, to avoid the performance reduction of Chinese word segmentation process due to too large lexicon derived from new word extension, this study also proposes a fuzzy rule based approach to eliminate out-of-date new words based on the inferred confidence degrees of new words. The experimental results revealed that the proposed scheme of new word extraction has excellent performance in terms of the amount of extracting new words and high accuracy rate of new words. Moreover, the expanding lexicon can obviously enhance the word identification capability for the proposed lexicon-based word segmentation system ECScanner due to the reduction of unknown words.

Section snippets

The proposed scheme of new word extraction

This section aims to detail the proposed scheme of new word extraction, and is organized as follows: Section 2.1 describes why to extract new words from Google News titles, and Section 2.2 explains the detailed procedures of the proposed scheme of new word extraction. Section 2.3 proposes how to develop a Chinese word segmentation system based on the proposed scheme of new word extension, and Section 2.4 presents how to infer the life cycle of the extracted new words for eliminating out-of-date

Experimental results

To show the excellent performance of the proposed scheme of new word extraction, Section 3.1 first reveals the performance evaluation results by extracting new words from Google News articles, and Section 3.2 presents the performance of the Chinese word segmentation system ECScanner with the proposed scheme of new word discovery. Finally, Section 3.3 assesses the performance of eliminating out-of-date words by inferring life cycles of extracted new words.

Discussion

This paper has proposed an excellent scheme of new word extraction for supporting lexicon-based Chinese word segmentation systems. However, two critical issues need to be further investigated.

Conclusion

In this paper, a novel statistics-based scheme for extracting new words based on the categorized corpora of Google News titles automatically retrieved from the Google News site is presented to promote the word identification capability for the lexicon-based Chinese word segmentation system ECScanner. In addition, to avoid reduce the performance of word identification due to over large lexicon derived from new word extension, this study also proposes a fuzzy rule knowledge base to eliminate

References (20)

  • S. Foo et al.

    Chinese word segmentation and its effect on information retrieval

    Information Processing & Management

    (2004)
  • M.-Y. Zhang et al.

    A Chinese word segmentation based on language situation in processing ambiguous words

    Information Sciences

    (2004)
  • K.J. Chen et al.

    Unknown word detection for Chinese by a corpus-based learning method

    International Journal of Computational Linguistics and Chinese Language Processing

    (1998)
  • C.-M. Chen et al.

    Personalized E-news monitoring agent system for tracking user-interested news events

    IEEE International Conference on Systems, Man, and Cybernetics

    (2006)
  • K.J. Chen et al.

    Word identification for Mandarin Chinese sentences

    Proceedings of COLING

    (1992)
  • K.J. Chen et al.

    Unknown word extraction for Chinese documents

    Proceedings of COLING

    (2002)
  • G.G. Chowdhury

    Introduction to modern information retrieval

    (2004)
  • CKIP Chinese Parser. (2007)....
  • Cscanner (A Chinese Lexicon Scanner). (2000)....
  • ECScanner (A Chinese Lexicon Scanner with Lexicon Extension). (2007)....
There are more references available in the full text version of this article.

Cited by (25)

  • Anecdotes extraction from webpage context as image annotation

    2015, Emerging Trends in Image Processing, Computer Vision and Pattern Recognition
  • Unknown chinese word extraction based on variety of overlapping strings

    2013, Information Processing and Management
    Citation Excerpt :

    However, dictionaries for domain specific words may not be always available, and it is costly and time-consuming to compile a list of domain specific words manually. Therefore, methods for automatically extracting unknown Chinese words from text corpora were developed (e.g., Chen & Ma, 2002; Hong, Chen, & Chiu, 2009; Peng, Feng, & McCallum, 2004). Typically, extraction of unknown Chinese word can be conducted in two steps: word extraction and word refinement.

  • Mining term networks from text collections for crime investigation

    2012, Expert Systems with Applications
    Citation Excerpt :

    The terms for the network can be obtained from a given list, output of an NLP parsers, or any key term extraction algorithms (Tseng, 1998). Examples include setting TF * IDF (term frequency and inverse document frequency) lower bound for selecting frequent words as key terms, using an NLP parser to extract specific noun phrases (Schneider, 2006), applying statistical testing for evaluating topic relatedness of each term for word selection (Matsuo & Ishizuka, 2004), association rule learning approach (Chen, Hsieh, & Hsu, 2007), using statistics-based scheme for extraction of new words based on the categorized corpora of Google News (Hong, Chen, & Chiu, 2009), using Bayesian text classification for keyword extraction from different text classification domains of varying characteristics (Lee, Isa, & Choo, 2011), or a combination of multiple techniques (Eck, Waltman, & Noyons, 2010). The use of frequency lower bound may be the simplest way for word selection.

View all citing articles on Scopus
View full text