Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems
Introduction
Identifying English or the other western languages texts into distinct words is natural and trivial task. By contrast, it is a very challenge and difficult task for Chinese texts, since Chinese texts consist of a string of ideographic characters without any blanks to mark word boundaries between words except for punctuation signs at the end of each sentence, and occasional commas within sentences (Chen and Liu, 1992, Foo and Li, 2004, Yeh and Lee, 1991, Zhang et al., 2004). However, the word segmentation is a necessary step in processing Chinese texts, such as machine translation, Chinese text mining and information retrieval. To survey the past studies (Chen and Liu, 1992, Foo and Li, 2004, Yeh and Lee, 1991, Zhang et al., 2004), Chinese word segmentation can be categorized as three approaches including the word identification (i.e. lexicon-based identification scheme), statistical word identification, and hybrid word identification schemes. The basic technique for identifying distinct words from Chinese texts is based on the lexicon-based identification scheme (Chen & Liu, 1992), which performs word segmentation process using string matching algorithms supported by a well prepared lexicon with sufficient amount of lexical entries which covers all of the Chinese words as possible. However, such a large lexicon is difficult to be constructed or maintained by manpower since the set of words is open-ended. Therefore, many used words in Chinese texts for word segmentation are often out-of-lexicon words due to insufficient amount of lexical entries so that the accuracy of Chinese word segmentation is degraded. The extraction of new words becomes a key technology for the lexicon-based Chinese word segmentation systems (Chen and Bai, 1998, Chen and Ma, 2002, Lin and Yu, 2001, Ma and Chen, 2003, Shan and Jie, 2001, Wai et al., 2004).
Moreover, the poor word segmentation results usually occur while using a general lexicon in a lexicon-based Chinese word segmentation system for specific domain texts. The best solution is to detect new words from the corpora of domain-specific and to add them into the original lexicon. With the rapid growth of Internet information, extracting new words automatically based on a large amount of collecting corpora from the Internet has become a likely task, specially from online daily Web news (Lin et al., 1998, Lu et al., 2002). Currently, Google News aggregator has gathered nearly 10,000 news sources from the World Wide Web by an automatic crawler and these news sources are presented as news stories/categories in a searchable format on the Google News site. Google News uses an automatic process to pull together related headlines into a news story/category, which enables people to see many different viewpoints on the same story/category.
Therefore, this paper proposes a novel statistics-based scheme of new word extraction based on the categorized corpora of Google News automatically retrieved from the Google News site to detect new words that appear in daily Google News titles. In parallel, the proposed scheme of new word extraction is also applied to expand the lexicon of the proposed Chinese word segmentation system ECScanner (A Chinese Lexicon Scanner with Lexicon Extension, 2006) in order to improve the word identification capability. Additionally, to avoid the performance reduction of Chinese word segmentation process due to too large lexicon derived from new word extension, this study also proposes a fuzzy rule based approach to eliminate out-of-date new words based on the inferred confidence degrees of new words. The experimental results revealed that the proposed scheme of new word extraction has excellent performance in terms of the amount of extracting new words and high accuracy rate of new words. Moreover, the expanding lexicon can obviously enhance the word identification capability for the proposed lexicon-based word segmentation system ECScanner due to the reduction of unknown words.
Section snippets
The proposed scheme of new word extraction
This section aims to detail the proposed scheme of new word extraction, and is organized as follows: Section 2.1 describes why to extract new words from Google News titles, and Section 2.2 explains the detailed procedures of the proposed scheme of new word extraction. Section 2.3 proposes how to develop a Chinese word segmentation system based on the proposed scheme of new word extension, and Section 2.4 presents how to infer the life cycle of the extracted new words for eliminating out-of-date
Experimental results
To show the excellent performance of the proposed scheme of new word extraction, Section 3.1 first reveals the performance evaluation results by extracting new words from Google News articles, and Section 3.2 presents the performance of the Chinese word segmentation system ECScanner with the proposed scheme of new word discovery. Finally, Section 3.3 assesses the performance of eliminating out-of-date words by inferring life cycles of extracted new words.
Discussion
This paper has proposed an excellent scheme of new word extraction for supporting lexicon-based Chinese word segmentation systems. However, two critical issues need to be further investigated.
Conclusion
In this paper, a novel statistics-based scheme for extracting new words based on the categorized corpora of Google News titles automatically retrieved from the Google News site is presented to promote the word identification capability for the lexicon-based Chinese word segmentation system ECScanner. In addition, to avoid reduce the performance of word identification due to over large lexicon derived from new word extension, this study also proposes a fuzzy rule knowledge base to eliminate
References (20)
- et al.
Chinese word segmentation and its effect on information retrieval
Information Processing & Management
(2004) - et al.
A Chinese word segmentation based on language situation in processing ambiguous words
Information Sciences
(2004) - et al.
Unknown word detection for Chinese by a corpus-based learning method
International Journal of Computational Linguistics and Chinese Language Processing
(1998) - et al.
Personalized E-news monitoring agent system for tracking user-interested news events
IEEE International Conference on Systems, Man, and Cybernetics
(2006) - et al.
Word identification for Mandarin Chinese sentences
Proceedings of COLING
(1992) - et al.
Unknown word extraction for Chinese documents
Proceedings of COLING
(2002) Introduction to modern information retrieval
(2004)- CKIP Chinese Parser. (2007)....
- Cscanner (A Chinese Lexicon Scanner). (2000)....
- ECScanner (A Chinese Lexicon Scanner with Lexicon Extension). (2007)....
Cited by (25)
Anecdotes extraction from webpage context as image annotation
2015, Emerging Trends in Image Processing, Computer Vision and Pattern RecognitionUnknown chinese word extraction based on variety of overlapping strings
2013, Information Processing and ManagementCitation Excerpt :However, dictionaries for domain specific words may not be always available, and it is costly and time-consuming to compile a list of domain specific words manually. Therefore, methods for automatically extracting unknown Chinese words from text corpora were developed (e.g., Chen & Ma, 2002; Hong, Chen, & Chiu, 2009; Peng, Feng, & McCallum, 2004). Typically, extraction of unknown Chinese word can be conducted in two steps: word extraction and word refinement.
Mining term networks from text collections for crime investigation
2012, Expert Systems with ApplicationsCitation Excerpt :The terms for the network can be obtained from a given list, output of an NLP parsers, or any key term extraction algorithms (Tseng, 1998). Examples include setting TF * IDF (term frequency and inverse document frequency) lower bound for selecting frequent words as key terms, using an NLP parser to extract specific noun phrases (Schneider, 2006), applying statistical testing for evaluating topic relatedness of each term for word selection (Matsuo & Ishizuka, 2004), association rule learning approach (Chen, Hsieh, & Hsu, 2007), using statistics-based scheme for extraction of new words based on the categorized corpora of Google News (Hong, Chen, & Chiu, 2009), using Bayesian text classification for keyword extraction from different text classification domains of varying characteristics (Lee, Isa, & Choo, 2011), or a combination of multiple techniques (Eck, Waltman, & Noyons, 2010). The use of frequency lower bound may be the simplest way for word selection.
Knowledge integration and sharing for collaborative molding product design and process development
2010, Computers in IndustryThe vision of Google News from academia: scoping review
2024, Doxa ComunicacionResearch on Chinese Medical Entity Relation Extraction Based on Syntactic Dependency Structure Information
2022, Applied Sciences (Switzerland)