Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems

doi:10.1016/j.eswa.2008.02.013

Expert Systems with Applications

Volume 36, Issue 2, Part 2, March 2009, Pages 3641-3651

https://doi.org/10.1016/j.eswa.2008.02.013 Get rights and content

Abstract

Chinese word segmentation is an essential step in a processing of Chinese natural language because it is beneficial to the Chinese text mining and information retrieval. Currently, the lexicon-based Chinese word segmentation scheme is widely adopted, which can correctly identify Chinese sentences as distinct words from Chinese language texts in real-word applications. However, the word identification ability of the lexicon-based scheme is highly dependent with a well prepared lexicon with sufficient amount of lexical entries which covers all of the Chinese words. In particular, this scheme cannot perform Chinese word segmentation process well for highly changeable texts with time, such as newspaper articles and web documents. This is because highly changeable documents often contain many new words that cannot be identified by a lexicon-based Chinese word segmentation system with a constant lexicon. Moreover, to maintain a lexicon by manpower is an inefficient and time-consuming job. Therefore, this study proposes a novel statistics-based scheme for extraction of new words based on the categorized corpora of Google News retrieved automatically from the Google News site to promote the word identification ability for lexicon-based Chinese word segmentation systems. Since corpora of news almost contain all words used in daily life, to extract news words from corpora of news and to incrementally add them into lexicon for lexicon-based Chinese word segmentation systems provide benefits in terms of automatically constructing a professional lexicon and enhancing word identification capability. Compared to another proposed scheme of new word extraction, the experimental results indicated that the proposed extraction scheme of new words not only more correctly retrieves new words from the categorized corpora of Google News, but also obtains larger amount of new words. Moreover, the proposed scheme of new word extraction has been applied to automatically expand the lexicon of the Chinese word segmentation system ECScanner (A Chinese Lexicon Scanner with Lexicon Extension). Currently, the ECScanner has been published on the Web to provide Chinese word segmentation service based on Web service. Experimental results also confirmed that ECScanner is superior to CKIP (Chinese knowledge information processing) in identifying meaningful Chinese words.

Introduction

Identifying English or the other western languages texts into distinct words is natural and trivial task. By contrast, it is a very challenge and difficult task for Chinese texts, since Chinese texts consist of a string of ideographic characters without any blanks to mark word boundaries between words except for punctuation signs at the end of each sentence, and occasional commas within sentences (Chen and Liu, 1992, Foo and Li, 2004, Yeh and Lee, 1991, Zhang et al., 2004). However, the word segmentation is a necessary step in processing Chinese texts, such as machine translation, Chinese text mining and information retrieval. To survey the past studies (Chen and Liu, 1992, Foo and Li, 2004, Yeh and Lee, 1991, Zhang et al., 2004), Chinese word segmentation can be categorized as three approaches including the word identification (i.e. lexicon-based identification scheme), statistical word identification, and hybrid word identification schemes. The basic technique for identifying distinct words from Chinese texts is based on the lexicon-based identification scheme (Chen & Liu, 1992), which performs word segmentation process using string matching algorithms supported by a well prepared lexicon with sufficient amount of lexical entries which covers all of the Chinese words as possible. However, such a large lexicon is difficult to be constructed or maintained by manpower since the set of words is open-ended. Therefore, many used words in Chinese texts for word segmentation are often out-of-lexicon words due to insufficient amount of lexical entries so that the accuracy of Chinese word segmentation is degraded. The extraction of new words becomes a key technology for the lexicon-based Chinese word segmentation systems (Chen and Bai, 1998, Chen and Ma, 2002, Lin and Yu, 2001, Ma and Chen, 2003, Shan and Jie, 2001, Wai et al., 2004).

Moreover, the poor word segmentation results usually occur while using a general lexicon in a lexicon-based Chinese word segmentation system for specific domain texts. The best solution is to detect new words from the corpora of domain-specific and to add them into the original lexicon. With the rapid growth of Internet information, extracting new words automatically based on a large amount of collecting corpora from the Internet has become a likely task, specially from online daily Web news (Lin et al., 1998, Lu et al., 2002). Currently, Google News aggregator has gathered nearly 10,000 news sources from the World Wide Web by an automatic crawler and these news sources are presented as news stories/categories in a searchable format on the Google News site. Google News uses an automatic process to pull together related headlines into a news story/category, which enables people to see many different viewpoints on the same story/category.

Therefore, this paper proposes a novel statistics-based scheme of new word extraction based on the categorized corpora of Google News automatically retrieved from the Google News site to detect new words that appear in daily Google News titles. In parallel, the proposed scheme of new word extraction is also applied to expand the lexicon of the proposed Chinese word segmentation system ECScanner (A Chinese Lexicon Scanner with Lexicon Extension, 2006) in order to improve the word identification capability. Additionally, to avoid the performance reduction of Chinese word segmentation process due to too large lexicon derived from new word extension, this study also proposes a fuzzy rule based approach to eliminate out-of-date new words based on the inferred confidence degrees of new words. The experimental results revealed that the proposed scheme of new word extraction has excellent performance in terms of the amount of extracting new words and high accuracy rate of new words. Moreover, the expanding lexicon can obviously enhance the word identification capability for the proposed lexicon-based word segmentation system ECScanner due to the reduction of unknown words.

Section snippets

The proposed scheme of new word extraction

This section aims to detail the proposed scheme of new word extraction, and is organized as follows: Section 2.1 describes why to extract new words from Google News titles, and Section 2.2 explains the detailed procedures of the proposed scheme of new word extraction. Section 2.3 proposes how to develop a Chinese word segmentation system based on the proposed scheme of new word extension, and Section 2.4 presents how to infer the life cycle of the extracted new words for eliminating out-of-date

Experimental results

To show the excellent performance of the proposed scheme of new word extraction, Section 3.1 first reveals the performance evaluation results by extracting new words from Google News articles, and Section 3.2 presents the performance of the Chinese word segmentation system ECScanner with the proposed scheme of new word discovery. Finally, Section 3.3 assesses the performance of eliminating out-of-date words by inferring life cycles of extracted new words.

Discussion

This paper has proposed an excellent scheme of new word extraction for supporting lexicon-based Chinese word segmentation systems. However, two critical issues need to be further investigated.

Conclusion

In this paper, a novel statistics-based scheme for extracting new words based on the categorized corpora of Google News titles automatically retrieved from the Google News site is presented to promote the word identification capability for the lexicon-based Chinese word segmentation system ECScanner. In addition, to avoid reduce the performance of word identification due to over large lexicon derived from new word extension, this study also proposes a fuzzy rule knowledge base to eliminate

References (20)

S. Foo et al.
Chinese word segmentation and its effect on information retrieval
Information Processing & Management
(2004)
M.-Y. Zhang et al.
A Chinese word segmentation based on language situation in processing ambiguous words
Information Sciences
(2004)
K.J. Chen et al.
Unknown word detection for Chinese by a corpus-based learning method
International Journal of Computational Linguistics and Chinese Language Processing
(1998)
C.-M. Chen et al.
Personalized E-news monitoring agent system for tracking user-interested news events
IEEE International Conference on Systems, Man, and Cybernetics
(2006)
K.J. Chen et al.
Word identification for Mandarin Chinese sentences
Proceedings of COLING
(1992)
K.J. Chen et al.
Unknown word extraction for Chinese documents
Proceedings of COLING
(2002)
G.G. Chowdhury
Introduction to modern information retrieval
(2004)
CKIP Chinese Parser. (2007)....
Cscanner (A Chinese Lexicon Scanner). (2000)....
ECScanner (A Chinese Lexicon Scanner with Lexicon Extension). (2007)....

There are more references available in the full text version of this article.

Cited by (25)

Anecdotes extraction from webpage context as image annotation
2015, Emerging Trends in Image Processing, Computer Vision and Pattern Recognition
Traditional feature-based or text processing techniques tend to assign the same annotation to all the images in the same cluster without considering the latent semantic anecdotes of each image. In this research, we propose the Chinese lexical chain processing method which is a bottom-up concatenating process based on the intensity and the degree of a lexical chain (LC) to extract the most meaningful LCs as anecdotes from a string. It requires minimum computation that allows sharing characters/words and facilitating their use at fine granularities without prohibitive cost. In the experiment, this method achieves a precision rate of 84.6%, and gains acceptance from expert rating and user rating of 84% and 76.6%, respectively. In performance testing, it only takes 0.007 s to process each image in a collection of 18,000 testing data set.
Unknown chinese word extraction based on variety of overlapping strings
2013, Information Processing and Management
Citation Excerpt :
However, dictionaries for domain specific words may not be always available, and it is costly and time-consuming to compile a list of domain specific words manually. Therefore, methods for automatically extracting unknown Chinese words from text corpora were developed (e.g., Chen & Ma, 2002; Hong, Chen, & Chiu, 2009; Peng, Feng, & McCallum, 2004). Typically, extraction of unknown Chinese word can be conducted in two steps: word extraction and word refinement.
Not all languages, e.g. Chinese, have delimiters for words. To extract words from a sentence in these languages, we usually rely on a dictionary for known words. For unknown words, some approaches rely on a domain specific dictionary or a tailor-made learning data set. However, this information may not be available. Another direction is to use unsupervised methods. These methods rely on a goodness measure to evaluate how likely the words are meaningful based on a statistical argument on the given text. The most challenging issue is to identify low-frequency meaningful words. In this paper, we first show by an empirical study on Chinese texts that all classical goodness measures cannot separate low-frequency meaningful and meaningless words effectively. To solve this problem, we propose a new goodness measure, the overlap variety method. The key idea behind the new measure is not to consider the absolute number of occurrences of the candidate (i.e., a string of Chinese characters) but to compare the goodness measures (we use the accessor variety) of the candidate and those of the strings overlapping the candidate. The candidate is likely to be meaningful if its accessor variety is larger than the accessor varieties of the overlapping strings. We implement an extraction system for unknown Chinese word, UNExtract, based on this overlap variety method. We evaluate our approach using the CIPS-SIGHAN-2010 bake off corpora and show that the proposed measure is more effective than the other five state-of-the-art goodness measures (accessor variety, branch entropy, description length gain, frequency substring reduction, pointwise mutual information), especially for low-frequency words and bi-gram words.
Mining term networks from text collections for crime investigation
2012, Expert Systems with Applications
Citation Excerpt :
The terms for the network can be obtained from a given list, output of an NLP parsers, or any key term extraction algorithms (Tseng, 1998). Examples include setting TF * IDF (term frequency and inverse document frequency) lower bound for selecting frequent words as key terms, using an NLP parser to extract specific noun phrases (Schneider, 2006), applying statistical testing for evaluating topic relatedness of each term for word selection (Matsuo & Ishizuka, 2004), association rule learning approach (Chen, Hsieh, & Hsu, 2007), using statistics-based scheme for extraction of new words based on the categorized corpora of Google News (Hong, Chen, & Chiu, 2009), using Bayesian text classification for keyword extraction from different text classification domains of varying characteristics (Lee, Isa, & Choo, 2011), or a combination of multiple techniques (Eck, Waltman, & Noyons, 2010). The use of frequency lower bound may be the simplest way for word selection.
An efficient term mining method to build a general term network is presented. The resulting term network can be used for entity relation visualization and exploration, which is useful in many text-mining applications such as crime exploration and investigation from vast piles of crime news or official criminal records. In the proposed method, terms from each document in a text collection are first identified. They are subjected to an analysis for pairwise association weights. The weights are then accumulated over all the documents to obtain final similarity for each term pair. Based on the resulting term similarity, a general term network for the collection is built with terms as nodes and non-zero similarities as links. In application, a list of predefined terms having similar attributes was selected to extract the desired sub-network from the general term network for entity relation visualization. This text analysis scenario based on the collective terms of the similar type or from the same topic enables evidence-based relation exploration. Some practical instances of crime exploration and investigation are demonstrated. Our application examples show that term relations, be it causality, subordination, coupling, or others, can be effectively revealed by our method and easily verified by the underlying text collection. This work contributes by presenting an integrated term-relationship mining and exploration approach and demonstrating the feasibility of the term network to the increasingly important application of crime exploration and investigation.
Knowledge integration and sharing for collaborative molding product design and process development
2010, Computers in Industry
This study presents a systematic approach to developing a knowledge integration and sharing mechanism for collaborative molding product design and process development. The proposed approach includes the steps of (i) collaborative molding product design and process development process modeling, (ii) an ontology-based knowledge model establishment, (iii) knowledge integration and sharing system framework design, (iv) ontology-based knowledge integration and sharing methods development, and (v) ontology-based knowledge integration and sharing mechanism implementation. The mechanism can support collaborative molding product design and process development by providing functions of knowledge integration and sharing. Results of this study facilitate the knowledge integration and sharing of collaborative molding product design and process development to satisfy the product knowledge demands of participants, and thus increase molding product development capability, reduce molding product development cycle time and cost, and ultimately increase molding product marketability.
The vision of Google News from academia: scoping review
2024, Doxa Comunicacion
Research on Chinese Medical Entity Relation Extraction Based on Syntactic Dependency Structure Information
2022, Applied Sciences (Switzerland)

View all citing articles on Scopus

View full text

Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems

Abstract

Introduction

Section snippets

The proposed scheme of new word extraction

Experimental results

Discussion

Conclusion

Information Processing & Management

Information Sciences

Unknown word detection for Chinese by a corpus-based learning method

International Journal of Computational Linguistics and Chinese Language Processing

Personalized E-news monitoring agent system for tracking user-interested news events

IEEE International Conference on Systems, Man, and Cybernetics

Word identification for Mandarin Chinese sentences

Proceedings of COLING

Unknown word extraction for Chinese documents

Proceedings of COLING

Introduction to modern information retrieval