A Multi-word Term Extraction System

Chen, Jisong; Yeh, Chung-Hsing; Chau, Rowena

doi:10.1007/978-3-540-36668-3_153

Jisong Chen²⁰,
Chung-Hsing Yeh²⁰ &
Rowena Chau²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4099))

Included in the following conference series:

Pacific Rim International Conference on Artificial Intelligence

2161 Accesses
4 Citations

Abstract

Traditional statistical approaches for identifying multi-word terms have to handle a large amount of noisy data and are extremely time consuming. This paper introduces a multi-word term extraction system for extracting multi-word terms from a set of documents based on the co-related text-segments existing in these documents. The system uses a short predefined stoplist as an initial input to segment a set of documents into text-segments, calculates the segment-weights of all text-segments, and then applies the short text-segments to segment the longer text-segments based on the weight values recursively until all text-segments cannot be further divided. The resultant text-segments can thus be identified as terms based on a specified threshold. The initial experimental result on a set of traditional Chinese documents shows that this system can achieve a minimum of 76.39% of recall rate and a minimum of 91.05% of precision rate on retrieving multiple occurrences terms, which include 18.30% of new identified terms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 239.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chang, J.S., Chen, S.D., Ker, S.J., Chen, Y., Liu, J.: A multiple-Corpus Approach to Recognition of Proper Names in Chinese Texts. Computer Processing of Chinese and Oriental Languages 8(1), 75–85 (1994)
Google Scholar
Lai, Y.-S., Wu, C.-H.: Unknown Word and Phrase Extraction Using a Phrase-Like-Unit-Based Likelihood Ratio. International Journal of Computer Processing of Oriental Languages 13(1), 83–95 (2000)
Article Google Scholar
Chinese Stoplist (Traditional). http://www.lc.leidenuniv.nl/awcourse/oracle/text.920/a96518/astopsup.htm#45728
Tsai, C.-H.: A Review of Chinese Word Lists Accessible on the Internet, http://technology.chtsai.org/wordlist/

Download references

Author information

Authors and Affiliations

The Clayton School of Information Technology, Monash University, Clayton, Victoria, 3800, Australia
Jisong Chen, Chung-Hsing Yeh & Rowena Chau

Authors

Jisong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chung-Hsing Yeh
View author publications
You can also search for this author in PubMed Google Scholar
Rowena Chau
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The Hong Kong University of Science and Technology,, Hong Kong
Qiang Yang
Clayton School of Information Technology, Monash University, P.O. Box, Australia
Geoff Webb

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, J., Yeh, CH., Chau, R. (2006). A Multi-word Term Extraction System. In: Yang, Q., Webb, G. (eds) PRICAI 2006: Trends in Artificial Intelligence. PRICAI 2006. Lecture Notes in Computer Science(), vol 4099. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-36668-3_153

Download citation

DOI: https://doi.org/10.1007/978-3-540-36668-3_153
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-36667-6
Online ISBN: 978-3-540-36668-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics