Learning to Order Terms: Supervised Interestingness Measures in Terminology Extraction

doi:10.5281/zenodo.1333849

Published June 26, 2007 | Version 7462

Journal article Open

Learning to Order Terms: Supervised Interestingness Measures in Terminology Extraction

Term Extraction, a key data preparation step in Text Mining, extracts the terms, i.e. relevant collocation of words, attached to specific concepts (e.g. genetic-algorithms and decisiontrees are terms associated to the concept "Machine Learning" ). In this paper, the task of extracting interesting collocations is achieved through a supervised learning algorithm, exploiting a few collocations manually labelled as interesting/not interesting. From these examples, the ROGER algorithm learns a numerical function, inducing some ranking on the collocations. This ranking is optimized using genetic algorithms, maximizing the trade-off between the false positive and true positive rates (Area Under the ROC curve). This approach uses a particular representation for the word collocations, namely the vector of values corresponding to the standard statistical interestingness measures attached to this collocation. As this representation is general (over corpora and natural languages), generality tests were performed by experimenting the ranking function learned from an English corpus in Biology, onto a French corpus of Curriculum Vitae, and vice versa, showing a good robustness of the approaches compared to the state-of-the-art Support Vector Machine (SVM).

Files

7462.pdf

Files (351.2 kB)

Name	Size	Download all
7462.pdf md5:540fc90d51f93bea7efe9aa8442145a0	351.2 kB	Preview Download

	All versions	This version
Views	51	51
Downloads	38	38
Data volume	13.7 MB	13.7 MB

Learning to Order Terms: Supervised Interestingness Measures in Terminology Extraction

Creators

Description

Files

7462.pdf

Files (351.2 kB)