article

Instance Filtering for entity recognition

Authors:
Alfio Massimiliano Gliozzo

ITC-irst, Trento, Italy

ITC-irst, Trento, Italy
View Profile

,
Claudio Giuliano

ITC-irst, Trento, Italy

ITC-irst, Trento, Italy
View Profile

,
Raffaella Rinaldi

ITC-irst, Trento, Italy

ITC-irst, Trento, Italy
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 7 Issue 1June 2005pp 11–18https://doi.org/10.1145/1089815.1089818

Published:01 June 2005Publication History

ACM SIGKDD Explorations Newsletter

Abstract

In this paper we propose Instance Filtering as preprocessing step for supervised classification-based learning systems for entity recognition. The goal of Instance Filtering is to reduce both the skewed class distribution and the data set size by eliminating negative instances, while preserving positive ones as much as possible. This process is performed on both the training and test set, with the effect of reducing the learning and classification time, while maintaining or improving the prediction accuracy. We performed a comparative study on a class of Instance Filtering techniques, called Stop Word Filters, that simply remove all the tokens belonging to a list of stop words. We evaluated our approach on three different entity recognition tasks (i.e. Named Entity, Bio-Entity and Temporal Expression Recognition) in English and Dutch, showing that both the skewness and the data set size are drastically reduced. Consequently, we reported an impressive reduction of the computation time required for training and classification, while maintaining (and sometimes improving) the prediction accuracy.

References

X. Carreras, L. Márques, and L. Padró. Named entity extraction using AdaBoost. In Proceedings of CoNLL-2002, Taipei, Taiwan, 2002. Google ScholarDigital Library
N. V. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl., 6(1):1--6, 2004. Google ScholarDigital Library
F. Ciravegna. Learning to tag for information extraction. In F. Ciravegna, R. Basili, and R. Gaizauskas, editors, Proceedings of the ECAI workshop on Machine Learning for Information Extraction, Berlin, 2000.Google Scholar
C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273--297, 1995. Google ScholarDigital Library
A. De Sitter and W. Daelemans. Information extraction via double classification. In International Workshop on Adaptive Text Extraction and Mining, 2003.Google Scholar
D. Freitag and N. Kushmerick. Boosted wrapper induction. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI 2000), pages 577--583, 2000. Google ScholarDigital Library
C. Giuliano, A. Lavelli, and L. Romano. Simple information extraction (SIE). Technical report, ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica, 2005.Google Scholar
A. M. Gliozzo, C. Giuliano, and R. Rinaldi. Instance pruning by filtering uninformative words: an Information Extraction case study. In Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2005), Mexico City, Mexico, 13--19 February 2005. Google ScholarDigital Library
J. Kim, T. Ohta, Y. Tateishi, and J. Tsujii. Genia corpus - a semantically annotated corpus for biotextmining. Rioinformatics, 19(Suppl.1):180--182, 2003.Google ScholarCross Ref
J. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier. Introduction to the bio-entity recognition task at JNLPBA. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), Geneva, Switzerland, 2004. Google ScholarDigital Library
S. Kotsiantis and P. Pintelas. Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing and Teleinformatics, 1(1):46--55, 2003.Google Scholar
J. Leskovec and J. Shawe-Taylor. Linear programming boosting for uneven datasets. In T. Fawcett and N. Mishra, editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 456--463. AAI Press, 2003.Google Scholar
D. Roth and W. Yih. Relational learning via propositional algorithms: An information extraction case study. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-01), 2001. Google ScholarDigital Library
Y. Song, E. Yi, E. Kim, and G. G. Lee. POSBIOTMNER in the shared task of bionip/nlpba2004. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), Geneva, Switzerland, 2004. Google ScholarDigital Library
I. Steinwart, Sparseness of Support Vector Machines---some asymptotically sharp bounds. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.Google Scholar
G. Weiss and F. Provost. The effect of class distribution on classifier learning. Technical Report ML-TR 43, Department of Computer Science, Rutgers University, 2001.Google Scholar
G. M. Weiss. Mining with rarity: a unifying framework. SIGKDD Explorations, 6(1):7--19, 2004. Google ScholarDigital Library
D. R. Wilson and T. R. Martinez. Instance pruning techniques. In ICML '97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 403--411, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
D. R. Wilson and T. R. Martinez. Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):257--286, 2000. Google ScholarDigital Library
Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US. Google ScholarDigital Library
Z. Zheng, X. Wu, and R. Srihari. Feature selection for text categorization on imbalanced data. SIGKDD Explor. Newsl., 6(1):80--89, 2004. Google ScholarDigital Library
G. D. Zhou and J. Su. Exploring deep knowledge resources in biomedical name recognition. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), Geneva, Switzerland, 2004. Google ScholarDigital Library

Index Terms

Instance Filtering for entity recognition
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection
    2. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction
  2. Information systems applications
    1. Data mining

Recommendations

Boosted Web Named Entity Recognition via Tri-Training
TALLIP Notes and Regular Papers

Named entity extraction is a fundamental task for many natural language processing applications on the web. Existing studies rely on annotated training data, which is quite expensive to obtain large datasets, limiting the effectiveness of recognition. In ...
Read More
Domain adaptive bootstrapping for named entity recognition
EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3

Bootstrapping is the process of improving the performance of a trained classifier by iteratively adding data that is labeled by the classifier itself to the training set, and retraining the classifier. It is often used in situations where labeled ...
Read More
Context-aware MIML instance annotation: exploiting label correlations with classifier chains

In multi-instance multi-label (MIML) instance annotation, the goal is to learn an instance classifier while training on a MIML dataset, which consists of bags of instances paired with label sets; instance labels are not provided in the training data. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 7, Issue 1
Natural language processing and text mining
June 2005
81 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1089815
Issue’s Table of Contents

Copyright © 2005 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 June 2005
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 257
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Instance Filtering for entity recognition

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Boosted Web Named Entity Recognition via Tri-Training

Domain adaptive bootstrapping for named entity recognition

Context-aware MIML instance annotation: exploiting label correlations with classifier chains

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Instance Filtering for entity recognition

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Boosted Web Named Entity Recognition via Tri-Training

Domain adaptive bootstrapping for named entity recognition

Context-aware MIML instance annotation: exploiting label correlations with classifier chains

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media