Abstract
In this paper we propose Instance Filtering as preprocessing step for supervised classification-based learning systems for entity recognition. The goal of Instance Filtering is to reduce both the skewed class distribution and the data set size by eliminating negative instances, while preserving positive ones as much as possible. This process is performed on both the training and test set, with the effect of reducing the learning and classification time, while maintaining or improving the prediction accuracy. We performed a comparative study on a class of Instance Filtering techniques, called Stop Word Filters, that simply remove all the tokens belonging to a list of stop words. We evaluated our approach on three different entity recognition tasks (i.e. Named Entity, Bio-Entity and Temporal Expression Recognition) in English and Dutch, showing that both the skewness and the data set size are drastically reduced. Consequently, we reported an impressive reduction of the computation time required for training and classification, while maintaining (and sometimes improving) the prediction accuracy.
- X. Carreras, L. Márques, and L. Padró. Named entity extraction using AdaBoost. In Proceedings of CoNLL-2002, Taipei, Taiwan, 2002. Google ScholarDigital Library
- N. V. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl., 6(1):1--6, 2004. Google ScholarDigital Library
- F. Ciravegna. Learning to tag for information extraction. In F. Ciravegna, R. Basili, and R. Gaizauskas, editors, Proceedings of the ECAI workshop on Machine Learning for Information Extraction, Berlin, 2000.Google Scholar
- C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273--297, 1995. Google ScholarDigital Library
- A. De Sitter and W. Daelemans. Information extraction via double classification. In International Workshop on Adaptive Text Extraction and Mining, 2003.Google Scholar
- D. Freitag and N. Kushmerick. Boosted wrapper induction. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI 2000), pages 577--583, 2000. Google ScholarDigital Library
- C. Giuliano, A. Lavelli, and L. Romano. Simple information extraction (SIE). Technical report, ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica, 2005.Google Scholar
- A. M. Gliozzo, C. Giuliano, and R. Rinaldi. Instance pruning by filtering uninformative words: an Information Extraction case study. In Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2005), Mexico City, Mexico, 13--19 February 2005. Google ScholarDigital Library
- J. Kim, T. Ohta, Y. Tateishi, and J. Tsujii. Genia corpus - a semantically annotated corpus for biotextmining. Rioinformatics, 19(Suppl.1):180--182, 2003.Google ScholarCross Ref
- J. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier. Introduction to the bio-entity recognition task at JNLPBA. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), Geneva, Switzerland, 2004. Google ScholarDigital Library
- S. Kotsiantis and P. Pintelas. Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing and Teleinformatics, 1(1):46--55, 2003.Google Scholar
- J. Leskovec and J. Shawe-Taylor. Linear programming boosting for uneven datasets. In T. Fawcett and N. Mishra, editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 456--463. AAI Press, 2003.Google Scholar
- D. Roth and W. Yih. Relational learning via propositional algorithms: An information extraction case study. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-01), 2001. Google ScholarDigital Library
- Y. Song, E. Yi, E. Kim, and G. G. Lee. POSBIOTMNER in the shared task of bionip/nlpba2004. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), Geneva, Switzerland, 2004. Google ScholarDigital Library
- I. Steinwart, Sparseness of Support Vector Machines---some asymptotically sharp bounds. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.Google Scholar
- G. Weiss and F. Provost. The effect of class distribution on classifier learning. Technical Report ML-TR 43, Department of Computer Science, Rutgers University, 2001.Google Scholar
- G. M. Weiss. Mining with rarity: a unifying framework. SIGKDD Explorations, 6(1):7--19, 2004. Google ScholarDigital Library
- D. R. Wilson and T. R. Martinez. Instance pruning techniques. In ICML '97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 403--411, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- D. R. Wilson and T. R. Martinez. Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):257--286, 2000. Google ScholarDigital Library
- Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US. Google ScholarDigital Library
- Z. Zheng, X. Wu, and R. Srihari. Feature selection for text categorization on imbalanced data. SIGKDD Explor. Newsl., 6(1):80--89, 2004. Google ScholarDigital Library
- G. D. Zhou and J. Su. Exploring deep knowledge resources in biomedical name recognition. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), Geneva, Switzerland, 2004. Google ScholarDigital Library
Index Terms
- Instance Filtering for entity recognition
Recommendations
Boosted Web Named Entity Recognition via Tri-Training
TALLIP Notes and Regular PapersNamed entity extraction is a fundamental task for many natural language processing applications on the web. Existing studies rely on annotated training data, which is quite expensive to obtain large datasets, limiting the effectiveness of recognition. In ...
Domain adaptive bootstrapping for named entity recognition
EMNLP '09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 3 - Volume 3Bootstrapping is the process of improving the performance of a trained classifier by iteratively adding data that is labeled by the classifier itself to the training set, and retraining the classifier. It is often used in situations where labeled ...
Context-aware MIML instance annotation: exploiting label correlations with classifier chains
In multi-instance multi-label (MIML) instance annotation, the goal is to learn an instance classifier while training on a MIML dataset, which consists of bags of instances paired with label sets; instance labels are not provided in the training data. ...
Comments