skip to main content
article

Instance Filtering for entity recognition

Published:01 June 2005Publication History
Skip Abstract Section

Abstract

In this paper we propose Instance Filtering as preprocessing step for supervised classification-based learning systems for entity recognition. The goal of Instance Filtering is to reduce both the skewed class distribution and the data set size by eliminating negative instances, while preserving positive ones as much as possible. This process is performed on both the training and test set, with the effect of reducing the learning and classification time, while maintaining or improving the prediction accuracy. We performed a comparative study on a class of Instance Filtering techniques, called Stop Word Filters, that simply remove all the tokens belonging to a list of stop words. We evaluated our approach on three different entity recognition tasks (i.e. Named Entity, Bio-Entity and Temporal Expression Recognition) in English and Dutch, showing that both the skewness and the data set size are drastically reduced. Consequently, we reported an impressive reduction of the computation time required for training and classification, while maintaining (and sometimes improving) the prediction accuracy.

References

  1. X. Carreras, L. Márques, and L. Padró. Named entity extraction using AdaBoost. In Proceedings of CoNLL-2002, Taipei, Taiwan, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. N. V. Chawla, N. Japkowicz, and A. Kotcz. Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl., 6(1):1--6, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. F. Ciravegna. Learning to tag for information extraction. In F. Ciravegna, R. Basili, and R. Gaizauskas, editors, Proceedings of the ECAI workshop on Machine Learning for Information Extraction, Berlin, 2000.Google ScholarGoogle Scholar
  4. C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273--297, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. De Sitter and W. Daelemans. Information extraction via double classification. In International Workshop on Adaptive Text Extraction and Mining, 2003.Google ScholarGoogle Scholar
  6. D. Freitag and N. Kushmerick. Boosted wrapper induction. In Proceedings of the 17th National Conference on Artificial Intelligence (AAAI 2000), pages 577--583, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Giuliano, A. Lavelli, and L. Romano. Simple information extraction (SIE). Technical report, ITC-irst, Istituto per la Ricerca Scientifica e Tecnologica, 2005.Google ScholarGoogle Scholar
  8. A. M. Gliozzo, C. Giuliano, and R. Rinaldi. Instance pruning by filtering uninformative words: an Information Extraction case study. In Proceedings of the Sixth International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2005), Mexico City, Mexico, 13--19 February 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Kim, T. Ohta, Y. Tateishi, and J. Tsujii. Genia corpus - a semantically annotated corpus for biotextmining. Rioinformatics, 19(Suppl.1):180--182, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  10. J. Kim, T. Ohta, Y. Tsuruoka, Y. Tateisi, and N. Collier. Introduction to the bio-entity recognition task at JNLPBA. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), Geneva, Switzerland, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Kotsiantis and P. Pintelas. Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics, Computing and Teleinformatics, 1(1):46--55, 2003.Google ScholarGoogle Scholar
  12. J. Leskovec and J. Shawe-Taylor. Linear programming boosting for uneven datasets. In T. Fawcett and N. Mishra, editors, Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, pages 456--463. AAI Press, 2003.Google ScholarGoogle Scholar
  13. D. Roth and W. Yih. Relational learning via propositional algorithms: An information extraction case study. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence (IJCAI-01), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. Song, E. Yi, E. Kim, and G. G. Lee. POSBIOTMNER in the shared task of bionip/nlpba2004. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), Geneva, Switzerland, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Steinwart, Sparseness of Support Vector Machines---some asymptotically sharp bounds. In S. Thrun, L. Saul, and B. Schölkopf, editors, Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA, 2004.Google ScholarGoogle Scholar
  16. G. Weiss and F. Provost. The effect of class distribution on classifier learning. Technical Report ML-TR 43, Department of Computer Science, Rutgers University, 2001.Google ScholarGoogle Scholar
  17. G. M. Weiss. Mining with rarity: a unifying framework. SIGKDD Explorations, 6(1):7--19, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. R. Wilson and T. R. Martinez. Instance pruning techniques. In ICML '97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 403--411, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. R. Wilson and T. R. Martinez. Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):257--286, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. H. Fisher, editor, Proceedings of ICML-97, 14th International Conference on Machine Learning, pages 412--420, Nashville, US, 1997. Morgan Kaufmann Publishers, San Francisco, US. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Z. Zheng, X. Wu, and R. Srihari. Feature selection for text categorization on imbalanced data. SIGKDD Explor. Newsl., 6(1):80--89, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. D. Zhou and J. Su. Exploring deep knowledge resources in biomedical name recognition. In Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004), Geneva, Switzerland, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Instance Filtering for entity recognition

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM SIGKDD Explorations Newsletter
                  ACM SIGKDD Explorations Newsletter  Volume 7, Issue 1
                  Natural language processing and text mining
                  June 2005
                  81 pages
                  ISSN:1931-0145
                  EISSN:1931-0153
                  DOI:10.1145/1089815
                  Issue’s Table of Contents

                  Copyright © 2005 Authors

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 1 June 2005

                  Check for updates

                  Qualifiers

                  • article

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader