Skip to main content
Log in

Integrating Linguistic Resources in TC through WSD

  • Published:
Computers and the Humanities Aims and scope Submit manuscript

Abstract

Information access methods must be improved to overcome theinformation overload that most professionals face nowadays. Textclassification tasks, like Text Categorization, help the usersto access to the great amount of text they find in the Internetand their organizations.TC is the classification of documents into a predefined set ofcategories. Most approaches to automatic TC are based on theutilization of a training collection, which is a set of manuallyclassified documents. Other linguistic resources that areemerging, like lexical databases, can also be used forclassification tasks. This article describes an approach to TCbased on the integration of a training collection (Reuters-21578)and a lexical database (WordNet 1.6) as knowledge sources.Lexical databases accumulate information on the lexical items ofone or several languages. This information must be filtered inorder to make an effective use of it in our model of TC. Thisfiltering process is a Word Sense Disambiguation task. WSDis the identification of the sense of words in context. This taskis an intermediate process in many natural language processingtasks like machine translation or multilingual informationretrieval. We present the utilization of WSD as an aid for TC. Ourapproach to WSD is also based on the integration of two linguisticresources: a training collection (SemCor and Reuters-21578) and alexical database (WordNet 1.6).We have developed a series of experiments that show that: TC andWSD based on the integration of linguistic resources are veryeffective; and, WSD is necessary to effectively integratelinguistic resources in TC.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Buenaga, M., J.M. Gómez and B. Díaz. “Using WORDNET to Complement Training Information in Text Categorization”. Second International Conference on Recent Advances in Natural Language Processing. Bulgary: Tzigov Chark, 1997.

    Google Scholar 

  • Chang, J.S., J.N. Chen, H.H. Sheng and J.S. Ker. “Combining Machine Readable Lexical Resources and Bilingual Corpora for Broad Word Sense Disambiguation”. In Proceedings of the Second Conference of the Association for Machine Translation, 1996.

  • Chen, J.N. and J.S. Chang. “Topical Clustering of MRD Sense Based on Information Retrieval Techniques”. Computational Linguistics, 24(1) (1998).

  • Dagan, I., A. Itai and U. Schwall. “Two Languages Are More Informative Than One”. In Proceedings of ACL'91, 1991.

  • Fellbaum, C. (Ed.). “WORDNET: An Electronic Lexical Database”. MIT Press, 1998.

  • Gale, W., K.W. Church and D. Yarowsky. “Estimating Upper and Lower Bounds on the Performance or Word-Sense Disambiguation Programs”. In Proceedings of the ACL'92, 1992.

  • Gonzalo, J., F. Verdejo, C. Peters and N. Calzolari. “Applying EuroWordNet to Cross-Language Text Retrieval”. Computers and the Humanities, 32(2/3) (1998).

  • Harman, D. “Overview of the Forth Text Retrieval Conference (TREC-4)”. Proceedings of the Fourth Text Retrieval Conference, 1996.

  • Hersh, W., C. Buckley, T.J. Leone and D. Hickman. “OHSUMED: an Interactive Retrieval Evaluation and New Large Test Collection for Research”. Proceedings of the ACM SIGIR, 1994.

  • Kilgarriff, A. “What isWord Sense Disambiguation Good For?”. Proc. Natural Language Processing Pacific Rim Symposium. Thailand: Phuket, 1997, pp. 209–214.

    Google Scholar 

  • Kilgarriff, A. “Gold Standard Datasets for Evaluating Word Sense Disambiguation Programs”. Computer Speech and Language, 12(3) (1998).

  • Krovetz, R. and W.B. Croft. “Lexical Ambiguity and Information Retrieval”. ACM Transaction on Information Systems, 1992.

  • Lewis, D. “Representation and Learning in Information Retrieval”. Ph.D. Thesis, Department of Computer and Information Science University of Massachusetts, 1992, pp. 39–41.

  • Lewis, D.D., R.E. Schapire, J.P. Callan and R. Papka. “Training Algorithms for Linear Text Classifiers”. In Proceedings of the ACM SIGIR, 1996.

  • Miller, G., C. Leacock and T. Randee and R. Bunker “A Semantic Concordance”. In Proceedings of the 3rd DARPA Workshop on Human Language Technology. New Jersey, 1993.

  • Miller, G. “WORDNET: Lexical Database”. Communications of the ACM, 38(11) (1995).

  • Ng, H.T. and H.B. Lee. “Integrating Multiple Knowledge Sources to Disambiguate Word Sense: An Exemplar-Based Approach”. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL '96), 1996.

  • Oracle Corporation: 1997, “Managing Text with Oracle8(TM) ConText Cartridge”. An Oracle Technical White Paper.

  • Pedersen, P. and R. Bruce. “Distinguishing Word Senses in Untagged Text”. In Proceedings of the Second Conference on Empirical Methods in Natural Language Processing. Providence, 1997.

  • Rocchio, J.J. Jr. “Relevance Feedback in Information Retrieval”. In The SMART Retrieval System: Experiments in Automatic Document Processing. Ed. G. Salton, Englewood Cliffs, New Jersey: Prentice-Hall, Inc., 1971, pp. 313–323.

    Google Scholar 

  • Salton, G. and M.J. McGill. “Introduction to Modern Information Retrieval”. McGraw-Hill, 1983.

  • Salton, G. “Automatic Text Processing: The Transformation, Analysis and Retrieval of Information by Computer”. Addison Wesley, 1989.

  • Sanderson, M. “Word Sense Disambiguation and Information Retrieval”. Ph.D. Thesis, Department of Computing Science, University of Glasgow, 1996.

  • Smeaton, A., F. Kelledy and R. O'Donell. “TREC-4 Experiments at Dublin City University: Thresholding PostingLists, Query Expansions with WORDNET and POS Tagging of Spanish”. Proceedings of TREC, 1995.

  • Ureña-López L. A., M. García, M. Buenaga and J. M. Gómez. “Resolución de la ambigüedad léxica mediante información contextual y el modelo del espacio vectorial”. Séptima Conferencia de la Asociación Española para la Inteligencia Artificial. CAEPIA. Spain: Málaga, 1997.

  • Ureña-López, L. A., M. García, J. M. Gómez and A. Díaz. “Integrando una Base de Datos Léxica y una Colección de Entrenamiento para la Desambiguación del Sentido de las Palabras”. Procesamiento del Lenguaje Natural, Revista No. 23, September 1998.

  • Voorhees, E.M. “Using WORDNET to Disambiguate Word Senses for Text Retrieval”. Proceedings of the 16th ACM SIGIR, 1993.

  • Vossen, P. “Introduction to EuroWordNet”. Computers and the Humanities, 32(2/3) (1998).

  • Widrow, B. and S. Sterns. “Adaptative Signal Processing”. Englewood Cliffs. New Jersey: Prentice-Hall, 1985.

    Google Scholar 

  • Wilks, Y. and M. Stevenson. “Combining Independent Knowledge Sources for Word Sense Disambiguation”. In Proceddings of the Conference “Recent Advances in Natural Language Processing”. Bulgaria: Tzigov Chark, 1997.

    Google Scholar 

  • Xiaobin, L. and S. Szpakowicz. “A WORDNET-based Algorithm for Word Sense Disambiguation”. In Proceedings of the Fourteenth International Joint Conference on Artificial, 1995.

  • Yarowsky, D. “Word-sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora”. In Proceedings of the 14th International Conference on Computational Linguistics. France Nantes, 1992.

    Google Scholar 

  • Yarowsky, D. “Decision Lists for Lexical Ambiguity Resolution: Application to Accent Restoration in Spanish and French”. In Proceedings of the 32th Annual Meeting of the Association for Computational Linguistics ACL. New Mexico: Las Cruces, 1994, pp. 454–460.

    Google Scholar 

  • Yokoi, T. “The EDR Electronic Dictionary”. Communications of the ACM, 38(11) (1995).

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ureña-López, L.A., Buenaga, M. & Gómez, J.M. Integrating Linguistic Resources in TC through WSD. Computers and the Humanities 35, 215–230 (2001). https://doi.org/10.1023/A:1002632712378

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1002632712378

Navigation