The Effiectiveness of a Graph-Based Algorithm for Stemming

Bacchin, Michela; Ferro, Nicola; Melucci, Massimo

doi:10.1007/3-540-36227-4_12

Michela Bacchin⁶,
Nicola Ferro⁶ &
Massimo Melucci⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2555))

Included in the following conference series:

International Conference on Asian Digital Libraries

1224 Accesses
6 Citations

Abstract

In Information Retrieval (IR), stemming enables a matching of query and document terms which are related to a same meaning but which can appear in different morphological variants. In this paper we will propose and evaluate a statistical graph-based algorithm for stemming. Considering that a word is formed by a stem (prefix) and a derivation (suffix), the key idea is that strongly interlinked prefixes and suffixes form a community of sub-strings. Discovering these communities means searching for the best word splits which give the best word stems. We conducted some experiments on CLEF 2001 test subcollections for Italian language. The results show that stemming improve the IR effectiveness. They also show that effectiveness level of our algorithm is comparable to that of an algorithm based on a-priori linguistic knowledge. This is an encouraging result, particularly in a multi-lingual context.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

M. Agosti, M. Bacchin, N. Ferro and M. Melucci. University of Padua at CLEF 2002: Experiments to evaluate a statistical stemming algorithm. In Cross-Language Information Retrieval and Evaluation: Proceedings of the CLEF 2002 workshop, Lecture Notes in Computer Science series, Springer Verlag (forthcoming).
Google Scholar
C. Cleverdon. The Cranfield Tests on Index Language Devices. In K. Sparck Jones and P. Willett (Eds.). Readings in Information Retrieval, pages 47–59, Morgan Kaufmann, 1997.
Google Scholar
W.B. Frakes and R. Baeza-Yates. Information Retrieval: data structures and algorithms. Prentice Hall, 1992.
Google Scholar
J. Goldsmith. Unsupervised learning of the morphology of a natural language. Computational Linguistics, 27(2):154–198, 2001. au]7._M. Hafer and S. Weiss. Word segmentation by letter successor varieties. Information Storage and Retrieval, 10:371–385, 1994.
Article MathSciNet Google Scholar
D. Harman. How effective is suffixing? Journal of the American Society for Information Science, 42(1):7–15, 1991.
Article Google Scholar
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, September 1999.
Article MATH MathSciNet Google Scholar
R. Krovetz. Viewing Morphology as an Inference Process,. In Proceedings of the ACM International Conference on Research and Development in Information Retrieval (SIGIR), 1993.
Google Scholar
J. Lovins. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11:22–31, 1968.
Google Scholar
C.D. Manning and H. Schütze. Foundations of statistical natural language processing. The MIT Press, 1999.
Google Scholar
C.D. Paice. Another Stemmer. In ACM SIGIR Forum, 24, 56–61, 1990.
Article Google Scholar
M. Popovic and P. Willett. The effectiveness of stemming for natural-language access to sloven textual data. Journal of the American Society for Information Science, 43(5):383–390, 1992.
Article Google Scholar
C. Peters and M. Braschler. Cross-Language System Evaluation: the CLEF Campaigns. Journal of the American Society for Information Science and Technology, 52(12):1067–1072, 2001.
Article Google Scholar
M. Porter. Snowball: A language for stemming algorithms. http://snowball.sourceforge.net, 2001.
M.F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, 1980.
Google Scholar
G. Salton and M. McGill. Introduction to modern Information Retrieval. McGraw-Hill, New York, NY, 1983.
Google Scholar
G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523, 1988.
Article Google Scholar
Institut interfacultaire d’informatique. CLEF and Multilingual information retrieval. University of Neuchatel. http://www.unine.ch/info/clef/, 2002.
C. Buckley. Trec eval. ftp://ftp.cs.cornell.edu/pub/smart/, 2002.

Download references

Author information

Authors and Affiliations

Department of Information Engineering, University of Padua, Via Gradenigo, 6/a, 35031, Padova, Italy
Michela Bacchin, Nicola Ferro & Massimo Melucci

Authors

Michela Bacchin
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Ferro
View author publications
You can also search for this author in PubMed Google Scholar
Massimo Melucci
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Nanyang Technological University, Singapore
Ee- Peng Lim , Schubert Foo & Chris Khoo , &
University of Arizona, USA
Hsinchun Chen
Virginia Tech, USA
Edward Fox
University of Mysore, Mysore
Shalini Urs
IEI-CNR, Italy
Thanos Costantino

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bacchin, M., Ferro, N., Melucci, M. (2002). The Effiectiveness of a Graph-Based Algorithm for Stemming. In: Lim, E.P., et al. Digital Libraries: People, Knowledge, and Technology. ICADL 2002. Lecture Notes in Computer Science, vol 2555. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36227-4_12

Download citation

DOI: https://doi.org/10.1007/3-540-36227-4_12
Published: 16 December 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-00261-1
Online ISBN: 978-3-540-36227-2
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics