Abstract
The retrieval of documents that originate from digitized and OCR-converted paper documents is an important task for modern retrieval systems. The problems that OCR errors cause for the retrieval process have been subject to research for several years now. We approach the problem from a theoretical point of view and model OCR conversion as a random experiment. Our theoretical results, which are supported by experiments, show clearly that information retrieval can cope even with many errors. It is, however, important that the documents are not too short and that recognition errors are distributed appropriately among words and documents. These results disclose that an expensive manual or automatic post-processing of OCR-converted documents usually does not make sense, but that scanning and OCR must be performed in an appropriate way and with care.
Article PDF
Similar content being viewed by others
References
Ballerini J-P, Büchel M, Domenig R, Knaus D, Mateev B, Mittendorf E, Schäuble P, Sheridan P and Wechsler M (1997) SPIDER retrieval system at TREC-5. In: TREC-5 Proceedings.
Cavnar WB (1992) N-gram-based text filtering for TREC-2. In: TREC-2 Proceedings.
Croft WB, Harding S, Taghva K and Borsack J (1993) An evaluation of information retrieval accuracy with simulated OCR output. In: Symposium on Document Analysis and Information Retrieval, pp. 115-126.
Efthimiadis E (1996) Query expansion. Annual Review of Information Science and Technology, 31:121-187.
Frei HP and Qui Y (1993) Effectiveness of weighted retrieval in an operational IR environment. In: Information Retrieval '93. Universit¨atsverlag Konstanz, pp. 41-45.
Fuhr N (1992) Probabilistic models in information retrieval. The Computer Journal, 35(3):243-255.
Garzotto A (1994) Vollautomatische Erkennung von Schriftzeichen in gedrucktem Schrittgut. PhD Thesis, Universit ät Zörich.
Glavitsch U, Schäuble P and Wechsler M (1994) Metadata for integrating speech documents in a text retrieval system. SIGMOD RECORD, 23(4):57-63.
Harding SM, Croft WB and Wein C (1997) Probabilistic retrieval OCR degraded text using n-grams. In: Research and Advanced Technology for Digital Libraries, First European Conference, ECDL'97, pp. 345-359.
Jäger Th (1996) OCR and voting shell fulfilling specific text analysis requirements. In: Symposium on Document Analysis and Information Retrieval, pp. 287-302.
Jones GJF, Foote JT, Sparck Jones K and Young SJ (1996) Retrieving spoken documents by combining multiple index sources. In: ACM SIGIR Conference on R&D in Information Retrieval, Zurich, pp. 30-38.
Mittendorf E (1998) Data corruption and information retrieval. PhD Thesis, ETH Zurich, Institute of Computer Systems.
Mittendorf E and Schäuble P (1996) Measuring the effects of data corruption on information retrieval. In: Symposium on Document Analysis and Information Retrieval, pp. 179-189.
Mittendorf E, Schäuble P and Sheridan P (1995) Applying probabilistic term weighting to OCR text in the case of a large alphabetic library catalogue. In: ACMSIGIR Conference on R&D in Information Retrieval, pp. 328-335.
Myka A and Göntzer U (1995) Automatic hypertext conversion of paper document collections. In: Adam N, Bhargava B and Yesha Y, Eds., Advances in Digital Libraries-Current Issuses, Springer-Verlag, Berlin, pp. 65-90. Lecture Notes in Computer Science, Vol. 916.
Porter MF (1980) An algorithm for suffix stripping. Program, 14(3):130-137.
Robertson SE and Walker S (1994) Some simple effective approximations of the 2-Poisson model for probabilistic weighted retrieval. In: ACM SIGIR Conference on R&D in Information Retrieval, pp. 232-241.
Salton G (1971) The SMART Retrieval System-Experiments in Automatic Document Processing. Prentice Hall, Englewood, Cliffs, New Jersey.
Salton G (1990) Automatic Text Processing. Addison-Wesley, Reading, MA.
Sanderson M (1994) Word sense disambiguation and information retrieval. In: ACM SIGIR Conference on R&D in Information Retrieval, pp. 142-151.
Schäuble P and Glavitsch U (1994) Assessing the retrieval effectiveness of a speech retrieval system by simulating recognition errors. In: ARPA Workshop on Human Language Technology (HLT'94), pp. 370-372.
Singhal A, Buckley C and Mitra M (1996) Pivoted document length normalization. In: ACM SIGIR Conference on R&D in Information Retrieval, pp. 21-29.
Smith S and Stanfill C (1988) An analysis of the effects of data corruption on text retrieval performance. Thinking Machines Corporation, Cambridge, MA.
Stahel W (1995) Statistische Datenanalyse: Eine Einföhrung för naturwissenschaftler. Lehrbuch, Angewandte Mathematik. Vieweg, Wiesbaden.
Taghva K, Borsack J and Condit A (1994) Effects of OCR errors on ranking and feedback using the vector space model. Technical Report TR 94-06, University of Nevada, Las Vegas.
Taghva K, Borsack J and Condit A (1994) Results of applying probabilistic IR to OCR text. In: ACM SIGIR Conference on R&D in Information Retrieval, pp. 202-211.
Teufel B (1989) Informationsspuren zum numerischen und graphischen Vergleich von reduzierten nat¨urlichsprachlichen Texten. PhD Thesis, Swiss Federal Institute of Technology, VdF-Verlag, Zörich.
Venables WN and Ripley BD (1994) Modern applied statistics with S-plus. Statistics and Computing. Springer-Verlag, New York.
Voorhees E and Kantor P (1997) TREC-5 confusion track. In: TREC-5 Proceedings.
Wechsler M and Schäuble P (1995) Speech retrieval based on automatic indexing. In: Ruthven Ian Ed., Proceedings of the FinalWorkshop on Multimedia Information Retrieval (MIRO'95), ElectronicWorkshops in Computing, Springer, Glasgow.
Wiedenhfer L, Hein H-G and Dengel A (1995) Post-processing of OCR results for automatic indexing. In: Third International Conference on Document Analysis and Recognition, Montreal, August 1995. IEEE Computer Society Press, Silver Spring, MD, pp. 592-597.
Xu J and Croft WB(1996) Query expansion using local and global document analysis. In:ACMSIGIR Conference on R&D in Information Retrieval, pp. 4-11.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Mittendorf, E., Schäuble, P. Information Retrieval can Cope with Many Errors. Information Retrieval 3, 189–216 (2000). https://doi.org/10.1023/A:1026564708926
Issue Date:
DOI: https://doi.org/10.1023/A:1026564708926