Skip to main content
Log in

Text Retrieval from Document Images Based on Word Shape Analysis

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-gram algorithm for text documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. G. Salton, “Developments in automatic text retrieval,” Science, vol. 253, pp. 974–980, 1991.

    Google Scholar 

  2. G. Salton and C. Buckley, “Global text matching for information retrieval,” Science, vol. 253, pp. 1012–1015, 1991.

    Google Scholar 

  3. G. Salton, J. Allan, C. Buckley, and A. Singhal, “Automatic analysis, theme generation, and summarization of machine-readable text,” Science, vol. 264, pp. 1421–1426, 1994.

    Google Scholar 

  4. C.E. Shannon, The Mathematical Theory of Communication, University of Illinois Press: Urbana, 1949.

    Google Scholar 

  5. C.Y. Suen, “N-gram statistics for natural language understanding and text processing,” IEEE Trans. on Pattern Analysis & Machine Intelligence, PAMI, vol. 1, no.2, pp. 164–172, 1979.

    Google Scholar 

  6. A. Zamora, “Automatic detection and correcting of spelling errors in a large data base,” Journal of the American Society for Information Science, vol. 31, no.51, 1980.

  7. J.L. Peterson, “Computer programs for detecting and correcting spelling errors,” Comm. vol. ACM 23, p. 676, 1980.

    Google Scholar 

  8. E.M. Zamora, J.J. Pollock, and A. Zamora, “The use of trigram analysis for spelling error detection,” Inf. Proc. Mgt. vol. 17, p. 305, 1981.

    Google Scholar 

  9. J.J. Hull and S.N. Srihari, “Experiments in text recognition with binary N-gram and Viterbi algorithms,” IEEE Trans. Pattern Analysis & Machine Intelligence, vol. PAMI-4, p. 520, 1980.

    Google Scholar 

  10. J.J. Pollock, “Spelling error detection and correction by computer: Some notes and a bibliography,” J. Doc. vol. 38, p. 282, 1982.

    Google Scholar 

  11. R.C. Angell, G.E. Freund, and P. Willette, “Automatic spelling correction using trigram similarity measure,” Inf. Proc. Mgt. vol. 18, p. 255, 1983.

    Google Scholar 

  12. E.J. Yannakoudakis, P. Goyal, and J.A. Huggill, “The generation and use of text fragments for data compression,” Inf. Proc. Mgt. vol. 18, p. 15, 1982.

    Google Scholar 

  13. J.C. Schmitt, “Trigram-based method of language identification,” U.S. Patent No. 5,062,143, 1990.

  14. W.B. Cavnar and J.M. Trenkle, “N-gram-based text categorization,” in Proceeding of the Symposium on Document Analysis and Information Retrieval, University of Nevada, Las Vegas, 1994.

  15. P. Willett, “Document retrieval experiments using indexing vocabularies of varying size. II. Hashing, Truncation. Digram and trigram encoding of index terms,” J. Doc. vol. 35, p. 296, 1979.

    Google Scholar 

  16. W.B. Cavnar, “N-gram-based text filtering for TREC-2,” The Second Text Retrieval Conference (TREC-2), NIST Special Publication 500-215, National Institute of Standards and Technology, Gaitherburg, Maryland, 1994.

    Google Scholar 

  17. Marc Damashek, “Gauging similarity via N-grams: Language-independent sorting, categorization, and retrieval of text,” Science, vol. 267, pp. 843–848, 1995.

    Google Scholar 

  18. S.M. Harding, W.B. Croft, and C. Weir, “Probabilistic retrieval of OCR degraded text using N-grams,” in European Conference on Digital Libraries, pp. 345–359, 1997.

  19. W.B. Croft, S.M. Harding, K. Taghva, and J. Borsack, “An evaluation of information retrieval accuracy with simulated OCR output,” in Symposium of Document Analysis and Information Retrieval, pp. 115–126, 1994.

  20. F.R. Chen and D.S. Bloomberg, “Extraction of thematically relevant text from images,” in Proceedings of the Symposium on Document Analysis and Information Retrieval, pp. 163–178, 1996.

  21. F.R. Chen and D.S. Bloomberg, “Extraction of indicative summary sentences from imaged documents,” in Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR’97), vol. 1, pp. 227–232, 1997.

    Google Scholar 

  22. J.J. Hull and J.F. Cullen, “Document image similarity and equivalence detection,” in Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR’97), vol. 1, pp. 308–312, 1997.

    Google Scholar 

  23. Y. He, Z. Jiang, B. Liu, and H. Zhao, “Content-based indexing and retrieval method of chinese document images,” in Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR’99), pp. 685–688, 1999.

  24. P. Sibun and A.L. Splitz, “Language determination: National language processing from scanned document images,” in Proceedings of the fourth Conference on Applied Natural Language Processing, pp. 423–433, Las Vegas, April 1995.

  25. A.L. Spitz, “Determination of the script and language content of document images,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 19, no.3, pp. 235–245, 1997.

    Google Scholar 

  26. C.Y. Suen, S. Bergler, N. Nobile, B. Waked, C.P. Nadal, and A. Bloch, “Categorizing document images into script and language classes,” in Proceedings of the International Conference on Advances in Pattern Recognition, Plymouth, UK, pp. 297–306, 23–25 Nov 1998.

  27. C.L. Tan, P.Y. Leong, and S. He, “Language identification in multilingual documents,” in International Symposium on Intelligent Multimedia and Distance Education, 1999.

  28. A.F. Smeaton and A.L. Spitz, “Using character shape coding for information retrieval,” in Proceeding of the Fourth International Conference on Document Analysis and Recognition (ICDAR97), vol. 2, pp. 974–978, 1997.

    Google Scholar 

  29. R.K. Powalka, N. Sherkat, and R.J. Whitrow, “Word shape analysis for a hybrid recognition system,”Pattern Recognition, vol. 30, no.3, pp. 421–445, 1997.

    Google Scholar 

  30. D.S. Bloomberg, G.E. Kopec, and L. Dasari, “Measuring document image skew and orientation,” SPIE Conf. 2422, Document Recognition II, San Jose, CA, pp. 302–316, 1995.

  31. Y. Yang and X. Liu, “A re-examination of text categorization methods,” in Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 42–49, 1999.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tan, C.L., Huang, W., Sung, S.Y. et al. Text Retrieval from Document Images Based on Word Shape Analysis. Applied Intelligence 18, 257–270 (2003). https://doi.org/10.1023/A:1023245904128

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1023245904128

Navigation