Text Retrieval from Document Images Based on Word Shape Analysis

Tan, Chew Lim; Huang, Weihua; Sung, Sam Yuan; Yu, Zhaohui; Xu, Yi

doi:10.1023/A:1023245904128

Text Retrieval from Document Images Based on Word Shape Analysis

Published: May 2003

Volume 18, pages 257–270, (2003)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Chew Lim Tan¹,
Weihua Huang¹,
Sam Yuan Sung¹,
Zhaohui Yu¹ &
…
Yi Xu¹

122 Accesses
11 Citations
Explore all metrics

Abstract

In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-gram algorithm for text documents.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

G. Salton, “Developments in automatic text retrieval,” Science, vol. 253, pp. 974–980, 1991.
Google Scholar
G. Salton and C. Buckley, “Global text matching for information retrieval,” Science, vol. 253, pp. 1012–1015, 1991.
Google Scholar
G. Salton, J. Allan, C. Buckley, and A. Singhal, “Automatic analysis, theme generation, and summarization of machine-readable text,” Science, vol. 264, pp. 1421–1426, 1994.
Google Scholar
C.E. Shannon, The Mathematical Theory of Communication, University of Illinois Press: Urbana, 1949.
Google Scholar
C.Y. Suen, “N-gram statistics for natural language understanding and text processing,” IEEE Trans. on Pattern Analysis & Machine Intelligence, PAMI, vol. 1, no.2, pp. 164–172, 1979.
Google Scholar
A. Zamora, “Automatic detection and correcting of spelling errors in a large data base,” Journal of the American Society for Information Science, vol. 31, no.51, 1980.
J.L. Peterson, “Computer programs for detecting and correcting spelling errors,” Comm. vol. ACM 23, p. 676, 1980.
Google Scholar
E.M. Zamora, J.J. Pollock, and A. Zamora, “The use of trigram analysis for spelling error detection,” Inf. Proc. Mgt. vol. 17, p. 305, 1981.
Google Scholar
J.J. Hull and S.N. Srihari, “Experiments in text recognition with binary N-gram and Viterbi algorithms,” IEEE Trans. Pattern Analysis & Machine Intelligence, vol. PAMI-4, p. 520, 1980.
Google Scholar
J.J. Pollock, “Spelling error detection and correction by computer: Some notes and a bibliography,” J. Doc. vol. 38, p. 282, 1982.
Google Scholar
R.C. Angell, G.E. Freund, and P. Willette, “Automatic spelling correction using trigram similarity measure,” Inf. Proc. Mgt. vol. 18, p. 255, 1983.
Google Scholar
E.J. Yannakoudakis, P. Goyal, and J.A. Huggill, “The generation and use of text fragments for data compression,” Inf. Proc. Mgt. vol. 18, p. 15, 1982.
Google Scholar
J.C. Schmitt, “Trigram-based method of language identification,” U.S. Patent No. 5,062,143, 1990.
W.B. Cavnar and J.M. Trenkle, “N-gram-based text categorization,” in Proceeding of the Symposium on Document Analysis and Information Retrieval, University of Nevada, Las Vegas, 1994.
P. Willett, “Document retrieval experiments using indexing vocabularies of varying size. II. Hashing, Truncation. Digram and trigram encoding of index terms,” J. Doc. vol. 35, p. 296, 1979.
Google Scholar
W.B. Cavnar, “N-gram-based text filtering for TREC-2,” The Second Text Retrieval Conference (TREC-2), NIST Special Publication 500-215, National Institute of Standards and Technology, Gaitherburg, Maryland, 1994.
Google Scholar
Marc Damashek, “Gauging similarity via N-grams: Language-independent sorting, categorization, and retrieval of text,” Science, vol. 267, pp. 843–848, 1995.
Google Scholar
S.M. Harding, W.B. Croft, and C. Weir, “Probabilistic retrieval of OCR degraded text using N-grams,” in European Conference on Digital Libraries, pp. 345–359, 1997.
W.B. Croft, S.M. Harding, K. Taghva, and J. Borsack, “An evaluation of information retrieval accuracy with simulated OCR output,” in Symposium of Document Analysis and Information Retrieval, pp. 115–126, 1994.
F.R. Chen and D.S. Bloomberg, “Extraction of thematically relevant text from images,” in Proceedings of the Symposium on Document Analysis and Information Retrieval, pp. 163–178, 1996.
F.R. Chen and D.S. Bloomberg, “Extraction of indicative summary sentences from imaged documents,” in Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR’97), vol. 1, pp. 227–232, 1997.
Google Scholar
J.J. Hull and J.F. Cullen, “Document image similarity and equivalence detection,” in Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR’97), vol. 1, pp. 308–312, 1997.
Google Scholar
Y. He, Z. Jiang, B. Liu, and H. Zhao, “Content-based indexing and retrieval method of chinese document images,” in Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR’99), pp. 685–688, 1999.
P. Sibun and A.L. Splitz, “Language determination: National language processing from scanned document images,” in Proceedings of the fourth Conference on Applied Natural Language Processing, pp. 423–433, Las Vegas, April 1995.
A.L. Spitz, “Determination of the script and language content of document images,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 19, no.3, pp. 235–245, 1997.
Google Scholar
C.Y. Suen, S. Bergler, N. Nobile, B. Waked, C.P. Nadal, and A. Bloch, “Categorizing document images into script and language classes,” in Proceedings of the International Conference on Advances in Pattern Recognition, Plymouth, UK, pp. 297–306, 23–25 Nov 1998.
C.L. Tan, P.Y. Leong, and S. He, “Language identification in multilingual documents,” in International Symposium on Intelligent Multimedia and Distance Education, 1999.
A.F. Smeaton and A.L. Spitz, “Using character shape coding for information retrieval,” in Proceeding of the Fourth International Conference on Document Analysis and Recognition (ICDAR97), vol. 2, pp. 974–978, 1997.
Google Scholar
R.K. Powalka, N. Sherkat, and R.J. Whitrow, “Word shape analysis for a hybrid recognition system,”Pattern Recognition, vol. 30, no.3, pp. 421–445, 1997.
Google Scholar
D.S. Bloomberg, G.E. Kopec, and L. Dasari, “Measuring document image skew and orientation,” SPIE Conf. 2422, Document Recognition II, San Jose, CA, pp. 302–316, 1995.
Y. Yang and X. Liu, “A re-examination of text categorization methods,” in Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 42–49, 1999.

Download references

Author information

Authors and Affiliations

School of Computing, National University of Singapore, 3 Science Drive 2, Singapore, 117543
Chew Lim Tan, Weihua Huang, Sam Yuan Sung, Zhaohui Yu & Yi Xu

Authors

Chew Lim Tan
View author publications
You can also search for this author in PubMed Google Scholar
Weihua Huang
View author publications
You can also search for this author in PubMed Google Scholar
Sam Yuan Sung
View author publications
You can also search for this author in PubMed Google Scholar
Zhaohui Yu
View author publications
You can also search for this author in PubMed Google Scholar
Yi Xu
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tan, C.L., Huang, W., Sung, S.Y. et al. Text Retrieval from Document Images Based on Word Shape Analysis. Applied Intelligence 18, 257–270 (2003). https://doi.org/10.1023/A:1023245904128

Download citation

Issue Date: May 2003
DOI: https://doi.org/10.1023/A:1023245904128

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Text Retrieval from Document Images Based on Word Shape Analysis

Abstract

Access this article

Similar content being viewed by others

Siamese Neural Networks: An Overview

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Text Retrieval from Document Images Based on Word Shape Analysis

Abstract

Access this article

Similar content being viewed by others

Siamese Neural Networks: An Overview

OCR with Tesseract, Amazon Textract, and Google Document AI: a benchmarking experiment

A comprehensive and analytical review of text clustering techniques

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation