Abstract
In this paper, we propose a method of text retrieval from document images using a similarity measure based on word shape analysis. We directly extract image features instead of using optical character recognition. Document images are segmented into word units and then features called vertical bar patterns are extracted from these word units through local extrema points detection. All vertical bar patterns are used to build document vectors. Lastly, we obtain the pair-wise similarity of document images by means of the scalar product of the document vectors. Four corpora of news articles were used to test the validity of our method. During the test, the similarity of document images using this method was compared with the result of ASCII version of those documents based on the N-gram algorithm for text documents.
Similar content being viewed by others
References
G. Salton, “Developments in automatic text retrieval,” Science, vol. 253, pp. 974–980, 1991.
G. Salton and C. Buckley, “Global text matching for information retrieval,” Science, vol. 253, pp. 1012–1015, 1991.
G. Salton, J. Allan, C. Buckley, and A. Singhal, “Automatic analysis, theme generation, and summarization of machine-readable text,” Science, vol. 264, pp. 1421–1426, 1994.
C.E. Shannon, The Mathematical Theory of Communication, University of Illinois Press: Urbana, 1949.
C.Y. Suen, “N-gram statistics for natural language understanding and text processing,” IEEE Trans. on Pattern Analysis & Machine Intelligence, PAMI, vol. 1, no.2, pp. 164–172, 1979.
A. Zamora, “Automatic detection and correcting of spelling errors in a large data base,” Journal of the American Society for Information Science, vol. 31, no.51, 1980.
J.L. Peterson, “Computer programs for detecting and correcting spelling errors,” Comm. vol. ACM 23, p. 676, 1980.
E.M. Zamora, J.J. Pollock, and A. Zamora, “The use of trigram analysis for spelling error detection,” Inf. Proc. Mgt. vol. 17, p. 305, 1981.
J.J. Hull and S.N. Srihari, “Experiments in text recognition with binary N-gram and Viterbi algorithms,” IEEE Trans. Pattern Analysis & Machine Intelligence, vol. PAMI-4, p. 520, 1980.
J.J. Pollock, “Spelling error detection and correction by computer: Some notes and a bibliography,” J. Doc. vol. 38, p. 282, 1982.
R.C. Angell, G.E. Freund, and P. Willette, “Automatic spelling correction using trigram similarity measure,” Inf. Proc. Mgt. vol. 18, p. 255, 1983.
E.J. Yannakoudakis, P. Goyal, and J.A. Huggill, “The generation and use of text fragments for data compression,” Inf. Proc. Mgt. vol. 18, p. 15, 1982.
J.C. Schmitt, “Trigram-based method of language identification,” U.S. Patent No. 5,062,143, 1990.
W.B. Cavnar and J.M. Trenkle, “N-gram-based text categorization,” in Proceeding of the Symposium on Document Analysis and Information Retrieval, University of Nevada, Las Vegas, 1994.
P. Willett, “Document retrieval experiments using indexing vocabularies of varying size. II. Hashing, Truncation. Digram and trigram encoding of index terms,” J. Doc. vol. 35, p. 296, 1979.
W.B. Cavnar, “N-gram-based text filtering for TREC-2,” The Second Text Retrieval Conference (TREC-2), NIST Special Publication 500-215, National Institute of Standards and Technology, Gaitherburg, Maryland, 1994.
Marc Damashek, “Gauging similarity via N-grams: Language-independent sorting, categorization, and retrieval of text,” Science, vol. 267, pp. 843–848, 1995.
S.M. Harding, W.B. Croft, and C. Weir, “Probabilistic retrieval of OCR degraded text using N-grams,” in European Conference on Digital Libraries, pp. 345–359, 1997.
W.B. Croft, S.M. Harding, K. Taghva, and J. Borsack, “An evaluation of information retrieval accuracy with simulated OCR output,” in Symposium of Document Analysis and Information Retrieval, pp. 115–126, 1994.
F.R. Chen and D.S. Bloomberg, “Extraction of thematically relevant text from images,” in Proceedings of the Symposium on Document Analysis and Information Retrieval, pp. 163–178, 1996.
F.R. Chen and D.S. Bloomberg, “Extraction of indicative summary sentences from imaged documents,” in Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR’97), vol. 1, pp. 227–232, 1997.
J.J. Hull and J.F. Cullen, “Document image similarity and equivalence detection,” in Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR’97), vol. 1, pp. 308–312, 1997.
Y. He, Z. Jiang, B. Liu, and H. Zhao, “Content-based indexing and retrieval method of chinese document images,” in Proceedings of the Fifth International Conference on Document Analysis and Recognition (ICDAR’99), pp. 685–688, 1999.
P. Sibun and A.L. Splitz, “Language determination: National language processing from scanned document images,” in Proceedings of the fourth Conference on Applied Natural Language Processing, pp. 423–433, Las Vegas, April 1995.
A.L. Spitz, “Determination of the script and language content of document images,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 19, no.3, pp. 235–245, 1997.
C.Y. Suen, S. Bergler, N. Nobile, B. Waked, C.P. Nadal, and A. Bloch, “Categorizing document images into script and language classes,” in Proceedings of the International Conference on Advances in Pattern Recognition, Plymouth, UK, pp. 297–306, 23–25 Nov 1998.
C.L. Tan, P.Y. Leong, and S. He, “Language identification in multilingual documents,” in International Symposium on Intelligent Multimedia and Distance Education, 1999.
A.F. Smeaton and A.L. Spitz, “Using character shape coding for information retrieval,” in Proceeding of the Fourth International Conference on Document Analysis and Recognition (ICDAR97), vol. 2, pp. 974–978, 1997.
R.K. Powalka, N. Sherkat, and R.J. Whitrow, “Word shape analysis for a hybrid recognition system,”Pattern Recognition, vol. 30, no.3, pp. 421–445, 1997.
D.S. Bloomberg, G.E. Kopec, and L. Dasari, “Measuring document image skew and orientation,” SPIE Conf. 2422, Document Recognition II, San Jose, CA, pp. 302–316, 1995.
Y. Yang and X. Liu, “A re-examination of text categorization methods,” in Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 42–49, 1999.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Tan, C.L., Huang, W., Sung, S.Y. et al. Text Retrieval from Document Images Based on Word Shape Analysis. Applied Intelligence 18, 257–270 (2003). https://doi.org/10.1023/A:1023245904128
Issue Date:
DOI: https://doi.org/10.1023/A:1023245904128