Large scale document image retrieval by automatic word annotation

Pramod Sankar, K.; Manmatha, R.; Jawahar, C. V.

doi:10.1007/s10032-013-0207-2

Large scale document image retrieval by automatic word annotation

Original Paper
Published: 16 July 2013

Volume 17, pages 1–17, (2014)
Cite this article

International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

K. Pramod Sankar¹,
R. Manmatha² &
C. V. Jawahar³

749 Accesses
6 Citations
3 Altmetric
Explore all metrics

Abstract

In this paper, we present a practical and scalable retrieval framework for large-scale document image collections, for an Indian language script that does not have a robust OCR. OCR-based methods face difficulties in character segmentation and recognition, especially for the complex Indian language scripts. We realize that character recognition is only an intermediate step toward actually labeling words. Hence, we re-pose the problem as one of directly performing word annotation. This new approach has better recognition performance, as well as easier segmentation requirements. However, the number of classes in word annotation is much larger than those for character recognition, making such a classification scheme expensive to train and test. To address this issue, we present a novel framework that replaces naive classification with a carefully designed mixture of indexing and classification schemes. This enables us to build a search system over a large collection of 1,000 books of Telugu, consisting of 120K document images or 36M individual words. This is the largest searchable document image collection for a script without an OCR that we are aware of. Our retrieval system performs significantly well with a mean average precision of 0.8.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Automated Pipeline for Robust Image Processing and Optical Character Recognition of Historical Documents

Lexicon-based probabilistic indexing of handwritten text images

Article Open access 10 May 2023

Handwritten Text Retrieval from Unlabeled Collections

References

Google Books at: http://books.google.com
Internet Archive at: http://www.archive.org
Universal Library at: http://www.ulib.org/
Govindaraju, V., Setlur, S. (eds.): Guide to OCR for Indic Scripts. Springer, Berlin (2009)
Google Scholar
Kumar, K.S.S., Kumar, S., Jawahar, C.V.: On segmentation of documents in complex scripts. In: Proceedings of ICDAR, pp. 1243–1247 (2007)
Pramod Sankar K., Jawahar, C.V., Manmatha, R.: Nearest neighbor based collection OCR. In: Proceedings of DAS (2010)
Pramod Sankar, K., Jawahar, C.V.: Probabilistic reverse annotation for large scale image retrieval. In: Proceedings of the CVPR (2007)
IIIT-H Telugu Word Recognition Dataset, available at: http://cvit.iiit.ac.in/index.php?page=resources
Digital Library of India at: http://dli.iiit.ac.in
Doermann, D.: The indexing and retrieval of document images: a survey. Comput. Vis. Image Underst. 70, 287–298 (1998)
Article Google Scholar
Marinai, S.: A survey of document image retrieval in digital libraries. In: Colloque International Francophone sur l’Ecrit et le Document (CIFED), pp. 193–198 (2006)
Kim, J., Seitz, S.M., Agrawala, M.: Video-based document tracking: unifying your physical and electronic desktops. In: Proceedings of the UIST, pp. 99–107 (2004)
Nagy, G.: At the frontiers of OCR. IEEE 80, 1093–1100 (1992)
Article Google Scholar
Pal, U., Chaudhuri, B.: Indian script character recognition: a survey. Pattern Recognit. 37(9), 1887–1899 (2004)
Article Google Scholar
Tong, X., Evans, D.A.: A statistical approach to automatic ocr error correction in context. In: Proceedings of the WVLC, pp. 88–10 (1996)
Francesconi, E., Gori, M., Marinai, S., Soda, G.: A serial combination of connectionist-based classifiers for OCR. IJDAR 3, 160–168 (2001)
Article Google Scholar
Byun, H., Lee, S.W.: Applications of support vector machines for pattern recognition: a survey. In: Proceedings of the First International Workshop on Pattern Recognition with Support Vector Machines, pp. 213–236 (2002)
Jawahar, C.V., Kumar, M.P., Ravikiran, S.S.: A bilingual OCR system for hindi-telugu documents and its applications. In: Proceedings of the ICDAR, pp. 408–413 (2003)
Kahan, S., Pavlidis, T., Baird, H.S.: On the recognition of printed character of any font and size. IEEE PAMI 9(2), 274–288 (1987)
Article Google Scholar
Lehal, G.S., Singh, C., Lehal, R.: A shape based post processor for gurumukhi OCR. In: Proceedings of the ICDAR, pp. 1105–1109 (2001)
Natarajan, P., MacRostie, E., Decerbo, M.: The BBN byblos hindi OCR system. Guide to OCR for Indic Scripts, pp. 173–180 (2009)
Lu, Z., Schwartz, R., Natarajan, P., Bazzi, I., Makhoul, J.: Advances in the bbn byblos ocr system. In: Proceedings of the ICDAR, pp. 337–340 (1999)
Bharati, A., Rao, P., Sangal, R., Bendre, S.M.: Basic statistical analaysis of corpus and cross comparison. In: Proceedings of the ICON (2002)
Spitz, A.L.: Using character shape codes for word spotting in document images. In: Dori, D., Bruckstein, A. (eds.) Shape, Structure and Pattern Recognition, pp. 382–389. World Scientific, Singapore (1995)
Li, L., Lu, S.J., Tan, C.L.: A fast keyword-spotting technique. In: Proceedings of the ICDAR, pp. 68–72 (2007)
Lu, S., Li, L., Tan, C.L.: Document image retrieval through word shape coding. IEEE PAMI 30(11), 1913–1918 (2008)
Article Google Scholar
Kesidis, A.L., Galiotou, E., Gatos, B., Pratikakis, I.: A word spotting framework for historical machine-printed documents. IJDAR 14, 131–144 (2011)
Article Google Scholar
Rath, T.M., Manmatha, R.: Word spotting for historical documents. IJDAR 9(2–4), 139–152 (2007)
Article Google Scholar
Zhang, B., Srihari, S.N., Huang, C.: Word image retrieval using binary features. In: DRR’04, pp. 45–53 (2004)
Terasawa, K., Tanaka, Y.: Slit style hog feature for document image word spotting. In: Proceedings of the ICDAR, pp. 116–120 (2009)
Llados, J., Sanchez, G.: Indexing historical documents by word shape signatures. In: Proceedings of the ICDAR, pp. 362–366 (2007)
Meshesha, M., Jawahar, C.V.: Matching word images for content-based retrieval from printed document images. IJDAR, pp. 29–38 (2008)
Kuo, S.S., Agazzi, O.E.: Keyword spotting in poorly printed documents using pseudo 2-d hidden markov models. IEEE PAMI 16, 842–848 (1994)
Article Google Scholar
Nishi, H., Kimura, Y., Iguchi, R.: A new word spotting framework using hough transform of distance matrix images. In: Proceedings of the IMECS, pp. 280–285 (2010)
Frinken, V., Fischer, A., Bunke, H., Manmatha, R.: Adapting BLSTM neural network based keyword spotting trained on modern data to historical documents. In: Proceedings of the ICFHR, pp. 352–357 (2010)
Jain, R., Frinken, V., Jawahar, C.V., Manmatha, R.: BLSTM neural network based word retrieval for hindi documents. In: Proceedings of the ICDAR, pp. 83–87 (2011)
Gatos, B., Pratikakis, I.: Segmentation-free word spotting in historical printed documents. In: Proceedings of the ICDAR, pp. 271–275 (2009)
Madhvanath, S., Govindaraju, V.: The role of holistic paradigms in handwritten word recognition. IEEE PAMI 23, 149–164 (2001)
Article Google Scholar
Srihari, S., Srinivasan, H., Babu, P., Bhole, C.: Handwritten arabic word spotting using the cedarabic document analysis system. In: Proceedings of the Symposium on Document Image Understanding Technology (SDIUT), pp. 123–132 (2005)
Adamek, T., O’Connor, N.E., Smeaton, A.F.: Word matching using single closed contours for indexing handwritten historical documents. IJDAR 9(2–4), 153–165 (2007)
Article Google Scholar
Kumar, A., Jawahar, C.V., Manmatha, R.: Efficient search in document image collections. In: Proceedings of the ACCV, pp. 586–595 (2007)
Rath, T., Manmatha, R., Lavrenko, V.: A search engine for historical manuscript images. In: Proceedings of the SIGIR, pp. 369–376 (2004)
Negi, A., Bhagvati, C., Krishna, B.: An ocr system for telugu. In: Proceedings of the ICDAR, pp. 1110–1114 (2001)
Rice, S.V., Jenkins, F.R., Nartker, T.A.: The fifth annual test of ocr accuracy. Technical report, UNLV (1996)
Ataer, E., Duygulu, P.: Matching Ottoman words: an image retrieval approach to historical document indexing. In: Proceedings of the CIVR, pp. 341–347 (2007)
Xiu, P., Baird, H.S.: Scaling up whole-book recognition. In: Proceedings of the ICDAR, pp. 698–702 (2009)
Konidaris, T., Gatos, B., Ntzios, K., Pratikakis, I., Theodoridis, S., Perantonis, S.J.: Keyword-guided word spotting in historical printed documents with synthetic data and user feedback. IJDAR, pp. 167–177 (2007)
Lavrenko, V., Rath, T.M., Manmatha, R.: Holistic word recognition for handwritten historical documents. In: Proceedings of the DIAL, pp. 278–287 (2004)
Chan, J., Ziftci, C., Forsyth, D.A.: Searching off-line arabic documents. In: Proceedings of the CVPR, pp. 1455–1462 (2006)
Lemur search engine: http://www.lemurproject.org/
Galago search engine: http://www.galagosearch.org/
Zipf, G.: Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge, MA (1949)
Liu, T.Y., Yang, Y., Wan, H., Zeng, H.J., Chen, Z., Ma, W.Y.: Support vector machines classification with a very large-scale taxonomy. SIGKDD Explor. Newsl. 7, 36–43 (2005)
Article Google Scholar
Deng, J., Berg, A.C., Li, K., Fei-Fei, L.: What does classifying more than 10,000 image categories tell us? In: ECCV (5), pp. 71–84 (2010)
Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: Proceedings of the CVPR, pp. 2161–2168 (2006)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the STOC, pp. 604–613 (1998)
Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: Proceedings of the VISAPP, pp. 331–340 (2009)
Andoni, A., Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.: Locality-sensitive hashing scheme based on p-stable distributions. Theory and Practice, Nearest Neighbor Methods in Learning and Vision (2006)
Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: improving particular object retrieval in large scale image databases. In: Proceedings of the CVPR (2008)
Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping. In: Proceedings of the CVPR, pp. 521–527 (2003)
Balasubramanian, A., Meshesha, M., Jawahar, C.V.: Retrieval from document image collections. In: Proceedings of the DAS, pp. 1–12 (2006)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)
Article Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the CVPR, pp. 886–893 (2005)
Bosch, A., Zisserman, A., Munoz, X.: Scene classification using a hybrid generative/discriminative approach. IEEE PAMI 30(4), 712–727 (2008)
Article Google Scholar
Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2009 web track (online resource)
Rath, T., Manmatha, R.: Features for word matching in historical manuscripts. In: Proceedings of the ICDAR, pp. 218–222 (2003)

Download references

Acknowledgments

This work was supported in part by the Ministry of Communications and Information Technology, Govt. of India. R. Manmatha was supported in part by the Center for Intelligent Information Retrieval and by the National Science Foundation grant #IIS-0910884. Any opinions, findings and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsors.

Author information

Authors and Affiliations

Xerox Research Center India, Bengaluru, India
K. Pramod Sankar
Multi-media Indexing and Retrieval Group, Department of Computer Science, University of Massachusetts, Amherst, MA, USA
R. Manmatha
Center for Visual Information Technology, IIIT-Hyderabad, Hyderabad, India
C. V. Jawahar

Authors

K. Pramod Sankar
View author publications
You can also search for this author in PubMed Google Scholar
R. Manmatha
View author publications
You can also search for this author in PubMed Google Scholar
C. V. Jawahar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to K. Pramod Sankar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pramod Sankar, K., Manmatha, R. & Jawahar, C.V. Large scale document image retrieval by automatic word annotation. IJDAR 17, 1–17 (2014). https://doi.org/10.1007/s10032-013-0207-2

Download citation

Received: 19 June 2012
Revised: 25 April 2013
Accepted: 05 June 2013
Published: 16 July 2013
Issue Date: March 2014
DOI: https://doi.org/10.1007/s10032-013-0207-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Large scale document image retrieval by automatic word annotation

Abstract

Access this article

Similar content being viewed by others

An Automated Pipeline for Robust Image Processing and Optical Character Recognition of Historical Documents

Lexicon-based probabilistic indexing of handwritten text images

Handwritten Text Retrieval from Unlabeled Collections

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Large scale document image retrieval by automatic word annotation

Abstract

Access this article

Similar content being viewed by others

An Automated Pipeline for Robust Image Processing and Optical Character Recognition of Historical Documents

Lexicon-based probabilistic indexing of handwritten text images

Handwritten Text Retrieval from Unlabeled Collections

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation