Skip to main content
Log in

Large scale document image retrieval by automatic word annotation

  • Original Paper
  • Published:
International Journal on Document Analysis and Recognition (IJDAR) Aims and scope Submit manuscript

Abstract

In this paper, we present a practical and scalable retrieval framework for large-scale document image collections, for an Indian language script that does not have a robust OCR. OCR-based methods face difficulties in character segmentation and recognition, especially for the complex Indian language scripts. We realize that character recognition is only an intermediate step toward actually labeling words. Hence, we re-pose the problem as one of directly performing word annotation. This new approach has better recognition performance, as well as easier segmentation requirements. However, the number of classes in word annotation is much larger than those for character recognition, making such a classification scheme expensive to train and test. To address this issue, we present a novel framework that replaces naive classification with a carefully designed mixture of indexing and classification schemes. This enables us to build a search system over a large collection of 1,000 books of Telugu, consisting of 120K document images or 36M individual words. This is the largest searchable document image collection for a script without an OCR that we are aware of. Our retrieval system performs significantly well with a mean average precision of 0.8.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Google Books at: http://books.google.com

  2. Internet Archive at: http://www.archive.org

  3. Universal Library at: http://www.ulib.org/

  4. Govindaraju, V., Setlur, S. (eds.): Guide to OCR for Indic Scripts. Springer, Berlin (2009)

    Google Scholar 

  5. Kumar, K.S.S., Kumar, S., Jawahar, C.V.: On segmentation of documents in complex scripts. In: Proceedings of ICDAR, pp. 1243–1247 (2007)

  6. Pramod Sankar K., Jawahar, C.V., Manmatha, R.: Nearest neighbor based collection OCR. In: Proceedings of DAS (2010)

  7. Pramod Sankar, K., Jawahar, C.V.: Probabilistic reverse annotation for large scale image retrieval. In: Proceedings of the CVPR (2007)

  8. IIIT-H Telugu Word Recognition Dataset, available at: http://cvit.iiit.ac.in/index.php?page=resources

  9. Digital Library of India at: http://dli.iiit.ac.in

  10. Doermann, D.: The indexing and retrieval of document images: a survey. Comput. Vis. Image Underst. 70, 287–298 (1998)

    Article  Google Scholar 

  11. Marinai, S.: A survey of document image retrieval in digital libraries. In: Colloque International Francophone sur l’Ecrit et le Document (CIFED), pp. 193–198 (2006)

  12. Kim, J., Seitz, S.M., Agrawala, M.: Video-based document tracking: unifying your physical and electronic desktops. In: Proceedings of the UIST, pp. 99–107 (2004)

  13. Nagy, G.: At the frontiers of OCR. IEEE 80, 1093–1100 (1992)

    Article  Google Scholar 

  14. Pal, U., Chaudhuri, B.: Indian script character recognition: a survey. Pattern Recognit. 37(9), 1887–1899 (2004)

    Article  Google Scholar 

  15. Tong, X., Evans, D.A.: A statistical approach to automatic ocr error correction in context. In: Proceedings of the WVLC, pp. 88–10 (1996)

  16. Francesconi, E., Gori, M., Marinai, S., Soda, G.: A serial combination of connectionist-based classifiers for OCR. IJDAR 3, 160–168 (2001)

    Article  Google Scholar 

  17. Byun, H., Lee, S.W.: Applications of support vector machines for pattern recognition: a survey. In: Proceedings of the First International Workshop on Pattern Recognition with Support Vector Machines, pp. 213–236 (2002)

  18. Jawahar, C.V., Kumar, M.P., Ravikiran, S.S.: A bilingual OCR system for hindi-telugu documents and its applications. In: Proceedings of the ICDAR, pp. 408–413 (2003)

  19. Kahan, S., Pavlidis, T., Baird, H.S.: On the recognition of printed character of any font and size. IEEE PAMI 9(2), 274–288 (1987)

    Article  Google Scholar 

  20. Lehal, G.S., Singh, C., Lehal, R.: A shape based post processor for gurumukhi OCR. In: Proceedings of the ICDAR, pp. 1105–1109 (2001)

  21. Natarajan, P., MacRostie, E., Decerbo, M.: The BBN byblos hindi OCR system. Guide to OCR for Indic Scripts, pp. 173–180 (2009)

  22. Lu, Z., Schwartz, R., Natarajan, P., Bazzi, I., Makhoul, J.: Advances in the bbn byblos ocr system. In: Proceedings of the ICDAR, pp. 337–340 (1999)

  23. Bharati, A., Rao, P., Sangal, R., Bendre, S.M.: Basic statistical analaysis of corpus and cross comparison. In: Proceedings of the ICON (2002)

  24. Spitz, A.L.: Using character shape codes for word spotting in document images. In: Dori, D., Bruckstein, A. (eds.) Shape, Structure and Pattern Recognition, pp. 382–389. World Scientific, Singapore (1995)

  25. Li, L., Lu, S.J., Tan, C.L.: A fast keyword-spotting technique. In: Proceedings of the ICDAR, pp. 68–72 (2007)

  26. Lu, S., Li, L., Tan, C.L.: Document image retrieval through word shape coding. IEEE PAMI 30(11), 1913–1918 (2008)

    Article  Google Scholar 

  27. Kesidis, A.L., Galiotou, E., Gatos, B., Pratikakis, I.: A word spotting framework for historical machine-printed documents. IJDAR 14, 131–144 (2011)

    Article  Google Scholar 

  28. Rath, T.M., Manmatha, R.: Word spotting for historical documents. IJDAR 9(2–4), 139–152 (2007)

    Article  Google Scholar 

  29. Zhang, B., Srihari, S.N., Huang, C.: Word image retrieval using binary features. In: DRR’04, pp. 45–53 (2004)

  30. Terasawa, K., Tanaka, Y.: Slit style hog feature for document image word spotting. In: Proceedings of the ICDAR, pp. 116–120 (2009)

  31. Llados, J., Sanchez, G.: Indexing historical documents by word shape signatures. In: Proceedings of the ICDAR, pp. 362–366 (2007)

  32. Meshesha, M., Jawahar, C.V.: Matching word images for content-based retrieval from printed document images. IJDAR, pp. 29–38 (2008)

  33. Kuo, S.S., Agazzi, O.E.: Keyword spotting in poorly printed documents using pseudo 2-d hidden markov models. IEEE PAMI 16, 842–848 (1994)

    Article  Google Scholar 

  34. Nishi, H., Kimura, Y., Iguchi, R.: A new word spotting framework using hough transform of distance matrix images. In: Proceedings of the IMECS, pp. 280–285 (2010)

  35. Frinken, V., Fischer, A., Bunke, H., Manmatha, R.: Adapting BLSTM neural network based keyword spotting trained on modern data to historical documents. In: Proceedings of the ICFHR, pp. 352–357 (2010)

  36. Jain, R., Frinken, V., Jawahar, C.V., Manmatha, R.: BLSTM neural network based word retrieval for hindi documents. In: Proceedings of the ICDAR, pp. 83–87 (2011)

  37. Gatos, B., Pratikakis, I.: Segmentation-free word spotting in historical printed documents. In: Proceedings of the ICDAR, pp. 271–275 (2009)

  38. Madhvanath, S., Govindaraju, V.: The role of holistic paradigms in handwritten word recognition. IEEE PAMI 23, 149–164 (2001)

    Article  Google Scholar 

  39. Srihari, S., Srinivasan, H., Babu, P., Bhole, C.: Handwritten arabic word spotting using the cedarabic document analysis system. In: Proceedings of the Symposium on Document Image Understanding Technology (SDIUT), pp. 123–132 (2005)

  40. Adamek, T., O’Connor, N.E., Smeaton, A.F.: Word matching using single closed contours for indexing handwritten historical documents. IJDAR 9(2–4), 153–165 (2007)

    Article  Google Scholar 

  41. Kumar, A., Jawahar, C.V., Manmatha, R.: Efficient search in document image collections. In: Proceedings of the ACCV, pp. 586–595 (2007)

  42. Rath, T., Manmatha, R., Lavrenko, V.: A search engine for historical manuscript images. In: Proceedings of the SIGIR, pp. 369–376 (2004)

  43. Negi, A., Bhagvati, C., Krishna, B.: An ocr system for telugu. In: Proceedings of the ICDAR, pp. 1110–1114 (2001)

  44. Rice, S.V., Jenkins, F.R., Nartker, T.A.: The fifth annual test of ocr accuracy. Technical report, UNLV (1996)

  45. Ataer, E., Duygulu, P.: Matching Ottoman words: an image retrieval approach to historical document indexing. In: Proceedings of the CIVR, pp. 341–347 (2007)

  46. Xiu, P., Baird, H.S.: Scaling up whole-book recognition. In: Proceedings of the ICDAR, pp. 698–702 (2009)

  47. Konidaris, T., Gatos, B., Ntzios, K., Pratikakis, I., Theodoridis, S., Perantonis, S.J.: Keyword-guided word spotting in historical printed documents with synthetic data and user feedback. IJDAR, pp. 167–177 (2007)

  48. Lavrenko, V., Rath, T.M., Manmatha, R.: Holistic word recognition for handwritten historical documents. In: Proceedings of the DIAL, pp. 278–287 (2004)

  49. Chan, J., Ziftci, C., Forsyth, D.A.: Searching off-line arabic documents. In: Proceedings of the CVPR, pp. 1455–1462 (2006)

  50. Lemur search engine: http://www.lemurproject.org/

  51. Galago search engine: http://www.galagosearch.org/

  52. Zipf, G.: Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge, MA (1949)

  53. Liu, T.Y., Yang, Y., Wan, H., Zeng, H.J., Chen, Z., Ma, W.Y.: Support vector machines classification with a very large-scale taxonomy. SIGKDD Explor. Newsl. 7, 36–43 (2005)

    Article  Google Scholar 

  54. Deng, J., Berg, A.C., Li, K., Fei-Fei, L.: What does classifying more than 10,000 image categories tell us? In: ECCV (5), pp. 71–84 (2010)

  55. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: Proceedings of the CVPR, pp. 2161–2168 (2006)

  56. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the STOC, pp. 604–613 (1998)

  57. Muja, M., Lowe, D.G.: Fast approximate nearest neighbors with automatic algorithm configuration. In: Proceedings of the VISAPP, pp. 331–340 (2009)

  58. Andoni, A., Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.: Locality-sensitive hashing scheme based on p-stable distributions. Theory and Practice, Nearest Neighbor Methods in Learning and Vision (2006)

  59. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: improving particular object retrieval in large scale image databases. In: Proceedings of the CVPR (2008)

  60. Rath, T.M., Manmatha, R.: Word image matching using dynamic time warping. In: Proceedings of the CVPR, pp. 521–527 (2003)

  61. Balasubramanian, A., Meshesha, M., Jawahar, C.V.: Retrieval from document image collections. In: Proceedings of the DAS, pp. 1–12 (2006)

  62. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)

    Article  Google Scholar 

  63. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the CVPR, pp. 886–893 (2005)

  64. Bosch, A., Zisserman, A., Munoz, X.: Scene classification using a hybrid generative/discriminative approach. IEEE PAMI 30(4), 712–727 (2008)

    Article  Google Scholar 

  65. Clarke, C.L.A., Craswell, N., Soboroff, I.: Overview of the TREC 2009 web track (online resource)

  66. Rath, T., Manmatha, R.: Features for word matching in historical manuscripts. In: Proceedings of the ICDAR, pp. 218–222 (2003)

Download references

Acknowledgments

This work was supported in part by the Ministry of Communications and Information Technology, Govt. of India. R. Manmatha was supported in part by the Center for Intelligent Information Retrieval and by the National Science Foundation grant #IIS-0910884. Any opinions, findings and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Pramod Sankar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pramod Sankar, K., Manmatha, R. & Jawahar, C.V. Large scale document image retrieval by automatic word annotation. IJDAR 17, 1–17 (2014). https://doi.org/10.1007/s10032-013-0207-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10032-013-0207-2

Keywords

Navigation