Skip to main content

An Image Based Approach for Content Analysis in Document Collections

  • Conference paper
Advances in Visual Computing (ISVC 2013)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 8034))

Included in the following conference series:

Abstract

We consider the task of content based analysis and categorization in large-scale historical book scanning projects. Mixed content, deprecated language, noise and unexpected distortions suggest an image based approach. The use of keypoint extractors combined with the bag of features approach is applied to scanned text documents. In order to incorporate spatial information into the bag of features approach we consider three methods of spatial verification. An approach based on comparison of statistical properties of local keypoint properties such as size orientation and scale showed comparable quality in content comparison while being computationally much more efficient. Cluster analysis delivers groups of pages characterized by common properties, especially duplicated page content is detected with high reliability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baluja, S., Covell, M.: Finding images and line drawings in document-scanning systems. In: Proc. Intl. Conf. on Doc. Anal. and Retrieval, ICDAR 2009 (2009)

    Google Scholar 

  2. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    Article  MathSciNet  MATH  Google Scholar 

  3. Chaudhury, K., Jain, A., Thirthala, S., Sahasranaman, V., Saxena, S., Mahalingam, S.: Google newspaper search - image processing and analysis pipeline. In: Proc. Intl. Conf. on Doc. Analysis and Recognition, ICDAR 2009 (2009)

    Google Scholar 

  4. Chum, O., Matas, J.: Unsupervised discovery of co-occurrence in sparse high dimensional data. In: Proc. Comp. Vis. and Pat. Rec., CVPR 2010 (2010)

    Google Scholar 

  5. Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV 2004 (2004)

    Google Scholar 

  6. Doermann, D., Li, H., Kia, O.: The detection of duplicates in document image databases. Image and Vision Computing 16(12-13), 907–920 (1998)

    Article  Google Scholar 

  7. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. Conf. on Knowledge Discovery and Data Mining, KDD 1996 (1996)

    Google Scholar 

  8. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 381–395 (1981)

    Article  MathSciNet  Google Scholar 

  9. Garz, A., Sablatnig, R., Diem, M.: Layout analysis for historic manuscripts using SIFT features. In: Proc. Intl. Conf. on Doc. Anal. and Rec., ICDAR 2011 (2011)

    Google Scholar 

  10. Hazelhoff, L., Creusen, I., van de Wouw, D., de With, P.H.N.: Large-scale classification of traffic signs under real-world conditions. In: Proc. SPIE Electronic Imaging: Algorithms and Systems VI (2012)

    Google Scholar 

  11. Huber-Mörk, R., Schindler, A.: Quality assurance for document image collections in digital preservation. In: Blanc-Talon, J., Philips, W., Popescu, D., Scheunders, P., Zemčík, P. (eds.) ACIVS 2012. LNCS, vol. 7517, pp. 108–119. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  12. Huber-Mörk, R., Schindler, A., Schlarb, S.: Duplicate detection for quality assurance of document image collections. In: Proc. Conf. on Digital Preservation, iPres 2012 (2012)

    Google Scholar 

  13. Jégou, H., Douze, M., Schmid, C.: On the burstiness of visual elements. In: Proc. Computer Vision and Pattern Recognition, CVPR 2009 (2009)

    Google Scholar 

  14. Ke, Y., Sukthankar, R., Huston, L.: An efficient parts-based near-duplicate and sub-image retrieval system. In: Proc. Intl. Conf. on Multimedia, MULTIMEDIA 2004 (2004)

    Google Scholar 

  15. Knopp, J., Sivic, J., Pajdla, T.: Avoiding confusing features in place recognition. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part I. LNCS, vol. 6311, pp. 748–761. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  16. Langley, A., Bloomberg, D.S.: Google books: making the public domain universally accessible. In: Proc. of SPIE, Doc. Rec. and Retrieval XIV (2007)

    Google Scholar 

  17. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. of Comput. Vision 60(2), 91–110 (2004)

    Article  Google Scholar 

  18. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval, 7th edn. Cambridge University Press (2008)

    Google Scholar 

  19. Ramachandrula, S., Joshi, G.D., Noushath, S., Parikh, P., Gupta, V.: PaperDiff: A script independent automatic method for finding the text differences between two document images. In: Proc. Intl. Workshop on Docu. Anal. Syst. (2008)

    Google Scholar 

  20. Rao, J.S.: Bahadur efficiencies of some tests for uniformity on the circle. Ann. Math. Statist. 43(2), 468–479 (1972)

    Article  MathSciNet  Google Scholar 

  21. Schilcher, U., Gyarmati, M., Bettstetter, C., Chung, Y.W., Kim, Y.H.: Measuring inhomogeneity in spatial distributions. In: Proc. Vehicular Technology Conference, VTC 2008 (2008)

    Google Scholar 

  22. van Beusekom, J., Shafait, F., Breuel, T.M.: Image-matching for revision detection in printed historical documents. In: Hamprecht, F.A., Schnörr, C., Jähne, B. (eds.) DAGM 2007. LNCS, vol. 4713, pp. 507–516. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  23. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Proc. 13(4), 600–612 (2004)

    Article  Google Scholar 

  24. Wu, X., Zhao, W.-L., Ngo, C.-W.: Near-duplicate keyframe retrieval with visual keywords and semantic context. In: Proc. Conf. on Image and Video Retrieval, CIVR 2007 (2007)

    Google Scholar 

  25. Xu, D., Cham, T.J., Yan, S., Duan, L., Chang, S.-F.: Near duplicate identification with spatially aligned pyramid matching. IEEE Trans. Circuits Syst. Video Techn. 20(8), 1068–1079 (2010)

    Article  Google Scholar 

  26. Zhang, S., Tian, Q., Hua, G., Huang, Q., Li, S.: Descriptive visual words and visual phrases for image applications. In: Proc. Intl. Conf. on Multimedia, MULTIMEDIA 2009 (2009)

    Google Scholar 

  27. Zhao, W.-L., Ngo, C.-W., Tan, H.-K., Wu, X.: Near-duplicate keyframe identification with interest point matching and pattern learning. IEEE Trans. Pat. Anal. Mach. Intell. 9(5), 1037–1048 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Huber-Mörk, R., Schindler, A. (2013). An Image Based Approach for Content Analysis in Document Collections. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2013. Lecture Notes in Computer Science, vol 8034. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41939-3_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41939-3_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41938-6

  • Online ISBN: 978-3-642-41939-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics