skip to main content
10.1145/1982185.1982242acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Pixel accurate document image content extraction

Published:21 March 2011Publication History

ABSTRACT

Versatile algorithms for document image content extraction (DICE) were investigated in [1, 2, 3, 4]. That is, to extract the image layers that contain the contents of interests, such as handwriting, machine-print text, photographs and blank, etc. The DICE classifier based on tight ground truth data can delimit the regions of interests approximately. In this paper, taking the result of DICE classifier as the input, we extended the work by trying to completely separate the pixels of characters from the background and the other contents using image post-processing techniques and pattern recognition methods. First of all, we applied the color space analysis on the detected text regions. Then we segmented the image into regions (connected components) that contain pixels of similar colors and content labels, and generated patches containing multiple connected components that are within a selected distance to their neighbors. Finally we classified the generated patches using the structure features and DICE labels. The preliminary experiment results of the proposed model are promising.

References

  1. C. An and H. S. Baird. The convergence of iterated classification. In Proceedings of 8th International Workshop on Document Analysis Systems, pages 663--670, September 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. An, H. S. Baird, and P. Xiu. Iterated document content classification. In Proceedings of the 9th International Conference on Document Analysis and Recognition, pages 252--256, September 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. H. S. Baird. Towards versatile document analysis systems. In Proceedings of 7th IAPR Document Analysis Workshop (DAS06), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. S. Baird, M. A. Moll, J. Nonnemaker, and D. L. Delorenzo. Versatile document image content extraction. In Proceedings of SPIE/IS&T Document Recognition & Retrieval XIII Conf, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  5. C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Clavelli, D. Karatzas, and J. Lladós. A framework for the assessment of text extraction algorithms on complex colour images. In Proceedings of 9th International Workshop on Document Analysis Systems, pages 27--34, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2nd Edition. CL-Engineerin Wiley-Interscience, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Liu and S. N. Srihari. Document image binarization based on texture features. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(5): 540--544, May 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. O'Gorman. Binarization and multi thresholding of document images using connectivity. IEEE Trans. Pattern Analysis and Machine Intelligence, 56(6): 494--506, November 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. X. Peng, S. Setlur, V. Govindaraju, R. Sitaram, and K. Bhuvanagiri. Markov random field based text identification from annotated machine printed documents. In Proceedings of the 10th International Conference on Document Analysis and Recognition, pages 431--435, July 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. Saund, J. Lin, and P. Sarkar. Pixlabeler: User interface for pixel-level labeling of elements in document images. In Proceedings of the 10th International Conference on Document Analysis and Recognition, pages 646--650, July 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Sauvola and M. Pietikainen. Adaptive document image binarization. Pattern Recognition, 33: 225--236, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  13. F. Shafait, D. Keysers, and T. M. Breuel. Performance evaluation and benchmarking of six-page segmentation algorithms. IEEE Trans. Pattern Analysis and Machine Intelligence, 30(6): 941--954, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. H. B. Smith. An analysis of binarization ground truthing. In Proceedings of 9th International Workshop on Document Analysis Systems, pages 27--34, June 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. Zheng, H. Li, and D. Doermann. Machine printed text and handwriting identification in noisy document images. IEEE Trans. Pattern Analysis and Machine Intelligence, 26(3): 337--353, March 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Pixel accurate document image content extraction

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing
        March 2011
        1868 pages
        ISBN:9781450301138
        DOI:10.1145/1982185

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 March 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,650of6,669submissions,25%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader