research-article

Pixel accurate document image content extraction

Authors:
Siyuan Chen

Lehigh University, Bethlehem, PA

Lehigh University, Bethlehem, PA
View Profile

,
Henry S. Baird

Lehigh University, Bethlehem, PA

Lehigh University, Bethlehem, PA
View Profile

SAC '11: Proceedings of the 2011 ACM Symposium on Applied ComputingMarch 2011Pages 245–251https://doi.org/10.1145/1982185.1982242

Published:21 March 2011Publication History

SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing

Pages 245–251

ABSTRACT

Versatile algorithms for document image content extraction (DICE) were investigated in [1, 2, 3, 4]. That is, to extract the image layers that contain the contents of interests, such as handwriting, machine-print text, photographs and blank, etc. The DICE classifier based on tight ground truth data can delimit the regions of interests approximately. In this paper, taking the result of DICE classifier as the input, we extended the work by trying to completely separate the pixels of characters from the background and the other contents using image post-processing techniques and pattern recognition methods. First of all, we applied the color space analysis on the detected text regions. Then we segmented the image into regions (connected components) that contain pixels of similar colors and content labels, and generated patches containing multiple connected components that are within a selected distance to their neighbors. Finally we classified the generated patches using the structure features and DICE labels. The preliminary experiment results of the proposed model are promising.

References

C. An and H. S. Baird. The convergence of iterated classification. In Proceedings of 8th International Workshop on Document Analysis Systems, pages 663--670, September 2008. Google ScholarDigital Library
C. An, H. S. Baird, and P. Xiu. Iterated document content classification. In Proceedings of the 9th International Conference on Document Analysis and Recognition, pages 252--256, September 2007. Google ScholarDigital Library
H. S. Baird. Towards versatile document analysis systems. In Proceedings of 7th IAPR Document Analysis Workshop (DAS06), 2006. Google ScholarDigital Library
H. S. Baird, M. A. Moll, J. Nonnemaker, and D. L. Delorenzo. Versatile document image content extraction. In Proceedings of SPIE/IS&T Document Recognition & Retrieval XIII Conf, 2006.Google ScholarCross Ref
C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm. Google ScholarDigital Library
A. Clavelli, D. Karatzas, and J. Lladós. A framework for the assessment of text extraction algorithms on complex colour images. In Proceedings of 9th International Workshop on Document Analysis Systems, pages 27--34, June 2010. Google ScholarDigital Library
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2nd Edition. CL-Engineerin Wiley-Interscience, 2000. Google ScholarDigital Library
Y. Liu and S. N. Srihari. Document image binarization based on texture features. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(5): 540--544, May 1997. Google ScholarDigital Library
L. O'Gorman. Binarization and multi thresholding of document images using connectivity. IEEE Trans. Pattern Analysis and Machine Intelligence, 56(6): 494--506, November 1994. Google ScholarDigital Library
X. Peng, S. Setlur, V. Govindaraju, R. Sitaram, and K. Bhuvanagiri. Markov random field based text identification from annotated machine printed documents. In Proceedings of the 10th International Conference on Document Analysis and Recognition, pages 431--435, July 2009. Google ScholarDigital Library
E. Saund, J. Lin, and P. Sarkar. Pixlabeler: User interface for pixel-level labeling of elements in document images. In Proceedings of the 10th International Conference on Document Analysis and Recognition, pages 646--650, July 2009. Google ScholarDigital Library
J. Sauvola and M. Pietikainen. Adaptive document image binarization. Pattern Recognition, 33: 225--236, 2000.Google ScholarCross Ref
F. Shafait, D. Keysers, and T. M. Breuel. Performance evaluation and benchmarking of six-page segmentation algorithms. IEEE Trans. Pattern Analysis and Machine Intelligence, 30(6): 941--954, 2008. Google ScholarDigital Library
E. H. B. Smith. An analysis of binarization ground truthing. In Proceedings of 9th International Workshop on Document Analysis Systems, pages 27--34, June 2010. Google ScholarDigital Library
Y. Zheng, H. Li, and D. Doermann. Machine printed text and handwriting identification in noisy document images. IEEE Trans. Pattern Analysis and Machine Intelligence, 26(3): 337--353, March 2004. Google ScholarDigital Library

Index Terms

Pixel accurate document image content extraction
1. Applied computing
  1. Document management and text processing
    1. Document capture
      1. Optical character recognition
2. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Object recognition

Recommendations

Document image content extraction
Read More
Deep learning Arabic printed document knowledge extraction
ICFNDS '21: Proceedings of the 5th International Conference on Future Networks and Distributed Systems

This paper presents how to utilize deep learning to extract knowledge from Arabic printed document images. The fundamental goal of deep learning is automatically extracting significant features from images, eliminating the need for a classic feature ...
Read More
Content-Based Image Retrieval Using Regional Representation
Proceedings of the 10th International Workshop on Theoretical Foundations of Computer Vision: Multi-Image Analysis

Representing general images using global features extracted from the entire image may be inappropriate because the images often contain several objects or regions that are totally different from each other in terms of visual image properties. These ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing
March 2011
1868 pages
ISBN:9781450301138
DOI:10.1145/1982185
Conference Chairs:
William Chu
Tunghai University, TaiChung, Taiwan
,
W. Eric Wong
University of Texas at Dallas, Richardson, Texas
,
Program Chairs:
Mathew J. Palakal
Indiana University Purdue University, Indianapolis
,
Chih-Cheng Hung
Southern Polytechnic State University, Marietta
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 March 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 160
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Pixel accurate document image content extraction

SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Document image content extraction

Deep learning Arabic printed document knowledge extraction

Content-Based Image Retrieval Using Regional Representation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Pixel accurate document image content extraction

SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Document image content extraction

Deep learning Arabic printed document knowledge extraction

Content-Based Image Retrieval Using Regional Representation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media