ABSTRACT
Versatile algorithms for document image content extraction (DICE) were investigated in [1, 2, 3, 4]. That is, to extract the image layers that contain the contents of interests, such as handwriting, machine-print text, photographs and blank, etc. The DICE classifier based on tight ground truth data can delimit the regions of interests approximately. In this paper, taking the result of DICE classifier as the input, we extended the work by trying to completely separate the pixels of characters from the background and the other contents using image post-processing techniques and pattern recognition methods. First of all, we applied the color space analysis on the detected text regions. Then we segmented the image into regions (connected components) that contain pixels of similar colors and content labels, and generated patches containing multiple connected components that are within a selected distance to their neighbors. Finally we classified the generated patches using the structure features and DICE labels. The preliminary experiment results of the proposed model are promising.
- C. An and H. S. Baird. The convergence of iterated classification. In Proceedings of 8th International Workshop on Document Analysis Systems, pages 663--670, September 2008. Google ScholarDigital Library
- C. An, H. S. Baird, and P. Xiu. Iterated document content classification. In Proceedings of the 9th International Conference on Document Analysis and Recognition, pages 252--256, September 2007. Google ScholarDigital Library
- H. S. Baird. Towards versatile document analysis systems. In Proceedings of 7th IAPR Document Analysis Workshop (DAS06), 2006. Google ScholarDigital Library
- H. S. Baird, M. A. Moll, J. Nonnemaker, and D. L. Delorenzo. Versatile document image content extraction. In Proceedings of SPIE/IS&T Document Recognition & Retrieval XIII Conf, 2006.Google ScholarCross Ref
- C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm. Google ScholarDigital Library
- A. Clavelli, D. Karatzas, and J. Lladós. A framework for the assessment of text extraction algorithms on complex colour images. In Proceedings of 9th International Workshop on Document Analysis Systems, pages 27--34, June 2010. Google ScholarDigital Library
- R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification, 2nd Edition. CL-Engineerin Wiley-Interscience, 2000. Google ScholarDigital Library
- Y. Liu and S. N. Srihari. Document image binarization based on texture features. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(5): 540--544, May 1997. Google ScholarDigital Library
- L. O'Gorman. Binarization and multi thresholding of document images using connectivity. IEEE Trans. Pattern Analysis and Machine Intelligence, 56(6): 494--506, November 1994. Google ScholarDigital Library
- X. Peng, S. Setlur, V. Govindaraju, R. Sitaram, and K. Bhuvanagiri. Markov random field based text identification from annotated machine printed documents. In Proceedings of the 10th International Conference on Document Analysis and Recognition, pages 431--435, July 2009. Google ScholarDigital Library
- E. Saund, J. Lin, and P. Sarkar. Pixlabeler: User interface for pixel-level labeling of elements in document images. In Proceedings of the 10th International Conference on Document Analysis and Recognition, pages 646--650, July 2009. Google ScholarDigital Library
- J. Sauvola and M. Pietikainen. Adaptive document image binarization. Pattern Recognition, 33: 225--236, 2000.Google ScholarCross Ref
- F. Shafait, D. Keysers, and T. M. Breuel. Performance evaluation and benchmarking of six-page segmentation algorithms. IEEE Trans. Pattern Analysis and Machine Intelligence, 30(6): 941--954, 2008. Google ScholarDigital Library
- E. H. B. Smith. An analysis of binarization ground truthing. In Proceedings of 9th International Workshop on Document Analysis Systems, pages 27--34, June 2010. Google ScholarDigital Library
- Y. Zheng, H. Li, and D. Doermann. Machine printed text and handwriting identification in noisy document images. IEEE Trans. Pattern Analysis and Machine Intelligence, 26(3): 337--353, March 2004. Google ScholarDigital Library
Index Terms
- Pixel accurate document image content extraction
Recommendations
Deep learning Arabic printed document knowledge extraction
ICFNDS '21: Proceedings of the 5th International Conference on Future Networks and Distributed SystemsThis paper presents how to utilize deep learning to extract knowledge from Arabic printed document images. The fundamental goal of deep learning is automatically extracting significant features from images, eliminating the need for a classic feature ...
Content-Based Image Retrieval Using Regional Representation
Proceedings of the 10th International Workshop on Theoretical Foundations of Computer Vision: Multi-Image AnalysisRepresenting general images using global features extracted from the entire image may be inappropriate because the images often contain several objects or regions that are totally different from each other in terms of visual image properties. These ...
Comments