Abstract
Document image classification has received huge interest in business automation processes. Therefore, document image classification plays an important role in the document image processing (DIP) systems. And it is necessary to develop an effective framework for this task. Many methods have been proposed for the classification of document images in literature. In this paper we propose an efficient document image classification task that uses vision transformers (ViTs) and benefits from visual information of the document. Transformers are models developed for natural language processing tasks. Due to its high performances, their structures have been modified and they have started to be applied on different problems. ViT is one of these models. ViTs have demonstrated imposing performance in computer vision tasks compared with baselines. Since, scans the image and models the relation between the image patches using multi-head self-attention Experiments are conducted on a real-world dataset. Despite the limited size of training data available, our method achieves acceptable performance while performing document image classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kowsari, K., Brown, D.E., Heidarysafa, M., Meimandi, K.J., Gerber, M.S., Barnes, L.E.: HDLTex: hierarchical deep learning for text classification. In: 16th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 364–371. IEEE, Cancun, Mexico (2017).
Liu, L., Wang, Z., Qiu, T., Chen, Q., Lu, Y., Suen, C.Y.: Document image classification: progress over two decades. Neurocomputing 453, 223–240 (2021)
Gallo, I., Noce, L., Zamberletti, A., Calefeti, A.: Deep neural networks for page stream segmentation and classification. In: 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–6. IEEE, Gold Coast, Australia (2016)
Sevim, S., İlhan Omurca, S., Ekinci, E.: Improving accuracy of document image classification through soft voting ensemble. In: 3rd International Conference on Artificial Intelligence and Applied Mathematics in Engineering (ICAIAME 2021), pp. 1–14. Springer, Cham (2021). https://doi.org/10.1186/s13059-022-02636-8
Jain, R., Wigington, C.: Multimodal document image classification. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 71–77. Sydney, Australia (2019)
Augereau, O., Journet, N., Vialard, A., Domenger, J.P.: Improving classification of an industrial document image database by combining visual and textual features. In: 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 314–318. IEEE, Tours, France (2014)
Kowsari, K., Meimandi, K.J., Heidarysafa, M., Mendu, S., Barnes, L., Brown, D.: Text classification algorithms: a survey. Information 10(150), 1–68 (2019)
Srinivasulu, K.: Health-related tweets classification: a survey. In: Gunjan, V.K., Zurada, J.M. (eds.) International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications Advances in Intelligent Systems and Computing, vol. 1245, pp. 259–268, Springer, Singapore (2021). https://doi.org/10.1007/978-981-15-7234-0
Nguyen, Q.D., Le, D.A., Phan, N.M., Zelinka, I.: OCR error correction using correction patterns and self-organizing migrating algorithm. Pattern Anal. Appl. 24, 701–721 (2021)
Kumar, J., Ye, P., Doermann, D.: Learning document structure for retrieval and classification. In: 21st International Conference on Pattern Recognition (ICPR), pp. 1558–1561. IEEE (2012).
Kang, L., Kumar, J., Ye, P., Li, Y., Doermann, D.: Convolutional neural networks for document image classification. In: 2014 22nd International Conference on Pattern Recognition, pp. 3168–3172. IEEE, Tsukuba, Japan (2014)
Bakkali, S., Ming, Z., Coustaty, M., Rusinol, M.: Cross-modal deep networks for document image classification. In: 2020 IEEE International Conference on Image Processing (ICIP). pp. 2556–2560. IEEE (2020)
Hatamizadeh, A., et al.: UNETR: Transformers for 3D medical image segmentation. CoRR abs/2103.10504 (2021). http://arxiv.org/abs/2103.10504
Liu, Y., Sangineto, E., Bi, W., Sebe, N., Lepri, B., Nadai, M.: Efficient training of visual transformers with small datasets. Adv. Neural. Inf. Process. Syst. 34, 1–13 (2021)
Mandivarapu, J.K., Bunch, E., You, Q., Fung, G.: Efficient document image classification using region-based graph neural network. CoRR abs/2106.13802 (2021). http://arxiv.org/abs/2106.13802
Baumann, S., et al.: Message extraction from printed documents—a complete solution—. In: Fourth International Conference on Document Analysis and Recognition, pp. 1055–1059. IEEE, Ulm, Germany (1997)
Eken, S., Menhour, H., Köksal, K.: DoCA: a content-based automatic classification system over digital documents. IEEE Access 7, 97996–98004 (2019)
Şahin, S. et al.: Dijital dokümanların anahtar kelime tabanlı doğrulanması. In: 6. Ulusal Yüksek Başarımlı Hesaplama Konferansı. Ankara, Turkey (2020)
Kumar, J., Ye, P., Doermann, D.: Structural similarity for document image classification and retrieval. Pattern Recogn. Lett. 43, 119–126 (2014)
Harley, A.W., Ufkes, A., Derpanis, K.G.: Evaluation of deep convolutional nets for document image classification and retrieval. CoRR abs/1502.07058 (2015). http://arxiv.org/abs/1502.07058
Afzal, M.Z., et al.: Deepdocclassifier: document classification with deep convolutional neural network. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1111–1115. Tunis, Tunisia (2015)
Roy, S., Das, A., Bhattacharya, U.: Generalized stacking of layerwise-trained deep convolutional neural networks for document image classification. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 1273–1278 (2016)
Csurka, G.: Document image classification, with a specific view on applications of patent images. CoRR abs/1601.03295 (2016). http://arxiv.org/abs/1601.03295
Csurka, G., Larlus, D., Gordo, A., Almaz´an, J.: What is the right way to represent document images?. CoRR abs/1603.01076 (2016). http://arxiv.org/abs/1603.01076
Yaman, D., Eyiokur, F.I., Ekenel, H.K.: Comparison of convolutional neural network models for document image classification. In: 2017 25th Signal Processing and Communications Applications Conference (SIU), pp. 1–4. IEEE, Antalya, Turkey (2017)
Zavalishin, S., Bout, A., Kurilin, I., Rychagov, M.: Document image classification on the basis of layout information. Electr. Imaging 2017, 78–86 (2017)
Tensmeyer, C., Martinez, T.R.: Analysis of convolutional neural networks for document image classification. CoRR abs/1708.03273 (2017). http://arxiv.org/abs/1708.03273
Kölsch, A., Afzal, M.Z., Ebbecke, M., Liwicki, M.: Real-time document image classification using deep CNN and extreme learning machines. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 1318–1323. Kyoto, Japan (2017)
Afzal, M.Z., Kölsch, A., Ahmed, S., Liwicki, M.: Cutting the error by half: Investigation of very deep CNN and advanced training strategies for document image classification. CoRR abs/1704.03557 (2017). http://arxiv.org/abs/1704.03557
Das, A., Roy, S., Bhattacharya, U.: Document image classification with intra-domain transfer learning and stacked generalization of deep convolutional neural networks. CoRR abs/1801.09321 (2018). http://arxiv.org/abs/1801.09321
Hassanpour, M., Malek, H.: Document image classification using squeezenet convolutional neural network. In: 2019 5th Iranian Conference on Signal Processing and Intelligent Systems (ICSPIS), pp. 1–4. IEEE, Shahrood, Iran (2019)
Mohsenzadegan, K., et al.: A convolutional neural network model for robust classification of document-images under real-world hard conditions. In: Developments of Artificial Intelligence Technologies in Computation and Robotics: Proceedings of the 14th International FLINS Conference (FLINS), pp. 1023– 1030. World Scientific (2020)
Jadli, A., Hain, M., Hasbaoui, A.: An improved document image classification using deep transfer learning and feature reduction. Int. Adv. Trends Comput. Sci. Eng. 10, 549–557 (2021)
Jadli, A., Hain, M., Jaize, A.: A novel approach to data augmentation for document image classification using deep convolutional generative adversarial networks. In: Motahhir, S., Bossoufi, B. (eds.) Digital Technologies and Applications, ICDTA 2021, LNNS, vol. 211, pp. 135–144. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-73882-2
Noce, L., Gallo, I., Zamberletti, A., Calefati, A.: Embedded textual content for document image classification with convolutional neural networks. In: Proceedings of the 2016 ACM Symposium on Document Engineering, pp. 165–173. ACM, Vienna, Austria (2016)
Audebert, N., Herold, C., Slimani, K., Vidal, C.: Multimodal deep networks for text and image-based document classification. CoRR abs/1907.06370 (2019). http://arxiv.org/abs/1907.06370
Jain, R., Wigington, C.: Multimodal document image classification. In: 2019 International Conference on Document Analysis and Recognition (ICDAR), pp. 71–77. IEEE, Sydney, Australia (2019)
Ferrando, J., et al.: Improving accuracy and speeding up document image classification through parallel systems. In: Krzhizhanovskaya, V., et al. (eds.) Computational Science – ICCS 2020. ICCS 2020. LNCS, vol. 12138, pp. 387–400. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50417-5_29
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1192–1200. ACM (2020)
Bakkali, S., Ming, Z., Coustaty, M., Rusinol, M.: Visual and textual deep feature fusion for document image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 562–563. IEEE (2020)
Bakkali, S., Ming, Z., Coustaty, M., Rusinol, M.: EAML: ensemble self-attention based mutual learning network for document image classification. Int. J. Doc. Anal. Recogn. (IJDAR) 24, 1–18 (2021)
Mandivarapu, J.K., Bunch, E., You, Q., Fung, G.: Efficient document image classification using region-based graph neural network. CoRR abs/2106.13802 (2021). https://arxiv.org/abs/2106.13802
Xiong, Y., Dai, Z., Liu, Y., Ding, X.: Document image classification method based on graph convolutional network. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A. N. (eds.) Neural Information Processing. ICONIP 2021, LNCS, vol. 13108, pp. 317–329. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92185-9_26
Siddiqui, S.A., Dengel, A., Ahmed, S.: Analyzing the potential of zero-shot recognition for document image classification. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition – ICDAR. LNCS, vol. 12824, pp. 293–304. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_20
Sellami, A., Tabbone, S.: EDNets: deep feature learning for document image classification based on multi-view encoder-decoder neural networks. In: Lladós, J., Lopresti, D., Uchida, S. (eds.) Document Analysis and Recognition – ICDAR. LNCS, vol. 12824, pp. 318–332. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-86337-1_22
Mandivarapu, J.K., Bunch, E., Fung, G.: Domain agnostic few-shot learning for document intelligence. CoRR abs/2111.00007 (2021). https://arxiv.org/abs/2111.00007
Vaswani, A., et al.: Attention is all you need. CoRR abs/1706.03762 (2017). https://arxiv.org/abs/1706.03762
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). https://arxiv.org/abs/1810.04805
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. CoRR abs/2010.11929 (2020). https://arxiv.org/abs/2010.11929
Acknowledgments
This work has been supported by the Kocaeli University Scientific Research and Development Support Program (BAP) in Turkey under project number FBA-2020-2152.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering
About this paper
Cite this paper
Sevim, S., Omurca, S.İ., Ekinci, E. (2022). Document Image Classification with Vision Transformers. In: Seyman, M.N. (eds) Electrical and Computer Engineering. ICECENG 2022. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 436. Springer, Cham. https://doi.org/10.1007/978-3-031-01984-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-031-01984-5_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-01983-8
Online ISBN: 978-3-031-01984-5
eBook Packages: Computer ScienceComputer Science (R0)