Scene Text Detection Using HRNet and Spatial Attention Mechanism

Tang, Qingsong; Jiang, Zhangyan; Pan, Bolin; Guo, Jinting; Jiang, Wuming

doi:10.1134/S0361768823080212

Scene Text Detection Using HRNet and Spatial Attention Mechanism

Published: 24 January 2024

Volume 49, pages 954–965, (2023)
Cite this article

Programming and Computer Software Aims and scope Submit manuscript

Qingsong Tang¹,
Zhangyan Jiang¹,
Bolin Pan¹,
Jinting Guo¹ &
…
Wuming Jiang²

27 Accesses
Explore all metrics

Abstract

To better extract the features from text instances with various shapes, a scene text detector using High Resolution Net (HRNet) and spatial attention mechanism is proposed in this paper. Specifically, we use HRNetv2-W18 as the backbone network to extract the text feature in text instances with complex shapes. Considering that the scene text instance is usually small, to avoid too small feature size, we optimize HRNet through deformable convolution and Smooth Maximum Unit (SMU) activation function, so that the network can retain more detail information and location information of the text instance. In addition, a Text Region Attention Module (TRAM) is added after the backbone to make it pay more attention to the text location information and a loss function is used to TRAM, so that the network can learn the features better. The experimental results illustrate that the proposed method can compete with the state-of-the-art methods. Code is available at: https://github.com/zhangyan1005/HR-DBNet.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

REFERENCES

Khan, T., Sarkar, R., and Mollah, A.F., Deep learning approaches to scene text detection: a comprehensive review, Artif. Intell. Rev., 2021, vol. 54, no. 5, pp. 3239–3298.
Article Google Scholar
Long, S., He, X., and Yao, C., Scene text detection and recognition: the deep learning era, Int. J. Comput. Vis., 2021, vol. 129, pp. 161–184.
Article Google Scholar
Ren, S., He, K., Girshick, R., and Sun, J., Faster R-CNN: towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell., 2017, vol. 39, no. 6, pp. 1137–1149.
Article Google Scholar
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., and Berg, A.C., SSD: aingle shot multibox detector, Proc. European Conf. on Computer Vision, Amsterdam, 2016, pp. 21–37.
Tian, Z., Huang, W., He, T., He, P., and Qiao, Y., Detecting text in natural image with connectionist text proposal network, Proc. European Conf. on Computer Vision, Amsterdam, 2016, pp. 56–72.
Liao, M., Shi, B., Bai, X., Wang, X., and Liu, W., Textboxes: a fast text detector with a single deep neural network, in Proc. 31st AAAI Conf. on Artificial Intelligence, Palo Alto, CA: AAAI Press, 2017, vol. 31, no. 1.
Liao, M., Shi, B., and Bai, X., Textboxes++: a single-shot oriented scene text detector, IEEE Trans. Image Process., 2018, vol. 27, no. 8, pp. 3676–3690.
Article MathSciNet Google Scholar
Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman, A., Reading text in the wild with convolutional neural networks, Int. J. Comput. Vis., 2018, vol. 116, no. 1, pp. 1–20.
Article MathSciNet Google Scholar
Zitnick, C.L., and Dollar, P., Edge boxes: locating object proposals from edges, Proc. European Conf. on Computer Vision, Zurich, 2014, pp. 391–405.
Dai, P., Zhang, S., Zhang, H., and Cao, X., Progressive contour regression for arbitrary-shape scene text detection, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Nashville, TN, 2021, pp. 7393–7402.
Wang, W., Xie, E., Li, X., Hou, W., Lu, T., Yu, G., and Shao, S., Shape robust text detection with progressive scale expansion network, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 9336–9345.
Deng, D., Liu, H., Li, X., and Cai, D., Pixellink: detecting scene text via instance segmentation, Proc. AAAI Conf. on Artificial Intelligence, New Orleans, 2018, vol. 32, no. 1.
Liao, M., Wan, Z., Yao, C., Chen, K., and Bai, X., Real-time scene text detection with differentiable binarization, Proc. AAAI Conf. on Artificial Intelligence, New York, 2020, vol. 34, no. 7, pp. 11474–11481.
Liao, M., Zou, Z., Wan, Z., Yao, C., and Bai, X., Real-time scene text detection with differentiable binarization and adaptive scale fusion, IEEE Trans. Pattern Anal. Mach. Intell., 2023, vol. 45, no. 1, pp. 919–931.
Article Google Scholar
Wu, Y. and Natarajan, P., Self-organized text detection with minimal post-processing via border learning, Proc. IEEE Int. Conf. on Computer Vision, Venice, 2017, pp. 5000–5009.
Zhang, S.X., Zhu, X., Chen, L., Hou, J.B., and Yin, X.C., Arbitrary shape text detection via segmentation with probability maps, IEEE Trans. Pattern Anal. Mach. Intell., 2022,vol. 45, no. 3, pp. 2736–2750.
Google Scholar
Tian, Z., Shu, M., Lyu, P., Li, R., Zhou, C., Shen, X., and Jia, J., Learning shape-aware embedding for scene text detection, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, 2019, pp. 4234–4243.
Lyu, P., Liao, M., Yao, C., Wu, W., and Bai, X., Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes, Proc. European Conf. on Computer Vision, Munich, 2018, pp. 67–83.
He, K., Zhang, X., Ren, S., and Sun, J., Identity mappings in deep residual networks, Proc. European Conf. on Computer Vision, Amsterdam, 2016, pp. 630–645.
Simonyan, K. and Zisserman, A., Very deep convolutional networks for large-scale image recognition, 2014. arXiv:1409.1556.
Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S., Feature pyramid networks for object detection, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, 2017, pp. 2117–2125.
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., and Xiao, B., Deep high-resolution representation learning for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell., 2020, vol. 43, no. 10, pp. 3349–3364.
Article Google Scholar
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., and Valveny, E., ICDAR 2015 competition on robust reading, Proc. 13th Int. Conf. on Document Analysis and Recognition, Tunis, 2015, pp. 1156–1160.
Chee, C.K. and Chan, C.S., Total-text: a comprehensive dataset for scene text detection and recognition, Proc. 14th IAPR Int. Conf. on Document Analysis and Recognition, Kyoto, 2017, vol. 1, pp. 935–942.
Nayef, N., Yin, F., Bizid, I., Choi, H., Feng, Y., Karatzas, D., and Ogier, J.M., ICDAR 2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt, Proc. 14th IAPR Int. Conf. on Document Analysis and Recognition, Kyoto, 2017, vol. 1, pp. 1454–1459.
Yao, C., Bai, X., Liu, W., Ma, Y., and Tu, Z., Detecting texts of arbitrary orientations in natural images, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Providence, RI, 2012, pp. 1083–1090.
Liu, Y., Jin, L., Zhang, S., Luo, C., and Zhang, S., Curved scene text detection via transverse and longitudinal sequence connection, Pattern Recogn., 2019, vol. 90, pp. 337–345.
Article Google Scholar
Sun, K., Xiao, B., Liu, D., and Wang, J., Deep high-resolution representation learning for human pose estimation, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 5693–5703.
Vatti, B.R., A generic solution to polygon clipping, Commun. ACM, 1992, vol. 35, no. 7, pp. 56–63.
Article Google Scholar
Guan, T., Gu, C., Lu, C., et al., Industrial scene text detection with refined feature-attentive network, IEEE Trans. Circuits Syst. Video Technol., 2022, vol. 32, no. 9, pp. 6073–6085.
Article Google Scholar
Lu, N., Yu, W., Qi, X., Chen, Y., Gong, P., Xiao, R., and Bai, X., Master: multi-aspect non-local network for scene text recognition, Pattern Recogn., 2021, vol. 117, p. 107980.
Article Google Scholar
Liu, Z., Zhou, W., and Li, H., AB-LSTM: attention-based bidirectional LSTM model for scene text detection, ACM Trans. Multimed. Comput. Commun. Appl., 2019, vol. 15, no. 4, pp. 1–23.
Google Scholar
Wu, Y., Liu, W., and Wan, S., Multiple attention encoded cascade R-CNN for scene text detection, J. Vis. Commun. Image Represent., 2021, vol. 80, p. 103261.
Article Google Scholar
Woo, S., Park, J., Lee, J.Y., and Kweon, I.S., CBAM: convolutional block attention module, Proc. European Conf. on Computer Vision, Munich, 2018, pp. 3–19.
Biswas, K., Kumar, S., Banerjee, S., and Pandey, A.K., SMU: smooth activation function for deep networks using smoothing maximum technique, 2021. arXiv:2111.04682.
Yao, C., Bai, X., and Liu, W., A unified framework for multioriented text detection and recognition, IEEE Trans. Image Process., 2014, vol. 23, no. 11, pp. 4737–4749.
Article MathSciNet Google Scholar
Powers, D.M., Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation, Int. J. Mach. Learn., 2011, vol. 2, no. 1, pp. 47–63.
Google Scholar
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., and Liang, J., East: an efficient and accurate scene text detector, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, 2017, pp. 5551–5560.
Zhu, Y., Chen, J., Liang, L., Kuang, Z., Jin, L., and Zhang, W., Fourier contour embedding for arbitrary-shaped text detection, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Nashville, TN, 2021, pp. 3123–3131.
Long, S., Ruan, J., Zhang, W., He, X., Wu, W., and Yao, C., Textsnake: a flexible representation for detecting text of arbitrary shapes, Proc. European Conf. on Computer Vision, Munich, 2018, pp. 20–36.
Shi, B., Bai, X., and Belongie, S., Detecting oriented text in natural images by linking segments, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, 2017, pp. 2550–2558.
Wang, P., Zhang, C., Qi, F., Huang, Z., En, M., Han, J., and Shi, G., A single-shot arbitrarily-shaped text detector based on context attended multi-task learning, Proc. 27th ACM Int. Conf. on Multimedia, Nice, 2019, pp. 1277–1285.
Zhang, C., Liang, B., Huang, Z., En, M., Han, J., Ding, E., and Ding, X., Look more than once: an accurate detector for text of arbitrary shapes, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 10552–10561.
Zhou, Y., Xie, H., Fang, S., Li, Y., and Zhang, Y., CRNet: a center-aware representation for detecting text of arbitrary shapes, Proc. 28th ACM Int. Conf. on Multimedia, Seattle, 2020, pp. 2571–2580.
Liu, Y., Chen, H., Shen, C., He, T., Jin, L., and Wang, L., Abcnet: real-time scene text spotting with adaptive bezier-curve network, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Seattle, 2020, pp. 9809–9818.
Xu, Y., Wang, Y., Zhou, W., Wang, Y., Yang, Z., and Bai, X., Textfield: learning a deep direction field for irregular scene text detection, IEEE Trans. Image Process., 2019, vol. 28, no. 11, pp. 5566–5579.
Article MathSciNet Google Scholar
Baek, Y., Lee, B., Han, D., Yun, S., and Lee, H., Character region awareness for text detection, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 9365–9374.
Liu, Z., Lin, G., Yang, S., Feng, J., Lin, W., and Goh, W.L., Learning Markov clustering networks for scene text detection, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 6936–6944.
Lyu, P., Yao, C., Wu, W., Yan, S., and Bai, X., Multi-oriented scene text detection via corner localization and region segmentation, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 7553–7563.

Download references

Funding

This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.

Author information

Authors and Affiliations

Department of Mathematics, College of Sciences, Northeastern University, 110819, Shenyang, Liaoning, China
Qingsong Tang, Zhangyan Jiang, Bolin Pan & Jinting Guo
Beijing Eyecool Technology Co., Ltd, 100089, Beijing, China
Wuming Jiang

Authors

Qingsong Tang
View author publications
You can also search for this author in PubMed Google Scholar
Zhangyan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Bolin Pan
View author publications
You can also search for this author in PubMed Google Scholar
Jinting Guo
View author publications
You can also search for this author in PubMed Google Scholar
Wuming Jiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Qingsong Tang, Zhangyan Jiang, Bolin Pan, Jinting Guo or Wuming Jiang.

Ethics declarations

The authors declare that they have no conflicts of interest.

Additional information

Publisher’s Note.

Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, Q., Jiang, Z., Pan, B. et al. Scene Text Detection Using HRNet and Spatial Attention Mechanism. Program Comput Soft 49, 954–965 (2023). https://doi.org/10.1134/S0361768823080212

Download citation

Received: 10 February 2023
Revised: 24 April 2023
Accepted: 05 May 2023
Published: 24 January 2024
Issue Date: December 2023
DOI: https://doi.org/10.1134/S0361768823080212

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scene Text Detection Using HRNet and Spatial Attention Mechanism

Abstract

Access this article

REFERENCES

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Additional information

Publisher’s Note.

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation