Abstract
To better extract the features from text instances with various shapes, a scene text detector using High Resolution Net (HRNet) and spatial attention mechanism is proposed in this paper. Specifically, we use HRNetv2-W18 as the backbone network to extract the text feature in text instances with complex shapes. Considering that the scene text instance is usually small, to avoid too small feature size, we optimize HRNet through deformable convolution and Smooth Maximum Unit (SMU) activation function, so that the network can retain more detail information and location information of the text instance. In addition, a Text Region Attention Module (TRAM) is added after the backbone to make it pay more attention to the text location information and a loss function is used to TRAM, so that the network can learn the features better. The experimental results illustrate that the proposed method can compete with the state-of-the-art methods. Code is available at: https://github.com/zhangyan1005/HR-DBNet.
REFERENCES
Khan, T., Sarkar, R., and Mollah, A.F., Deep learning approaches to scene text detection: a comprehensive review, Artif. Intell. Rev., 2021, vol. 54, no. 5, pp. 3239–3298.
Long, S., He, X., and Yao, C., Scene text detection and recognition: the deep learning era, Int. J. Comput. Vis., 2021, vol. 129, pp. 161–184.
Ren, S., He, K., Girshick, R., and Sun, J., Faster R-CNN: towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell., 2017, vol. 39, no. 6, pp. 1137–1149.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., and Berg, A.C., SSD: aingle shot multibox detector, Proc. European Conf. on Computer Vision, Amsterdam, 2016, pp. 21–37.
Tian, Z., Huang, W., He, T., He, P., and Qiao, Y., Detecting text in natural image with connectionist text proposal network, Proc. European Conf. on Computer Vision, Amsterdam, 2016, pp. 56–72.
Liao, M., Shi, B., Bai, X., Wang, X., and Liu, W., Textboxes: a fast text detector with a single deep neural network, in Proc. 31st AAAI Conf. on Artificial Intelligence, Palo Alto, CA: AAAI Press, 2017, vol. 31, no. 1.
Liao, M., Shi, B., and Bai, X., Textboxes++: a single-shot oriented scene text detector, IEEE Trans. Image Process., 2018, vol. 27, no. 8, pp. 3676–3690.
Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman, A., Reading text in the wild with convolutional neural networks, Int. J. Comput. Vis., 2018, vol. 116, no. 1, pp. 1–20.
Zitnick, C.L., and Dollar, P., Edge boxes: locating object proposals from edges, Proc. European Conf. on Computer Vision, Zurich, 2014, pp. 391–405.
Dai, P., Zhang, S., Zhang, H., and Cao, X., Progressive contour regression for arbitrary-shape scene text detection, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Nashville, TN, 2021, pp. 7393–7402.
Wang, W., Xie, E., Li, X., Hou, W., Lu, T., Yu, G., and Shao, S., Shape robust text detection with progressive scale expansion network, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 9336–9345.
Deng, D., Liu, H., Li, X., and Cai, D., Pixellink: detecting scene text via instance segmentation, Proc. AAAI Conf. on Artificial Intelligence, New Orleans, 2018, vol. 32, no. 1.
Liao, M., Wan, Z., Yao, C., Chen, K., and Bai, X., Real-time scene text detection with differentiable binarization, Proc. AAAI Conf. on Artificial Intelligence, New York, 2020, vol. 34, no. 7, pp. 11474–11481.
Liao, M., Zou, Z., Wan, Z., Yao, C., and Bai, X., Real-time scene text detection with differentiable binarization and adaptive scale fusion, IEEE Trans. Pattern Anal. Mach. Intell., 2023, vol. 45, no. 1, pp. 919–931.
Wu, Y. and Natarajan, P., Self-organized text detection with minimal post-processing via border learning, Proc. IEEE Int. Conf. on Computer Vision, Venice, 2017, pp. 5000–5009.
Zhang, S.X., Zhu, X., Chen, L., Hou, J.B., and Yin, X.C., Arbitrary shape text detection via segmentation with probability maps, IEEE Trans. Pattern Anal. Mach. Intell., 2022,vol. 45, no. 3, pp. 2736–2750.
Tian, Z., Shu, M., Lyu, P., Li, R., Zhou, C., Shen, X., and Jia, J., Learning shape-aware embedding for scene text detection, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, 2019, pp. 4234–4243.
Lyu, P., Liao, M., Yao, C., Wu, W., and Bai, X., Mask textspotter: an end-to-end trainable neural network for spotting text with arbitrary shapes, Proc. European Conf. on Computer Vision, Munich, 2018, pp. 67–83.
He, K., Zhang, X., Ren, S., and Sun, J., Identity mappings in deep residual networks, Proc. European Conf. on Computer Vision, Amsterdam, 2016, pp. 630–645.
Simonyan, K. and Zisserman, A., Very deep convolutional networks for large-scale image recognition, 2014. arXiv:1409.1556.
Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., and Belongie, S., Feature pyramid networks for object detection, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, 2017, pp. 2117–2125.
Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., and Xiao, B., Deep high-resolution representation learning for visual recognition, IEEE Trans. Pattern Anal. Mach. Intell., 2020, vol. 43, no. 10, pp. 3349–3364.
Karatzas, D., Gomez-Bigorda, L., Nicolaou, A., Ghosh, S., Bagdanov, A., Iwamura, M., and Valveny, E., ICDAR 2015 competition on robust reading, Proc. 13th Int. Conf. on Document Analysis and Recognition, Tunis, 2015, pp. 1156–1160.
Chee, C.K. and Chan, C.S., Total-text: a comprehensive dataset for scene text detection and recognition, Proc. 14th IAPR Int. Conf. on Document Analysis and Recognition, Kyoto, 2017, vol. 1, pp. 935–942.
Nayef, N., Yin, F., Bizid, I., Choi, H., Feng, Y., Karatzas, D., and Ogier, J.M., ICDAR 2017 robust reading challenge on multi-lingual scene text detection and script identification-rrc-mlt, Proc. 14th IAPR Int. Conf. on Document Analysis and Recognition, Kyoto, 2017, vol. 1, pp. 1454–1459.
Yao, C., Bai, X., Liu, W., Ma, Y., and Tu, Z., Detecting texts of arbitrary orientations in natural images, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Providence, RI, 2012, pp. 1083–1090.
Liu, Y., Jin, L., Zhang, S., Luo, C., and Zhang, S., Curved scene text detection via transverse and longitudinal sequence connection, Pattern Recogn., 2019, vol. 90, pp. 337–345.
Sun, K., Xiao, B., Liu, D., and Wang, J., Deep high-resolution representation learning for human pose estimation, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 5693–5703.
Vatti, B.R., A generic solution to polygon clipping, Commun. ACM, 1992, vol. 35, no. 7, pp. 56–63.
Guan, T., Gu, C., Lu, C., et al., Industrial scene text detection with refined feature-attentive network, IEEE Trans. Circuits Syst. Video Technol., 2022, vol. 32, no. 9, pp. 6073–6085.
Lu, N., Yu, W., Qi, X., Chen, Y., Gong, P., Xiao, R., and Bai, X., Master: multi-aspect non-local network for scene text recognition, Pattern Recogn., 2021, vol. 117, p. 107980.
Liu, Z., Zhou, W., and Li, H., AB-LSTM: attention-based bidirectional LSTM model for scene text detection, ACM Trans. Multimed. Comput. Commun. Appl., 2019, vol. 15, no. 4, pp. 1–23.
Wu, Y., Liu, W., and Wan, S., Multiple attention encoded cascade R-CNN for scene text detection, J. Vis. Commun. Image Represent., 2021, vol. 80, p. 103261.
Woo, S., Park, J., Lee, J.Y., and Kweon, I.S., CBAM: convolutional block attention module, Proc. European Conf. on Computer Vision, Munich, 2018, pp. 3–19.
Biswas, K., Kumar, S., Banerjee, S., and Pandey, A.K., SMU: smooth activation function for deep networks using smoothing maximum technique, 2021. arXiv:2111.04682.
Yao, C., Bai, X., and Liu, W., A unified framework for multioriented text detection and recognition, IEEE Trans. Image Process., 2014, vol. 23, no. 11, pp. 4737–4749.
Powers, D.M., Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation, Int. J. Mach. Learn., 2011, vol. 2, no. 1, pp. 47–63.
Zhou, X., Yao, C., Wen, H., Wang, Y., Zhou, S., He, W., and Liang, J., East: an efficient and accurate scene text detector, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, 2017, pp. 5551–5560.
Zhu, Y., Chen, J., Liang, L., Kuang, Z., Jin, L., and Zhang, W., Fourier contour embedding for arbitrary-shaped text detection, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Nashville, TN, 2021, pp. 3123–3131.
Long, S., Ruan, J., Zhang, W., He, X., Wu, W., and Yao, C., Textsnake: a flexible representation for detecting text of arbitrary shapes, Proc. European Conf. on Computer Vision, Munich, 2018, pp. 20–36.
Shi, B., Bai, X., and Belongie, S., Detecting oriented text in natural images by linking segments, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, 2017, pp. 2550–2558.
Wang, P., Zhang, C., Qi, F., Huang, Z., En, M., Han, J., and Shi, G., A single-shot arbitrarily-shaped text detector based on context attended multi-task learning, Proc. 27th ACM Int. Conf. on Multimedia, Nice, 2019, pp. 1277–1285.
Zhang, C., Liang, B., Huang, Z., En, M., Han, J., Ding, E., and Ding, X., Look more than once: an accurate detector for text of arbitrary shapes, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 10552–10561.
Zhou, Y., Xie, H., Fang, S., Li, Y., and Zhang, Y., CRNet: a center-aware representation for detecting text of arbitrary shapes, Proc. 28th ACM Int. Conf. on Multimedia, Seattle, 2020, pp. 2571–2580.
Liu, Y., Chen, H., Shen, C., He, T., Jin, L., and Wang, L., Abcnet: real-time scene text spotting with adaptive bezier-curve network, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Seattle, 2020, pp. 9809–9818.
Xu, Y., Wang, Y., Zhou, W., Wang, Y., Yang, Z., and Bai, X., Textfield: learning a deep direction field for irregular scene text detection, IEEE Trans. Image Process., 2019, vol. 28, no. 11, pp. 5566–5579.
Baek, Y., Lee, B., Han, D., Yun, S., and Lee, H., Character region awareness for text detection, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Long Beach, CA, 2019, pp. 9365–9374.
Liu, Z., Lin, G., Yang, S., Feng, J., Lin, W., and Goh, W.L., Learning Markov clustering networks for scene text detection, Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 6936–6944.
Lyu, P., Yao, C., Wu, W., Yan, S., and Bai, X., Multi-oriented scene text detection via corner localization and region segmentation, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018, pp. 7553–7563.
Funding
This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
The authors declare that they have no conflicts of interest.
Additional information
Publisher’s Note.
Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Tang, Q., Jiang, Z., Pan, B. et al. Scene Text Detection Using HRNet and Spatial Attention Mechanism. Program Comput Soft 49, 954–965 (2023). https://doi.org/10.1134/S0361768823080212
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S0361768823080212