A Multimodal Text Block Segmentation Framework for Photo Translation

Wu, Jiajia; Li, Anni; Zhao, Kun; Yang, Zhengyan; Yin, Bing; Liu, Cong; Dai, Lirong

doi:10.1007/978-3-031-46311-2_10

Jiajia Wu^14,15,
Anni Li¹⁵,
Kun Zhao¹⁵,
Zhengyan Yang¹⁵,
Bing Yin¹⁵,
Cong Liu¹⁴ &
…
Lirong Dai¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14357))

Included in the following conference series:

International Conference on Image and Graphics

331 Accesses

Abstract

Nowadays, with the vigorous development of OCR (Optical Character Recognition) and machine translation, photo translation technology brings great convenience to people’s life and study. However, when translating the content of an image line by line, the lack of contextual information in adjacent semantic-related text lines will seriously influence the actual effect of translation, making it difficult for people to understand. To tackle the above problem, we propose a novel multimodal text block segmentation encoder-decoder model. Specifically, we construct a convolutional encoder to extract the multimodal representation which combines visual, semantic, and positional features together for each text line. In the decoder stage, the LSTM (Long Short Term Memory) module is employed to output the predicted segmentation sequence inspired by the pointer network. Experimental results illustrate that our model outperforms the other baselines by a large margin.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Allahyari, M., et al.: Text summarization techniques: a brief survey. arXiv preprint arXiv:1707.02268 (2017)
Antol, S., et al.: VQA: visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Google Scholar
Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015 (2015)
Google Scholar
Caglayan, O., et al.: Cross-lingual visual pre-training for multimodal machine translation. In: arXiv preprint arXiv:2101.10044 (2021)
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Google Scholar
D’Informatique, D.E., Ese, N., Esent, P., Au, E., Frasconi, P.P.: Long short-term memory in recurrent neural networks. EPFL (2001)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Li, L., Gao, F., Bu, J., Wang, Y., Yu, Z., Zheng, Q.: An end-to-end OCR Text Re-organization sequence learning for Rich-Text detail image comprehension. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 85–100. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_6
Chapter Google Scholar
Liao, M., Shi, B., Bai, X., Wang, X., Liu, W.: Textboxes: a fast text detector with a single deep neural network. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
Google Scholar
Lin, T.Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Long, S., Ruan, J., Zhang, W., He, X., Wu, W., Yao, C.: TextSnake: a flexible representation for detecting text of arbitrary shapes. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 20–36 (2018)
Google Scholar
Mikolov, T., KarafiÍćt, M., Burget, L., Cernock, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September (2015)
Google Scholar
Neubeck, A., Van Gool, L.: Efficient non-maximum suppression. In: 18th International Conference on Pattern Recognition (ICPR 2006). vol. 3, pp. 850–855. IEEE (2006)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Google Scholar
Shi, B., Bai, X., Yao, C.: An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(11), 2298–2304 (2016)
Article Google Scholar
Shi, B., Wang, X., Lyu, P., Yao, C., Bai, X.: Robust scene text recognition with automatic rectification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4168–4176 (2016)
Google Scholar
Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates. Artif. Intell. Mach. Learn. Multi-Domain Oper. Appl. 11006, 369–386 (2019)
Google Scholar
St\(\acute{{\rm l}}\)źn, A., Berard, A., Besacier, L., Gall\(\acute{{\rm l}}\)ę, M.: Multilingual unsupervised neural machine translation with denoising adapters. In: Empirical Methods in Natural Language Processing (2021)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS (2014)
Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Vinyals, O., Fortunato, M., Jaitly, N.: Pointer networks. In: Advances in Neural Information Processing Systems, pp. 2692–2700 (2015)
Google Scholar
Wang, W., et al.: Shape robust text detection with progressive scale expansion network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9336–9345 (2019)
Google Scholar
Wu, J., et al.: A multimodal attention fusion network with a dynamic vocabulary for textVQA. Pattern Recogn. 122, 108214 (2022)
Article Google Scholar
Xie, Z., Huang, Y., Zhu, Y., Jin, L., Liu, Y., Xie, L.: Aggregation cross-entropy for sequence recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6538–6547 (2019)
Google Scholar
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Google Scholar
Yan, R., Peng, L., Xiao, S., Yao, G.: Primitive representation learning for scene text recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 284–293 (2021)
Google Scholar
Zhou, X., et al.: EAST: an efficient and accurate scene text detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5551–5560 (2017)
Google Scholar
Zhu, Y., Chen, J., Liang, L., Kuang, Z., Jin, L., Zhang, W.: Fourier contour embedding for arbitrary-shaped text detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3123–3131 (2021)
Google Scholar
Zhu, Y., Du, J.: TextMountain: accurate scene text detection via instance segmentation. Pattern Recogn. 110, 107336 (2020)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Science and Technology of China, Hefei, China
Jiajia Wu, Cong Liu & Lirong Dai
IFLYTEK Research, Hefei, China
Jiajia Wu, Anni Li, Kun Zhao, Zhengyan Yang & Bing Yin

Authors

Jiajia Wu
View author publications
You can also search for this author in PubMed Google Scholar
Anni Li
View author publications
You can also search for this author in PubMed Google Scholar
Kun Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Zhengyan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Bing Yin
View author publications
You can also search for this author in PubMed Google Scholar
Cong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Lirong Dai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiajia Wu .

Editor information

Editors and Affiliations

Dalian University of Technology, Dalian, China
Huchuan Lu
University of Sydney, Sydney, NSW, Australia
Wanli Ouyang
Shenzhen University, Shenzhen, China
Hui Huang
Tsinghua University, Beijing, China
Jiwen Lu
Dalian University of Technology, Dalian, China
Risheng Liu
Institute of Automation, CAS, Beijing, China
Jing Dong
University of Technology Sydney, Sydney, NSW, Australia
Min Xu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, J. et al. (2023). A Multimodal Text Block Segmentation Framework for Photo Translation. In: Lu, H., et al. Image and Graphics . ICIG 2023. Lecture Notes in Computer Science, vol 14357. Springer, Cham. https://doi.org/10.1007/978-3-031-46311-2_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-46311-2_10
Published: 29 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46310-5
Online ISBN: 978-3-031-46311-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Multimodal Text Block Segmentation Framework for Photo Translation