skip to main content
10.1145/3581783.3612174acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Self-Supervised Cross-Language Scene Text Editing

Published:27 October 2023Publication History

ABSTRACT

We propose and formulate the task of cross-language scene text editing, modifying the text content of a scene image into new text in another language, while preserving the scene text style and background texture. The key challenges of this task lie in the difficulty in distinguishing text and background, great distribution differences among languages, and the lack of fine-labeled real-world data. To tackle these problems, we propose a novel network named Cross-LAnguage Scene Text Editing (CLASTE), which is capable of separating the foreground text and background, as well as further decomposing the content and style of the foreground text. Our model can be trained in a self-supervised training manner on the unlabeled and multi-language data in real-world scenarios, where the source images serve as both input and ground truth. Experimental results on the Chinese-English cross-language dataset show that our proposed model can generate realistic text images, specifically, modifying English to Chinese and vice versa. Furthermore, our method is universal and can be extended to other languages such as Arabic, Korean, Japanese, Hindi, Bengali, and so on.

References

  1. Eloi Alonso, Bastien Moysset, and Ronaldo Messina. 2019. Adversarial generation of handwritten text images conditioned on sequences. In International Conference on Document Analysis and Recognition. 481--486.Google ScholarGoogle ScholarCross RefCross Ref
  2. Lu Chi, Borui Jiang, and Yadong Mu. 2020. Fast Fourier Convolution. In Advances in Neural Information Processing Systems, Vol. 33. 4479--4488.Google ScholarGoogle Scholar
  3. Chee Kheng Chng, Yuliang Liu, Yipeng Sun, et al. 2019. ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT. In International Conference on Document Analysis and Recognition. 1571--1576.Google ScholarGoogle Scholar
  4. Sharon Fogel, Hadar Averbuch-Elor, Sarel Cohen, Shai Mazor, and Roee Litman. 2020. ScrabbleGAN: Semi-supervised varying length handwritten text generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4324--4333.Google ScholarGoogle ScholarCross RefCross Ref
  5. Ji Gan, Weiqiang Wang, Jiaxu Leng, and Xinbo Gao. 2022. HiGAN: Handwriting Imitation GAN with Disentangled Representations. ACM Transactions on Graphics, Vol. 42, 1 (2022), 1--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Raul Gomez, Baoguang Shi, Lluis Gomez, Lukas Numann, Andreas Veit, Jiri Matas, Serge Belongie, and Dimosthenis Karatzas. 2017. ICDAR2017 robust reading challenge on COCO-Text. In International Conference on Document Analysis and Recognition, Vol. 1. 1435--1443.Google ScholarGoogle Scholar
  7. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  8. Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  9. Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, Vol. 33. 6840--6851.Google ScholarGoogle Scholar
  10. Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. 2017. Globally and locally consistent image completion. ACM Transactions on Graphics, Vol. 36, 4 (2017), 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision. 694--711.Google ScholarGoogle ScholarCross RefCross Ref
  12. Lei Kang, Pau Riba, Marcal Rusinol, Alicia Fornés, and Mauricio Villegas. 2021. Content and style aware generation of text-line images for handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).Google ScholarGoogle Scholar
  13. Lei Kang, Pau Riba, Yaxing Wang, Marçal Rusinol, Alicia Fornés, and Mauricio Villegas. 2020. GANwriting: content-conditioned generation of styled handwritten word images. In European Conference on Computer Vision. 273--289.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, et al. 2015. ICDAR 2015 competition on robust reading. In International Conference on Document Analysis and Recognition. 1156--1160.Google ScholarGoogle Scholar
  15. Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4401--4410.Google ScholarGoogle ScholarCross RefCross Ref
  16. Praveen Krishnan, Rama Kovvuri, Guan Pang, Boris Vassilev, and Tal Hassner. 2023. Textstylebrush: Transfer of text aesthetics from a single example. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).Google ScholarGoogle Scholar
  17. Hyeonsu Lee and Chankyu Choi. 2022. The Surprisingly Straightforward Scene Text Removal Method with Gated Attention and Region of Interest Generation: A Comprehensive Prominent Model Analysis. In European Conference on Computer Vision. 457--472.Google ScholarGoogle Scholar
  18. Junyeop Lee, Yoonsik Kim, Seonghyeon Kim, Moonbin Yim, Seung Shin, Gayoung Lee, and Sungrae Park. 2021. RewriteNet: Realistic Scene Text Image Generation via Editing Text in Real-world Image. arXiv preprint arXiv:2107.11041 (2021).Google ScholarGoogle Scholar
  19. Chenhao Li, Yuta Taniguchi, Min Lu, and Shin'ichi Konomi. 2021. Few-shot font style transfer between different languages. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 433--442.Google ScholarGoogle ScholarCross RefCross Ref
  20. Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2017. Universal Style Transfer via Feature Transforms. In Advances in Neural Information Processing Systems, Vol. 30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Chongyu Liu, Yuliang Liu, lianwen Jin, Shuaitao Zhang, Canjie Luo, and Yongpan Wang. 2020. EraseNet: End-to-End Text Removal in the Wild. IEEE Transactions on Image Processing, Vol. 29 (2020), 8760--8775.Google ScholarGoogle ScholarCross RefCross Ref
  22. Pengyuan Lyu, Xiang Bai, Cong Yao, Zhen Zhu, Tengteng Huang, and Wenyu Liu. 2017. Auto-encoder guided GAN for Chinese calligraphy synthesis. In International Conference on Document Analysis and Recognition. 1095--1100.Google ScholarGoogle ScholarCross RefCross Ref
  23. Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).Google ScholarGoogle Scholar
  24. Nibal Nayef, Yash Patel, Michal Busta, et al. 2019. ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In International Conference on Document Analysis and Recognition. 1582--1587.Google ScholarGoogle ScholarCross RefCross Ref
  25. Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, Wafa Khlif, Muhammad Muzzamil Luqman, Jean-Christophe Burie, Cheng-lin Liu, and Jean-Marc Ogier. 2017. ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT. 1454--1459.Google ScholarGoogle Scholar
  26. Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning. 8162--8171.Google ScholarGoogle Scholar
  27. Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. 2016. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2536--2544.Google ScholarGoogle ScholarCross RefCross Ref
  28. Yadong Qu, Qingfeng Tan, Hongtao Xie, Jianjun Xu, Yuxin Wang, and Yongdong Zhang. 2023. Exploring stroke-level modifications for scene text editing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2119--2127.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. 2022. Palette: Image-to-Image Diffusion Models. In ACM SIGGRAPH 2022 Conference Proceedings. 1--10.Google ScholarGoogle Scholar
  30. Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 11 (2016), 2298--2304.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations.Google ScholarGoogle Scholar
  32. Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. 2256--2265.Google ScholarGoogle Scholar
  33. Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. 2022. Resolution-robust Large Mask Inpainting with Fourier Convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2149--2159.Google ScholarGoogle ScholarCross RefCross Ref
  34. Osman Tursun, Rui Zeng, Simon Denman, Sabesan Sivapalan, Sridha Sridharan, and Clinton Fookes. 2019. MTRNet: A generic scene text eraser. In International Conference on Document Analysis and Recognition. 39--44.Google ScholarGoogle ScholarCross RefCross Ref
  35. Liang Wu, Chengquan Zhang, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding, and Xiang Bai. 2019. Editing text in the wild. In Proceedings of ACM International Conference on Multimedia. 1500--1508.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Yangchen Xie, Xinyuan Chen, Li Sun, and Yue Lu. 2021. DG-Font: Deformable Generative Networks for Unsupervised Font Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5130--5140.Google ScholarGoogle ScholarCross RefCross Ref
  37. Qiangpeng Yang, Jun Huang, and Wei Lin. 2020. SwapText: Image based texts transfer in scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14700--14709.Google ScholarGoogle ScholarCross RefCross Ref
  38. Boxi Yu, Yong Xu, Yan Huang, Shuai Yang, and Jiaying Liu. 2021. Mask-guided GAN for robust text editing in the scene. Neurocomputing, Vol. 441 (2021), 192--201.Google ScholarGoogle ScholarCross RefCross Ref
  39. Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 586--595.Google ScholarGoogle ScholarCross RefCross Ref
  40. Shuaitao Zhang, Yuliang Liu, Lianwen Jin, Yaoxiong Huang, and Songxuan Lai. 2019. EnsNet: Ensconce text in the wild. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 801--808.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Self-Supervised Cross-Language Scene Text Editing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '23: Proceedings of the 31st ACM International Conference on Multimedia
          October 2023
          9913 pages
          ISBN:9798400701085
          DOI:10.1145/3581783

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 October 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia
        • Article Metrics

          • Downloads (Last 12 months)194
          • Downloads (Last 6 weeks)27

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader