research-article

Self-Supervised Cross-Language Scene Text Editing

Authors:
Fuxiang Yang

Harbin Institute of Technology, Harbin, China

Harbin Institute of Technology, Harbin, China

0000-0001-7252-4207
View Profile

,
Tonghua Su

Harbin Institute of Technology & Chongqing Research Institute of Harbin Institute of Technology, Harbin, China

Harbin Institute of Technology & Chongqing Research Institute of Harbin Institute of Technology, Harbin, China

0000-0002-8869-1664
View Profile

,
Xiang Zhou

Harbin Institute of Technology, Harbin, China

Harbin Institute of Technology, Harbin, China

0000-0003-1594-5173
View Profile

,
Donglin Di

Harbin Institute of Technology, Harbin, China

Harbin Institute of Technology, Harbin, China

0000-0002-2270-3378
View Profile

,
Zhongjie Wang

Harbin Institute of Technology, Harbin, China

Harbin Institute of Technology, Harbin, China

0000-0002-9084-7373
View Profile

,
Songze Li

Harbin Institute of Technology, Harbin, China

Harbin Institute of Technology, Harbin, China

0000-0002-7892-8242
View Profile

MM '23: Proceedings of the 31st ACM International Conference on MultimediaOctober 2023Pages 4546–4554https://doi.org/10.1145/3581783.3612174

Published:27 October 2023Publication History

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 4546–4554

ABSTRACT

We propose and formulate the task of cross-language scene text editing, modifying the text content of a scene image into new text in another language, while preserving the scene text style and background texture. The key challenges of this task lie in the difficulty in distinguishing text and background, great distribution differences among languages, and the lack of fine-labeled real-world data. To tackle these problems, we propose a novel network named Cross-LAnguage Scene Text Editing (CLASTE), which is capable of separating the foreground text and background, as well as further decomposing the content and style of the foreground text. Our model can be trained in a self-supervised training manner on the unlabeled and multi-language data in real-world scenarios, where the source images serve as both input and ground truth. Experimental results on the Chinese-English cross-language dataset show that our proposed model can generate realistic text images, specifically, modifying English to Chinese and vice versa. Furthermore, our method is universal and can be extended to other languages such as Arabic, Korean, Japanese, Hindi, Bengali, and so on.

References

Eloi Alonso, Bastien Moysset, and Ronaldo Messina. 2019. Adversarial generation of handwritten text images conditioned on sequences. In International Conference on Document Analysis and Recognition. 481--486.Google ScholarCross Ref
Lu Chi, Borui Jiang, and Yadong Mu. 2020. Fast Fourier Convolution. In Advances in Neural Information Processing Systems, Vol. 33. 4479--4488.Google Scholar
Chee Kheng Chng, Yuliang Liu, Yipeng Sun, et al. 2019. ICDAR2019 Robust Reading Challenge on Arbitrary-Shaped Text - RRC-ArT. In International Conference on Document Analysis and Recognition. 1571--1576.Google Scholar
Sharon Fogel, Hadar Averbuch-Elor, Sarel Cohen, Shai Mazor, and Roee Litman. 2020. ScrabbleGAN: Semi-supervised varying length handwritten text generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4324--4333.Google ScholarCross Ref
Ji Gan, Weiqiang Wang, Jiaxu Leng, and Xinbo Gao. 2022. HiGAN: Handwriting Imitation GAN with Disentangled Representations. ACM Transactions on Graphics, Vol. 42, 1 (2022), 1--17.Google ScholarDigital Library
Raul Gomez, Baoguang Shi, Lluis Gomez, Lukas Numann, Andreas Veit, Jiri Matas, Serge Belongie, and Dimosthenis Karatzas. 2017. ICDAR2017 robust reading challenge on COCO-Text. In International Conference on Document Analysis and Recognition, Vol. 1. 1435--1443.Google Scholar
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems.Google Scholar
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, Vol. 33. 6840--6851.Google Scholar
Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. 2017. Globally and locally consistent image completion. ACM Transactions on Graphics, Vol. 36, 4 (2017), 1--14.Google ScholarDigital Library
Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of the European Conference on Computer Vision. 694--711.Google ScholarCross Ref
Lei Kang, Pau Riba, Marcal Rusinol, Alicia Fornés, and Mauricio Villegas. 2021. Content and style aware generation of text-line images for handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).Google Scholar
Lei Kang, Pau Riba, Yaxing Wang, Marçal Rusinol, Alicia Fornés, and Mauricio Villegas. 2020. GANwriting: content-conditioned generation of styled handwritten word images. In European Conference on Computer Vision. 273--289.Google ScholarDigital Library
Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, et al. 2015. ICDAR 2015 competition on robust reading. In International Conference on Document Analysis and Recognition. 1156--1160.Google Scholar
Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4401--4410.Google ScholarCross Ref
Praveen Krishnan, Rama Kovvuri, Guan Pang, Boris Vassilev, and Tal Hassner. 2023. Textstylebrush: Transfer of text aesthetics from a single example. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).Google Scholar
Hyeonsu Lee and Chankyu Choi. 2022. The Surprisingly Straightforward Scene Text Removal Method with Gated Attention and Region of Interest Generation: A Comprehensive Prominent Model Analysis. In European Conference on Computer Vision. 457--472.Google Scholar
Junyeop Lee, Yoonsik Kim, Seonghyeon Kim, Moonbin Yim, Seung Shin, Gayoung Lee, and Sungrae Park. 2021. RewriteNet: Realistic Scene Text Image Generation via Editing Text in Real-world Image. arXiv preprint arXiv:2107.11041 (2021).Google Scholar
Chenhao Li, Yuta Taniguchi, Min Lu, and Shin'ichi Konomi. 2021. Few-shot font style transfer between different languages. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 433--442.Google ScholarCross Ref
Yijun Li, Chen Fang, Jimei Yang, Zhaowen Wang, Xin Lu, and Ming-Hsuan Yang. 2017. Universal Style Transfer via Feature Transforms. In Advances in Neural Information Processing Systems, Vol. 30.Google ScholarDigital Library
Chongyu Liu, Yuliang Liu, lianwen Jin, Shuaitao Zhang, Canjie Luo, and Yongpan Wang. 2020. EraseNet: End-to-End Text Removal in the Wild. IEEE Transactions on Image Processing, Vol. 29 (2020), 8760--8775.Google ScholarCross Ref
Pengyuan Lyu, Xiang Bai, Cong Yao, Zhen Zhu, Tengteng Huang, and Wenyu Liu. 2017. Auto-encoder guided GAN for Chinese calligraphy synthesis. In International Conference on Document Analysis and Recognition. 1095--1100.Google ScholarCross Ref
Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).Google Scholar
Nibal Nayef, Yash Patel, Michal Busta, et al. 2019. ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition-RRC-MLT-2019. In International Conference on Document Analysis and Recognition. 1582--1587.Google ScholarCross Ref
Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas, Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, Wafa Khlif, Muhammad Muzzamil Luqman, Jean-Christophe Burie, Cheng-lin Liu, and Jean-Marc Ogier. 2017. ICDAR2017 Robust Reading Challenge on Multi-Lingual Scene Text Detection and Script Identification - RRC-MLT. 1454--1459.Google Scholar
Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning. 8162--8171.Google Scholar
Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, and Alexei A Efros. 2016. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2536--2544.Google ScholarCross Ref
Yadong Qu, Qingfeng Tan, Hongtao Xie, Jianjun Xu, Yuxin Wang, and Yongdong Zhang. 2023. Exploring stroke-level modifications for scene text editing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 2119--2127.Google ScholarDigital Library
Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. 2022. Palette: Image-to-Image Diffusion Models. In ACM SIGGRAPH 2022 Conference Proceedings. 1--10.Google Scholar
Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, 11 (2016), 2298--2304.Google ScholarDigital Library
K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations.Google Scholar
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning. 2256--2265.Google Scholar
Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, Anastasia Remizova, Arsenii Ashukha, Aleksei Silvestrov, Naejin Kong, Harshith Goka, Kiwoong Park, and Victor Lempitsky. 2022. Resolution-robust Large Mask Inpainting with Fourier Convolutions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2149--2159.Google ScholarCross Ref
Osman Tursun, Rui Zeng, Simon Denman, Sabesan Sivapalan, Sridha Sridharan, and Clinton Fookes. 2019. MTRNet: A generic scene text eraser. In International Conference on Document Analysis and Recognition. 39--44.Google ScholarCross Ref
Liang Wu, Chengquan Zhang, Jiaming Liu, Junyu Han, Jingtuo Liu, Errui Ding, and Xiang Bai. 2019. Editing text in the wild. In Proceedings of ACM International Conference on Multimedia. 1500--1508.Google ScholarDigital Library
Yangchen Xie, Xinyuan Chen, Li Sun, and Yue Lu. 2021. DG-Font: Deformable Generative Networks for Unsupervised Font Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5130--5140.Google ScholarCross Ref
Qiangpeng Yang, Jun Huang, and Wei Lin. 2020. SwapText: Image based texts transfer in scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14700--14709.Google ScholarCross Ref
Boxi Yu, Yong Xu, Yan Huang, Shuai Yang, and Jiaying Liu. 2021. Mask-guided GAN for robust text editing in the scene. Neurocomputing, Vol. 441 (2021), 192--201.Google ScholarCross Ref
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 586--595.Google ScholarCross Ref
Shuaitao Zhang, Yuliang Liu, Lianwen Jin, Yaoxiong Huang, and Songxuan Lai. 2019. EnsNet: Ensconce text in the wild. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 801--808.Google ScholarDigital Library

Index Terms

Self-Supervised Cross-Language Scene Text Editing
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Reconstruction
  2. Computer graphics
    1. Image manipulation
      1. Image-based rendering
2. Information systems
  1. Information systems applications
    1. Multimedia information systems
      1. Multimedia content creation

Recommendations

English–Vietnamese cross-language paraphrase identification using hybrid feature classes
Abstract
Paraphrase identification plays an important role with various applications in natural language processing tasks such as machine translation, bilingual information retrieval, plagiarism detection, etc. With the development of information ...
Read More
Mining comparable bilingual text corpora for cross-language information integration
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-...
Read More
A word embedding-based approach to cross-lingual topic modeling
Abstract
The cross-lingual topic analysis aims at extracting latent topics from corpora of different languages. Early approaches rely on high-cost multilingual resources (e.g., a parallel corpus), which is hard to come by in many real cases. Some works ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cross-language
gan
image generation
scene text editing
self-supervised
style transfer
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 194
  Total Downloads
- Downloads (Last 12 months)194
- Downloads (Last 6 weeks)27
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Self-Supervised Cross-Language Scene Text Editing

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

English–Vietnamese cross-language paraphrase identification using hybrid feature classes

Mining comparable bilingual text corpora for cross-language information integration

A word embedding-based approach to cross-lingual topic modeling