Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition

Authors

  • Hao Liu Tencent YouTu Lab
  • Bin Wang Tencent YouTu Lab
  • Zhimin Bao Tencent YouTu Lab
  • Mobai Xue University of Science and Technology of China
  • Sheng Kang University of Science and Technology of China
  • Deqiang Jiang Tencent YouTu Lab
  • Yinsong Liu Tencent YouTu Lab
  • Bo Ren Tencent YouTu Lab

DOI:

https://doi.org/10.1609/aaai.v36i2.20062

Keywords:

Computer Vision (CV)

Abstract

We introduce Perceiving Stroke-Semantic Context (PerSec), a new approach to self-supervised representation learning tailored for Scene Text Recognition (STR) task. Considering scene text images carry both visual and semantic properties, we equip our PerSec with dual context perceivers which can contrast and learn latent representations from low-level stroke and high-level semantic contextual spaces simultaneously via hierarchical contrastive learning on unlabeled text image data. Experiments in un- and semi-supervised learning settings on STR benchmarks demonstrate our proposed framework can yield a more robust representation for both CTC-based and attention-based decoders than other contrastive learning methods. To fully investigate the potential of our method, we also collect a dataset of 100 million unlabeled text images, named UTI-100M, covering 5 scenes and 4 languages. By leveraging hundred-million-level unlabeled data, our PerSec shows significant performance improvement when fine-tuning the learned representation on the labeled data. Furthermore, we observe that the representation learned by PerSec presents great generalization, especially under few labeled data scenes.

Downloads

Published

2022-06-28

How to Cite

Liu, H., Wang, B., Bao, Z., Xue, M., Kang, S., Jiang, D., Liu, Y., & Ren, B. (2022). Perceiving Stroke-Semantic Context: Hierarchical Contrastive Learning for Robust Scene Text Recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 36(2), 1702-1710. https://doi.org/10.1609/aaai.v36i2.20062

Issue

Section

AAAI Technical Track on Computer Vision II