G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model

Authors

  • Pan Xie Beihang University
  • Qipeng Zhang Beihang University
  • Peng Taiying Beihang University
  • Hao Tang Carnegie Mellon University
  • Yao Du Beihang University
  • Zexian Li Beihang University

DOI:

https://doi.org/10.1609/aaai.v38i6.28441

Keywords:

CV: Computational Photography, Image & Video Synthesis

Abstract

The Sign Language Production (SLP) project aims to automatically translate spoken languages into sign sequences. Our approach focuses on the transformation of sign gloss sequences into their corresponding sign pose sequences (G2P). In this paper, we present a novel solution for this task by converting the continuous pose space generation problem into a discrete sequence generation problem. We introduce the Pose-VQVAE framework, which combines Variational Autoencoders (VAEs) with vector quantization to produce a discrete latent representation for continuous pose sequences. Additionally, we propose the G2P-DDM model, a discrete denoising diffusion architecture for length-varied discrete sequence data, to model the latent prior. To further enhance the quality of pose sequence generation in the discrete space, we present the CodeUnet model to leverage spatial-temporal information. Lastly, we develop a heuristic sequential clustering method to predict variable lengths of pose sequences for corresponding gloss sequences. Our results show that our model outperforms state-of-the-art G2P models on the public SLP evaluation benchmark. For more generated results, please visit our project page: https://slpdiffusier.github.io/g2p-ddm.

Downloads

Published

2024-03-24

How to Cite

Xie, P., Zhang, Q., Taiying, P., Tang, H., Du, Y., & Li, Z. (2024). G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model. Proceedings of the AAAI Conference on Artificial Intelligence, 38(6), 6234-6242. https://doi.org/10.1609/aaai.v38i6.28441

Issue

Section

AAAI Technical Track on Computer Vision V