Conditional Motion In-betweening

Motion in-betweening (MIB) is a process of generating intermediate skeletal movement between the given start and target poses while preserving the naturalness of the motion, such as periodic footstep motion while walking. Although state-of-the-art MIB methods are capable of producing plausible motions given sparse key-poses, they often lack the controllability to generate motions satisfying the semantic contexts required in practical applications. We focus on the method that can handle pose or semantic conditioned MIB tasks using a unified model. We also present a motion augmentation method to improve the quality of pose-conditioned motion generation via defining a distribution over smooth trajectories. Our proposed method outperforms the existing state-of-the-art MIB method in pose prediction errors while providing additional controllability.


Introduction
The demand for generating natural and expressive 3D human motion proliferates in the film and gaming industries. Despite the demand, however, generating diverse character movements in industries still dominantly relies on Motion Capture (MoCap) machines and traditional approaches [1,2] rather than learning-based approaches due to its inherent complexity. A model that can synthesize natural human motion with providing controllability can help professional animators by allowing them to focus on creative and novel motions by reducing the labor involved in simple and redundant motion creation. One unique property of human motion is that it is spatio-temporally constrained by a human's feasible skeletal (kinematic) structure, which distinguishes itself from static and structured data such as images. In the context of motion generation, a wide range of studies [3,4] has focused on forecasting natural motion-frames (i.e., motion prediction) from a given key-pose or multiple sets of key-poses. Others [5,6,7] investigate methods to generate motion from given semantics such as textual action description. In this paper, we focus on a motion inbetweening (MIB) problem, where MIB is a process of generating skeletal motions that naturally interpolates a given set of key-poses, reducing the great amount of time and manual effort required for animators.
Recent studies [8,9] have shown that learning-based methods can provide plausible solutions for MIB given sparse key-poses. However, they often lack controllability in human motion generation. This paper presents a conditional motion in-betweening (CMIB) method that can handle two types of conditioned motion generation tasks: poseconditioned and semantic-conditioned MIB. First, the pose-conditioned MIB generates a sequence of poses from the start, anchor, and target poses so that the generated motion naturally interpolates the given poses while preserving the naturalness of the motion. We also present a motion augmentation method to further provide the pose-controllability beyond the given motion dataset. In particular, we augment the motion trajectories by sampling the root trajectories from a Gaussian random path distribution [10] which defines a probability distribution over smooth trajectories in reproducing kernel Hilbert space. Finally, semantic-conditioned MIB enables users to choose desired semantics at the inference stage. These conditioning problems are described in Figure 1.
We propose a CMIB method that integrates pose and semantic CMIB in a single Transformer encoder-based architecture. We interpret MIB as a masked motion modeling problem and introduce a randomized shuffled anchor pose which makes our motion encoder perform pose-conditioned MIB. We also introduce a semantic embedding token by prepending a sequence's semantic context to motion representation. To the best of our knowledge, we are the first to propose a controllable motion generation method in the MIB framework.
Our main contributions are summarized as follows: • We introduce a Transformer encoder-based CMIB model that can perform the following conditional motion generation tasks using a single model: -Pose-conditioned: Generate a motion sequence that satisfies the given anchor pose while interpolating start and target poses -Semantic-conditioned: Generate a motion sequence with given semantic information (e.g., walk, run, or dance) • We propose a motion data augmentation strategy that can generate various trajectories from existing motion for pose-conditioned MIB tasks.
• Our proposed method outperforms the existing state-of-the-art MIB method with additional capabilities of conditioned motion generation. Furthermore, we also present a performance evaluation measure to validate semantic-conditioned MIB tasks.

Related Work
Motion Prediction Motion prediction refers to forecasting plausible motion from previous pose(s). The recent development of deep learning-based approaches has allowed significant advancement in motion synthesis. Recent studies adopted Recurrent Neural Networks (RNN) to address motion prediction problems. Fragkiadaki et al. [3] introduced a three-layer long short-term memory (LSTM-3LR) network and encoder-recurrent-decoder (ERD) structure, which takes advantage of the autoregressive approach in latent dimension formulated from nonlinear encoder and decoder. Martinez et al. [11] suggested a sequence to sequence (seq2seq) architecture with residual connections in decoders. Ghosh et al. [4] integrated dropout autoencoder (DAE) to LSTM-3LR to mitigate the accumulation of error in RNN. Gui et al. [12] introduced a new loss based on geodesic loss instead of conventional Euclidean loss with adversarial training to predict human motion. Harvey et al. [13] improved ERD by employing additional encoder blocks to inject various contexts to RNN. Aksan et al. [14] decomposed motion prediction into joint-level prediction by modeling dependencies across joints with a structured prediction layer. Rossi et al. [15] have studied motion prediction from the perspective of trajectory prediction by leveraging both LSTM and generative adversarial network (GAN).
However, even with the help of the improved RNN structure, long-term prediction is an issue due to the inherent error accumulation problem of RNN, also with difficulty in parallelization. Aksan et al. [16] introduced self-attention to directly attend to the previous context and capture dependencies across poses and presented that the Transformer architecture can produce a long motion sequence with reduced error accumulation. Martinez et al. [17] proposed a seq2seq Transformer encoder and decoder model with Graph Convolutional Networks (GCN) before and after the Transformer block.
The conditioned motion generation focuses on generating motion from given semantic information, such as music or action. Semantic information of human motion has been widely studied in action recognition tasks, but conditioned motion generation has a difference in that it should be able to generate desired motions for given semantics. Won et al. [5] designed a generate-and-rank approach to choreograph human motion from high-level semantics. Ahn et al. [6] proposed a method to generate motion from text input. They trained an autoencoder between the language and action, and the language encoder is separately trained again in an adversarial manner with the attention mechanism. Henter et al. [18] set the stage for using normalizing flows on probabilistic and controllable motion generation. Ling et al. [19] proposed motion variational autoencoder (motion-VAE), which is based on conditional variational autoencoder (CVAE), to achieve controllable motion generation. Guo et al. [7] also studied action-conditioned motion generation, allowing the RNN-based architecture to learn action by providing action code with pose vector while training CVAE. Since MIB is also a part of motion generation, studies on motion prediction form an important foundation for our research.
Motion in-betweening In this work, we consider motion in-betweening (MIB) as a motion generation task, completing natural motion from sparse pose information. This task is analogous to image in-painting [20]. However, MIB has an additional difficulty in that it has to synthesize spatio-temporal motion data. More closely, the video in-painting [21,22] shares similarities in that it should take both spatial and temporal aspects into consideration to connect sparse information naturally.
Frame-based video interpolation [23] also addresses interpolation problems, connecting sparse image frames with image sequences learned from videos. Image-level interpolation has the advantage that it can handle not only motion but also spatially adjacent information such as background or interacting objects. However, utilizing human motion for 3D applications is difficult with this approach since motion is entangled with other information in image space and lacks depth information, which is essential for 3D motion representation.
Rose et al. [24] employed radial basis function (RBF) to interpolate parameterized motions. Mukai et al. [25] proposed a statistical prediction method which optimizes frame-level interpolating kernels for a given parametric space to perform MIB. Lehrmann et al. [26] have shown the possibility of Markov models in MIB tasks.
MIB can also be viewed as a boundary value problem (BVP) for multi-joints. Two-point BVP studies the solution of a differential equation constrained to the provided start and target conditions. Li et al. [27] approached motion planning problems as a trajectory optimization problem for given initial and final conditions. Xie et al. [28] combined optimal planning algorithm with a two-point BVP framework to solve optimal motion planning problem.
As deep neural network approaches are introduced, there have been great improvements in MIB tasks. Harvey et al. [8] demonstrated studio-quality MIB methods based on their motion transition network [13]. Also, they leveraged the least-squares generative adversarial network (LSGAN) [29] to make generated motions to be more natural. However, these auto-regressive approaches are prone to long-horizon generation as the error accumulates over time and, thus, is difficult to parallelize. To overcome this, Kaufmann et al. [9] employed a convolutional autoencoder by representing motion data in a matrix that can be interpreted as an image and showed that a non-autoregressive method could produce comparable results in MIB without degrading visual results. In contrast to prior work, we propose a controllable motion generation method on top of the MIB framework. The most relevant work of our approach is Harvey et al. [8], which, in contrast, only performs MIB tasks without controllability.

Motion Data Representation
We describe a human pose with joint positions and joint rotations. Joint positions are expressed on a real-world scale. There are multiple ways to represent rotations, such as Euler angles, axis-angle representation, and quaternions. We compare the model's performance on the three rotation expressions in Table 1. In our experiment, the Euler angle representation is highly unstable during training. Empirically, the quaternion format outperforms other rotation expression formats in both performance and training stability, so we choose quaternion vectors as our rotation representation as previous work [8]   We represent a joint position vector p ∈ R 3 in Euclidean space and adopt the quaternion q ∈ R 4 as a rotational representation of each joint. Positions and rotations at specific key-pose t ∈ [1, T ] for J number of joints are denoted in a vectorized form P t ∈ R 3J and R t ∈ R 4J , respectively. All positional and rotational values are represented in the global coordinate system in this paper.

Conditional Motion In-betweening
A wide range of studies have been conducted to generate motion, but motion in-betweening (MIB) has been less focused despite its practical usefulness in industries. MIB can be viewed as a boundary value problem for motion synthesis network g that satisfies the initial conditions: Recent studies [8,9] approached MIB as a problem conditioned only by both ends, but we extend the condition not only both ends but also additional intermediate frame and semantic information. In this work, we focus on two conditioning aspects: pose-conditioned and semantic-conditioned MIB. First, pose-conditioned motion in-betweening is a method of connecting starting and target pose while satisfying a given anchor pose as well. The other constraint is semantic information in motion. Motion data contains semantic information, such as emotions or action contents. Motion stylization [30,31] has covered methods to transfer style information while preserving motion's content features. Unlike motion stylization, our semantic-conditioned MIB needs to generate stylized content at once since it does not have available content motion frames other than starting and target pose.

Proposed Method
We propose a conditional motion in-betweening (CMIB) method that performs conditioned motion generation for both pose and semantic context, based on a Transformer architecture. Figure 2 illustrates the overall architecture of the proposed method.

Pose Interpolation
To perform MIB, starting, anchor (optional), and target poses should be provided. Instead of initializing key-poses with arbitrary values, we initialize our network with baseline interpolation to facilitate the learning process. For the given poses, the joint coordinates are interpolated with linear interpolation (Eq. 2), and joint rotations are interpolated by spherical linear interpolation (Eq. 3) in quaternion. Starting and target positions are denoted as p start , p target , and the position of the anchor pose given at k-th key-pose is written as p k . Joint rotations of start (q start ), target (q start ), and anchor (q k ) poses are represented in the same manner. Figure 2: Overview of the model architecture. During training, we dynamically sample masking frames and replace those frames with interpolation from given poses. Then, the interpolated sequence is fed to the backbone Transformer encoder network with prepended semantic embedding. This task aims at generating natural human motion from given poses while providing semantic controllability.

Model Architecture
We utilize a Transformer encoder to implement CMIB. Since our objective is to create natural motion between given key-poses, we make the Transformer's multi-head self-attention to attend to all given input poses without masking. Input motion data is interpolated with LERP and SLERP before feeding into the motion encoder. Input motion can be compactly represented by vectorized positions P t ∈ R 3J and rotations R t ∈ R 4J at time step t: where d = J · (3 + 4).
Since the non-autoregressive Transformer is leveraged, additional temporal information of the motion needs to be incorporated. In this work, we adopt learned positional embedding [32]. The dimension of positional embedding is set to be equal to that of pose representation x t . Learned positional embedding is added before feeding into the Transformer. With this, we can write the input motion matrix I: Our motion encoder uses multiple encoder layers consisting of multi-head attention and position-wise feedforward networks. Each single head attention is computed as: where m is the number of heads, d k = d m is the dimension of motion representation and W Q i , W K i and W V i are trainable parameter matrices for i-th head H i . This single-head computation can be extended to the multi-head version by concatenating multiple single-heads followed by additional projection with W O : where W O ∈ R md×d . A fully connected feed-forward network (FCN) projects the encoder's representation dimension to the motion representation dimension. FCN block is composed of two stacked linear transformations with GeLU activation [33]. Residual connection [34] is used for both multi-head attention and FCNs followed by layer normalization. The Transformer encoder outputs a matrix of the same size as I, representing the complete predicted motion sequence. Since our model learns positions and rotations in the global coordinate system, link lengths from the predicted output are not guaranteed to have the same lengths as the given kinematic chain. To make the link lengths consistent, we linearly scale the predicted link lengths to have the same lengths as the predefined kinematic chain.

Training Phase 4.3.1. Randomized Shuffled Anchor Pose
Pose conditioning is a way of generating motion to make it satisfy a given anchor pose during the MIB process. An anchor pose can be located between the start and target poses and it should be physically feasible in the context of MIB. This makes a clear difference from existing MIB methods, which perform MIB only for starting and target poses. Inspired from dynamic masking in RoBERTa [35], we uniformly sample the anchor key-pose from training motion sequences for every iteration. The sampled anchor pose is provided to the Transformer encoder along with the start and target poses after pose interpolation.

Motion Data Augmentation
The quality of pose-conditioned motion in-betweening is affected by the motion path distribution of the training dataset. However, training motion data are often insufficient to fully cover various human motions. In the image processing domain, a number of data augmentation methods (e.g., random cropping or warping) are widely used, however, most of the existing augmentation methods in vision domains are not directly applicable to the motion domain due to their temporal characteristics. Harvey et al. [13] used simple augmentation strategies of mirroring the motion data in the forward direction of the character. However, this method has a clear limitation in that it is not able to create a qualitatively different motion. Here, we introduce a motion augmentation method by first defining a probability distribution of smooth trajectories interpolating the start and target points. This method generates multiple motion trajectories from existing motions utilizing Gaussian Random Path (GRP) [10].
Suppose that (x t , y t ) ∈ R 2 be a two-dimensional root joint position projected onto the ground plane extracted from given three-dimensional position vector p t ∈ R 3 and (x P , y P ) = {(x t , y t ) | t = 1, . . . , T } be a two-dimensional trajectory of length T . Let {x a , y a } = {(x 1 , y 1 ), (x T , y T )} be a pair of start and target anchoring root positions. Our proposed augmentation method utilizes x P and anchoring root vectors {x a , y a } to augment new root pathỹ P . Note that we assume the monotonicity of x P . Given a squared exponential kernel function k(x, x ) = exp(−(x − x ) 2 /2) and x P , a Gaussian random path distributionỹ P is specified with a mean path µ P and a covariance matrix K P as following: where k(x P , x a ) ∈ R T ×2 is a kernel matrix of x's path indices and anchored start and target. The kernel matrix of start and target indices is represented as K a = K(x a , x a ) ∈ R 2×2 , and K P = K(x P , x P ) ∈ R T ×T indicates a kernel matrix of given x path . In the sense of Cholesky decomposition of LL T = K P , Eq. 11 is equivalent toỹ P = µ P + Lu, where u ∼ N(0, I).
In order to augment motion trajectory from ground truth motion, we first rotate the original trajectory to make starting and target positions aligned on the X-axis to facilitate the GRP process. In the rotated system, the proposed motion augmentation process samples a new trajectory, which has the same x values but newỹ values from the GRP process. We have fixed the progress on X and Z axes since motion should maintain its starting and target position and changing the Z-axis empirically yields unnatural motion sequences. Therefore, motion augmentation is applied only for 'walking' and 'running' motions. Through the augmentation process, the rotated path (x P , y P , z P ) can be augmented to multiple sampled paths (x P ,ỹ P , z P ). On top of the positional augmentation, we further compute the orientation difference between the original and augmented trajectories in the XY plane and rotate the augmented sequence accordingly on the XY-plane to make the generated motion face the correct direction as the original motion. Figure 3 illustrates the process of the proposed motion augmentation method.

Semantic Embedding
Motion representation and semantic information need to be learned together to achieve semantic-conditioned motion in-betweening. To this end, we introduce semantic embedding, which projects motion's high-level semantics into a look-up table. We prepend the semantic embedding to be the first token of the motion input matrix so that the Transformer encoder can attend to this information: (14) where s indicates the semantic embedding. We use the extended input motion representation before feeding the Transformer encoder.

Loss
There are three forms of learning losses in our model: semantic loss, position loss, and rotation loss. First, semantic loss is a reconstruction loss. The first predicted token is a special token to represent semantics in motion. We make the model reconstruct the first token for every sequence: s −ŝ 1 (15) where N is the total number of data and s means semantic embedding. Position and rotation losses are the average L1 distance between predicted sequences and their ground truth values: Figure 3: The original trajectory is rotated to align the +X-axis (character's front direction), then GRP is employed to sample a trajectory. The generated trajectory is rotated back to match the original trajectory's starting and target position. The root joint's trajectory is visualized.
Our total loss is defined as: All of our losses are scaled to be approximately equal before training begins.

Hyperparameters
Model is trained with AdamW [36] optimizer with a learning rate of 0.0001, β 1 = 0.9, β 2 = 0.99, and weight decay of 0.01. Also, we set Transformer encoder to have 7 heads, 8 encoder layers, 2,048 feedforward dimensions, and 0.05 dropout probability during training. Weights for the total losses w 1 , w 2 , and w 3 are set to 1.5, 0.05, and 2.0 respectively. In all our experiments, minibatch size is set to 32.

Experiments
This section demonstrates the experimental settings and presents qualitative and quantitative results. Please refer to our project web page 1 for more videos from the experiments.
LAFAN1 [8]: LAFAN1 is a high-quality public motion capture dataset. It contains 15 actions such as walking, dancing, fighting, jumping, etc, with approximately 4.6 hours long and contains 496, 672 frames taken by five actors. It was used in Harvey et al. [8], which first proposed a robust MIB method. We evaluate the performance of our proposed method using this dataset for both unconstrained MIB and CMIB tasks.
HumanEva [37]: This dataset is obtained from a 3D whole-body motion capture system. It is recorded by four subjects with six common actions such as walking, jogging, gesturing, etc. Three RGB cameras and four grayscale cameras are used for the recording.
HUMAN4D [38]: This is a recently-collected large and multimodal public motion dataset with various human actions. This dataset is captured from 24 MoCap cameras with four depth sensors. The full dataset is composed of 50, 306 frames with 19 activity labels.
MPI-HDM05 [39]: MPI-HDM05 is a research-purpose dataset especially for motion analysis, synthesis, and classification. It contains more than three hours of recorded motion frames.

Motion Data Pre-processing
We adapt the same pre-processing procedures with [8] for a fair comparison. In particular, we split each motion into a fixed-width window where all windowed motion sub-sequences are rotated to align X + axis ahead at the 10-th frame. For semantic conditioning, we choose labeled actions as semantic information of the motion. Positions and rotations are represented in the global coordinate frame. We also build a deploy pipeline for the Unity Engine 2 to present qualitative results.

Motion In-betweening Results
We use L2P and L2Q used in [8] to evaluate our model's MIB performance. L2P and L2Q are the average L2 norm of global positions and quaternions: where N indicates the number of data being evaluated. Note that global positions are normalized before calculating L2P.
Here, we compare our model with the state-of-the-art robust motion in-betweening models in MIB tasks. Since the implementation of RMIB [8] and SSMCT [40] are not publicly available, the quantitative results of both models that are not reported in the original paper are provided by our own reproduced implementation. We would like to note that our reproduced implementation also shows similar results in their reported performance. Our model is compared with zero-velocity, interpolation, bidirectional LSTM model, ERD-QV [8], CAE [9], and SSMCT [40]. The zero-velocity baseline interpolates the missing frames with the latest frame. The interpolation baseline model is the same as our baseline pose interpolation, which uses LERP for joints' positional values and SLERP for rotational (quaternion) values. Table 3 shows the MIB evaluation results on the test dataset of LAFAN1, and Table 4 presents L2P and L2Q score comparison on other public motion datasets. Our proposed method presents comparable or better performances in most MIB tasks, even with the additional controllability. Synthesized motions are visualized in Figure 5. Figure 4 presents the computation time of different batch sizes and in-betweening frames. This supports that the non-autoregressive model has a great advantage in inference time speed performance over the autoregressive models.

Pose-conditioned Motion In-betweening Results
Pose-conditioned MIB is a process of generating natural skeletal motion with the start, target, and additional anchor poses constraints which can be thought of as having additional anchor pose constraints to the unconstrained MIB task. Since a non-autoregressive Transformer architecture of utilized, we can leverage multiple numbers of anchor poses while training. In this paper, however, we simply assume that a single anchor pose is given for the pose-conditioned MIB task, and while training, an anchor pose is uniformly sampled within the horizon of a motion sequence (i.e., t ∼ U(1, T ) where T is the sequence length).
Note that the pose-conditioned MIB results are subjected to the diversity of trajectories within the dataset, where we observe that pose-conditioned generation often fails when the anchor pose is located outside the plausible reason. However, this limitation can be alleviated via the proposed motion data augmentation method in Section 4.3.2. To validate the effectiveness of the proposed motion data augmentation on the pose-conditioned MIB tasks, we select the poses at t = 20, 40, and 60 on aiming, dance, fight, run, and walk motions of LAFAN1 (total 670 motions). Then, we add random perturbations to the (x, y) position of the anchor pose and check how the generated sequences pass through the given anchor pose in the XY plane. We further change the orientation of the start and target poses to match the given anchor pose. Table 5 shows the comparative results of the pose-conditioned MIB task with and without the proposed motion augmentation method in terms of the L2 norm between the given and generated root positions at t = 20, 40, and 60. Our proposed augmentation method greatly improves the pose-conditioned MIB performance on all anchor positions. Figure 6 depicts the pose-conditioned MIB result. When given a jumping anchor pose (green), the entire generated motion changes accordingly to make the jump motion.

Semantic-conditioned Motion In-betweening Results
In this experiment, we set conditioning semantics as 15 action labels provided from the LAFAN1 dataset. We evaluate our model based on the reconstruction error of the semantic token values. For quantitative evaluation, we divide the test dataset by action labels and make the model conduct MIB for available action labels. Pose conditioning is disabled for the experiments (i.e., no anchor pose is given). Here we evaluate on the model's performance in terms of L2P. We assume that the corresponding ground truth labels should yield the smallest error. Table 6 describes the results and Figure 7 visualizes the results of semantic-conditioned MIB.
From the visualized results, we found that infilling horizon affects the visual quality of semantic-conditioned MIB. Long-horizon MIB tends to produce better results compared to the short-horizon task in the semantic-conditioned generation. For example, in the case of jumping motions, it is difficult to generate a complete jumping motion with a 30-frames horizon, but long-horizon infilling is more likely to produce clear jumping motions. This suggests that there will be a minimum number of frames to reflect the semantics of motion.

Limitation and Discussion
In this work, we confirm that the CMIB model can perform the most of in-betweening settings, but there are some limitations. First, generated motion does not take volume into consideration, which produces motion with penetration between the human body as visualized in Figure 8. It can be mitigated by including a post-processing pipeline or integrating loss term for penetrating configuration as a regularizer. The other failure case is conditioning failure. This case is similar to mode collapse in GAN, synthesizing nearly identical motion regardless of conditioning semantics. We suspect that the problem comes from two objectives that are difficult to satisfy simultaneously. For instance, if given starting and target poses are too far apart, motion in-betweening objective mandates a model to generate running motion without regard to given semantics such as 'fight' or 'dance'. This is depicted in Figure 9. Empirically, conditioned generation performs best when given starting and target poses are feasible enough to generate the given condition.

Conclusion
In this study, we have focused on the problem of adding controllability to motion in-betweening (MIB) tasks. In particular, we proposed a conditional motion in-betweening (CMIB) method using non-autoregressive Transformer encoder architecture where pose-conditioned and semantic-conditioned MIB methods are achieved via randomized shuffled anchor pose and semantic embedding token, respectively. An effective motion data augmentation method is also presented to increase the performance of the pose-conditioned MIB. Our proposed method outperforms existing state-of-the-art MIB with respect to pose prediction accuracy and inference time while providing additional controllability.
While our approach demonstrates controllability in MIB, there are still promising areas where more research is required. As our model relies on the non-autoregressive structure, the horizon it predicts is limited to the predefined maximum length, and the evaluation metric is not suitable for expressing the diversity of human motions. It would be interesting to combine our method with explicit density estimation methods to better capture the possible human motion distributions. Furthermore, we will continue to extend controllable motion generation combined with other modalities such as natural language commands.