ASMNet: Action and Style-Conditioned Motion Generative Network for 3D Human Motion Generation

Extensive research has explored human motion generation, but the generated sequences are influenced by different motion styles. For instance, the act of walking with joy and sorrow evokes distinct effects on a character’s motion. Due to the difficulties in motion capture with styles, the available data for style research are also limited. To address the problems, we propose ASMNet, an action and style-conditioned motion generative network. This network ensures that the generated human motion sequences not only comply with the provided action label but also exhibit distinctive stylistic features. To extract motion features from human motion sequences, we design a spatial temporal extractor. Moreover, we use the adaptive instance normalization layer to inject style into the target motion. Our results are comparable to state-of-the-art approaches and display a substantial advantage in both quantitative and qualitative evaluations. The code is available at https://github.com/ZongYingLi/ASMNet.git.


Introduction
Current human motion generation tasks aim for highly realistic and natural motions but fall short of meeting the final require ments for direct application in the video games and film indus tries.Because of the diversity of human motion, encompassing not only simple actions but also stylistic features that convey a character's personality, emotions, and age.Therefore, when using existing research on human motion generation, it is important for the generated motion to satisfy criteria for both motion content and character motion style.However, there is a scarcity of available datasets containing diverse motion style.Modifying captured motions while preserving the motion con tent to match a particular style allows for motion data recy cling.The generated motion can greatly enhance the motion dataset and be used in downstream tasks, e.g., human behavior recognition [1][2][3][4][5][6][7] and robot biomimicry research [8][9][10].This approach enables the reuse of motion data while meeting char acter motion style requirements and reducing the cost of com mercial applications.
Existing works such as MotionCLIP [11] and Motion Diffuse [12] have not only focused on motion generation but also rec ognized the impact of motion styles on generated results.However, they merely acknowledge the impact of motion styles without emphasizing style features, and simply expect the model to generate stylized motion clips based on the textual descrip tions of motion styles (Fig. 1A).Motion style is abstract and cannot be accurately described semantically.Consequently, these approaches only allow for rough generation and fail to exhibit distinct style traits.In addition, the generated motion appears lifeless based on the visual results.It differs from the actual motion effects of people with the same style in real life and lacks distinctive characteristics.
To address the above problems, we attempt to develop a uni fied model capable of generating human motion sequences that not only conform to the given action categories but also inte grate real human motion sequences as style inputs (Fig. 1B).We extract style features from real human motion sequences.Due to the excellent performance of spatial temporal Trans former in tasks such as human motion prediction [13], human motion estimation [14], and human motion recognition [15,16], we also apply the spatial temporal Transformer to the task of human motion generation based on action labels.Our approach not only focuses on extracting motion features for each frame in the sequence but also emphasizes feature extraction between different joints in the same frame.We propose a spatial tempo ral extractor based on Transformer to extract motion features.By using motion clips corresponding to the action category and providing real motion clips as style sources during training, we enable the model to generate motion sequences that are both natural in motion and exhibit distinct style characteristics dur ing the inference phase.For instance, animation designers in the video game and film industry can be facilitated by using an existing segment of character animation with a distinct style to generate character animation performing different actions within that style.
Our contributions are threefold: (a) We aim to address the problem of threedimensional (3D) motion generation based on action labels while maintaining explicit motion styles throughout the process.To the best of our knowledge, our approach is the first to address such a problem.(b) We intro duce ASMNet, a novel network capable of generating human motion with distinct styles.(c) In our experiments, we perform a thorough ablation study on the model components and out perform stateoftheart results on the HumanAct12 dataset [17] and the Xia dataset [18].

Human motion generation
Motion generation can be divided into two main categories: unconstrained motion generation and conditioned motion gen eration.Unconstrained motion generation models the entire space of possible motions [19][20][21].These methods sample from a distribution, allowing for the generation of diverse motions.However, they lack the ability to control the generation process.Conditioned motion generation tends to generate motion based on given control conditions, which can take various forms such as music [22,23], audio [24], speech [25,26], motion clips [27,28], and text information [17,29].Research on motion generation conditioned on both motion sequences and text descriptions is particularly important.Cui et al. [27,28] use residual graph con volutional network to capture spatial temporal correlation and rely on history information to generate deterministic single future motion.Yuan and Kitani [30] use learnable mapping functions to map samples from a single random variable (gen erated by a given historical information sequence) to a set of correlated latent codes for further decoding into a set of cor related future motion sequences.Zhang et al. [31] use novel variational autoencoder (VAE) in human motion prediction.Aliakbarian et al. [32,33] use conditional variational autoen coder (CVAE) to enforce diversity and contextual consistency with historical information for final prediction.Generative adversarial network (GAN) [34][35][36] and normalizing flows [37,38] were used to develop generative models of human motion.Guo et al. [17] and Petrovich et al. [29] propose to use the CVAEbased framework, which not only performs coupled encoding on labels and motion sequences but also treats action labels as conditions to modulate the latent space, resulting in natural and diverse generated motions.We also use the CVAE framework to build the model ASMNet, where we use action labels as conditions.We propose the spatial temporal extractor based on Transformer to construct the motion encoder and motion decoder.In contrast to [17], we do not need multiview cameras to process monocular trajectory estimates.Add itionally, unlike [29], we extract motion features not only from the temporal dimension but also from the spatial dimension.This enhances both the temporal correlations between frames in the motion sequence and the spatial connections between joints within each frame.

Motion style transfer
Previous work relied on handcrafted features [39][40][41].Deep learningbased methods extract style features directly from data.Ma et al. [42] model style variations of individual body parts with latent parameters, controlled by userdefined parameters through a Bayesian network.Xia et al. [18] pro pose a method for constructing mixtures of autoregressive models online to represent style variations in the current pose and apply linear transformations to control style.However, In our approach (B), we include real-world motions to incorporate specific motion styles in the generation of motions.This enables the model to learn abstract style features that cannot be easily described in text, which are reflected in the generated motions.
these approaches have limitations such as instability and long computation time.Many current methods now employ neural networks to extract style features.Holden et al. [43,44] pro pose a style transfer framework composed of a pretrained motion manifold to supervise content and Gram matrices to represent the style of the motion.Mason et al. [45] introduces a residual block for modeling style.Du et al. [46] present a CVAE with the Gram matrices to construct the styleconditioned distribution.These approaches require a lot of computational time to extract style features through optimization and have a limitation in capturing complex or subtle motion features, making style transfer between motions with significantly dif ferent contents ineffective.Aberman et al. [47] apply GAN based architecture with the adaptive instance normalization (AdaIN), which can transfer style from videos to 3D anima tions.Considering the work of [47], we opt to use real motion clips as input to extract style features.Furthermore, we employ the AdaIN layer to inject style into human motion.
The model employs text loss to align the latent space of human motion features with the CLIP text label space and image loss to align the motion feature latent space with the CLIP image latent space.This achieves the goal of unifying the CLIP latent space and the motion latent space.Consequently, it enables the generation of human motion sequences based on text labels.MotionCLIP [11] compared the style effectiveness of generated motion with [47].Of the total eight motion styles evaluated, three achieved higher audience preference scores compared to the latter.This evaluation method, due to its reliance on user study, is highly subjective and does not provide robust persuasive evidence.Furthermore, the animation results from MotionCLIP indicate that human motion across different styles tends to converge toward the same outcome, lacking clear dif ferentiation among various motion styles.
MotionDiffuse [12] incorporates the denoising diffusion probabilistic model (DDPM) into motion generation to estab lish a mapping between text and motion.MotionDiffuse pro vides examples of character motion in different styles.However, the transitions between styles in the animation results are not distinguishable.Tevet et al. [11] and Zhang et al. [12] provide textual descriptions of the desired actions and styles, such as "walk angry." But Aberman et al. [47] require two motion sequences, providing the content and style.Accurately describ ing the human motion style is challenging due to its abstract nature.To improve the vividness and realism of the generated motion, we aim to use real motion as style input, aligning them with realworld scenarios.

ASMNet
The whole process is divided into two branches: one for gen erating motion sequences and the other for injecting style into the motion sequences.Figure 2 illustrates the overall architec ture of the proposed method.We employ a CVAE model to accomplish motion generation.The input vector of the motion encoder is a concatenation of human motion sequences P 1:T and actionspecific learnable distribution tokens.The spatial extractor in the motion encoder extracts local information between skeletal joints, and the temporal extractor captures global information between frames in the motion sequence.The output of the motion encoder is associated with these tokens, providing Gaussian distribution parameters, and a latent vector z M is sampled.Next, z M and a learnable input b token a , determined by the action label a, are fed into the motion decoder.This process restores the motion information embedded in the motion latent vector z M into a human motion sequence P1:T of length T. We employ the learnable distribution tokens for two reasons: first, to fit the sequencelevel latent space of human motion based on human motion sequences, enabling the gen eration of diverse motions given action labels, and second, to assist in capturing spatial temporal motion features more effec tively.To achieve style injection, the style extractor extracts latent style features z S from the source style input S 1:T .These latent style features are then injected into the target sequence P1:T using the style injector, which employs AdaIN layers.The generated sequence Ŝ1:T exhibits a distinct style while adhering to the action category a.In the testing phase, only providing wanted action label a, durations F, and source style input S 1:T , we can obtain the target styled action sequences P1:T .During the testing phase, providing the desired action category a and the source style input S 1:T , we can obtain the targetstyled action sequences Ŝ1:T .Next, we will provide details on the spatial temporal extractor ("Spatial temporal extractor" sec tion) in motion generation and the process of motion style injection ("Motion style extraction and injection" section).Finally, we will discuss our training process and the associated losses ("Training and loss" section).

Spatial temporal extractor
To generate human motion, we developed a motion encoder Enc M and motion decoder Dec M as shown in Fig. 3.We use the spatial extractor to extract the feature embedding from a single frame.First, we map the motion sequence P 1:T ∈ ℝ T×J×3 repre sented by 3D joint position coordinates to a high dimension d with a trainable linear projection.Two learnable distribution tokens, μ token ∈ ℝ d and σ token ∈ ℝ d , are provided as input to the spatial extractor during training.During processing in the spa tial extractor, we treat each frame as an input and compute the correlations between the 3D position coordinates of the J indi vidual joints within each frame.We consider each token as a joint and concatenate the tokens with the motion embedding, introducing a sense of ordering by sinusoidal positional encod ing.Next, the tensor after positional encoding xseq is fed into spatial blocks based on the Transformer, and output tokens are then passed through a linear layer to project them up to the temporal embedding dimension (J × d).The outputs of the spatial extractor are tokens μ spa , ∑ spa , and the motion vector x spa .In the temporal extractor, these tokens are concatenated with other encoded frames along the temporal dimension, fol lowed by positional encoding.They then pass through temporal blocks based on Transformer, yielding the distribution param eters μ and ∑ .In summary, our motion encoder may be for mally expressed as: The motion decoder, denoted as Dec M , shares a similar archi tecture with the motion encoder, Enc M , consisting of a temporal extractor and spatial extractor.To incorporate category infor mation, a learnable bias b token a is utilized to shift the latent representation toward the motion latent space, which is subse quently added to the latent motion feature z M .Prior to entering T is converted to x by a linear layer and then combined with μ token and ∑ token to form x seq , which is then subjected to positional encoding before being input to the spatial blocks.At the output of the motion encoder, the part corresponding to the distribution tokens is mapped to the temporal latent space via a linear layer to obtain μ spa and ∑ spa , while the output motion latent tensor is denoted as x spa .μ spa , ∑ spa , and x spa are concatenated along the temporal dimension to form the input y for the temporal extractor.y is passed through the temporal extractor, yielding parameters μ and ∑ that represent the latent distribution of the motion features.Using the latent motion representation z M , a duration F, and the action category a as input, the motion decoder outputs the generated motion P1:T .a is mapped to the motion feature latent space using b token a , then added to z M , and concatenated with F. The resulting tensor y seq is subjected to positional encoding before being input to the temporal blocks.The output is the motion features y tem in temporal latent space, which is fed into the spatial extractor.Finally, the motion feature q spa is reconstructed by a linear projection into the action sequence P1:T .
the temporal blocks, positional encoding is applied alongside the action duration, denoted as F ∈ ℝ T× (J×d) , to preserve frame position information.The resulting motion features y tem are then passed through the spatial extractor and finally mapped back to a lower dimension using a linear transformation.This process yields the resulting sequence, referred to as P1:T .Our motion decoder can be formally expressed as: where a is the action label.

Motion style extraction and injection
Our style injection process can be seen in Fig. 4. The source style sequence S 1:T = {S 1 , S 2 , …, S T } is entered into the style extractor Ex S , where S 1:T ∈ ℝ T×J×3 .The style extractor is stacked by ConvBlock (comprising convolutional layers) and LinearBlock (comprising linear layers), which map S 1:T to a fixed dimensionality independent of sequence length, yielding the latent style encoding z S .The entire process proceeds as follows: Style injection is achieved through the AdaIN layer in the style injector.The generated action sequence P1:T is input to the style injector Ijec S , which consists of ConvBlocks and ResBlocks, which are formed by convolutional layers.There are two types of residual blocks: one to remove the style, and the other to inject the style.Instance normalization (IN) removes the style variations in the motion by normalizing the feature statistics for each channel in each training sample.The motion features after style removal are then injected with action style using AdaIN.AdaIN is the key to successful style deformation.This normalization technique injects the style from Ŝ1:T to the input P1:T by adjusting the global distribution statistics as: where μ and σ represent the channelwise mean and variance, respectively.Finally, the motion features are mapped back to the same dimension as the input using the corresponding ConvBlock in motion injector Ijec S , resulting in the styled action sequence Ŝ1:T .The general process of style injection is as follows:

Training and loss
Given not only a groundtruth pair consisting of human motion P 1:T and action category a but also source style motion S 1:T and same style motion S same 1:T , we compare (a) reconstruction loss, (b) KL divergence loss, and (c) content preserving loss.

Reconstruction loss
Since ASMNet serves two purposes, our reconstruction loss is utilized in both aspects.First, it is used in action generation.We compare the real motion sequence with the generated sequence based on the action label, using the L2 loss.Second, it is employed in style injection.We compare the real motion sequence with the motion sequence after style injection, using the L2 loss. (2)

KL divergence loss
To regularize the latent space, we utilize classspecific distribu tion tokens in constructing the motion latent space and encour age the distribution to be similar to a normal distribution =  (0, 1).Thus, we obtain the terms:

Content preserving loss
To ensure that the style features injected into P have distin guishable style properties, we not only use the source style input S but also select motion sequences S same from the dataset M that have the same style as S. We compute the L1 norm between the results obtained by injecting S and S same into the motion.

Experiments
We implemented ASMNet in PyTorch and conducted the experiments on a PC equipped with an NVIDIA Quadro RTX 5000.The motion encoder and motion decoder in ASMNet utilize spatial and temporal extractors with four blocks, respec tively.The latent vector dimension is set to 48 for both.We trained the entire model using the AdamW optimizer with a fixed learning rate of 1 × 10 −4 and a batch size of 20.The motion generation branch is trained for 200 epochs, while the style injection is trained for 2,000 epochs.
In the subsequent section, we first show the details of the used benchmark dataset and evaluation metrics ("Datasets and evalu ation" section).We show the quantitative and qualitative com parison results with the stateoftheart methods ("Comparison with stateoftheart results" section).Finally, we analyze the main components of our method ("Ablation studies" section).

Datasets and evaluation
The dataset used to train ASMNet is Xia dataset [18], a motion capture dataset widely used in style research.The complete database includes six actions (walk, run, jump, kick, punch, and transitions) and eight styles (neutral, proud, angry, depressed, strutting, childlike, old, and sexy).There are 11 min of motion in the database.It was captured with a Vicon optical system at 120 Hz, resulting in roughly 79,000 individual frames.In our processing, motions are retargeted to a single 21joint skeleton with the same skeletal topology as the CMU (Carnegie Mellon University) motion data.The dataset was split into training and testing sets at a ratio of 0.85:0.15.To evaluate the performance of ASMNet in action generation, we utilized the widely recog nized dataset HumanAct12 [17].HumanAct12 comprises 1,191 motion clips and 90,099 frames in total.All motions are orga nized into 12 action categories (warm up, walk, run, jump, drink, lift dumbbell, sit, eat, turn steering wheel, phone, boxing, and throw).
We employed the same evaluation metrics as Action2motion [17] to assess the generated motion based on the action cate gory.These metrics include FID (Fréchet inception distance), accuracy, diversity, and multimodality.We followed the same approach as Action2Motion, training a standard recurrent neu ral network (RNN) action recognition classifier using the train ing set for Xia dataset [18], and use its final layer as the motion feature extractor.For HumanAct12 [17], we directly use the provided recognition models of Action2Motion [17] that oper ate on joint coordinates.Subsequently, FID is calculated by comparing the feature distribution of generated motions with that of the real motions.We use the RNN action recognition classifier to classify motion, and calculate the overall recogni tion accuracy.Diversity measures the variance of the generated motions across all action categories.Two subsets of equal size S d are randomly selected from a set of all generated motions from different action types.Their respective sets of motion feature vectors {v 1 , …, v Sd } and {v′ 1 , …, v′ Sd } are extracted.The diversity is defined as: Multimodality measures how much the generated motions diversify within each action type.Suppose we are given a set of motions with C action types.For c th action, two subsets of equal size S l are randomly selected from generated motions, and then two subsets of feature vectors {v c,1 , …, v c,Sl } and {v′ c,1 , …, v′c ,Sl }.The multimodality is formalized as: Due to the subjective nature of commonly used human evalu ation methods, which can vary among individuals, we employ quantitative evaluation.Similar to [49], the following metrics are computed to evaluate the generated stylized motion sequences: content recognition accuracy (CRA) and style rec ognition accuracy (SRA).For a fair comparison, we used the same recognition network [50] as for [49] to identify the action category and style category of the generated stylized motion Ŝ.

Comparison with state-of-the-art results
We first compare motion generation performance based on action labels.Previous work includes Action2Motion [17] and ACTOR [29].We compared ASMNet with these methods in the Xia dataset [18] and HumanAct12 [17] dataset, as shown in Tables 1 and 2. We also compare with other baselines imple mented by Action2Motion by adapting other works [35,36].Both ACTOR and ASMNet achieve significantly lower FID scores compared to other methods in the HumanAct12 dataset, implying a stronger resemblance between the generated motion distributions and the distributions observed in real human motion.Moreover, ASMNet outperforms ACTOR in terms of accuracy and multimodality, indicating that ASMNet generates more realistic action sequences closely resembling real human motion.In the Xia dataset, our results also outperform previous work in terms of FID, accuracy, and multimodality while main taining a high level of diversity, showing the superiority of our approach.We employ the spatial temporal extractor to extract motion features and utilize multiple losses for reconstruction, rather than relying on a GRU block and using an autoregres sive design like Action2motion.By extracting latent informa tion between skeletal joints, we also incorporate more local skeletal features that help generate better motion sequences than ACTOR.
Moreover, in addition to motion generation, we incorporate a separate branch to inject specific motion styles into the gener ated motion, which is absent in current motion generation models.Therefore, we compared our approach with methods (6) specifically designed for style research, such as [43,47].As shown in Table 3, our method is comparable to Motion Puzzle [49] in SRA but surpasses previous methods [43,47,49] by a large margin in CRA, indicating that the motion generated by ASMNet maintains distinct style characteristics while achiev ing higher accuracy in recognizing the content of the actions, resulting in more realistic humanlike motion.
In addition to quantitative analysis, we also provide visual izations of the generated motion for comparison with motion CLIP [11] and motionDiffuse [12] in Fig. 5.We choose the "walk" action to demonstrate four different styles.It is evident that although all individuals perform the action of walking, there are significant differences in the posture and amplitude of the movements due to the different styles.In Fig. 5, it can be seen that our motion sequences show the most pronounced and visually appealing style variations.In contrast, motionCLIP [11] and motionDiffuse [12] show similar results regardless of style changes.Both quantitative and qualitative analyses con firm the excellent performance of ASMNet in motion genera tion with motion style.

Ablation studies
In this section, we evaluate the impact of several components of our framework in a controlled setting.First, we ablate several architectural alternatives.We replace the spatial temporal extractor in the motion encoder and motion decoder with the fully connected autoencoder, GRU, and Transformer separately in Table 4A to C. We see that our ASMNet outperforms the methods based on fully connected layers and GRU by a large margin.It also outperforms the Transformerbased methods.The spatial temporal extractor in our ASMNet excels at model ing the temporal evolution of human motion, in contrast to fully connected layers and GRU, providing more accurate mod eling capabilities for human movement.Moreover, the spatial temporal extractor has the unique capability to capture the local correlation between joints in the human skeleton, which evades Transformer.
We also remove the token b token a that transforms the action category information into the latent motion space.Instead, we convert the action category to the onehot encoding and input it directly into the motion decoder (Table 4D).We remove the distribution parameters tokens μ token and ∑ token , which are input to the motion encoder.We derive the distribution param eters μ and σ by averaging the output of the motion encoder (Table 4E).As expected, our results in Table 4 outperform other methods, demonstrating the superior structure of ASMNet.Onehot encoding directly inputs the binary encoding of action categories to the motion decoder, failing to capture the deep semantic information of action categories.In contrast, the learnable Table 1.Compare with the state of the art in motion generation.We compare with recent work ACTOR and Action2Motion on HumanAct12.Due to differences in implementation (e.g., random sampling, using zero shape parameter), our metrics of real data and ACTOR (Real*, ACTOR*) differ slightly from those in [17,29].† denotes the baselines implemented by [17,29].
HumanAct12 [17]  Table 2. Compare with the state of the art in motion generation.We compare with Action2Motion [17] and ACTOR [29] on Xia dataset [18].We use the architecture in [17,29] and show the results.The performance improvement with our model shows a significant gap with Action2Motion and ACTOR.
Xia [18]  token b token a maps action categories to a highdimensional semantic space, enabling the model to obtain richer semantic information and understand the similarities between different action categories.This facilitates the model's learning of the com plex relationship between action categories and motion sequences, resulting in improved generation of motion sequences that align with specific actions.
Finally, we conduct ablation experiments on the number of spatial blocks and temporal blocks in the spatial extractor and temporal extractor, as shown in Fig. 6A.The results demon strate that the accuracy of generated motion sequences is high est and the FID score is lowest when the number of blocks is 4, indicating the best performance.We note that the accuracy of the generated motion reaches its maximum when the latent   dimension is set to 48, and that the score FID reaches its mini mum in Fig. 6B.Therefore, when constructing ASMNet, we set the number of blocks in the spatial temporal extractor to 4 and the latent dimension d to 48.

Conclusion
In this work, we target the motion generation with explicit style conditioned on the action label.We propose a comprehensive model named ASMNet.We extract motion features from both the temporal and spatial dimension by the spatial temporal extractor, which enables us to enhance the accuracy and increase the diversity in motion generation.Moreover, we also note the influence of motion style in the generation process.Instead of using a simple textual representation, we use real motion to provide the style and use AdaIN for style injection, resulting in generated motion sequences with distinct stylistic attributes.We also provided a detailed analysis to assess differ ent components of our proposed approach.The experiments show the superiority of ASMNet, which not only satisfies the motion generation but also incorporates explicit stylistic attri butes into the generated motion.Our approach outperforms the state of the art in both quantitative metrics and visual analy sis, highlighting its effectiveness.Of course, AMSNet has its limitations, as the current study of motion styles relies on motion capture data, with the challenge of scarce mocap data sets with motion styles.Future work can therefore explore obtaining sufficient motion style data from other forms of data to generate motion with more diverse styles.
Table 4. Ablation on ASMNet architecture.We replace the architecture of the motion encoder with fully connected layers (A), GRU (B), and Transformer (C).We omit the usage of b token a in the motion decoder and instead convert the motion action label to one-hot encoding (D).We remove the input of μ token and ∑ token in motion encoder (E) and take only the real motion as input.

Fig. 1 .
Fig.1.Previous work (A) mostly focused on generating motions based on action label a, duration F, and latent vector z, with the desired style incorporated within the action labels.In our approach (B), we include real-world motions to incorporate specific motion styles in the generation of motions.This enables the model to learn abstract style features that cannot be easily described in text, which are reflected in the generated motions.

Fig. 2 .
Fig. 2. Overview of the proposed ASMNet framework.Training phase: The motion encoder takes a concatenation of motion sequences {P 1 , P 2 , …, P T } together with action-specific learnable distribution tokens μ token and ∑ token as input.It outputs tokens μ and ∑ that provide Gaussian distribution parameters, from which a latent vector z M is sampled.Then, z M , F, and a are fed into the motion decoder, which reconstructs the motion information encoded in z M into a motion sequence P1 , P2 , … , PT of length T corresponding to action category a.To perform style injection, the style extractor extracts the latent style features z S from the source style {S 1 , S 2 , …, S T }.These latent style features are then injected into the generated P1 , P2 , … , PT using AdaIN in the style injector.As a result, we obtain the styled sequence Ŝ1 , Ŝ2 , … , ŜT .Test phase (yellow area): Providing only the desired action label a, duration F, and source style input {S 1 , S 2 , …, S T }, we get the target styled motion sequence Ŝ1 , Ŝ2 , … , ŜT .

Fig. 3 .
Fig.3.Details about spatial extractor and temporal extractor in motion generation.The figure shows the motion encoder (top) and the motion decoder (bottom), which consist of spatial and temporal extractors (within the gray box) connected in sequential order.The motion encoder takes distribution tokens μ token and ∑ token and the motion sequence P 1 : T as input.The motion sequence P 1 : T is converted to x by a linear layer and then combined with μ token and ∑ token to form x seq , which is then subjected to positional encoding before being input to the spatial blocks.At the output of the motion encoder, the part corresponding to the distribution tokens is mapped to the temporal latent space via a linear layer to obtain μ spa and ∑ spa , while the output motion latent tensor is denoted as x spa .μ spa , ∑ spa , and x spa are concatenated along the temporal dimension to form the input y for the temporal extractor.y is passed through the temporal extractor, yielding parameters μ and ∑ that represent the latent distribution of the motion

5 )Fig. 4 .
Fig. 4. Style injection.Style extractor extracts the style features z M from the input S 1 : T .The style injector first uses instance normalization (IN) to remove the existing style features in P1:T .Finally, z S are injected into P1:T using AdaIN.The style injector outputs Ŝ1:T .

Fig. 6 .
Fig. 6.Ablation over the number of blocks in the spatial temporal extractor (A), when fixing the latent dimension in the spatial extractor to 48.Ablation about the latent space dimension d (B) mentioned in motion encoder and motion decoder, when fixing the blocks in spatial temporal blocks to 4. The dashed area represents the fluctuation error of the metrics.