Human Motion Prediction via Dual-Attention and Multi-Granularity Temporal Convolutional Networks

Intelligent devices, which significantly improve the quality of life and work efficiency, are now widely integrated into people’s daily lives and work. A precise understanding and analysis of human motion is essential for achieving harmonious coexistence and efficient interaction between intelligent devices and humans. However, existing human motion prediction methods often fail to fully exploit the dynamic spatial correlations and temporal dependencies inherent in motion sequence data, which leads to unsatisfactory prediction results. To address this issue, we proposed a novel human motion prediction method that utilizes dual-attention and multi-granularity temporal convolutional networks (DA-MgTCNs). Firstly, we designed a unique dual-attention (DA) model that combines joint attention and channel attention to extract spatial features from both joint and 3D coordinate dimensions. Next, we designed a multi-granularity temporal convolutional networks (MgTCNs) model with varying receptive fields to flexibly capture complex temporal dependencies. Finally, the experimental results from two benchmark datasets, Human3.6M and CMU-Mocap, demonstrated that our proposed method significantly outperformed other methods in both short-term and long-term prediction, thereby verifying the effectiveness of our algorithm.


Introduction
With the rapid development of artificial intelligence technology, an increasing number of intelligent devices are being applied in industrial production and daily human life. Human motion prediction, a key technology for enhancing device intelligence, aims to capture the intrinsic temporal evolution within historical human motion sequences to generate predictions for future motion. Human motion prediction has been widely applied in fields such as autonomous driving [1], human-computer interaction [2,3], human emotion recognition [4], and human behavior analysis [5][6][7]. However, due to the high dimensionality, joint spatial collaboration, hierarchical human structure, and strong temporality characteristics of human motion, capturing temporal dynamic information and spatial dependency features for precise human motion prediction remains a challenging research hotspot.
Human motion prediction is a typical task in the computer vision field. Traditional human motion prediction algorithms, such as hidden Markov models (HMMs) [8], Gaussian process dynamic models (GPDMs) [9], and restricted Boltzmann machines [10], as shown in Figure 1, often require extensive prior knowledge and assumptions, making it challenging to capture the complexity and diversity of human motion and so restricting their application impact.
As more and more large-scale motion capture datasets become available, an increasing number of deep learning models have been designed and have demonstrated excellent performance, such as convolutional neural networks (CNNs) [11], graph neural networks (a) Spatial relationship modeling: In most previous studies, spatial joint graphs were designed based on the human physical structure, typically utilizing graph neural networks (GNNs) [25] to capture spatial correlations. However, GNNs are limited by the local and linear aggregation of node features and may not effectively capture the global and nonlinear dynamics of human motion. The introduction of adaptive graphs aimed to overcome these limitations, but they still have drawbacks, such as overlooking the correlation between critical 3D coordinate information, which results in a loss of relevant internal data feature information.
(b) Simultaneously capturing complex short-term and long-term temporal dependencies: Most research has employed temporal learning components to capture temporal correlations. RNNs are a classic approach, but they face gradient vanishing or exploding issues when learning long time sequences. More advanced models such as LSTM and GRU mitigate the issue of vanishing gradients to a certain degree, but pose challenges in training and lack a parallel computation capability. Self-attention mechanisms [26,27] attempt to capture temporal dependencies but still struggle to effectively model long-range dependencies. TCNs [22,28] capture long-term dependencies through fixed kernel sizes, adopting an independent module framework that can only capture single dependency relationships from a temporal scale perspective. Fixed receptive fields limit their ability to adaptively learn multi-scale temporal dependencies.
To tackle these challenges in human motion prediction, a novel method based on dual attention and multi-granularity temporal convolutional networks (DA-MgTCNs) was proposed. This approach effectively captures spatial correlations and multi-scale temporal dependencies. Specifically, joint attention and channel attention were combined to design a dual-attention structure for extracting spatial features and capturing information on spatial correlations between and within human joints. TCNs were employed to model long-term temporal dependencies, and the concept of multi-granularity was introduced into the TCN to further enhance performance. The multi-granularity TCN (MgTCN) employed convolution kernels of varying scales in its convolution operations across multiple branches, enabling it to effectively capture multi-scale temporal dependencies in a flexible manner.
The MgTCN module was comprised of a combination of multi-granularity causal convolutions, dilated convolutions, and residual connections. Each branch of the module was composed of multiple causal convolution layers with varying dilation factors. This design enabled the adaptive selection of different receptive fields based on varying motion styles and joint trajectory features for short-term and long-term human motion prediction.
The main contributions of this paper are as follows: (1) We designed a dual-attention model for extracting inter-joint and intra-joint spatial features, more effectively mining spatial relationships between joints and different motion styles, providing richer information sources for motion prediction.
(2) We introduced a multi-granularity temporal convolutional network (MgTCN) that employed multi-channel TCNs with different receptive fields for learning, thus achieving discriminative fusion at different time granularities, flexibly capturing complex short-term and long-term temporal dependencies, and thereby further improving the model's performance.
(3) We conducted extensive experiments on the Human3.6M and CMU-MoCap datasets, demonstrating that our method outperformed most state-of-the-art approaches in shortterm and long-term prediction, verifying the effectiveness of the proposed algorithm.
The remainder of this paper is organized as follows: Section 2 reviews related work. Section 3 details the proposed methodology. In Section 4, we describe experiments conducted on two large-scale datasets, comparing the performance of the proposed method with baselines. Section 5 provides a summary and conclusion, as well as a discussion of future work.

Related Work
In this section, we review the literature relevant to our dual-attention multi-granularity temporal convolutional networks (DA-MgTCNs) model, focusing on existing methods for human motion prediction, temporal convolutional networks (TCNs), multi-granularity (Mg) convolutions, and attention mechanisms.

Human Motion Prediction
The development of human motion prediction has evolved through several phases. Traditional methods primarily rely on statistical approaches, such as hidden Markov models (HMMs) [8], Gaussian processes (GPs) [9], and restricted Boltzmann machines [10], to learn underlying patterns and structures from data in order to predict future human motion [15]. Although these methods have achieved some success in certain scenarios, they still face challenges in capturing complex spatial and temporal dependencies, computational efficiency, and scalability.
With the rapid development of deep learning, researchers have started to apply it to human motion prediction tasks. Recurrent neural networks (RNNs) have been widely adopted for the temporal information modeling of human motion. Some representative works include Fragkiadaki et al. [15]'s RNN model, Martinez et al. [17]'s RNN-based joint angle prediction model, and Li et al. [11]'s convolutional recurrent neural network (CRNN) model. Although RNNs have achieved high accuracy in human motion prediction, they may lead to error accumulation and eventually converge to a statical average pose due to the continuous computation of time series.
To address this issue, researchers have improved RNNs. Chiu et al. [29] used LSTM units to model the underlying structure of human motion hierarchically, but this method did not adequately capture the spatial structure of the human body. Martinez et al. [17] introduced a residual structure using GRUs to model the velocity of human motion sequences, focusing on short-term temporal modeling but ignoring long-term dependencies and spatial structure. Jain et al. [16] combined LSTM and fully connected (FC) layers in a structural RNN model to encode high-level spatio-temporal structures in human motion sequences. Guo et al. [30] employed FC layers and GRUs to model local structures and capture long-term temporal dependencies, but they did not account for the interactions between different limbs. These RNN-based models faced challenges in capturing long-term dependencies and error accumulation.

Temporal Convolutional Networks
The temporal convolutional network (TCN) was developed to address these issues. The fundamental TCN architecture includes causal convolution, dilated convolution, and residual blocks [31]. Compared to RNNs and LSTMs, TCNs offer the advantages of parallel computation and larger receptive fields. Recent research has shown that 1D convolution can effectively represent time-series data [31][32][33], achieving significant success in various sequence learning tasks, such as machine translation [34], speech synthesis [35], video analysis [36], and semantic segmentation [37]. The contextual size of the network may be easily increased by stacking numerous one-dimensional convolutional layers, and creating hierarchical feature representations for input sequences enables the efficient modeling of long-term temporal patterns [35,36].

Multi-Granularity Convolution
A single-scale TCN might not be sufficient to capture the multi-scale temporal correlations in motion sequences in human motion prediction tasks. To capture complicated short-term and long-term temporal connections, researchers have developed a method known as multi-granularity convolution, which can combines multi-scale information fusion [38]. By adjusting the convolutional size, CNN-based deep learning models can quickly gather feature information at various granularities, enabling more accurate decision making by combining and evaluating data from various scales. Recent achievements in the field of computer vision have fully exploited multi-granularity information fusion based on CNNs [39].

Attention Mechanisms
Additionally, understanding spatial relationships is essential when attempting to predict human motion. In order to overcome this constraint, attention mechanisms were incorporated into the model to extract the spatial correlations of joints. There is still untapped potential in the area of human motion prediction, despite the widespread use of attention mechanisms in natural language processing [40,41] and image processing [42]. Tang et al. [43] employed the attention module for information extraction along the temporal dimension, and Cai et al. [44] used it for global spatial dependency among joint trajectories. However, it is believed that the intrinsic three-dimensional coordinate information of the human body is crucial for spatial representation.

Problem Formulation
Our goal was to forecast future human posture sequences based on previous 3D human pose sequences. Three-dimensional joint positions were employed as the pose representation to prevent the ambiguity produced by the joint angle representation. A graphical representation of the human pose was created by analysing the properties of human joint positions over time. Let x 1:T = [x 1 , x 2 , . . . , x T ] represent the set of joint positions for T time steps, where x i ∈ R J×C , T specifies the number of input time steps; J the number of human pose joints; and C = 3 the feature dimension (x, y, z). Our goal was to anticipate the pose's future steps x T+1:T+N = [x T+1 , x T+2 , . . . , x T+N ]. We began by copying the latest pose x T N times to build a time series of length T + N, as described in the literature [25,45]. As a result, the goal became generating a time series of length T + N from the input sequence x 1:T+N = [x 1 , x 2 , . . . , x T ; x T , . . . , x T ] to produce the output sequencê x 1:T+N , where x i is commonly designated as the 3D coordinates of N body joints.

Overview
We employed a residual depth network consisting of DA-MgTCN modules to capture the global spatial correlation and multi-scale temporal dependence of human motion. Each DA-MgTCN module consisted of a two-branch attention structure module (DA) and a multi-granularity TCN module (MgTCN) connected in series to capture the inter-temporal dependence of historical motion sequences. The DA module was used to extract spatially significant information from joint-level and channel-level dimensions. A combination of multi-granularity convolution and TCN was used in the MgTCN module to increase the prediction quality and adapt to varied forms of human motion and multi-scale time. The complete model architecture was trained end-to-end, with global and local residual connections improving the deep neural network's performance. Each DA-MgTCN component is described in detail below.
(1) Figure 2 shows a detailed description of the module. The specifics of the DA and MgTCN modules are provided below. The whole architecture of our suggested solution for motion prediction, which employed an end-to-end framework. We encoded human poses X 1:T+N and fed them into the DA-MgTCN, which was a series connection of DA and MgTCN modules. The DA module was used to extract spatially important information from dimensions at the joint and channel levels. The MgTCN was used to capture different scales of temporal dependencies. Finally, the decoder module recovered the time dimension length.

Dual Attention (DA)
The self-attentive mechanism is regarded as an efficient method for modeling remote dependencies. Tang et al. [43] and Cai et al. [44] used the attention module for information extraction along the temporal dimension and the modeling of global spatial dependencies, respectively. However, we observed that the 3D coordinate information from human joints is crucial for spatial representations.
As a result, we proposed a dual-attention module that took into account both joint-level attention and channel-level attention in order to extract joint-related and channel-related information for spatial correlation. The DA module is depicted in the lower left corner of Figure 2 and is described in detail below.
Given a human motion feature X, a linear transformation was first performed using the weight matrices W q , W k , and W v to obtain the query Q, the key K, and the value V. These two branches shared the same embeddings: Q, K, and V. The embeddings are partially reorganized into J × CT (for Q, K, and V in the joint branch) and C × JT (for Q, K, and V in the channel branch) dimensions. The joint and channel attention were used to simultaneously mine the dependencies between joints in the space and channel dimensions. This was computed as follows: , and V (C) represent the deformations of the Q, K, and V matrices, respectively. W q , W k , and W v are trainable weights, and d k is the dimension of K. J and C denote joint-level and channel-level branches, respectively. F J and F C are the output features of the joint-level and channel-level networks. After obtaining the joint-level and channel-level features, we summed them one by one to obtain the spatially noticed feature representationX of the MgTCN, as shown in Equation (5): After obtaining the spatially noticed feature representationX of the motion data, we could feed this representation into the subsequent layers of the network. This process helped to capture joint-level and channel-level contextual information, which is crucial for effective motion prediction modeling.

Multi-Granularity TCN (MgTCN)
To learn human motion temporal features efficiently, we extended the concept of temporal multi-granularity convolutional kernels to TCN networks and proposed MgTCNs for extracting temporal features at multiple scales for different motion styles. The MgTCN module is shown in the lower right corner of Figure 2 and consisted of multi-granularity causal convolution, dilated convolution, and residual blocks. There were three causal convolution channels in the MgTCN, each using kernels with granularity sizes of 2, 3, and 5 for feature extraction. Each channel consisted of three residual blocks connected in series. These units increased the perceptual field at a dilation rate of [1,2,4] and used ReLU as the activation function. In addition, a dropout unit was included in each residual block for regularization.
Causal convolution: The output at the tth timestamp for standard 1D convolution is calculated from the k elements around the previous layer with time step t, which is not reasonable for the human motion prediction task [31]. The goal of this research was to find the best function for generating human-like future poses based on previous motion capture sequences. As a result, the predicted pose at time step t could be derived only from all possible representations of previously observed frames and not from later poses. MgTCN's causal convolution ensured that only past data were used as the model input, preventing future information leakage. This was easily accomplished by shifting the standard convolution output by a few time steps, as shown in the equation below: where is the convolution weight at time step i, and k is the kernel size. Dilated convolution: Causal convolution captures historical data inadequately. Increasing the network's depth or number of layers can help it capture historical data linearly. However, increasing the network depth exponentially increases the number of parameters, making network training more difficult. Oord et al. [35] suggested using dilation convolution to extend causal convolutional networks' receptive field to better capture historical information.
Dilated convolution is implemented by adding expansion parameters to the moving convolutional kernel. Compared to traditional deep convolutional networks, dilation convolution can obtain a larger receptive field without significantly increasing the number of parameters, thus capturing information over a longer time range. This approach can focus on both local details and motion trends over a longer time span when dealing with human motion prediction tasks.
Dilated causal convolution can be expressed by the following equation. For a filter f = (0, . . . , k − 1) and x ∈ R T , denoting the given 1D time-series input, the dilated convolution operation F on element s of the sequence is computed as: where d is the dilation factor, k is the size of the filter, and the convolution kernel is restricted to slide only at the current position and to its left (i.e., past information). The receptive field R for a three-layer convolution is calculated as: where d 1 , d 2 , and d 3 are the dilation factors of the three-layer convolution, which are used to calculate the size of the receptive field. Our TCN was calculated as: Figure 3 shows an example with a three-layer causal expansion convolutional network (TCN). The TCN's elements in Figure 3 include a series of dilation causal convolutions with dilation factors d = 1, 2, 4 and a filter size K = 3.
Multi-granularity convolution: In order to handle complex, multi-action, multi-joint predictions of the human body, MgTCN required the use of convolutional kernel filters with different granularities to extract time-series features at different scales. This was necessary to meet the needs of short-and long-term predictions that require the capture of time-series features of different lengths. Three MgTCN time series were processed separately, which made it possible to combine multiple time granularities in the feature extraction process, which could better represent a large range of spatio-temporal features. Therefore, integrating time-series data with different time granularities to obtain better results is a challenge. In order to handle the complex multi-action and multi-joint prediction of the human body, one must integrate time-series data with different time granularities to obtain better results. MgTCN used convolutional kernel filters with different granularities to extract time-series features at different scales. This satisfied the need for capturing short-and long-term forecasts of time-series features of different lengths. Three time series were treated separately in MgTCN, which allowed us to combine multiple temporal granularities in the feature extraction process to better represent certain large ranges of spatio-temporal features.
The MgTCN network output could be used to extract multi-temporal granularity features (short-term and long-term) using the aforementioned spatial and temporal feature extraction steps. We combined the data from these three TCN channels and used the equation below to make predictions in order to achieve the integration of the multi-granularity information.
where w i is a learnable parameter to adjust the weights for different time periods, and g(.) represents a mapping function that maps the fused features to the predicted values. With this multi-granularity temporal convolution (MgTCN) method, we could both observe the general trend of human motion in the long-term and capture the outliers of short-term changes. This temporal correlation facilitated predictive power.

Global and Local Residual Connection
Residual connection skips a layer of the network and adds its output to the next layer's output. This eliminates gradient fading by propagating the gradient straight from the back layer to the front layer. This architecture simplifies neural network representation learning in deeper structures. Figure 2 illustrates the use of global residual connections between the encoder and decoder modules and local residual connections in each DA-MgTCN module to enhance neural network training and deeper structural performance. This method assisted the network in capturing complex data patterns in human motion prediction.

Loss Function
To train our DA-MgTCN model, we employed an end-to-end training technique. The mean position per joint error (MPJPE) loss function between the anticipated motion sequence and the ground truth motion sequence was used to analyze the difference between the predicted outcomes and the true pose, which was defined as follows: where N is the number of human joints, T is the number of time steps in the future series, Y i,j ∈ R C is the prediction of the ith joint at the jth time step, and Y i,j is the corresponding ground truth. We optimized the loss function using the improved Adam method (AdamW [46]), which mitigated the overfitting problem by adding a weight decay term and could significantly improve the robustness of the model.

Experiments
In this section, we evaluate the performance of the proposed method using two large-scale human motion capture benchmark datasets: Human3.6M and CMU-Mocap.

Datasets
Human3.6M [47] is the largest existing human motion analysis database, consisting of 7 actors (S1, S5, S6, S7, S8, S9, and S11) performing 15 actions: walking, eating, smoking, discussing, directions, greeting, phoning, posing, purchases, sitting, sitting Down, taking photos, waiting, walking a dog, and walking together. Some actions are periodic, such as walking, while others are non-periodic, such as taking photos. Each pose includes 32 joints, represented in the form of an exponential map. By converting these into 3D coordinates, eliminating redundant joints, global rotation, and translation, the resulting skeleton retains 17 joints that provide sufficient human motion details. These joints include key ones that locate major body parts (e.g., shoulders, knees, and elbows). This strategy ensures that no crucial joints are overlooked. We downsampled the frame rate to 25 fps and used S5 and S11 for testing and validation, while the remaining five actors were used for training.
CMU-MoCap, available at http://mocap.cs.cmu.edu/, accessed on 13 June 2023, is a 3D human motion dataset released by Carnegie Mellon University that used 12 Vicon infrared MX-40 cameras to record the positions of 41 sensors attached to the human body, describing human motion. The dataset can be divided into six motion themes, including human interaction, interaction with environment, locomotion, physical activities and sports, situations and scenarios, and test motions.
These motion themes can be further subdivided into 23 sub-motion themes. The same data preprocessing method as in the literature [25] was adopted, simplifying each human body and reducing the motion rate to 25 frames per second. Furthermore, eight actions (basketball, basketball signals, directing traffic, jumping, running, soccer, walking, and washing the face) were selected from the dataset to evaluate the model's performance. No hyperparameters were adjusted in this dataset, and we only used the training and testing sets, applying a splitting method consistent with the common practice in the literature.

Implementation Details
All experiments in this paper were implemented using the PyTorch deep learning framework. The experimental environment was Ubuntu 20.04 with an NVIDIA A100 GPU. During the training process, the batch normalization size was set to 16, and the AdamW optimizer was used to optimize the model. The initial learning rate was set to 0.003, with decay by 5% every 5 epochs. The model was trained for 60 epochs, and each experiment was conducted three times. The average result was taken to ensure a more robust evaluation of the model's performance. The input motion prediction length was 25 frames (1000 ms), and the prediction generated 25 frames (1000 ms). The choice and configuration of the relevant hyperparameters are shown in Table 1.

Evaluation Metrics and Baselines
The same evaluation metrics as those used in existing algorithms [25,45] were employed for assessing model performance. The standard mean per joint position error (MPJPE) was used to measure the average Euclidean distance (in millimeters, mm) between the predicted joint 3D coordinates and the ground truth, as illustrated in Equation (12). In addition, to further illustrate the advantages of the method, we conducted a comparative analysis of our method with Res. sup. [17], convSeq2Seq [11], DMGNN [13], LTD [25], LPJP [44], Hisrep [48], MSR [49], and ST-DGCN [45].

Experimental Results and Analysis
Human3.6M: Based on the existing work, we divided the prediction results into shortterm (80-400 ms) and long-term predictions (500-1000 ms). The experimental results are shown in Table 2, which demonstrates the joint position error and mean error for short-term (80 ms, 160 ms, 320 ms, 400 ms) and long-term (560 ms, 1000 ms) predictions for 15 kinds of movements. It was found that the existing methods usually showed high prediction accuracy when dealing with more periodic and regular movements, such as "walking" and "eating". However, when dealing with more random and irregular movements, such as "directions", "posing", and "purchases", the prediction accuracy decreased significantly. The algorithm proposed in this paper showed high prediction accuracy when dealing with highly complex, non-periodic, and irregular movements. Our experimental results revealed that the proposed DA-MgTCN method outperformed most baseline methods in both short-term and long-term motion prediction. In particular, it can be observed from the experimental results that the proposed DA-MgTCN method outperformed most baseline methods in short-term motion prediction and improved more significantly in long-term prediction, with each MPJPE index reaching the optimum and obtaining excellent prediction results for both the 560 mm and 1000 mm MPJPE metrics. This success can be attributed to the ability of DA-MgTCN to fully capture spatial correlation and multi-granularity temporal features, which was a key factor in enhancing the model's prediction accuracy.
Qualitative comparison: We visualized the results of the aforementioned motion prediction to further assess the model's performance. Figure 4 illustrates the visualization results for actions including "walking", "discussion", "posing", and "sitting down". The first row in every subplot shows the ground truth pose sequences (in black), followed by the predicted poses (in blue), i.e., each row displays the prediction results of one model. From the visualization results, it was observed that the predictions generated by the DA-MgTCN method showed higher similarity to the actual sequences and exhibited lower distortion and better continuity between frames. This was due to the dual-branch spatial attention and multi-granularity temporal convolution modeling joint motion trajectories, which provided richer and smoother joint motion temporal context information. The model could sufficiently capture global spatial dependencies, allowing it to encode joint information with distant hidden dependencies. For example, in the "sitting down" motion visualization, the motion between the hands and feet was more coordinated and coherent. This demonstrated once again how well the suggested DA-MgTCN forecasted very complicated irregular movements and complex periodic motions.
CMU-MoCap: To further validate the generalization of the DA-MgTCN method, we compared its performance with existing algorithms on the CMU-MoCap dataset, including Res. sup. [17], convSeq2Seq [11], DMGNN [13], LTD [25], LPJP [44], MSR [49], and ST-DGCN [45]. The experimental results are shown in Table 3, presenting the mean per joint position error and corresponding average error for short-term and long-term predictions across eight actions. From the table, it can be observed that the DA-MgTCN method's short-term and long-term prediction accuracy was significantly higher than that of the other seven existing prediction algorithms, including Cai et al. [44] method, even when handling relatively complex non-periodic actions. The DA-MgTCN method improved the average prediction accuracy by about 1.5% in short-term prediction and 3% in long-term prediction, respectively, compared to the state-of-the-art ST-DGCN method. Thus, the comprehensive experimental results once again confirmed the effectiveness and generalization capabilities of the DA-MgTCN method.

Ablation Study
To deeply evaluate the contribution of each component in our model, we conducted a series of ablation experiments on the Human3.6M dataset. These experiments focused on the impact of the channel-attention (channel-att) and multi-grained (Mg) convolution modules on the model's performance. The results of the experiments are shown in Table 4. Table 4. Influence of the channel-attention (channel-att) and multi-grained (Mg) convolution modules on the Human3.6M and CMU-MoCap datasets. On average, the two components of our model contributed to its accuracy. The best results are marked in bold. In terms of channel attention, the prediction accuracy significantly decreased when only joint attention was used without dual attention. The multi-granularity convolutional TCN module showed excellent performance in capturing long-term temporal dependence, thus improving the long-term prediction accuracy. Furthermore, when the channel-att or Mg module was removed, the error at 1000 ms increased by 1.9% and 4.0%, respectively, on the Human3.6M dataset, and by 2.9% and 4.0%, respectively, on the CMU-MoCap dataset. The best performance could be achieved by combining these two components.

Human3
The multi-granularity model demonstrated better performance compared to the singlegranularity model, especially for long prediction cycles. Additionally, the use of learnable weight parameters led to better prediction performance compared to fixed weights. This suggested that by designing a multi-granularity temporal structure, we could extract the temporal correlation between different time periods more effectively, thus improving the prediction performance.
Effects of the Number of DA-MgTCNs: Further, to validate the effect of multiple DA-MgTCNs in the model, we increased the number of DA-MgTCNs from 6 to 10 in step 2 and determined the prediction error and running time cost for both dataset predictions, as shown in Table 5. The experimental results showed that when 6 to 10 DA-MgTCNs were used, the predicted MPJPE decreased, while the time cost continued to increase. When 12 or 14 DA-MgTCNs were used, the prediction error remained stable at a lower level, but the time cost increased. Therefore, the use of 10 MgTCNs was chosen to achieve higher prediction accuracy and operational efficiency. In summary, the experimental results in this paper revealed the importance of the DA-MgTCN method using the dual-attention and multi-granularity convolutional design in terms of performance improvement. Modeling joint motion trajectories using dualattention, dual-branch spatial attention, and multi-granularity temporal convolution could provide richer and smoother temporal contextual information related to joint motion, which was conducive to adequately modeling spatial global dependencies and enabling the model to encode joint information with hidden dependencies at a distance, thus improving the overall performance of the model for both short-term and long-term motion prediction.

Limitations
In addition to the qualitative results presented in Figure 4, challenging cases encountered by the DA-MgTCN model were also investigated. Figure 5 illustrates an example of a predicted skeleton for the "walking a dog" action. It was evident that the last few frames did not perfectly align with the ground truth pose. This misalignment resulted from the high degree of uncertainty inherent in human motion, where a series of past poses can suggest various possible future outcomes. As a result, predicting long-term dependencies between joints and frames becomes more difficult. Furthermore, the experiments were constrained by more realistic data scenarios and experimental conditions, which may have posed challenges to our algorithm's validation. In the future, we will consider motion prediction in more intricate scenarios to investigate novel methods for multi-grain human motion prediction in multi-domain contexts. The aim is to enhance the adaptability and performance of the model.

Conclusions
In this paper, we proposed a novel human motion prediction method leveraging dualchannel attention and multi-granularity temporal convolutional networks (DA-MgTCNs) to accurately understand and analyze human motion. Our method combined a dualattention mechanism and a multi-granularity temporal convolutional networks model to address the challenging problem of extracting inter-joint and intra-joint spatial features. Moreover, the multi-granularity temporal convolutional networks model facilitated the design of a TCN with different convolutional kernel granularities, enabling the learning of richer multi-scale temporal information and further enhancing the performance of the model. Extensive experiments were conducted on two large-scale datasets, Human3.6M and CMU-MoCap. The experimental results demonstrated that the proposed method significantly outperformed other approaches in both short-term and long-term prediction tasks, thus validating the effectiveness of the proposed algorithm. In future work, we aim to further optimize the network structure and parameter settings and extend the application of our model to spatio-temporal prediction tasks in real-world scenarios, such as robot perception and interaction.

Data Availability Statement:
The datasets generated and/or analyzed during the current study are publicly available. The Human3.6M dataset can be accessed through the reference in [47]. The CMU-MoCap dataset is publicly available and can be accessed online at http://mocap.cs.cmu.edu/ (accessed on 13 June 2023). The use of these datasets is governed by their respective usage policies.