Hybrid Directed Hypergraph Learning and Forecasting of Skeleton-Based Human Poses

Forecasting 3-dimensional skeleton-based human poses from the historical sequence is a classic task, which shows enormous potential in robotics, computer vision, and graphics. Currently, the state-of-the-art methods resort to graph convolutional networks (GCNs) to access the relationships of human joint pairs to formulate this problem. However, human action involves complex interactions among multiple joints, which presents a higher-order correlation overstepping the pairwise (2-order) connection of GCNs. Moreover, joints are typically activated by the parent joint, rather than driving their parent joints, whereas in existing methods, this specific direction of information transmission is ignored. In this work, we propose a novel hybrid directed hypergraph convolution network (H-DHGCN) to model the high-order relationships of the human skeleton with directionality. Specifically, our H-DHGCN mainly involves 2 core components. One is the static directed hypergraph, which is pre-defined according to the human body structure, to effectively leverage the natural relations of human joints. The second is dynamic directed hypergraph (D-DHG). D-DHG is learnable and can be constructed adaptively, to learn the unique characteristics of the motion sequence. In contrast to the typical GCNs, our method brings a richer and more refined topological representation of skeleton data. On several large-scale benchmarks, experimental results show that the proposed model consistently surpasses the latest techniques.


Introduction
Human motion is a key medium for understanding human behavior of robots, and human motion analysis has been widely applied in many fields, such as human-robot interaction, virtual reality, and computer animation [1][2][3].In this work, we focus on the specific task of human motion forecasting that aims to estimate future human actions over a while from its historical skeleton data.Due to the unique significance in human-robot interaction and machine intelligence, it has attracted increasing attention [4][5][6][7][8].
Contemporary approaches regard the human skeleton as a simplified graph and then resort to graph convolutional networks (GCNs) to formulate this task, achieving impressive results [3,[9][10][11].These methods explicitly consider the natural structure of human poses.However, typical GCN methods can only model pairwise (2-order) connections of human joints.In real scenarios, human action often presents interactive movements of multiple joints on a specific limb, making it even more complicated, beyond the range of paired connections that GCN can represent.For instance, when dancing, the motion pattern is that the knee interacts with all the other joints on the leg to perform a graceful movement according to kinematic characteristics, and so do other limbs.This complicated highorder relationship obviously transcends the pairwise connection of the traditional GCNs [12,13].
In addition, human joints are typically driven by their parent joints to move on a spherical surface, while they cannot drive the parent's movement.The directionality of this message passing is an essential characteristic of human action [14].Most of the existing methods crudely treat this asymmetric relationship, which is undoubtedly unreasonable; therefore, it further deteriorates the prediction performance [9][10][11].
To address these limitations, we present a novel hybrid directed hypergraph convolution network (H-DHGCN), which is capable of efficiently representing the complex and diversified human skeleton.Our inspiration intuitively comes from the fact that, in natural human activities, multiple joints on each limb or the trunk always move interactively, especially the parent joint, which drives each of its child joints to perform a chain of actions.This high-order and directional correlation transcend simple pairwise connections and provide profound kinematical characteristics, which is vital and instructive for human motion prediction [15].To encode it, we first propose to represent the human skeleton as a static directed hypergraph (S-DHG) composed of several pre-defined hyperedges, in which the degree of hyperedges may exceed 2. In contrast to standard GCNs, hypergraphs are beneficial to directly simulate the linkage of multiple joints in a practical action scenario, without indirect reasoning jointby-joint [14,16,17].Although the S-DHG describes the natural relationship among joints, it remains non-optimal, as the motion pattern of different actions should be theoretically depicted by different hypergraphs, while the S-DHG is fixed in the whole network update.
For this purpose, we also propose a learnable dynamic directed hypergraph (D-DHG), in which the KNN and KMeans approaches are exploited to dynamically construct the suitable topology for different samples.As a supplement to S-DHG, the proposed D-DHG improves the model flexibility, and the potential high-order directional correlation of human poses can be extracted.Finally, we integrate D-DHG and S-DHG into the hybrid form, which presents a more elaborate and richer topological representation of the 3-dimensional (3D) skeleton pose, thus facilitating motion forecasting.
Our contributions are as follows: (a) We propose to represent the human skeleton as a hypergraph to capture high-order relations.To our best knowledge, it is the first research attempt to exploit the hypergraph for human motion forecasting.(b) The proposed H-DHGCN is constructed on 2 core components: S-DHG and D-DHG, which is able to effectively access the natural topology of the human skeleton and, meanwhile, potentially capture the high-order and directional correlations.(c) Experiments clearly demonstrate that the proposed model consistently outperforms the state-of-the-art perform ance on 3 human action benchmarks.
The rest of this paper is organized as follows: The "Related Work" section describes the related works including human pose forecasting and hyper-graph learning.The "Methodology" section presents the proposed H-DHGCN in detail.The "Experiments" section reports the results, experiment analysis, and computation overhead with the extensive experiments.In the "Conclusion" section, we also discuss the limitations and future work of this paper.

Human motion prediction
With the rapid development of deep learning technology, researchers have attempted various solutions to analyze the problem of human motion prediction [18][19][20].Since human motion is essentially a sequential data, typical methods utilize the variants of recurrent neural networks (RNNs) to extract temporal patterns of motion sequences pose-by-pose [21][22][23].Although a promising result has been achieved, RNN can hardly extract the structural characteristics of the human skeleton.Besides, as mentioned in previous work [24,25], due to the problem of error accumulation, RNN variants often fall into an obvious discontinuity between the first frame of the prediction and the last frame of the historical sequence, and even converge to a static pose [26][27][28].
Recently, as an instrumental alternative, GCNs have been proposed for forecasting human motions.Mao et al. [9] first introduce the concept of GCNs, where they consider the human pose to involve an unrestricted topology.While impressive results have been reported, because of the omission of the meaningful human skeletal structure, distorted predictions may be produced.Cui et al. [11] propose to model the pairwise connection of adjacent joints and geometrically separated joints simultaneously.Although slight improvements have been achieved, they only consider the simple 2-order relationship of the human skeleton, which violates the complex situation of multi-joint interaction.Li et al. [10] develop a novel multiscale graph to comprehensively analyze the motion sequence; however, because the asymmetric relationship between the joints is not considered, it only produces sub-par results.Li et al. [10] observe that the motion of human pose becomes more stable, and based on it, they extend [9,10] to a residual multiscale version to capture features from fine to coarse scale.Ma et al. [3] also propose to use GCNs as the basic building block, and further develop both key modules: spatial and temporal dense GCNs.Alternatively, Zhong et al. [29] present a spatial-temporal gating-adjacency GCN to encode the spatial-temporal relationship over diverse action categories, and utilize trainable adjacency to improve the generalization ability.
Despite achieving encouraging performance, GCNs are still not able to capture the high-order and directional correlations of the human skeleton.In contrast, our approach adeptly leverages the human topology with the pre-defined S-DHG, and meanwhile, with the D-DHG, the potential high-order relations among multiple joints can also be analyzed, thus extracting meaningful context for predicting human motions effectively.

Hypergraph neural networks
It is noteworthy that, the data structure in the real world frequently involves high-order correlations between multi-objects, or even more complicated relations [29][30][31][32].Compared with GCNs in which the degree of all edges is fixed to 2, a hypergraph can utilize hyperedges with a liberalized degree to express this higher-order interaction (beyond pairwise connections).If all edges in a hypergraph contain only 2 vertices, the hypergraph will degenerate into an ordinary graph.Therefore, the hypergraph is essentially a generalization of the ordinary graph, which can theoretically achieve better performance than the traditional GCN algorithms.
Feng et al. [12] propose a general hypergraph neural network, which is then applied to citation network classification, achieving remarkable results.Zhang et al. [33] introduce a selfattention hypergraph representation learning model to the task of outsider identification, which obtains significant improvements.Tran et al. [17] present a directed hypergraph algorithm into page ranking.It considers the directionality of information transfer among nodes on the hyperedge.Our motivation is partly inspired by the above literature.Human activities exhibit high-order interactions and pregnant asymmetries among multiple joints, which are of particular significance in understanding the motion pattern.Our model focuses on these characteristics to extract a refined and informative semantic, so as to obtain high-quality generation.

Methodology Problem statement and definition
the historical human poses, where X t ∈ R N × D expresses a frame at a certain timestamp with N joints and D = 3 dimensions.Then, the predicted actions are defined as denotes the corresponding ground truth.For 3D skeleton data, the high-order and directional relationships between joints reflect the vital and instructive motion patterns.To accurately capture these meaningful connections, we propose a hybrid directed hypergraph convolution to learn a refined mapping: : X −T:0 → Y 1:ΔT , to make Y 1 : ΔT as close to Y 1 : ΔT as possible.
Concretely, the proposed approach mainly involves the following 2 components: S-DHG and D-DHG, to learn the specific high-order topology and the potential one, respectively.Next, we will illustrate the details.

Static directed hypergraph
In the context of human movements, all joints on the limbs or trunk are pulled to move together, which involves an explicit direction from the parent joint to the child joint.For example, when walking, the joints of the legs operate interactively in the form of the hip driving the knee, and then the knee driving the ankle, which is similar to the movement of the joints in the arms.This intuitive observation inspires us to naturally divide the human skeleton into 6 parts, i.e., 2 arms, 2 legs, 1 torso, and 1 head, as shown in Fig. 1.To capture the complex relationship across multiple human joints, we naturally divide the human skeleton into 6 intersecting subsets, i.e., 2 arms, 2 legs, 1 torso, and 1 head, as shown in Fig. 1A.Within each part, we establish several directed hyperedges.The direction of each hyperedge is from the parent joints to their child joints to model the directionality.However, these parts are not separate but transmit messages through their intersections.Because a parent affects all its tail joints, in each directed hyperedge, the number of the head is fixed to 1, and the tail nodes are all its child joints.Figure 1B provides the specific construction of the S-DHG.
Based on the above illustration, we represent the human pose as a directed hypergraph  = (, , W).  is the set of N vertices. is the set of M hyperedges, where the degree of each edge can be greater than 2 to construct high-order correlations.The diagonal matrix W = diag(w 1 , w 2 , …, w M ) ∈ R M × M assigns the weights to each edge.Note that we re-express each hyperedge e ∈  as e ∈ (e head , e tail ).e head and e tail are called the head and tail of the hyperedge e. e head ∪ e tail denotes the vertices on edge e.For each directed hyperedge e, we notice that e head ≠ Ø, e tail ≠ Ø, and e head ∩ e tail = ⌀.
Instead of the adjacency matrix of GCNs, the directed hypergraph can be expressed as 2 incidence matrices, denoted as H tail and H head , where H tail represents the connections with the tail and H head represents the connections with the head.As shown in Fig. 1B, when a vertex v i is connected by the head of a hyperedge e head , then H head i, = 1; on the contrary, when it is connected by the tail of the hyperedge e tail , H tail i, = 1.Mathematically, it can be formulated as: The main principle of defining convolution on a directed hypergraph is that propagation should be performed along the information transfer direction of those vertices on the hyperedge.
(1) To this end, inspired by [17], we directly develop the hypergraph convolution on the S-DHG, called static directed hypergraph convolution network (S-DHGCN), expressed as: i is the output feature, x (l) j is the latent code of the vertex v j at lth layer.σ(•) is the activation function, e.g., Mish [34], and Θ ∈ R C(l) × C(l + 1) is the learnable parameter.Equation 3indicates that the message is passed from the single parent joint to its child joints.
The above S-DHGCN can be simplified into a matrix form: where 1) are the input and output of the l-th layer.
When we stack multiple Eq. 4, it may lead to significant training instability.To solve this, we propose to utilize asymmetric normalization for the incidence matrices.Assuming D head v ∈ R N,N and D tail e ∈ R M,M are the degree matrix of the head for the vertex v and the tail for the hyperedge e respectively, it can be then expressed as: Then, the normalized form of S-DHGCN is obtained: where the D head v −1 H headT WD tail e −1 H tail is the directed hypergraph Laplacian.With the above operation, the hyperedge state is propagated from the vertices connecting it, and the tail nodes receive the information from the head joints.Then, the highorder and directional relationships of the human skeleton can be effectively captured in end-to-end training via a gradientbased optimizer.
For the sake of simplicity, we denote the subscript S as the above S-DHG, and be the convolution operator of head vertices updating in S-DHG, and Im p tail S = D tail e −1 H tail be the corresponding one of tail vertices.Finally, Eq. 7 can be re-written as the following formula: Through the operators Im p head S and Im p tail S , the proposed S-DHG structure is pre-defined and frozen in the whole network updating.Therefore, the S-DHGCN gains insight into the natural topology, and captures meaningful high-order and directional information.

Dynamic directed hypergraph
Nevertheless, the S-DHG structure remains non-optimal.The major reason is that the motion patterns of different actions should be depicted by different structures.In S-DHG, the number of hyperedges and the nodes on each hyperedge are typically unchanged; hence, it is challenging to adaptively extract the dynamic correlation of complex actions.To solve it, we propose the novel D-DHG to adaptively describe the high-order structure of motion sequences.
Specifically, we construct the D-DHG using the following 2 steps: (a) Intuitively, in embedding space, the closer the distance of 2 nodes, the higher the correlation, and it is more reasonable to connect them to the same hyperedge.Therefore, for the feature x i of each human joint, we calculate its Euclidean distance dis(•) from the other joints x j , and connect the k m − 1 smallest distances and v i to the same hyperedge using KNN.Then, we obtain N hyperedges, each of which contains k m nodes.Note that we denote the node set on each hyperedge as  n .(b) We then exploit KMeans on each  n , and until convergence, the 2 disjoint subsets  1 ,  2 are generated, where their centroids are denoted as centroid  1 , centroid  2 .Then, for  1 ,  2 , the one with a smaller average distance from all joints to the centroid is connected to e head , and the other is the tail joints e tail .Finally, the dynamic directed hyperedge e n = e head n , e tail n is established.The overall procedure of the D-DHG construction is described in Algorithm 1.
With the above algorithm, the proposed D-DHG is adaptively constructed, in which each directed hyperedge contains k m joints (involving k n = 2 categories w.r.t.head or tail joint) to enhance the flexibility.Therefore, the motion patterns of various activities can be considered.
We then develop the dynamic directed hypergraph convolution network (D-DHGCN) on the D-DHG.Similar to the S-DHGCN, the update of the D-DHGCN can be directly expressed as: where Im p head D is the convolution operator of head nodes in D-DHG, and the Im p tail D is one of the tail nodes.

Implementation details
In the development of our proposed model, we leverage the strengths of 2 key components, namely, the S-DHGCN and D-DHGCN.These 2 components are strategically stacked to form the H-DHGCN, a novel architecture that takes into account both the specific human topology and the implicit high-order relationships within human poses.We note that the H-DHGCN is dynamically constructed, and the number of hyperedges and the nodes on each hyperedge are not fixed.Therefore, it is capable of adaptively capturing the high-order and directional correlations (3) of human poses.Moreover, the D-DHGCN is achieved by Kmeans clustering (K n = 2) and KNN (K m = 5), which means that each hyperedge contains 1 head node and 4 tail nodes.To capture the temporal dynamics inherent in motion sequences, we employ temporal convolutional networks (TCNs).The TCN layers are configured with a filter size of f = 5 × 1, as visually depicted in Fig. 2. Our proposed model consists of a total of 9 residual blocks, each comprising an H-DHGCN layer and a TCN layer, both followed by batch normalization (BN).This design is aimed at extracting rich and refined representations from human action sequences.Additionally, 2 extra H-DHGCN layers are appended to the beginning and end of the network to enhance its overall effectiveness.Within each residual block, the filter size of the H-DHGCN layer is set to Θ ∈ ℝ 512 × 512 .Both the H-DHGCN and TCN layers are activated using the Mish function [34], incorporating a dropout rate of 0.25 to enhance model generalization.For training our model, we employ the average L 2 distance as the loss function, consistent with previous works such as [9,35].The predicted result Ŷ1:ΔT is compared against the ground truth Y 1:ΔT .The optimization process is facilitated using the Adam optimizer [36], and we initialize the learning rate at 0.01.A decay rate of 0.98 is applied every 2 epochs to facilitate convergence.It is important to note that our model is trained across all motion categories, ensuring the development of an action-agnostic model.The entire implementation is carried out using the PyTorch framework, providing a robust and scalable environment for model development and experimentation.

Datasets
We evaluate on 3 motion capture datasets, Human3.6M,CMU MoCap, and 3DPW MoCap.Human3.6M[37] includes 15 motion categories, specifically including walking, eating, smoking, discussion, direction, greeting, phoning, posing, purchases, sitting, sitting down, taking photo, waiting, walking dog, and walking together, performed by 7 actors.We also note that the Human3.6Mdataset contains >1,000 min of motion capture data, which is the largest publicly available dataset for human motion prediction.Specifically, each pose X i ∈ R 17 × 3 is represented using the position of 17 joints in Cartesian space.In our experiments, all the sequences are downsampled to 25 fps.
Consistent with [38,39], we select the action of subject-5 (S5) as the test sample, S11 as the validation, and then the rest is for training.In CMU MoCap [40], 8 action categories (basketball, basketball signal, directing traffic, jumping, running, soccer, walking, and wash window) are selected to report the result, where the validation set is unavailable and the test/train sets are consistent with the previous paper [11,41].Other preprocessing solutions are the same as for the Human3.6Mdataset.We also report the experimental results on the 3DPW Mocap Dataset [42], where the officially recommended training, validation, and testing set are exploited.The 3DPW MoCap dataset contains 60 video sequences (372 min) performed by 18 actors, each of which is annotated with 3D human poses, but does not explicitly categorize the action type.We also note that the Human3.6Mdataset is captured in a controlled environment, the CMU MoCap dataset is captured in an unconstrained environment, while the 3DPW dataset is captured in both indoor and outdoor scenario.The proposed model and the baseline methods are evaluated on the 3 datasets, which covers different aspects and complements each other, and can more comprehensively verify the effectiveness of our model.

Metrics
In our model, both the input and output are 3D coordinatebased skeleton data.Therefore, for position-based sequences, we first use the mean per joint position error (MPJPE) [37] to report the 3D error in millimeters.Moreover, the predicted position is also converted into an Euler angle, and then the angle error of the final angle-based prediction is evaluated using mean angle error (MAE).Besides, the predicted pose is also animated to investigate the qualitative performance.Baselines.We select the following baselines to evaluate the performance of our model: dynamic multi-scale graph neural network (DMGNN) [10], learning trajectory dependency (LTD) [9], multi-scale residual graph convolution network (MSR) [43], and gradually generating better initial guess (PGBIG) [3].DMGNN [10] exploits multi-scale GCNs to capture the spatial information of the motion sequences.LTD [9] converts the motion sequence to the frequency domain and then resorts to fully connected GCN to predict future human actions.MSR [43] extends LTD to extract multi-scale features.PGBIG [3] generates a better initial guess of the final target future pose to obtain the high-quality future motions.For a fair comparison, our model is compared with their re-trained models using the released code the published results in their paper.

Comparisons on the Human3.6M dataset
Following the previous work [10,11], we first visualize the character animation of each time step for qualitative comparison.Besides, MPJPE and MAE evaluation criteria are used for quantitative comparison, in which the angle error is calculated by transforming the predicted position-based sequence into angle space, while the 3D error is calculated directly.In all experiments, the length of the observed and predicted poses is 1,000 ms.
As shown in Fig. 3, we visualize the animation at each timestamp on the phoning activity from the Human3.6Mdataset, in which the red dotted line distinguishes between observations and predictions.From top to bottom, we show the GT, and the generation of the GCN-based method, i.e., LTD [9], DMGNN [10], MSR [43], and PGBIG [3], and the proposed H-DHGCN.The green dotted rectangles indicate unreasonable segments.For short-term prediction (first 10 predicted frames), we observe that the baselines and our approach have almost achieved indistinguishable results.However, the visualization of the proposed model is still slightly superior to that of the competitive methods.It is worth noting that, with the extension of the predicted range, the superiority of the proposed H-DHGCN gradually appears.In particular, we observe that the refinement of the legs and arms obtained by our H-DHGCN is higher in almost all scenarios, which is more matched to the GT.The above discussion evidences our remarkable visualization.In each row, the underlying red skeletons are the ground truth, and the blue ones are the predicted results.Note that the green boxes highlight the predicted unreasonable segments that are visually more distinct from the ground truth, in which the green boxes highlight the unreasonable segments.From top to bottom, we show the result of the DMGNN [10], LTD [9], MSR [43], PGBIG [3] and the proposed H-DHGCN.From the generated animations, we observe that the proposed H-DHGCN produces a more realistic visualization in almost all scenarios.Now, we quantitatively compare our approach with all the baselines in terms of MAE measures.Table 1 presents the numerical results on 4 representative activities.Specifically, we report the result at 80 ms, 160 ms, 320 ms, and 400 ms for short-term prediction, and then the predicted pose of 1,000 ms is calculated as the long-term prediction.Generally speaking, our method achieves better accuracy, regardless of short-range or long-range predictions.Actually, such a small error is hardly detected by human eyes in character animations, which also confirms the excellent performance in qualitative results in the above paragraph.
Compared with the typical recurrent models [19,21], GCNs are able to establish the spatial dependency of human joints explicitly.Although promising results have been achieved, human motion is a complex natural manifestation in which human joints show significant high-order correlation, whereas GCN-based approaches can only capture pairwise connections but have a low capability for high-order relations.Therefore, corresponding to Table 1, those GCN-based methods only achieve subpar results.By contrast, our H-DHGCN relies on both the S-DHG and D-DHG, which efficiently extracts complex high-order dependencies of multi-joints and dynamically analyzes the potential human structure of different motion sequences.Moreover, it also considers the important asymmetric interaction among joints, which leads to better performance.
Next, we follow the previous work [9,11], and directly measure themean 3D error on the predicted 3D coordinate.As shown in Table 2, it provides a detailed comparison under a total of 15 activities on the Human3.6Mdataset.From the results, we observe that the proposed H-DHGCN has almost consistently achieved superiority in almost all scenarios, even compared with the state-of-the-arts.Although GCN-based baselines are capable of accessing the topology of the human skeleton, they can only establish the 2-order relationship of the joint pair; hence, inaccurate results may be obtained.As a comparison, our approach captures the complex high-order correlation and directional connection of human joints simultaneously, thus yielding a higher-quality prediction.

Results on the CMU and 3DPW MoCap datasets
To fully investigate the proposed model, consistent with [3,31], we also report the 3D error of the MPJPE metric under both the CMU and 3DPW MoCap datasets, as shown in Tables 3 and 4. From the results, we observe that our approach widely outperforms the competitors, regardless of short-term or long-term prediction.These empirical experiments evidence that our approach is excellent in predicting future actions again.

Efficiency analysis
To evaluate the efficiency of our proposed H-DHGCN, we compare it with the latest approaches in terms of consuming time and parameter number of the long-term prediction (1,000 ms) on the Human3.6Mdataset.The results are shown in Table 5.
From the result, we observe that our method has a smaller parameter number, mainly because the hypergraph convolution is able to directly analyze the high-order spatial correlation among multiple joints, surpassing the 2 nodes in the standard graph convolution.On the other hand, we note that the proposed H-DHGCN achieves the second-rank running time, which is slightly slower than the state-of-the-art LTD [9].All experiments are implemented on a single NVIDIA GeForce RTX 3090Ti GPU.

Ablation experiments
To study the effect of various aspects, we run ablation studies to analyze our approach, which measures the average MPJPE in the Human3.6Mdataset.
1. Different representations of the human skeleton.In our model, we exploit 2 hypergraph structures to represent the human pose, including S-DHG, and D-DHG.Moreover, we perform the hypergraph convolution on them (S-DHGCN and, D-DHGCN) to consider both the natural human topology and the potential high-order directed correlation.To verify the effectiveness of our S-DHGCN and D-DHGCN, we study the effects of retaining one of them respectively.As reported in Table 6, we observe that the D-DHGCN brings more improvements than the S-DHGCN, and when both are introduced concurrently, a better result is achieved.Therefore, we reasonably conclude that simultaneous modeling of the specific high-order connections and the potential relations, is beneficial for motion forecasting.
2. Standard hypergraph vs. directed hypergraph.In this work, our inspiration comes from the fact that human joints are typically activated by the parent joint, showing the explicitly directed message passing.To verify it, we run the ablation studies, in which both the proposed S-DHGCN and D-DHGCN are replaced by their standard (undirected) versions.Note that, except for expressing the human body as the undirected hypergraphs, other components remain unchanged.From Table 7, we observe that our directed hypergraph achieves a lower error, which evidences that the motion pattern of human joints involves directional information, which can be well captured by the directed hypergraphs.
3. Effect of different k m in KNN.We utilize the KNN method to construct the D-DHG, where the number of human joints in each hyperedge is determined by k m .Moreover, the KMeans (with k n = 2) is used to select the head and tail joints in the directed hypergraph.Because the cluster number of the KMeans is fixed k n = 2, in this part, we only set different values for k m to conduct comparative experiments to find its best configuration.As shown in Table 8, we observe that when k m < 5, the performance begins to decline.The influence of k m of D-DHGCN proves that there is a threshold for the joint number constituting a directed hyperedge.Notably, k m = 5 achieves a better result, and the larger value brings no benefits.
4. Effect of different filter size f of TCNs.We exploit TCNs to extract the temporal correlation of inter-frames.To verify the influence of filter size, we chose f = {3, 5, 7}.From able 9, we observe that when f = 5, a balance is achieved between longrange and local temporal correlation, which, accordingly, generates a better generation.

Conclusion
In this work, we have proposed a novel H-DHGCN for predicting future human motions from its historical observations.To achieve it, we construct 2 hypergraph structures of the human skeleton-S-DHG and D-DHG-to consider the specific and potential high-order correlations.In contrast to simplistic GCNs, our model flexibly extracts complex patterns of 3D skeleton-based poses and then establishes meaningful semantics.Moreover, with empirical experiments, we verify that the asymmetric (directional) relationship is conducive to human motion modeling as well as the forecasting.Moreover, we demonstrate that the proposed H-DHGCN significantly exceeds the state-of-the-art approaches, regardless of short-horizon or longhorizon prediction.Our code will be publicly available.Despite the promising results achieved by our H-DHGCN, there is still room for improvement in the future.For example, the potential of the hypergraph convolution in capturing the long-term correlation of human sequences needs to be explored.Moreover, we will consider the possibility of reducing the computational cost of the dynamic hypergraph construction, which is conducive to the real-time application of our model.

Fig. 1 .
Fig. 1. (A) Human skeleton, divided into 6 parts: a head, a trunk, 2 arms, and 2 legs.(B) Static directed hypergraph (S-DHG).e ϵ is the hyperedge formed by a single head and many tails (possibly one to one).(C) Dynamic directed hypergraph (D-DHG), where the dotted line denotes the dynamic hyperedge.It can be established adaptively according to the motion pattern of the specific sequence.

Fig. 2 .
Fig.2.Illustration of our temporal-spatial block consisting of the proposed H-DHGCN and TCN, where the H-DHGCN is formed by adding the output of the S-DHGCN and D-DHGCN for extracting both the high-order human topology and the semantic directionality.In the S-DHGCN, the hypergraph points from the head to tails, in the form of one to many, whereas in the D-DHGCN, it is many to many, constructed dynamically.Note that there are N edges in D-DHGCN, each of which contains k m = 5 joints calculated by KNN.After that, KMeans is used to categorize these nodes into k n = 2 clusters, where the solid circles are the head nodes, and the hollow circles are the tail nodes.

Fig. 3 .
Fig. 3. Qualitative comparison of the predicted human poses under phoning activity.Predicted human poses (with an interval of 40 ms) from 40 ms to 1,000 ms are shown.In each row, the underlying red skeletons are the ground truth, and the blue ones are the predicted results.Note that the green boxes highlight the predicted unreasonable segments that are visually more distinct from the ground truth, in which the green boxes highlight the unreasonable segments.From top to bottom, we show the result of the DMGNN[10], LTD[9], MSR[43], PGBIG[3] and the proposed H-DHGCN.From the generated animations, we observe that the proposed H-DHGCN produces a more realistic visualization in almost all scenarios.

Table 1 .
Comparisons of MAE on 4 representative activities from the Human3.6Mdataset, which are calculated by converting the predicted position into angle space.The best result is in boldface, and the second is underlined.

Table 2 .
MPJPE comparison for per action on the total of 15 activities of the Human3.6Mdataset.The state-of-the-art result is highlighted in boldface, and the second is underlined.

Table 3 .
MPJPE comparison per action on 8 activities of CMU MoCap.The best result is highlighted in boldface, and the second is underlined.

Table 4 .
Mean 3D error on the 3DPW Mocap dataset.The best result is highlighted in boldface, and the second is underlined.

Table 5 .
Analysis of the number of parameters and time overhead of different methods.The best result is highlighted in boldface, and the second is underlined.

Table 6 .
Impact of different hypergraph structures.The best result is highlighted in boldface, and the second is underlined.

Table 7 .
Standard hypergraph vs. directed hypergraph.The best result is highlighted in boldface.Hypergraph 80 ms 160 ms 320 ms 400 ms 1,000 ms

Table 8 .
Impact with different k m in our KNN.The best result is highlighted in boldface, and the second is underlined.

Table 9 .
Impact with different k m in our KNN.The best result is highlighted in boldface, and the second is underlined.