A Hybrid Deep Learning-Based Intelligent System for Sports Action Recognition via Visual Knowledge Discovery

The intelligent recognition systems for sports actions have been a more general demand, so as to facilitate technical analysis of health management. This highly relies on deep analysis towards frame-level image data from the perspective of visual knowledge discovery. In recent years, the rapid development of deep learning technology has well boosted a number of technical breakthrough in computer vision. In this context, this work takes aerobics as the main object, and proposes a hybrid deep learning-based intelligent system for sports action recognition via visual knowledge discovery. Specifically, the human skeleton is represented as a graph based on the physical structure of the human body in this paper, and the selective hypergraph convolution network is selected to adaptively extract the multi-scale information in the skeleton. And the selective-frame temporal convolution is specially selected for the situation to construct recognition model. Upon the basis of proper feature extraction, a triple loss-based error measurement method is employed to construct objective function, and a recurrent neural network structure is further developed to model dynamic action sequence characteristics. The data source of this article is mainly the private data compiled by the research group. Finally, experiments are carried out on the CMU motion capture dataset, and the effectiveness of the proposed algorithm is verified by comparing the experimental results with those of the existing algorithms.


I. INTRODUCTION
In recent years, artificial intelligence technology changes with each passing day, a variety of intelligent devices emerge in endlessly [1]. And the intelligent processing of all kinds of information in all kinds of life has shown a trend of diversification [2]. In daily life, people's communication is not limited to language, body language is also a very direct and efficient way of communication [3]. Thus, it is of great significance to recognize aerobics [4].
For the exercise practice of aerobics, its theoretical research is relatively backward [5]. The degree of integration with artificial intelligence is low [6]. And the improvement of athletes relies on traditional empirical exercises, lacking The associate editor coordinating the review of this manuscript and approving it for publication was Laura Celentano . research on the essential characteristics of sports, the basic laws of sports technology development and the main factors affecting sports performance [7]. Since physical training is the basis of athletes' training, the relationship between athletes' physical fitness and sports technology is mutual promotion and mutual influence [8]. Only when athletes' physical training is done well can they ensure stronger sports skills. The physical training of track and field athletes must ensure the combination of theory and practice. Theoretical knowledge serves as a guide and leads practical training activities, so as to ensure the best effect of physical training [9]. Therefore, in the sport of aerobics, carrying out the analysis of the human body's behavioral posture and establishing an action recognition model based on convolutional neural networks is the basis of scientific training in the sport of aerobics [10]. The progress of scientific research and the development of sports programs is a dynamic process that promotes each other, and the lack of theoretical scientific research will certainly become a barrier that restricts the development of sports technology [11]. Therefore, the quantification of the indicators of aerobics and the integration of new technologies into the sports program will be the future development trend [12].
From a business value point of view, it lies in the fact that if the human pose in a given image can be quickly and accurately acquired [13]. It can be used in a real-time platform to analyze the real-time human state based on the corresponding behavior obtained from pose recognition, thus playing a role of human monitoring. With the application of the recognition and analysis system based on deep learning algorithms, although the embedded dimension value level of aerobics dynamic poses detected by the system host has a certain upward trend, no matter whether the poses are simple or not, the embedded dimensions. The numerical level can be well controlled, and its average value is always lower than the detection value of the spatio-temporal weight gesture motion feature recognition method, which has a strong promoting effect on accurately capturing the dynamic posture of aerobics [14]. In terms of recognition time, as the value of the joint angle increases, the deep learning algorithm can always effectively control the time required to accurately identify the pose data of aerobics. This plays an important role in many commercial reality scenarios [15]. Main contributions of this paper can be summarized as three aspects: • The research issue of computation intelligence-based aerobics action recognition is discussed and put forward.
• A hybrid deep learning-based intelligent system via visual sensing is proposed for this purpose.
• Some experiments are conducted to evaluate the proposed recognition method.

II. RELATED WORK
At present, the theoretical research on aerobics is mostly discussed from the aspects of aerobics teaching and the influence of aerobics on human body [16]. In addition to the basic teaching research, the academic research on aerobics focuses on the research of aerobics on morphological function, human psychological quality, female form and other physical aspects, on the other hand, it is the research on the rules and choreography creation of aerobics itself, most of these researches are qualitative analysis, which are relatively superficial, and the research on the assessment of aerobics special athletic ability and movement technology is very rare [17]. In the area of action recognition, many scholars have made a lot of theoretical discussions on action recognition of aerobics in the last decade. The research methods are divided into two types: traditional methods and deep learning methods [18]. The traditional behavior recognition methods are generally designed manually through manual observation and design, and feature extraction methods that can characterize the action features are designed manually because they require manual setting of parameters, consume a lot of human and material resources, and have a low accuracy rate.
Deep learning methods can accomplish end-to-end motion recognition with high accuracy without a lot of manual labor [19]. Applying deep learning to video motion target detection can effectively describe the visual features such as appearance, structure, and color of the target to achieve target localization. In the early stage of research, artificial feature-based motion recognition methods process feature information into feature vectors, which are then input to a classifier for learning and training. Artificial features have two major drawbacks: one is that it is difficult to recognize complex sequential features, and the other is that they cannot adequately reflect the spatial and temporal characteristics of aerobics. In recent years, deep learning methods have been gradually applied to the detection and classification of images and videos, so the field of motion recognition has been developing rapidly and vigorously in recent years [20].
The traditional spatio-temporal weight gesture motion feature recognition method can determine the time-series modeling relationship between dynamic nodes according to the spatio-temporal features of human body pose video frames, and then use big data technology to realize the on-demand recognition of associated nodes. However, this method has poor accuracy in capturing the dynamic poses of aerobics, and due to the relatively large amount of calculation, it is easy to cause infinite prolongation of pose data recognition time. The essence of the deep learning network is the simulation of the human brain nervous system. During the application process, it contains multiple hidden layer structures at the same time. In recent years, deep learning techniques have shown powerful modeling capabilities in computer vision and natural language processing, and with the availability of large amounts of skeletal data, they have attracted the attention of many scholars [21]. According to deep learning techniques, they are classified into the following categories: recurrent neural networks or long short-term memory RNNs, convolutional neural networks LSTMs, and graph convolutional neural networks GCNs.

III. AEROBICS MOVEMENT RECOGNITION MODEL BASED ON CONVOLUTIONAL NEURAL NETWORK A. PROBLEM STATEMENT
In the aerobics competition, it can be divided into power strength, jumping, kicking and static strength according to different action strength. In this paper, we selected the most suitable aerobics movements from all the difficult movements of competitive aerobics for the construction of the movement recognition model [22]. Taking aerobics as an example, the movements of the human body can be regarded as a series of pose data that appear over time. Compared with other methods, the special kinematic feature model of the human skeleton has the ability to describe the state of posture changes. 100 test athletes performed 30 difficult movements of aerobics, and five experts in the field of aerobics scored the complexity of the movements, and for the nth movement of the mth athlete, the scores of the five experts from the highest to the lowest order were al a2 a3 a4 a5, removing one of the highest and lowest scores, then the gymnast's movement score can be expressed as: hj = (a2 + a3 + a4)/3.
The overall score Z for this aerobic gymnast can be calculated as follows: where l i is weight factor, which can be calculated as follows: The final aerobics movement assessment system determined according to the expert scoring is shown in Table 1, and these 10 movements are selected as the main objects of aerobics movement identification evaluation in this paper.

B. ACTION RECOGNITION ALGORITHM BASED ON SELECTIVE HYPERGRAPH CONVOLUTIONAL NETWORK
In terms of skeletal motion recognition, previous methods treat the skeleton as a false image or sequence, and then use convolutional neural nets or recurrent neural nets to further extract motion characteristics [23]. This paper represents the human skeleton as a diagram based on the physical structure of the body to make it more natural. The method based on key feature description can better identify continuous and interactive actions with strong robustness.
The left one in Figure 1 indicates that the human skeleton is modeled as a simple graph with joint-scale information. The two diagrams on the right are the human skeleton diagrams used in this paper. This method introduces hypergraphs containing multi-scale information to compensate for the lack of representational capability of behaviors by epistemic features and motion features. The method uses convolutional neural networks to automatically extract features of the target in the image, which eliminates the instability of manually labeled features and also extracts deep features of the target, improving the accuracy of recognition. In this chapter, selective hypergraph convolutional network is chosen to extract the multiscale information in the skeleton adaptively. In addition to this this model can also selectively aggregate temporal keyframe features, thus compensating for the shortcomings of keyframe dropout, and its structure is shown in Figure 3.
In this paper, we propose to represent the human skeleton as a hypergraph, which can be done without destroying the inherent spatial properties between joints, and also maintain the higher order correlation of the human skeleton in the model built. In a selective hypergraph convolutional network, a hyperedge connects more than two joints, and the advantage of this hypergraph is that the higher-order correlations between joints can be easily captured. Using the hypergraph convolution operator, graph neural networks can be easily extended to other models and applied to handle various non-pairwise relations [24].
The hypergraph convolutional network architecture diagram is divided into the following parts. There are eight selective spatio-temporal hypergraph convolutional blocks and a fully connected classifier with global average pooling. Every convolutional layer is connected with a batch normalization layer and a ReLU activation layer. Because multi-scale information is an essential aspect in the process of skeleton recognition, and most spatio-temporal graph convolutional networks perform poorly in this aspect, focusing only on single-scale information. Therefore, this paper uses VOLUME 11, 2023 L. Zhao: Hybrid Deep Learning-Based Intelligent System for Sports Action Recognition the proposed Supplemental Health Care (SHC) network to capture multi-scale contexts as well as to aggregate information from multiple scales through adaptive sensory fields. As shown in Figure 4, this automatic selection mechanism of the scale-selective hypergraph convolution network consists of three stages.
The process is as follows, first input Y ∈ S C×T×V ,, input the value to the three branches, the calculation is as: where σ represents the error parameter, X represents the input matrix, and H represents the height matrix. In the case of this paper, after performing the hypergraph convolution operation, the number of nodes in the hypergraph does not produce a numerical change, so that the hypergraph convolution features obtained at different scales can be aggregated smoothly [25]. In this paper, when aggregating the three branch features, the element level summation is taken, which is calculated as: In this paper, the global contextual information is collected by globally averaging the pooling in both spatial and temporal dimensions in a sequential manner, so that all information in the feature graph is averaged without losing too much critical information. A fully connected layer with nonlinearity is then used to make the selective weights more adaptive and to reduce the feature dimensionality. After multi-layer convolutional pooling of the combined features, valid features are selected by a fully connected layer to construct a mapping relationship with the output.
Then, this paper uses soft attention on the three channel dimensions to adaptively select information at different scales. The features of each channel are reallocated to complete the feature rescaling on the channel dimension, giving the model the ability to better identify the features of each channel. The soft attention in the three branches does not share weights and is implemented through a fully connected layer with softmax normalization, which indicates the selective weights assigned to the feature maps at different scales. The specific computational expression is given as: Finally, the selective feature maps are obtained by calculating selective weights at multiple scales, as expressed as: (7) In addition, the Nonlocal module is integrated in this paper to obtain long range information. The feature map Y in ∈ S C×T ×V is given first, and then the scale selection context information is obtained, and then the Non-local module is used to obtain the long range confidence. By this method, the model can obtain more semantic information, which leads to better performance of the model. The algorithm incorporates multiple features to ensure the accuracy of the algorithm and improves the efficiency of the algorithm by using multi-scale pictures and filtering process [26]. In addition, the original input is first fed into two embedding functions (e.g., λ and ξ ) to obtain the encoded features, and then the element-level product is used to obtain the attention matrix, which is calculated as: Higher-order features are obtained using an aggregation function, calculated as: This paper proposes frame selection time convolution, which is based on a selective convolution mechanism of key frames to replace stepwise convolution and adaptively select more important frames. A comparison of traditional step-time convolution and frame-selective time-convolution is shown in Figure 5.In this paper, the frame selection time convolution consists of three branches: the frame importance calculation branch, the frame feature aggregation branch, and the residual concatenation branch. To be consistent with previous methods, using only the frame feature aggregation branch and the residual connection branch is the original time convolution, which can selectively pool the time series frames after adding the frame importance calculation branch. Combining the advantages of time series models  to construct a combined model can improve the prediction accuracy.
Next, the entire selection mechanism is described in Algorithm 1. It is worth noting that the original input features are also selectively pooled to obtain the final residual branching results. The whole process is computed as:

IV. AEROBICS MOVEMENT EVALUATION MODEL BASED ON DEPTH METRIC LEARNING
Aerobics often change in different sequences, which means that a certain pose will appear on the sequence representing the same movement at different moments [27]. When comparing body postures, it is an interesting question how to evaluate the similarity of two postures and action sequences. As shown in Figure 6, the first two lines of action are both walking, but both the L2 and DTW metrics consider the unrelated ''standing'' sequence (bottom) to be more similar than the semantically related ''walking'' sequence (top).

A. LOSS FUNCTION
A loss function is used in the similarity test with the goal of using the network to reduce the distance between the anchor and the positive sample distribution when increasing to negative values [28]. The conventional cubic loss function is obtained from the same type of sampling and is given as the boundary distance. Denoting the aerobics sequence by Y 1 , Y and Y −1 , respectively, the loss function is shown as:  and b margin is not easily measurable and this problem can be solved by NCA (Neighbourhood Components Analysis) with the loss function shown as: where D denotes all categories except positive samples. Ideally, when iterating over three sets of samples, it is expected that samples from the same category are grouped in the same cluster in the corresponding embedding space. However, finding all possible triples is very time-consuming and impractical, and the model should be trained on only a small number of meaningful triples. Metric learning using triplet networks became more popular due to Google's FaceNet, which uses a triplet loss to learn the embedding space of images of faces so that embeddings of similar faces are closer together and embeddings of different faces are closer together. Far. For face recognition, positive images are images from the same person in the anchor image, while negative images are images of people randomly selected from the mini-batch. However, our case does not have a classification that allows easy selection of positive and negative instances. Although it would be costly to find hard-to-score samples in the dataset, it is possible to select them as negative samples. As the parameters are continuously updated, the current positive and negative samples will get further apart and will also get closer to other types that can move the full cluster, which can be shown as: 46546 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.  where y and y ′ are independently and identically distributed sequences drawn from q, z and z ′ are independently and identically distributed sequences drawn from r, and h denotes the kernel function as: Varying the expected value of a given sample gives as: where Y = {y 1 , y 2 , · · · , y n } is the sample set drawn from the q-distribution and Z = {z 1 , z 2 , · · · , z n } is the sample set drawn from the r-distribution, and the distance between the two sets of distributions can be measured as:

B. TWO-LAYER RECURRENT NEURAL NETWORK STRUCTURE
Based on the triple loss function of Formula 15, a convolutional recurrent neural network structure incorporating a self-attentive mechanism is designed and implemented in this section, and the structure is shown in Figure 7. The data captured by aerobics is a time series, which has some correlation with the current data in time and space [29]. Bidirectional GateRecurrent Unit (GRU) structure learns to model the implicit relationship of time dimension, SA_MMD_NCA consists of two layers of bidirectional GRU units, every layer has t GRUs corresponding to the dimension of the input data, the structure of GRU units is shown in Figure 8. Suppose that given a sequence of motions of length t, then for an input of x t , each GRU cell contains an update gate z, The weight matrix is represented by the corresponding weight matrices X s , X a , X i X p , which means: The SA_MMD_NCA pseudo-code is shown in Algorithm 2.

C. CMU DATASET EXPERIMENTAL RESULTS
To prove the effectiveness of the algorithm, experiments are conducted on the CMU motion capture dataset in this paper. Eleven classes are selected as the training set and 10 classes are selected as the test set in the CMU motion capture data, respectively. Since training with different sizes slows down the training process, this experiment uses a fixed sequence length for training, dividing the action sequence into 120 consecutive frames and leaving 20 frame gaps. This experiment uses the false positive rate with different percentage recall for the evaluation of model performance. The results on the CMU dataset compared with different models are shown in Table 2.
As seen in Table 2 and Figure 9, SA_MMD_NCA obtained lower FPR rates at different percentages of TPR. With a TPR rate of 80%, the method in this chapter has nearly 19% improvement in FPR compared to the first four methods and about 5% improvement compared to the three deep learning models. The results show that the modeling accuracy of this paper's method is more and the performance is better, which illustrates the effectiveness and superiority of this paper's method by comparing with the selected methods.

V. CONCLUSION
In this paper, the calisthenics action evaluation system is constructed, and the calisthenics action recognition model based on convolutional neural network is established. Selectiveframe temporal convolution frame selection time convolution is proposed. Frame selection time convolution consists of three branches, including frame importance calculation branch, frame feature aggregation branch and residual connection branch. In addition, this paper establishes an aerobics action evaluation model based on deep metric learning. A measurement method based on triple loss and maximum average difference is proposed, and the action sequence measurement model based on MMD-NCA recurrent neural network architecture is described in detail. Finally, experiments are carried out on the CMU motion capture dataset, and the effectiveness of the proposed algorithm is verified by comparing the experimental results with those of the existing algorithms.
The aerobics dynamic posture recognition and analysis system designed in this paper can control the adjustment ability of RBM coefficients by integrating dynamic posture data, and then define a specialized human skeleton model according to the principle of feedback fine-tuning. Due to the existence of the behavior recognition data set structure, the signal feature extraction results can directly affect the pose data capture ability matched with the system host.