Multiperson Interactive Activity Recognition Based on Interaction Relation Model

Multiperson activity recognition is a pivotal branch as well as a challenging topic of human action recognition research. This paper adopts a hybrid learning model to the spatio-temporal relationship and occlusion relationship among multiple people. Initially, this paper builds up an active multiperson interaction relationship estimation framework model to capture interpersonal spatio-temporal relation. This model incorporates the interaction relationship estimation framework with the multiperson relationship network. On this ground, it automatically learns from the human-computer interaction dataset in an end-to-end manner and performs reasoning with standard matrix operations. Secondly, this paper proposed an adaptive occlusion state behavior recognition method derived from the semantic knowledge model to ravel out the concern of occlusion and self-occlusion in human action recognition. Then, Petri Nets are used to recognize multiperson interactive actions. This model has been through extensive experiments on the TV interaction dataset, Vlog dataset, AVA dataset, and MLB-YouTube dataset, experimental results have proved that the recognition performance of this model is superior than the other available models. This paper prospects and summarizes the estimation framework of the interaction relationship and occlusion semantic-knowledge relationship. Experimental results suggest that the proposed method in the paper could capture the discriminative relation information for multiperson interactive activity recognition, which further validates the eﬃciency of the hybrid learning model.


Introduction
Human interactive activity involves social behaviors and interactive behaviors. e former refers to individual behaviors, yet it takes other individuals' behaviors into account; the latter refers to group behaviors for a shared goal. To recognize different behaviors, collective intelligence is thus built, aiming to jointly reason multiple individuals' behaviors. On this ground, the paper proposed a method to recognize group behaviors which allow us to locate and narrate the interactive and collective actions of each individual in the context. e above perception of social context can be applied to sports analytics, social behavior recognition, and surveillance.
Recent methods for multiperson interactive behavior recognition take a sequential approach. Different from action recognition [1][2][3][4][5][6][7][8], which focuses on recognition of individual action, multiple behavior recognition settles down to figuring out the multiperson interactive behavior in a public area scenario, which can be applied to practical applications such as visual surveillance, intelligent robotics, and sport event analysis. Multiple activity involves more than one person and can scarcely be identified if only considering the action of a single person. For example, the action standing (performed by a single subject) is in fact compatible with multiperson interactive behavior, as shown in Figures 1-4 . Hence, it is important to take the interactions overhead the related group into consideration for multiperson interactive activity modelling.
Dating back to a decade ago, interactive activity recognition modelling started with developing the model each class of collective activity by an interaction matrix [9] and employed systematically organized skeleton features enhanced with directional features in dealing with interactive action recognition and real-time detection tasks [10]. An-An et al. [11] proposed visual feature learning which is incorporated into the multitask learning framework, with the Frobenius-norm regularization term and the sparse inputs Interaction estimation framework Feature description Multi-person occlusion state reconstruction based on semantic knowledge event Action recognition based on petri network outputs Figure 1: Proposed multiperson interactive activity recognition. e data comes into the framework on the left. e interaction relation estimation framework and multiperson relation graph network represent the multiperson scene. Secondly, we extract feature description based on the 3D extension to the SIFT algorithm. irdly, multiperson occlusion state reconstruction based on the semantic knowledge event is used to describe multiperson interactive behavior events. Finally, we introduce Petri Nets (PNs) for detecting activities in four public datasets. constraint term, for joint task modelling and task relatedness-induced feature learning. Zhu et al. [12] proposed a photonic switched optically connected memory system architecture for deep learning models. However, these descriptors are not able to tell the exact relationship between people in a group. Xu et al. [13] proposed technology of novel group activity recognition which is proposed based on multimodal relation representation with temporal-spatial attention. Kim et al. [14] proposed a set of predictive features interbehavior relation based on spatial, temporal, transitional, and environmental contexts. Liu et al. [15] proposed a deep fully connected relation model to learn the interactions between people. Lu et al. [16] proposed a graph attention interaction model (GAIM) embedded with the graph attention block (GAB) to explicitly and adaptively infer unbalanced interaction relations at personal and group levels.
In these previous studies, features are first extracted for each person, and then, the interactions between each pair of features are explicitly modelled. A key limitation of most existing models lies in the separation of human detection and activity modelling. Existing models weigh more on group behavior or collective behavior modelling. As for human detection, they directly adopt bounding boxes output by the third-party pedestrian detectors. Unfortunately, this design completely decouples the detector and collective activity recognizer and abandons the inherent collaboration between these two modules. e outliers who perform individual activities semantically irrelevant to the main group or the missing active participant information are detrimental to the collective activity recognition (misclassified as crossing and walking, respectively), thus affecting group activity recognition. e lack of collaboration between the detector and recognizer brought by the separate learning/ processing in previous methods also leads to heavy computation when reasoning the collective activities.
is is because forward propagation through the neural network   Journal of Mathematics backbone that aims at basic feature extraction is executed twice for the detection and activity recognition tasks, respectively.
We have also observed that the current trend [17][18][19][20][21] of tackling the problem of multiperson interactive activity recognition is to develop a model or framework with increasing complexity to jointly learn more subtasks simultaneously (e.g., detection, tracking, pose estimation, appearance modelling, and interaction).
Although approaches aforementioned seems to be reasonable, it has some limitations. Firstly, most of the advanced detection approaches do not involve joint optimization to process multiple objects, but rather rely heavily on the heuristic postprocessing approach. Greedy nonoptimal decisions are thus common to be seen. Secondly, feature extraction of each object ignores a large amount of context and interaction, which is believed to be productive information for reasoning multipersonal interactions because the position and action of some human objects with interactive relationships can be highly correlated. irdly, separating the tracking and the detecting suggest the loss of positioning features, and the utilization of these features will make a more productive recognition model. Finally, the sequential approach cannot prevail in a multiperson scene as it needs to be run multiple times on a image.
In short, previous approaches showcase some advantages (e.g., a good joint learning framework can process multiple tasks simultaneously), but it also reveals noticeable limits. For instance, (1) the sophisticated model which can be difficult to be optimized, (2) each subtask has not been fully explored and studied in depth, and (3) the core part and the problem-solving mechanism of these models remain underexplained, and it is difficult to guide further research studies. e approach suggested in our research attempts to overcome the above limitations and address specific problems. Inspired by the recent multipersonal interaction recognition, we propose the multiperson interactive relation Graph (MRG) to simultaneously capture the interpersonal appearance and position relation. Secondly, we build the interaction relation estimation framework and multiperson relation graph network to represent the multiperson scene. Finally, we inferred human behavior by the Petri network. In the following, Section 2 discusses overview of the proposed method. In Section 3, we discuss the multiperson interaction estimation framework, multiperson occlusion state reconstruction based on semantic knowledge event, and Petri Nets (PNs). Section 4 provides a comparative experiment. In the final section, general conclusions are made and possible further improvements on this research are stated.

Related Work
Multiperson interactive activity recognition is a comprehensive analytical task that has developed rapidly in recent years. In this section, we will briefly review the development of the multiperson interactive activity recognition. Recognizing the collective activity of a group of participators becomes an attractive research topic these years. Usually, multiperson interactive activity recognition needs to infer the complex interactions among different activity participators, and it is much more challenging than recognizing action of an individual subject.

Descriptor Learning without Interaction
Modelling. At the early stage, researchers aim at seeking a discriminative descriptor to summarize the "multiperson interactive activity states" in a collective scenario. For this purpose, Zhao et al. [22] proposed a unified discriminative learning framework of multiple context models for concurrent collective activity recognition. Chen et al. [23] presented a new attribute-based spatio-temporal (AST) feature representation descriptor including spatio-temporal (ST) features and attribute features.
By means of the seminal works [24,25], deep neural networks are used for extracting features with high representational capacity. Sudhakaran et al. [26] presented EgoACO for video action recognition that learns to pool action-context-object descriptors from frame level features by leveraging the verb-noun structure of action labels. By the aid of the multiscale feature maps output by a fully convolutional network, features of each individual within consecutive frames are fused together by a recurrent network [17].

Interaction Modelling with Shallow Models.
Several works attempted to explore the interactions between individuals by shallow models. Dong et al. [27] proposed the Residual 3D Network (R3D) and Attention Residual 3D Network (AR3D) human action recognition models. Yang et al. [28] proposed a plug-and-play channel adaptive merging module (CAMM) based on graph convolutional networks (GCNs) specific for the human skeleton graph, which can merge the vertices from the same part of the skeleton graph adaptively and efficiently. Sheng et al. [29] proposed a discriminative subspace learning model (DSLM) to explore the complementary properties between the handcrafted shallow feature representations and the deep features. Song et al. [30] proposed a video-level 2D feature representation and a temporal attention model within a shallow convolutional neural network to efficiently exploit the temporal-spatial dynamics. Yan et al. [31] designed a Hierarchical Graph-based Cross Inference Network (HiG-CIN), in which three levels of information include the bodyregion level, person level, and group-activity level. Li et al. [32] proposed symbiotic graph neural networks, which contain a backbone, an action-recognition head, and a motion-prediction head.
To sum up, these research studies intended to capture the interactive actions in collective activities by a shallow model, rendering them inapplicable for describing activities with complex interactions.

Deep Interaction
Modelling. Zhou et al. [33] proposed the cascaded parsing network (CP-HOI) for a multistage, structured HOI understanding, which each cascade stage refines HOI proposals and feeds them into a structured interaction reasoning module. Fan et al. [34] proposed a spatio-temporal graph neural network to represent the diverse gaze interactions and to infer atomic-level gaze communication by message passing. Qi et al. [35] proposed the graph parsing neural network (GPNN) to infer the HOI graph structure and the node labels. Rahmani et al. [36] proposed the robust nonlinear knowledge transfer model (R-NKTM) for human action recognition. Xiao et al. [37] proposed a dual attention network model which reasons about human-object interactions. is network weighs the important features for objects and actions, respectively.
Whereas the aforementioned method seems reasonable, it has several drawbacks. First of all, the majority of state-ofthe-art detection methods do not use any kind of joint optimization to handle multiple objects, but rather rely on heuristic postprocessing. us, they are susceptible to greedy nonoptimal decisions. Secondly, the extracted features individually for each object discarded a large amount of context and interactions, which can be useful to reason the multiperson interactive behavior. is point is particularly importantly because of the locations and actions of some interactive humans which can be highly correlated. irdly, the independent detection and tracking methods mean that the representation used for localization is discarded, whereas re-using it would be more efficient. Finally, the sequential approach does not scale well with many people in the scene, as it requires multiple runs for a single image.

Overview of the Proposed Method
In order to overcome the above limitations and address specific problems, we propose the multiperson interactive relation graph (MRG) to simultaneously capture the interpersonal appearance and position relation. Firstly, we build the interaction relation estimation framework and multiperson relation graph network to represent the multiperson scene. Secondly, we discuss the multiperson interaction estimation framework. Section 4 discusses multiperson occlusion state reconstruction based on the semantic knowledge event. Finally, we introduce Petri Nets (PNs) for detecting activities in the TV human interaction dataset, Vlog, Atomic Visual Actions, and MLB-YouTube datasets. e system flowchart is shown in Figure 1.

Multiperson Interaction Estimation Framework.
e framework is designed to recognize the multiperson interactive scene by clear-cut relation information and the input of extracted feature results and generate a set of reliable bounding box coordinates with their corresponding confidence scores via the target in the detection stage. On this basis, we build the interaction relation estimation framework and multiperson relation graph network to represent the multiperson scene. Responding to the self-occlusion issue in the field of human action recognition, a new adaptive occlusion state behavior recognition approach was presented based on the Petri network. In the following section, we will give detailed descriptions of our approach.

Interaction Estimation Framework.
Firstly, in order to obtain multiperson objects from the surveillance video, we use the method [17] to target multiperson.
From being given the feature map F ∈ R|I| × D and two dense maps B ∈ R|I| × 6 and P ∈ R|I| (P represents a segmentation mask encoding in which parts of the image contain multiperson and B represents the coordinates of the bounding boxes of the people present in the scene, encoded relative to the pixel locations), we convert the given ground truth object locations into dense ground truth maps B and P, for detecting the set of bounding boxes (x 0 , y 0 ), (x 1 , y 1 ), . . . , (x n , y n )}: where s y and s x are scaling coefficients that are fixed as the maximum size of the bounding box over the training set image. B i are defined for i: P i � 1, and the regression loss is constructed accordingly. e loss is defined as follows: where w is a weight that makes training focused more on classification or regression for datasets where classification is easy, such as volleyball. Secondly, we use the method [18] to acquire appearance relation and position relation: where acts as a normalization factor: where (·) is the indicator function; according to the actual scene, set a threshold value to determine 1 or 0. d(x s i , x s j ) denotes the Euclidean distance between center points of multiperson bounding boxes, and μ acts as a distance threshold which is a hyperparameter. e parametric ξ is represented as the relative distance among multiperson position relation value using cosine and sine functions of different wavelengths. e feature dimension after embedding is d s . We then transform the embedded feature into a scalar by weight vectors W s and b s , followed by a RELU activation:

Journal of Mathematics
In order to recognize the interaction relationship of multiperson behavior, we developed a multiperson relation method in which we defined three indexes [19]: movement time (MT), nonoverlapped movement time (NOMT), and group movement time (GMT). MT denotes the movement time of a single user, GMT is the total movement time of all persons, and NOMT is the nonoverlapped movement time between persons.
Generalizing the equation, in a given duration, if the MT of person is MT n , the GMT of person 1 , person 2 · · ·person n−1 is GMT 1, 2· · ·n−1 , and the NOMTof person 1 , person 2 · · ·person n−1 is NOMT 1, 2· · ·n−1 ; then, GMT and NOMT of user 1 , user 2 · · ·user n are calculated as below: Assuming that the higher the NOST, the more likely that there is an interaction, the networking method recognizes the interaction persons through four stages: (1) establish a potential interaction persons, (2) link another person to the potential interaction persons, (3) confirm an interaction relation, and (4) recognize another interaction relation. ese stages are described in detail as follows: Stage 1: establish the potential persons' interaction relation (1) Calculate the NOMT of all linkable cases between any two persons with the highest NOMT. (2) If the two people's NOMT is higher than respective MT, a potential interactive relation is identified.
Otherwise, there would be none.
Stage 2: connect other people to the potential interactive relationship (1) Calculate the NOMT of the potential one and any individual, and then, connect with a person who increases NOMT the most. After that, the potential multiperson interaction is re-established involving the newly connected person. (2) Repeat the above steps until there is no user left who can increase the NOMT. If there is any people who are not involved in the potential interaction relation, they are deemed to be noninteractive persons.
Stage 3: confirm the interactor (1) If GMT of the potential interaction relation is equal or more than a value, it is confirmed as an interaction (2) If GMT of the potential interaction relationship is less than a value, all of the collected relations are considered noninteraction Stage 4: recognize group interactions When they are more than two interactors, the networking methods can be utilized multiple times to recognize interactive relations.
As mentioned above, we utilize the graph structure to explicitly model pair-wise relation information for group activity understanding. Our design is inspired by the recent success of relational reasoning and graph neural networks [8,38].
Formally, the nodes in our graph correspond to a set of actors A � (GMT i , NOMT i )|i � 1, . . . , n , where i is the number of actors, x a i ∈ A is actor i's appearance feature, and x s i � (t x i , t y i ) is the center coordinates of actor i's bounding box. According to the existing method [38], we construct graph G ∈ R N×N to represent pair-wise relation among actors, where relation value G ij indicates the importance of actor j's feature to actor i: where GMT i (x a i , x a j ) denotes the appearance relation between two actors, and the position relation is computed by NOMT i (x a i , x a j ). e function h fuses appearance and position relation to a scalar weight: where act as a normalization factor: where d(x s i , x s j ) denotes the Euclidean distance between center points of two actors' bounding boxes and μ acts as a distance threshold which is a hyperparameter.

Feature Description.
We use a 3D extension to the SIFT algorithm [39,40], as described in [41,42], to determine the location of interest points. Given a 3D input volume I (x, y, z) and a 3D Gaussian filter G(x, y, z, δ), we form multiscale difference of Gaussian (DoG) volumes, similar to [38,39], as follows: where D ij are the second derivatives in the DoG volume.
If there are reject points, we define the following equation: 6 Journal of Mathematics Since the ultrasound image is noisy and not as sharp as the normal image, we lessen this threshold restriction so as to obtain more feature points [39]. τ e � 25, and this threshold is obtained from practical tests and has a certain universality. erefore, we define the subvoxel estimate of the extrema true location is achieved by quadratic interpolation on the DoG volume data. After obtaining the identification of interest point locale, we define a localized neighborhood function, extending this from the earlier work [39,40]: where w(d, δ) can limit the contribution of voxels around the point of interest to those in the local neighborhood, d is the voxel distance from the point of interest to the contributing voxel, and δ is used to determine the extent of the local contribution: For voxel k, the voxel distance dk from the adjacent acquired interest point location with the density ρ k .

Multiperson Occlusion State Reconstruction Based on Semantic Knowledge Event.
is section presents the semantic knowledge techniques employed in the framework, namely, the semantic knowledge interpretation (SKI) components based on the method [25], which, respectively, provide a unanimous representation storage (representation layer) and semantic interpretation and event fusion (interpretation layer). e semantic-knowledge structures and vocabularies are described as the follows. e semantic knowledge modelling and storage allows end users to model domain knowledge about (1) goal-oriented protocols, (2) domain observation entities and events, and (3) interactive occlusion behavior contextual models. Semantic knowledge of complex activities involved in each scenario. e multiperson occlusion protocol (or scene) can be marked as an instance, which is applicable for the preservation of multiperson occlusion information of the occlusion state. e participant instances allow profile-related assertions about participants to be defined, such as interaction occlusion and self-occlusion. Protocol steps cover one or multiple tasks, with a start and an end nodes. Our proposed method implements three protocol steps: directional activities, occlusion activities, and self-occlusion activities.
e term events suggest lower-level observation types and high-level activities.
e ontology offers lightweight glossary for basic event-related information, for instance, event hierarchies and temporal extension. Event is the root class with two direct subclasses' observation and activity for modelling observations and activities, respectively. e event has two embranchments: observations and activities. Four observation types derived from the observation modelling are, namely, modelling location (e.g., in the surveillance area), postures (e.g., standing up), actions (e.g., drinking water), and objects (e.g., a wallet). ese are the basic types under the observation category.
e agents of the events and the temporal context are captured using constructs from DUL [43,44] and OWL Time [45], respectively. For example, the detection of the object is modelled as follows: To perfect the definition of the interactive behavior structure in the occlusion state, so as to better share and reuse knowledge, this paper chooses an active ontology for the occluded interactive behavior modelling. e ontology is able to recognize the occlude activity and observation type, and the context of the occlusion and self-occlusion interactive activity thus could be stated. e knowledge representation module of this framework creates a model for semantic modelling in the context. Contextual information in each occlusion activity is conveyed by class equivalence axioms, which connects interactive behaviors with lowerlevel observations. More precisely, the activity models encompass the domain semantic knowledge required to detect complex activities. It is manifested by the mutual interdependence of lower-level observations and sophisticated behaviors and is composed of the following knowledge structure.

Action Recognition.
In this paper, we introduce Petri Nets (PNs) for detecting activities in the TV human interaction dataset, Vlog, Atomic Visual Actions, and MLB-YouTube datasets.
e Petri Nets are graph-based techniques that can model and visualize various behavior types including parallelism, concurrency, resource sharing, and synchronization [46,47]. A Petri Net is a finite state machine that allows multiple inputs and multiple outputs (a traditional finite state machine is a Petri Net in which each transition is restricted to have exactly one output and one input [48]. In a graphical representation [49][50][51], the places are drawn as circles and the transitions are drawn as squares or rectangles. Arcs are drawn connecting place nodes to transition nodes (input arcs) or transition nodes to place nodes (output arcs). Regular arcs are drawn with arrow heads. Inhibit arcs are drawn with dot heads. Arcs are associated with a weight, also called the arc's multiplicity. Arc multiplicity is taken to be one if not specified. Places connected to a transition by input arcs are called transition's input places (or input set). Similarly, places connected to a transition by output arcs are called transition's output places (or output set). A place node may contain a number of tokens (another graph component). Tokens are visualized as black dots within the place node which contains them.
As PN is introduced to recognize the relative velocity of area models and nodes, the proposed method is based on a primitive advanced Petri Net, also known as the PN with area-velocity tokens.
Firstly, define a Dong et al. [27]-like model and reasoning as follows.
With regard to the format, the basic place/transition PN can be described as a five-tuple: PN � P, T, I, O, M { } and can be graphically represented by a directed bipartite graph, which includes two types of nodes: the places P, which are drawn as circles, and the transitions T, which are drawn either as bars or boxes [15]: P � p 1 , p 2 , p 3 , . . . , p n is a finite set of places T � t 1 , t 2 , t 3 , . . . , t m is a finite set of transitions I: (P × T) ⟶ N is the input arc function which can be represented by the input matrix: I n×m . If there existed an arc with weight k that connects the place p 1 to the transition t j , then I(p i , t j ) � k; otherwise, I(p i , t j ) � 0. O: (P × T) ⟶ N is the output arc function, which can be represented by the output matrix: O n×m . If there exists an arc with weight w that connects the transition t j to the place p k , then O(t j , p k ) � w; otherwise, O(t j , p k ) � 0. M: P ⟶ N is the current marking of the net and can be represented as a vector M 1×n . M 0 is the initial marking, which denotes the initial state of the net.
After a new transformed function, the behavior recognition is as follows: where S j is the detector response for the action j, T j is the area-velocity goodness of the transition j, and X i old is the previous score of the token i that is passing through the transition j.
After firing a given transition (rule), tokens from its input places are removed. Firing a given transition (rule) and removing tokens can be intuitively interpreted as an execution of reasoning by using this rule in a given reasoning process. Hence, in the next steps, markings of input places of a fired rule are already unnecessary. Such reasoning can be understood as a kind of forward reasoning.

Experiments
Results of the TV human interaction dataset [52] are shown in Figure 2. e Interaction dataset consists of 300 video clips collected from over 20 different TV shows, containing four types of interactions: hand shakes, high fives, hugs, and kisses. Noninteraction clips are also attached.
Vlog [53] is a large-scale YouTube dataset, aiming to collect lifestyle vlogs which includes daily interactions between people and things. e experimental results of this dataset are shown in Figure 3.
Atomic Visual Actions [54] is a dataset released by UC Berkeley and Google with a diverse environment and a large number of classes labelled using exhaustive frame-level annotation of YouTube videos. 430 clips were extracted from the 15 to 30 mins from movies and television shows related to famous actors from different countries. e results of the Atomic Visual Actions dataset are shown in Figure 4. e MLB-YouTube (Major League Baseball) [55] is a fine-grained activity dataset consisting of 20 baseball games from the 2017 MLB postseason. Among them, the 9 activities (swing, foul, ball, and strike and no action) are not very distinct whereas salient repetition of activities are noticed. e annotations are multilabel and overlapping along with pitch type (e.g., fastball, curveball, and slider) and the speed of the pitch also being given for each pitch.
e results of the MLB-YouTube (Major League Baseball) are shown in Figure 5. e experimental results are shown in Tables 1-4 , respectively.
It can be seen from the above table that the average recognition accuracy of the recognition method in this section is 93.42%, which is superior to other recognition algorithms.
It can be seen from the above table that the average recognition accuracy of the recognition method in this section is 93.37%, which is superior to other recognition algorithms. It can be seen from the above table that the average recognition accuracy of the recognition method in this section is 93.21%, which is superior to other recognition algorithms.
It can be seen from the above table that the average recognition accuracy of the recognition method in this section is 92.87%, which is superior to other recognition algorithms.
Hence, according to the above tables, our approach proved to be more accurate in recognizing human-computer interactions, which further validates the accuracy and efficiency of our algorithm.

Conclusions and Future Work
e present study proposes a newly-developed interactive behavior relational modelling and identification framework, which successfully recognizes multiperson interaction Table 1: Compared with other approaches on the TV human interaction dataset.

Method
Average recognition rate (%) e proposed method 28.12 Afrasiabi et al. [56] 30.26 Atto et al. [57] 31.22 Yoon et al. [58] 28.23 Zachary and Holder [59] 27.13 Bagautdinov et al. [60] 26.33 Hussain et al. [61] 30.43 Khan et al. [62] 28.78 behaviors. We build a flexible and efficient model of the multiperson interaction relation estimation framework to capture the appearance and position relation of people, and a new reasoning approach is thus formed. e temporal consistency is handled via a person-level matching recurrent neural network. With regard to the integration of the interaction relation estimation framework and multiperson relation graph network, it not only facilitates automatic learning of the human interaction dataset in an end-to end manner in our proposed approach but also enables that the reasoning process could be efficiently performed with standard matrix operations. As for the self-occlusion issue in human action recognition, this paper proposes a new adaptive occlusion state behavior recognition approach based on semantic knowledge event representation.
On account of four standard interactive behavior database, we visualize the learned interaction relation estimation framework and the multiperson relation graph (MRG), which demonstrates that the proposed method is able to capture the discriminative relation information for multiperson interactive activity recognition. Experimental results reveal that the multiperson occlusion state reconstruction based on SKE outperforms other approaches in its accuracy.
ere are some failure cases in all of the above experiment. During the experiment process, if there are some strong occlusion (e.g., the occlusion area is over one-third) and high dynamic videos (e.g., interactive actions switch quickly. For instance, the cricket is quickly dropped from the hand) among multiperson or person objects, the identification accuracy would significantly decrease. Because of strong occlusion and high dynamic, it can cause feature points and interaction relationship positioning to be inaccurate and location can only be expected.
We extracted the above database containing video data with strong occlusion and high dynamic video and conducted experiments with the proposed method in the paper. e results are shown in Tables 5-8 . It can be seen from the above table that the average recognition accuracy of the recognition method in this section is 30.26%, which is superior to other recognition algorithms.
It can be seen from the above table that the average recognition accuracy of the recognition method in this section is 31.12%, which is superior to other recognition algorithms.
It can be seen from the above table that the average recognition accuracy of the recognition method in this section is 34.23%, which is superior to other recognition algorithms.
It can be seen from the above table that the average recognition accuracy of the recognition method in this section is 28.12%, which is inferior to other recognition algorithms.
It is assumed that the proposed method can be extended to interacting people or objects. Tracking poses of interacting people, however, will involve more complex factors for instance, dealing with more variable motion, interperson occlusions, and possible appearance similarity of different people. [51] Data Availability e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.