Optimised ARG based Group Activity Recognition for Video Understanding

Video understanding identiﬁes and classiﬁes various actions and events in the video. Many previous works, such as video annotations, have demonstrated promising success in generating general video understanding. However, a ﬁne sum-mary of human activities and their interactions using state-of-the-art video captioning techniques is still diﬃcult to produce. The comprehensive explanation of human actions and collective behaviors is important information for real time CCTV video tracking, medical treatment, sports video analysis etc. This research suggests a form of video understanding that focuses primarily on identifying group activity by learning the similarities between the pair and the actors appearance. In order to measure the similarity between the pair appearances and construct an actor relations map, the Zero Mean Normalized Cross-Correlation (ZNCC) and the zero-mean sum of absolute diﬀerences(ZSAD) are proposed to allow the graph convolution network (GCN) to learn how to distinguish group actions. We recommend that MNASNet be used as the backbone to retrieve features from any video frame. A visualization model is also developed to visualize every input video frame and predict individual behavior or collective activity with projected bounding boxes on a human object.


Introduction
A deeply studied topic in the field of video content analysis is video understanding [1]. Current neural networks, specifically LSTM [2], are utilized for training modeling with video-sentence pairs [1,[3][4][5] by traditional video captioning techniques such as LSTM-YT [3] and S2VT [4]. The models learn how to connect the sequence of video frames with the sequence of the sentences in order to generate a video description [4]. Krishna et al. stated that these video subtitles work with only one significant event for a short video [5].They have therefore introduced a new module to describe all events in a video clip using contextual information from the timeline [5]. However, the description of human actions and their interactions in the form of video captions remains very limited [6]. The capability to describe further human actions and group actions [7,8] is demonstrated by recent studies in pose estimation, human action recognition and group activity recognition. A major problem in video understanding is recognition of human action and recognition of group activities [9]. The techniques of action and activity recognition have been widely used, for example in the fields of social behavior understanding, sport video analysis and video monitoring. It is important to better understand a video scene with several people and to understand the action and collective activity of all individuals. The group activity recognition based on the Actor Relation Graph (ARG) is the state-of-the-art model that aims to capture the appearance and position of the actors in the scene, and to identify the action and the group activity [9]. In this paper, we suggest several ways to improve the functionality and efficiency of the Actor Relation Graph model in order to improve video understanding, which is focused primarily on group activity recognition. To enhance human actions and group activity recognition, MNASNet is used in the CNN layer and to calculate the pair-wise appearance similarity in building an Actor Relationship Graph, Zero Mean Normalized Cross-Correlation (ZNCC) and the zero-mean sum of absolute differences (ZSAD) is used. We also implement a visualization model, which shows each video input frame and predicts bounding boxes on each human object.

Video Captioning
Video captioning is an important field of study for video understanding. In 2015, S Venugopalan et al. proposed an end to end model that used the recurrent neural network (LSTM [2]) to use video sentence pairs to create a sequence of images in a video with sequential words to describe the event as captions [2]) and to develop a sequence that would combine videos and sequential words. To learn the temporal structure and sequence model of the generated sentences of the frames a stack of two LSTMs was employed. The whole video sequence must be encrypted in this approach using an initial LSTM network. Long video sequences might lead to gradient vanishing problem and avoid successful training of the model [5]. R. Krishna et al. introduced a model for Dense Captioning Events (DCE) which detects several events and provides a description for each event using past, current and future contextual information in one single video pass [5]. The process is divided in two steps in this paper: event detection and event description. In order to localize temporal proposals of interest to short and long video sequences, the DCE model utilizes a multi-stage variant of the deep actions predictor model. A LSTM captioning model is also introduced to use the historical and future contexts with a mechanism for attention. X.P. Li et al. have developed a novel attention-based framework, called LSTM (Res-ATT [10]). This new model takes advantage of the existing mechanism of attention and integrates residual mapping into a two level LSTM network to avoid the loss of word information previously generated. The decoder model based on a residual attention has five different parts: a sentence encoder, time attention, visual and sentence fusion layer, residual layer and MLP [10]. The phrase encoder is an LSTM layer exploring important syntactic information from the phrase. Temporal attention is designed to recognize the significance of each frame. The LSTM Visual and sentence fusion Layer works to mix natural language information with image characteristics and the remaining layer is proposed to lower the transmission loss. The MLP layer predicts the word to create a natural language description [10].

Pose Estimation Activity Recognition
It is essential to understand the actions of each individual to better understand a video scene which includes several people. OpenPose is a real-time open-source system used to detect pose in 2D multi-persons [8]. It is now widely used in video frames [11,7,12] for body and facial detection of markers. It generates a spatial encoding for a variable number of people, followed by a greedy bipartisan graph that matches all people on the image with the 2D keypoints. This approach refines both the prediction of PAFs and the detection of trusted maps in each phase [8]. This improves real-time performance while maintaining each component's accuracy separately. The OpenPose online library collected on one single image, provides 2D-Human-pose-Evaluation for our proposed system to detect the body, hand, and facial keypoints. The Long Short Term Memory (LSTM) recurrent neural networks (RNNs) are frequently used for the recognition of human actions through the emerging methods of human activity recognition. F. M. Noori et al. proposes an approach in which, after considering movements on the video frames, the anatomical keypoint is first removed from the RGB images using the OpenPose library and then the features of RNN are classified into the corresponding LSTM activities [7]. Improved performance is demonstrated by the organization of various subjects from different angles. However, their efforts to improve accuracy are still underway on the basis of their multi-person action classification.

Actor Relation Graphs for Group Activity Recognition
A major problem in video understanding is recognition of human action and recognition of group activities [9]. J Wu et al. suggested that Actor Relational Graph (ARG) be used to model actor-to-actor relationships and so that group activity learning be recognized with multiple participants [9]. The relationship between actors from a similarity to the appearance and the relative location is determined and captured using the ARG in a multi-person scene. When compared to the use of CNN to extract person-level features and then add them into a scene-level feature, or the application of RNN to collect temporal information from densely sampled frames, the computationally costly and flexible way of learning with ARG is less while addressing variations in group activity. With a video sequence with bounds for actors in the scene, the trained network can recognize individual actions and group activity in a multiperson scene. The network can also recognize action. ARG efficiency is improved for long range video clips by forcing a relational connection in a local neighborhood alone and by dropping several frames alone while maintaining the diversity of training samples and reducing the risk of overfitting. Initially, the features of the actors will be extracted from the provided bounding boxes by model CNN and RoIAlign [13]. After the feature vectors of actors in the scene have been obtained, several graphs are created to represent the diverse information of the same set of actors. Finally, the GCN is designed to perform learning in order to identify the actions and activities of individual groups based on ARG. The pooled ARG applies respectively to two classifiers used for each action and group recognition activity. The representation in scene-level is generated by maxpooling individual actors, that later use them for classification of groups of activities.

Proposed Method
We propose to use an improved model based on the Actor relations graph, which focuses on recognition of group activity. The overview is shown in Fig. 1 of the original ARG model. Firstly, the model extracts the actor from sampled video frames with CNN and RoIAlign bounding boxes [14]. It then builds a N by d feature matrix with a d dimensional vector to represent each bounding box of the actor and N to present the number of video frame bounding boxes. The graphs of the actor relation are then constructed to capture the appearance and position of each actor in the scene. The model will then analyze from the ARG the relations of each actor using Graph Convolutional Networks (GCN). Finally, two distinct classifiers aggregate and utilize the original and relational features to carry out actions and recognition of group activities [9]. As the study focuses mainly on the recognition of group activities, individual actions are not very accurate since the model uses the area of interest and CNN only to recognize actions. Although high accuracy predictions of group activities can be made using the ARG model. Some areas for improvement still exist. Figure 2 illustrates the improved ARG-based model for human actions and group recognition.
We proposed to use MNASNet for extracting image characteristic maps in CNN and to calculate the pair-wise appearance similarity for the Actor Relation Graph Fig. 2 Optimized ARG-based human actions and group activity recognition by using Zero mean normalized cross-correlation (ZNCC) and the Zero mean sum of absolute differences (ZSAD). Section V provides further details on our proposed methodology. In order to give our model a more visualized result we also introduce a visualization model to display every video frame with forecast bounding boxes on each human object. Fig 3 shows the output examples.

Methodology
The key to building this Actor relationship graph is in our model. J. Wu et al. has shown that in each frame, the ARG can represent the graphical structure of the information in pairs between the pairs of actors, and use the related information to understand group activity [9]. Both features of appearance and position information are used to construct the ARG to better understand the relation between two actors.The value of the relationship is defined as a composite function f a below which indicates the relationship of appearance, and f s indicates position relationship. x a i and x a j refers to the appearance features of the actor i and actor j while x s i and x s j refers to the location features (the center of the bounding box of each actor) of the actor i and actor j. Function h combines appearance along with position to scalar weight [9].
The normalisation of every actor node with SoftMax function is further adopted in order to ensure that the sum of all values corresponding to each actor node is always one [9]:

Appearance Relation
The Embedded Dot-Product is used in J Wu's paper to calculate the similitude between the appearance features of both actors (the image within each actor's boundary box) in the space where they are embedded [9]: The following is the corresponding function: Both θ and φ are functions utilizing W x + b, where W nad b are learnable weights. The learned changes of original features can better comprise the relationship between two actors in a subspatial environment. Two other methods for appearance calculation are: Zero Mean Normalized Cross-Correlation (ZNCC) and Zero mean sum of absolute differences (ZSAD). A method to evaluation the degree of similarity among two comparable images is the normalized cross correlation(NCC). The lighting and exposure conditions can cause the images to be different in brightness, so that they can first be normalized to have a more accurate similarity score using NCC. The advantage of the normalized cross-relation is that it is less sensitive to linear changes in the illumination amplitude of the two comparative images and can be written in the following way [13]: The quantity ϕ x a i x a j (t) vary between −1 to 1. The value of the NCC can help us better understand the relationship of appearance between the two actors. Another method we evaluate in order to calculate the relationship of appearance of ARG is the sum of absolute differences (SAD). By calculating the amount of absolute difference between the matrix components as the formula, the SAD computes the distance between two matrices: SAD is stronger against extreme data values, making it robust when comparing appearance characteristics and better capturing the relationship between appearances.

Position Relation
Furthermore, spatial structural information is considered to better capture the relationship between actors. To obtain signals from entities not remotely distant, a distancing mask has been applied. As relation is crucial to the understanding of group activities in a local context in comparison to the global relationship, an evaluation of Euclidean distance G ij between two actors is taken as follows: Where I(·)is the indicator function and d x s i , x s j evaluates the Euclidean distance between the center coordinates of two actor's bounding boxes, µ is a distance threshold.

Datasets
This paper uses a data set known as a collective activity dataset and an augmented dataset [15] to train and test our model. This dataset has a 74 video scene which contains several people in each scene. Every person is also marked with manually defined boundaries and the ground truth of actions and group activity.

Experminets
The minibatch size is 16, the learning rate is 0.0001 and we train our network through 100 epochs. The single action weight loss λ = 1 is used. The GCN parameters are determined as d k = 256, d s = 32, and the distance mask threshold µ is adopted as 1/5 of the image width. The default CNN network backbone for function extraction is set to Inception-v3, and an embedded dot-product is the default appearances relating function. The PyTorch framework and individual instance of Tesla K80 GPU are the basis for our implementation.

-Experiment 1: Evaluation with different backbone networks
In this section we conduct extensive studies on the dataset of the collective activities, in order to understand the modeling relationship of the proposed backbone network using group activity prediction precision as the assessment metric. Table I presents the results of the experiments. During our two-stage training, the ImageNet pretrained backbone network  -Experiment 2: Evaluation with different appearance relation functions In this experiment, we analyze the performance of the group activity recognition with different functions. The group activity recognition performance is first trained and validated on the default Inceptionv3 backbone and the embedded dot-product for calculating appearance-relation. We have 92.06 % as the best result. The appearance relation function is then updated to the zero mean normalized cross correlation (ZNCC), and the best outcome is 93.58 %. The zero mean sum of the absolute distance (ZSAD) function is further evaluated in order to calculate appearance similarity and 94.37 % is the best result we can achieve. In this Experiment 2, we prove to be more accurate in our proposed model with the ZNCC or ZSAD as an apparence-related calculative function. Table II shows the results of the experiments. Fig 3 shows the output from visualization model.

Conclusion & Future Work
This paper uses the model based on the actor relation graph (ARG) for the recognition of group activity. To improve the performance of our model, we learn ARG to use zero mean normalized cross-correlation (ZNCC) and the zero mean sum of absolute difference in the graphs (ZSAD). We also have MNASNet as the backbone network in our proposed model to improve computational speed. In addition, extensive experiments show that the suggested methods for improving precision and speed on the Collective Activity dataset are robust and effective. Because our project focuses primarily on group activity recognition, individual actions are not very accurate because the region of interest (RoI) and CNN are used only to achieve action recognition. We want to further finalize and apply skeleton extraction for the backbone network MNASNet in future to achieve greater prediction precision for individual actions.