Multi-Layered Deep Learning Features Fusion for Human Action Recognition

: Human Action Recognition (HAR) is an active research topic in machine learning for the last few decades. Visual surveillance, robotics, and pedestrian detection are the main applications for action recognition. Computer vision researchers have introduced many HAR techniques, but they still face challenges such as redundant features and the cost of computing. In this article, we proposed a new method for the use of deep learning for HAR. In the proposed method, video frames are initially pre-processed using a global contrast approach and later used to train a deep learning model using domain transfer learning. The Resnet-50 Pre-Trained Model is used as a deep learning model in this work. Features are extracted from two layers: Global Average Pool (GAP) and Fully Connected (FC). The features of both layers are fused by the Canonical Correlation Analysis (CCA). Then features are selected using the Shanon Entropy-based threshold function. The selected features are finally passed to multiple classifiers for final classification. Experiments are conducted on five publicly available datasets as IXMAS, UCF Sports, YouTube, UT-Interaction, and KTH. The accuracy of these data sets was 89.6%, 99.7%, 100%, 96.7% and 96.6%, respectively. Comparison with existing techniques has shown that the proposed method provides improved accuracy for HAR. Also, the proposed method is computationally fast based on the time of execution.


Introduction
referred to as an action [4,5]. According to the view of computer vision (CV), action recognition relates the observations such as video data with sequences [6]. A sequence of human actions accomplished by at least two actors in which one actor must be a person or an object is called interaction [7]. It has become a demanding task in CV to understand the human activities from videos. Automated recognition of an activity that being performed by human in a video sequences is the main capability of intelligent video system [8].
The main aim of action recognition is to supply useful information related to the subjects' habits. Also, they permit the system or robot to make users comfortable with interacting with them. Recognition and forecasting the occurrence of crimes could be done by interpreting human activities to assist the police or other agencies in reacting straightaway [9]. The proper recognition of human actions accurately is extremely difficult due to lots of problems, e.g., jumbled backgrounds, changing environmental conditions, and viewpoint differences [10].
HAR techniques from video sequences are usually classified into two types-template-based method and model-based method. In template-based method, lower and middle-level features are emphasized. In the model-based method, high-level features are emphasized [11,12]. In the past few years, a large number of feature extraction methods are introduced, especially spatialtemporal interest points (STIP) feature descriptor [13], motion extraction image (MEI) and motion history image (MHI) [14], spatiotemporal descriptors based on 3-D gradients [15], extend robust scale features (SURF) [16], 3D SIFT [17], histogram oriented gradients (HOG) 3D [15] and dense optical trajectories [18] have achieved fruitful results for HAR using video sequences [19]. Then classification of these extracted features is done using different machine learning (ML) classification methods such as K-Nearest Neighbor (KNN), Support Vector Machine (SVM), decision tree, linear discriminant analysis (LDA), Ensemble tree and neural networks, etc.
Compared to the techniques above, significant performance was achieved after the deep convolutional neural network (DCNN) in machine learning [20,21]. Several pre-trained deep models are presented in the literature, such as AlexNet, VGG, GoogleNet, ResNet, and Inception V3. DCNN models can act directly on the raw inputs without any preprocessing [22]. More complex features can be extracted with every supplementary layer. A major dissimilarity in the complexity of adjoining layers of model reduces with the proceeding of the data to the upper convolutional layers. In recent years, these deep learning (DL) models are utilized for HAR and show high accuracy [23]. But sometimes, when humans have performed complex actions similar to each other, these models diminish the performance. Therefore, some researchers presented sequential techniques. In those techniques, they performed fusion and get better information for an entire video sequence. Afza et al. [24] presented a parallel fusion approach named length control features (LCF) and achieved improved performance. Xiaog et al. [25] presented a two-stream CNN model for action recognition. They focused on both optical flow-based generated images and temporal actions for better action recognition. Elharrous et al. [26] presented a combined model for both action classification and summarization. They initially extract the human silhouette and then extract the shape and temporal features. In summary these methods computed improved results but they did not focused on the computational time. The major challenges which are handling in this work are: i) Change of human variation and viewpoint due to camera conditions, camera being static or dynamic; ii) Consumption of more time during the training process, and iii) Selection of the most related features is a key problem for minimizing error rate of an automated system. In this article, we proposed a fusion based framework along with a reduction scheme for better computational time and improved accuracy. Our major contributions are as follows: The rest of the manuscript is organized as follows. The proposed methodology includes frame normalization, deep learning features, features fusion, and selection, presented in Section 2. Results are presented in Section 3, which followed the Conclusion of Section 4.

Proposed Methodology
A unified model has been proposed in this article for HAR built on the fusion of multiple deep CNN layers features. Five core steps are involved in this work-database normalization, individual CNN features extraction through two successive layers, fusion information of both successive layers (AVG-Pool and FC1000), selection of best features, and finally classification through supervised learning method. The proposed method is evaluated on five datasets and also compared with existing techniques. Proposed architecture of this work has been illustrated in

Preprocessing
The main objective of preprocessing is to bring improvement in image statistics. It represses unwanted distortions or improves some image features that are essential for additional processing.
In this work, we performed normalization of video frames. Normalization is a procedure frequently applied as a major aspect of data preparation for machine learning. The normalization process is comprised of global contrast normalization(GCN), local normalization, and histogram equalization [27].

GCN:
In Global Contrast Normalization (GCN), each value of an image's pixels is subtracted from the mean value and then divided with predictable error. To avert images from possessing differing quantities of contrast is the main objective of GCN. Images having very little, however, not equal to zero contrast have smaller details and turn out to be problematic for action recognition. GCN can be described as: where m represents a row, n represents column, l is a depth of color an d mean intensity of the full image is represented by Z. Then the local contrast is improved by employing bottom hat filtering and log transformation. In the end, the output of both local and global is combined in one matrix for the final resultant enhanced image. These resultant enhanced images are used for the training of a deep learning model for further processing.

Convolutional Neural Network
A simple CNN model consists of the following layers: Convolution Layer, Pooling Layer, ReLu Layer, Batch Normalization Layer, Fully Connected Layer, and output layer. The details of each layer defined as follows: The Convolutional layer received a volume of size M 1 × G 1 × D 1 . This layer needs four variables/factors, e.g., C Filters, their spatial range E, the stride S and the quantity of zero padding P. It generates a capacity of size It represents E.E.D 1 weights per filter with parameter sharing. The next layer is the pooling layer. The pooling layer's task is to lessen the image's spatial dimensions to minimize the number of variables and calculations within the linkage. It thus controls the problem of overfitting among layers. The pooling layer takes a capacity that has size M 1 × G 1 × D 1 . This needs 2 variables, e.g., stride S, and their spatial range E. It generates a capacity that has M 2 × G 2 × D 2 . Mathematically, it is defined as: The next layer is a ReLU activation layer. It is a kind of activation function. Mathematically, it is described as z = max(0, y). This function described that the values with negative would be converted into zero (positive). A fully connected layer can be described as the calculation of some component of the output y l+1 (or z note that z is an alias for y l+1 ) have to need of every component of the input y l . This layer is also known as the feature layer. The last layer is softmax, which is used for classification. In this layer, an entropymax function is employed for the final decision.

Deep Learning Features
Deep learning showed great success in machine learning in the last few years for several video surveillance and medical imaging applications. Video surveillance is a hot research area, but the researchers face major issues called imbalanced datasets and large datasets. In this article, we used a pre-trained deep learning model named ResNet-50. The common ResNet models usually perform the skipping of double or triple layers, which comprise (ReLu) and batch normalization [28]. The incentive for skipping over layers is to attach the output by the previous layer to the next coming layer. This will support reducing the vanishing gradient issue. Skipping helps simplify the network because it uses small no layers in the initial training stages. As a result, it accelerates the process of learning. As long as the network will learn the space of features, it slowly reinstates those skipped layers. A neural network in the absence of residual portion investigates additional space of feature that makes it unprotected. ResNet-50 is a convolutional neural network (CNN), originally trained on the ImageNet dataset, consisting of around 100 million images of 1000 classes. The depth of this network is 50 layers, and the input size is 224-by-224-by-3. The original number of layers and selected layers are presented in Tabs. 1 and 2. We modified this ResNet50 architecture according to the number of classes. For this purpose, we removed the last layer and added a new layer that includes the number of action classes. We then train the modified model through transfer learning on 70% data, where the next 30% are used for the testing process. In the training process, we define the number of epochs 100, the learning rate is 0.0001, and mini-batch size is 64. The input size is the same as the input of the original deep model. We extract features from the last two feature layers named Average Pool Layer and Fully Connected Layer. The dimension of extracted features is N ×2048 and N ×1000, respectively. In the later stage, we fused features of both vectors into one vector for further processing.

Features Fusion Using CCA
In multivariate statistical analysis, CCA has similar significance as principal component analysis (PCA) and linear discriminant analysis (LDA). It is the most important multi-data processing technique. CCA is conventionally utilized for analyzing the associations amongst two groups of variables [29]. It tries to find two groups of random variables so that these random variables presume the highest correlation over two groups of data. In contrast, the transformations inside every group of data are not correlated. Mathematically, it is formulated as follows: Two groups of data Z 1 ∈ -R m * p and Z 2 ∈ -R m * q are given, CCA finds the linear combinations Z 1 V 1 and Z 2 V 2 which will enhance the couple-wise correlations over the two groups of data. E 1 and E 2 ∈ -R m * b , b ≤ min (rank (Z 1 , Z 2 )) , are identified as canonical variables and V 1 ∈ -R p * b and V 2 ∈ -R q * b are the canonical vectors. If another method is used, the procedure discovers an initial couple of canonical vectors v (1) 2 ∈ -R q * 1 , which will enhance the linear fusion of the two groups of data.
To acquire the initial couple of canonical variables stated as: 1 and e The leftover b − 1 canonical variables will be computed by using the same method. By applying further restrictions at matrices columns, i.e., e (i) , every group of data, the canonical variables are uncorrelated, and they possess zero mean and unit variance.
The issue of CCA may be presented as an optimization problem that uses "Lagrange multipliers," and the canonical covariates can be computed by resolving a general eigenvalue explanation [29]. Here the columns of V 1 and V 2 are the eigenvectors of the two matrices S −1 z 1 S z 1 ,z 2 S −1 z 2 S z 2 ,z 1 and S −1 z 2 S z 2 ,z 1 S −1 z 1 S z 1 ,z 2 , where S z 1 ,z 2 is the cross-correlation matrix of Z 1 and Z 2 (S z 2 ,z 1 = S T z1,z 2 ), and S z 1 and S z 2 are the autocorrelation matrices of Z 1 and Z 2 , respectively. Hence, the final fusion is defined as: where, K represents the number of fused features, which are 1876 in this work, and N represents the sample frames used for training and the testing.

Feature Selection
Feature selection is the process of selecting the best features for accurate classification within less execution time. In this work, a new threshold function is proposed to select best features based on the Shannon Entropy. Initially, we computed the entropy value of the fused vector by the following equation.
where, N represents the total number of features, p i is the probability of each feature, and i represents the index of each feature in the fused vector. Then, we implemented a threshold function to select the best features. The criterion of the selection of best features is the fitness function, which is Fine KNN. We initialize 20 iterations, and after each iteration, the selected features are evaluated through the fitness function. In the end, the higher accuracy-based feature set is selected for the final classification. Mathematically, the threshold function is defined as follows: The final selected features are passed in the supervised learning classifiers for final classification.

Experimental Results and Discussion
The proposed method is evaluated on four selected datasets as IXMAS, UCF Sports, UT Interaction, and KTH. Each classifier's performance is measured through the following parameters such as recall rate, precision rate, accuracy, FNR, and testing time. Also, the performance of

IXMAS Dataset
The proposed recognition accuracy of the IXMAS dataset is presented in Tab. 3. Six different classifiers are utilized for recognition accuracy and selected the best one based on the accuracy performance. From this table, the highest accuracy is 89.6% achieved on Fine KNN, whereas the other parameter such as recall rate is 89.58%, the precision rate is 89.75%, FNR is 10.42%, and the classification computation time is 277.97 s. The next highest accuracy is 87.8% which is attained at Cubic SVM. The minimum accuracy of 79.8% is achieved on Weighted KNN along best recognition time of 194.89 s. The best accuracy of FKNN is proved by the confusion matrix given in Tab. 4. Besides, the computation time of each classifier is plotted in Fig. 2. As shown in this figure, the WKNN executes fast as compare to other classifiers.

UCF Sports Dataset
The proposed recognition accuracy of the UCF Sports dataset is presented in Tab. 5. Six different classifiers are used for recognition accuracy and selected the best one based on the accuracy performance. From this table, the highest accuracy is 99.7% achieved on Linear Discriminant. In contrast, the other parameter such as recall rate is 99.76%, the precision rate is 99.76%, FNR is 0.24, and the classification computation time is 49.143 s. The next highest accuracy is 99.2% achieved on Quadratic SVM and cubic SVM. The minimum accuracy of 93.5% is achieved on Weighted KNN along best recognition time of 16.524 s. The best accuracy of LDA is further proved by a confusion matrix, presented in Tab. 6. Besides, the computation time of each classifier is plotted in Fig. 3. From this figure, it is illustrated that the WKNN executes fast as compare to other classifiers.

UT Interaction Dataset
The proposed recognition accuracy of the UT Interaction dataset is presented in Tab. 7. Six different classifiers are used for recognition accuracy and selected the best one based on the accuracy performance. From this table, the highest accuracy is 96.7% achieved on Fine KNN, whereas the other parameters such as recall rate are 97%, the precision rate is 96.66%, and FNR is 3%. The next highest accuracy is 96.5% that is attained on the LDA classifier. The minimum noted the accuracy of 91.2% achieved on Weighted KNN along best recognition time of 14.604 s. The best accuracy of FKNN is proved by the confusion matrix given in Tab. 8. Also, the computation time of each classifier is plotted in Fig. 4. As shown in this figure, the WKNN computationally fast as compared to the rest of the classifiers.

KTH Dataset
The proposed recognition accuracy of the KTH dataset is shown in Tab. 9. Six different classifiers are used for recognition accuracy and selected the best one based on the accuracy performance. From this table, the highest accuracy is 96.6% achieved on FKNN. In contrast, the other parameter such as recall rate is 96.5%, the precision rate is 96.5%, FNR is 3.5%, and the classification computation time is 497.09 s. The next highest accuracy is 96.0% that is attained on Quadratic SVM. The minimum achieved accuracy is 91.7% for the weighted KNN classifier. The accuracy of Fine KNN is further proved by a confusion matrix, given in Tab. 10. Also, the computation time of each classifier is plotted in Fig. 5. From this figure, it is noted that the WKNN classifier executes fast as compared to the rest of the listed classifiers.

Conclusion
A new method for the recognition of human actions is presented in this deep learning research work. There are few important steps to the proposed method. In the first step, pre-processing is applied, and video frames are resized according to the target model's input. The pre-trained ResNet50 model is used in the next step and is trained using transfer learning. Employing TL, features are extracted from two successive layers and fused using canonical correlation analysis (CCA). The fused feature vector consists of irrelevant information that is selected using the Shanon Entropy approach. Finally, the selected features are classified using supervised learning classifiers, and the best of them are selected based on the accuracy value. A few well-known datasets are used to evaluate the proposed method and have achieved remarkable accuracy. Based on the accuracy, we conclude that the features extracted through deep learning give better results when handling large-scale datasets. It is also noted that the merging of multilayer features produces better results. But this step affects the efficiency of the system. As a result, the selection process provided more accuracy and also minimizes overall time. In future studies, more complex datasets such as HMDB51 and UCF101 will be considered to evaluate the proposed method.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.