Study of Human Motion Recognition Algorithm Based on Multichannel 3D Convolutional Neural Network

provided the original


Introduction
Human motion recognition technology based on computer vision is widely used in robotics, video surveillance, virtual reality, and other fields [1,2]. e methods to solve the problem of human action recognition are mainly divided into traditional algorithms and recognition algorithms based on deep learning [3,4]. Traditional algorithms use the method of "feature extraction and expression + feature matching" to recognize human behaviour, and recognition algorithms based on deep learning learn object characteristics through neural networks and directly output the final recognition results [5].
Traditional algorithms recognize human behaviour by analysing the inherent characteristics of human behaviour, including motion information features, temporal and spatial interest points, and geometric features. Among them, spatial-temporal interest points have strong robustness to illumination changes, background differences, and environmental noise, feature expression is more adequate, and the recognition rate is the highest. Commonly used time and space points of interest include 3D-SIFT [6], HOG3D [7], and ESURF [8]. However, this type of feature extraction is complicated and time-consuming to match, and it is difficult to meet real-time requirements in practical applications. In contrast, MHI [9], MEI [10], and so forth only need simple processing methods such as image difference to achieve the extraction of geometric features. However, the conventional way of using this kind of descriptors for feature expression often results in the compression of feature dimensions. Although the speed of feature extraction has been improved, the recognition rate will be greatly reduced due to the loss of temporal and spatial information. In recent years, deep learning methods have emerged, and the characteristics obtained through neural networks have more abstract and comprehensive descriptions of behaviour characteristics [11,12]. e convolutional neural network can effectively reduce the amount of parameters to be trained due to its shared weight characteristics of the convolutional layer and is the most widely used.
In view of the excellent feature expression capabilities of deep neural networks, this paper uses convolutional neural networks instead of traditional feature descriptors as the feature expression method to ensure the accuracy of the algorithm, and directly using the original video as the input of the neural network will bring about the problem of algorithm speed reduction. In order to take into account the speed of the algorithm, this paper compresses the data of optical flow pictures and RGB pictures based on the compressed sensing random projection matrix, which effectively reduces the complexity of the model, and the compression method is simple and efficient, ensuring the real-time performance of the algorithm. Input the compressed data into the network, obtain more time dimension features through six-layer 3D convolution and 4-layer 3D pooling, and increase the adaptability of the model through the dropout layer. en input all the information to the fully connected layer and softmax classifier to obtain the final output result. e experimental verification on the public dataset shows that the accuracy of the algorithm in this paper has been improved.

Related Knowledge
With the production and development of various smart devices, in addition to paying more attention to smart machines and equipment, human beings also pay more and more attention to the recognition of human movements. At present, the study of human action behaviour has gradually become one of the important research issues in the field of machine vision.
Literature [13] proposed a detection method based on geometric constraints for the extraction of action targets. e background corner points corresponding to successive frames in the video sequence are used to meet the principle of geometric constraints.
is method first extracts the Harris corner points of the previous frame and then uses the pyramid-layered Lucas-Kanade optical flow method to obtain the corresponding points in the next frame. e basic matrix is predicted by random sampling algorithm, and the background Angle and foreground Angle are identified. Finally, the moving target region is identified. is method can complete the detection of multiple moving targets with high precision in a dynamic background, but the number of image frames processed per second is relatively low. Literature [14] proposed a new spatiotemporal interest point detector.
is method takes the detected spatiotemporal interest points as the centre and establishes a spatiotemporal gradient descriptor based on a polyhedron model to further characterize the visual characteristics of human actions in time and space. en use the bag-of-words method to build a more effective codebook for the seen action features. Finally, the feature descriptor is combined with the artificially defined high-level action properties. Implicit support vector machine and coordinate descent method are used to solve the local optimal solution of the final recognition model. is method has a high recognition rate, but the feature description is too complicated. e good results of deep learning in the field of image recognition in computer vision have made deep learning methods, especially convolutional neural networks, widely researched and applied in the field of computer vision. Compared with static images, video sequences have not only appearance information but also motion information [15,16]. erefore, some recent studies have begun to try to design an action recognition model based on convolutional neural networks that can effectively utilize the appearance and motion information of video sequences. Amin et al. [17] studied and compared three widely used methods in a variety of CNN connection methods, namely, late fusion, early fusion, and slow fusion. e experimental results show that none of these methods can make full use of motion information and can only make moderate improvements to a single frame. e dual-stream convolutional neural network proposed by Liao et al. [18] trains the second CNN stream on the optical flow of the video frame to compensate for the defect that the superimposed RGB stream cannot make full use of time information. Here comes a certain performance gain. Al-Hammadi et al. [19] built a three-dimensional convolutional neural network (3DCNN), took the video sequence as the input of the network, and successfully extracted the spatiotemporal information of the behaviour sequence. Saleem [20] obtained time information and spatial information separately through the two-stream network and then merged them into behaviour feature descriptors. e premise of the above method is that the input video sequence needs to cover the entire behaviour sequence. e increase of the action duration will deepen the network structure and reduce the real-time performance. Improved dense trajectories (IDT) [21] uses dense trajectories algorithm. IDT with higher-dimensional encoding (IDTHDE) [22] is an improvement of the visual bag-of-words model and fusion of multidimensional features to achieve higher-dimensional feature encoding. ese two action recognition methods are more traditional methods of manually designing features. Two-stream (TS) [23] recognizes actions by constructing a two-stream spatiotemporal network model, but the network is relatively shallow. Very deep two-stream (DTS) [24] has a very deep network architecture and improves the network in depth. e KVMF algorithm [25] intercepts multiple 3D volumes from the video segment as the input of the network and uses the prediction vector obtained from each volume to represent the category probability of the action.
In recent years, the emerging sparse representation theory has received widespread attention in the fields of machine learning, machine vision, and pattern recognition, providing a theoretical basis for signal classification based on sparse representation [26,27]. Gan et al. [28] proposed a sparse representation based classifier (SRC) based on the sparse representation theory. e basic idea is to transform the pattern recognition problem into a signal sparse representation problem without considering other types of sample data. Yan et al. [29] proposed a sparse representation classification with random projection (SRC-RP), which tried to perform sparse representation classification on compressed sample data from random projections to reduce the energy cost of sensor nodes to improve the action recognition rate. Díaz et al. [30] proposed a fast sparse representation algorithm based on prototype representation, trying to use the K-SVD algorithm to construct a small overcomplete dictionary that satisfies the sparse representation condition and converts the sparse representation problem into an l1-norm minimization problem. He et al. [31] proposed a fast sparse representation classification algorithm, which aims to solve the problem of high computational complexity of the traditional SRC algorithm. e basic idea is to use the K-nearest neighbour (KNN) algorithm to find a smaller training sample set used to sparsely represent the test samples, and an attempt is made to improve the recognition rate of the algorithm on the basis of reducing the complexity of the algorithm. erefore, this paper combines random projection algorithm and deep neural network and proposes an action recognition algorithm based on random projection. e experimental verification on the public dataset shows that the accuracy of the algorithm in this paper has been improved.

Adaptive Spectral Clustering Algorithm
In this paper, firstly, the convolutional neural network is used instead of the traditional feature descriptor to ensure the accuracy of the algorithm. Secondly, the data of optical flow images and RGB images are compressed based on the compressed sensing random projection matrix, which effectively reduces the complexity of the model. en, the compressed data are input into the network, and more time dimension features are obtained through six layers of 3D convolution and four layers of 3D pooling, and the adaptability of the model is increased through the dropout layer. Finally, all the information is entered into the full connection layer and the SoftMax classifier to get the final output.

Algorithm Steps.
e basic process of behaviour recognition in this paper is shown in Figure 1. First, the random projection matrix is used to reconstruct the sample set used for training to reduce the interference of training samples of irrelevant categories, improve the accuracy of action recognition, and reduce the computational cost of classification at the same time.
en, use the training set to train the improved convolutional neural network in this paper, and get the model with parameters. Finally, use the model to make predictions on the test set.
Step 1: initialize training sample set x � [x 1 , x 2 ,..., x n ], test sample y, and random projection matrix m of each sensor node Step 2: construct training sample set and test sample set Step 3: normalize each column of vectors in the training sample set Step 4: send the RGB picture compressed by the random projection matrix and its corresponding optical flow picture information into the 3D network, and obtain the action category 3.2. Random Projection. Random projection (RP) is a data compression method that uses a random projection matrix to reduce the dimensionality of high-dimensional data and has the advantage of low computational complexity [32][33][34].
Assuming that the human motion signal collected by the sensor node is s ∈ Sn, using a random projection matrix m ∈ M, the original signal s can be projected into a d-dimensional subspace; namely, cm n � m × s n . (1) In the formula, cm ∈ Sn is the projected compressed data, and its computational complexity is only O(dn).
In the practical application of projection matrix, the selection of appropriate human motion signals is very important [35,36]. In order to accurately reconstruct and restore the original signal s from the d-dimensional measurement value cm with high probability, the projection matrix m must meet the restricted isometry property (RIP).
at is, for any signal s, there is a RIP constant a ∈ (0, 1), so that the following formula holds: In the formula, m' is the submatrix formed by the relevant columns indicated by the index in m.
In order to effectively and reasonably use the limited computing resources of the system, a sparse binary matrix that satisfies the properties of RIP is selected as a random projection matrix for data compression, which provides reliable data for the subsequent construction of action recognition and classification algorithms.
e key to compressing data based on random projections is to find as few training samples as possible to optimally reconstruct the test samples linearly. First, on the sensor node i, use the random projection matrix m i to randomly project the action vector v ij to be recognized. Variable m i is from the random projection matrix m on the sensor node i.
In the remote data processing canter, the motion vector to be recognized composed of k sensor nodes is Denote matrix m as a diagonal matrix composed of diagonal elements of random projection matrix m i ; then formula (4) can be expressed as At the same time, using the same random projection matrix to compare the training sample set, If the sparse coefficient β is kept sparse enough, based on the compressed sensing theory, the l1-norm minimization problem can be used to solve the problem. Reconstruct the training sample set for the sparse representation of test samples, reduce the interference of training samples of Complexity 3 irrelevant categories, improve the accuracy of action recognition, and reduce the computational overhead of sparse representation classification.

Improved Action Recognition Algorithm of 3D Convolutional Neural
Network. e core of the 3D convolutional neural network constructed in this paper uses a dual-stream 3D convolutional neural network, and its structure is shown in Figure 2. e algorithm is divided into three stages: data processing stage, feature extraction stage, and model fusion stage.
e input of each group of networks is divided into two inputs. e spatial stream CNN input is the continuous 16 frames of RGB picture information compressed by the random projection matrix, and the time stream CNN input is the optical flow corresponding to the RGB picture compressed by the random projection matrix. RGB pictures have more static characteristics in the spatial domain, and optical flow pictures have dynamic characteristics in the time domain. e RGB pictures and optical flow pictures are input to the improved convolutional neural network, and the spatial and temporal features are extracted at the same time, and the abstract high-dimensional features are obtained through the multilayer convolutional neural network to train and output the behaviour recognition model. In the test, the output of the 3DCNN SoftMax layer trained separately through the RGB picture, and the optical flow picture is weighted and fused, and the output result after the first model fusion is obtained as shown in the following formula: In formula (7), I is the feature vector, α is a constant greater than 0 and less than 1, and n is the number of video frames.
Repeat the above experiment to obtain two weighted fusion models. When testing, the outputs of the SoftMax layer of the two models are averagely fused to calculate the final probability, as shown in the following equation: In formula (8), p is the final recognition rate, N is the number of videos tested, and n is the number of videos whose category label of top-1 is the same as the real label of the input video.
e improved neural network model in this paper is shown in Figure 3. In Figure 3, C represents the convolutional layer, and P represents the pooling layer. e main part of the AlexNet network classification effect is outstanding, and the model complexity is moderate, which can meet the real-time requirements of this paper. In order to further improve the speed of the algorithm, this article optimizes AlexNet partly. e model has 6 3D convolutional layers, 4 3D pooling layers, a dropout layer with a ratio of 0.5, two fully connected layers, and a SoftMax layer. e size of the 3D convolution kernel is [3 * 3 * 3], where 3 * 3 is the space dimension and 3 is the time dimension. e step size is 1, and padding is 1. From the first convolutional layer to the sixth convolutional layer, the numbers of convolution kernels are 64, 128, 256, 256, 512, and 512, respectively. Each convolutional layer is followed by a pooling layer. All the pooling types in this network structure are maximum pooling, which can effectively eliminate the estimated mean deviation caused by parameter errors of the convolutional  . After the dropout layer, the SoftMax layer is used for classification, and the behaviour category output is obtained. Because the feature enhancement effect of the LRN layer is random, it can improve the recognition rate for some tasks, but, for some tasks, it will cause feature distortion and reduce the recognition rate. erefore, this paper eliminates the LRN layer in the implementation process.
In order to further accelerate the speed of neural network convergence, this paper uses dropout technology for optimization. Dropout refers to discarding hidden nodes in the Complexity neural network according to a certain probability during the training process, inactivating some neurons, reducing the complexity of the model, and effectively inhibiting the overfitting phenomenon of the neural network.

Experimental Data and Environment.
e training of deep convolutional neural network has very high requirements for the configuration of the experimental platform. erefore, the processor used in this article is Intel(R) Core(TM) i7-5930K CPU-3.50 GHz; installed memory is 16.0 GB; system type is a 64-bit Win7 operating system; the graphics card is GTX1080; the development tools are Café and VS2013. e experimental data in this article are based on the public video action recognition dataset UCF101, WARD, and HMDB51. e UCF101 dataset contains 101 types of actions, with a total of 13 320 video segments. e UCF101 dataset is composed of network videos shot in an unconstrained real environment.
e video frame pixels are relatively low, contain different lighting information, and have partial occlusion and camera movement. e dataset divides actions into five types: (a) Human-human interaction, such as hair cutting, head massage, and 5 other categories (b) Playing musical instruments, such as flute, violin, and 10 other categories (c) Contains only human sports, such as blowing candles, playing Tai Chi, and 16 other categories (d) Human-object interaction, such as blowing hair, cutting vegetables, and 20 other categories (e) Sports, such as playing billiards, breaststroke, and 50 other categories e HMDB51 dataset contains 51 types of actions and a total of 6 849 video segments. Most of the HMDB51 dataset comes from movie clips, and a small part comes from video sites such as YouTube. Similarly, HMDB51 is also divided into five types: (a) Facial expressions interacting with objects, such as smoking and drinking and 3 other categories (b) General facial motions, such as smiling, speaking, and 4 other categories (c) Human-to-human interaction body movements, such as hugs, kisses, and 7 other categories (d) Human-to-thing interaction body movements, such as drawing swords, horse riding, and 18 other categories (e) General body movements, such as applause and 19 other categories such as handstand e WARD database places wearable multisensor nodes on the anatomical parts of the human waist, left wrist, right wrist, left ankle, and right ankle and collects 20 subjects and 13 kinds of movement data in the natural state, including standing, sitting, lying, walking forward, walking counterclockwise, walking clockwise, turning left, turning right, going upstairs, going downstairs, jogging, jumping, and pushing wheelchair.
As an evaluation indicator, we selected the recognition accuracy rate. e recognition accuracy rate is the most commonly used evaluation index, and the calculation formula is In the above formula, C m represents the number of correct recognitions. N all represents all the numbers.

Analysis of the Impact of LRN on the Effect of Behaviour
Recognition. In the training phase, the data volume of each batch of training images is 50. rough continuous attempts, the selected reference learning rate is 0.0001, the learning rate decay coefficient is 0.9, the learning rate decay period is 1,000, and the maximum number of iterations is 25,000. On the HMDB51 dataset and UCF101 dataset, this article trains according to the above-mentioned training strategy and compares the impact of using the LRN layer and not using the LRN layer on the recognition effect. Figures 4 and 5 show the training sample loss and the test sample loss, respectively, and the variation curve of the recognition accuracy with the number of iterations.
It can be seen from Figure 4 that whether the LRN layer is included does not have much impact on the loss of training samples and the loss of test samples, and the attenuation curve is basically consistent with the number of iterations. It can be seen from Figure 5 that, after deleting the LRN layer, the accuracy of the test sample has been slightly improved. is just verifies that the LRN layer has a certain randomness in the optimization of features. erefore, this paper removes the LRN layer.

Analysis of Dropout's Influence on the Effect of Behaviour
Recognition. Similarly, according to the above training strategy, on the HMDB51 dataset and the UCF101 dataset, this paper further compares and analyses the influence of dropout on the training results of neural networks, including training sample loss, test sample loss, and recognition accuracy, respectively, as shown in Figure 6. As shown in Figure 7, the dropout ratio is set to 0.5. Figure 6 reflects that the loss of training samples and test samples with and without dropout network is basically consistent with the decreasing trend of the number of iterations. However, it can be seen from Figure 6 that, at the beginning of training, the network without dropout has a very obvious fluctuation, and after dropout is added, the loss of training samples shows a smooth downward trend. is is because dropout reduces the complexity of the model through the inactivation of some neurons and suppresses the degree of overfitting of the neural network to a certain extent. Figure 7 shows that dropout improves the accuracy of behaviour recognition by about 1%, and the effect is 6 Complexity relatively not obvious. is is because the recognition rate has reached more than 90% when dropout is not used, and there is not much room for improvement, so that the effect of dropout is not significant.

Performance Analysis of Different Optimization
Functions. In this experiment, under the same other conditions, the model recognition results constructed using different optimization functions are compared. e optimization functions are SGD, Adam, and RMSPR. e experimental results are shown in Figure 8 and Tables 1 and 2. e abscissa of Figure 8 and Tables 1 and 2 represent the number of iterations, and the ordinate represents the accuracy rate. e lines of different legends in the figure represent different optimization functions. It can be seen from Tables 1 and 2 that the recognition rate of the Adam optimization function model is the highest, and we get the accuracy rates of 0.936 and 0.916.

Performance Analysis of Action Recognition.
Under the same experimental conditions, three different random projections are selected, and the action recognition rate changes under different compression ratios are shown in Figure 9. It can be seen that, for different random projection matrices, the action recognition rate based on this algorithm increases as the data compression ratio increases, and the three random matrices almost obtain the same action recognition rate. However, a sparse binary random matrix can 8 Complexity further reduce computing resources and power consumption by reducing the number of nonzero elements in each column of the matrix. e above results show that the sparse binary random matrix is helpful to reduce the computational complexity of the algorithm and improve the action recognition rate. erefore, this article chooses a sparse binary matrix. Figure 10 shows the comparison of the recognition accuracy between the method in this paper and the more typical action recognition methods in action recognition on the three datasets. On the HMDB51 dataset, the recognition rates of IDT, IDTHDE, TS, DTS, KVMF, and 3DCNN were 0.859, 0.879, 0.881, 0.912, 0.932, and 0.935, respectively. e method in this paper is the highest, which is 0.943. IDT uses the dense trajectory algorithm. IDTHDE improves the visual bag-of-words model and integrates multidimensional features to achieve higher-dimensional feature coding. ese two action recognition methods are more traditional methods of manually designing features. TS recognizes actions by constructing a dual-stream spatiotemporal     can be seen from Figure 10 that the improved 3D convolutional neural network proposed in this paper has better action recognition capabilities on the three datasets.

Conclusion
In this research, a method of human action recognition based on random projection is proposed. First, use the optical flow picture and RGB picture obtained by the Lucas-Kanade algorithm. Secondly, the data of optical flow pictures and RGB pictures are compressed based on the random projection matrix of compressed sensing, which effectively reduces power consumption; and, based on random projection compression data, it can effectively find the optimal linear representation to reconstruct training samples and test samples. A multichannel 3D convolutional neural network is proposed, and two model fusions are used in the test phase of the network to obtain the final recognition result. Compared with the traditional 3DCNN algorithm, the network constructed in this paper continuously adds dropout layers in the high-level feature extraction stage of the 3D convolutional network to reduce the overfitting situation during model training and speed up the training. In the output stage of the model, two fusion techniques are used to improve the stability and accuracy of the network model. Experimental results show that the algorithm in this paper not only achieves better recognition performance but also can better adapt to human action recognition problems in complex scenes and has better application prospects.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.