A Multi-View Gait Recognition Method Using Deep Convolutional Neural
Network and Channel Attention Mechanism

In many existing multi-view gait recognition methods based on images or video sequences, gait sequences are usually used to superimpose and synthesize images and construct energy-like template. However, information may be lost during the process of compositing image and capture EMG signals. Errors and the recognition accuracy may be introduced and affected respectively by some factors such as period detection. To better solve the problems, a multi-view gait recognition method using deep convolutional neural network and channel attention mechanism is proposed. Firstly, the sliding time window method is used to capture EMG signals. Then, the back-propagation learning algorithm is used to train each layer of convolution, which improves the learning ability of the convolutional neural network. Finally, the channel attention mechanism is integrated into the neural network, which will improve the ability of expressing gait features. And a classifier is used to classify gait. As can be shown from experimental results on two public datasets, OULP and CASIA-B, the recognition rate of the proposed method can be achieved at 88.44% and 97.25% respectively. As can be shown from the comparative experimental results, the proposed method has better recognition effect than several other newer convolutional neural network methods. Therefore, the combination of convolutional neural network and channel attention mechanism is of great value for gait recognition.


Introduction
Human interaction with the external environment is multimodal. With the enhancement of computer's ultra-large-scale parallel computing capabilities and the development of various sensor technologies, it has provided new research ideas for analyzing and understanding human behavior and intentions. For example, the combination of deep learning [1][2][3] and voice signals, the combination of deep learning and vision-based human motion signals, the combination of deep learning and signals based on human motion intentions that indicate electromyography, the combination of deep learning and motion intent signals based on electroencephalogram, and even the combination of deep learning and human emotion signals based on electroencephalogram. As one of the human body motion recognition [4], gait recognition is used to authenticate or recognize the identity by walking posture or footprints and is one of the most effective methods in long-distance identity recognition. As one of human body motion recognition, gait recognition uses the myoelectric signal of human walking to authenticate or recognize the identity. It is considered as one of the most effective methods in long-distance identity recognition. Therefore, it is very important to design an efficient gait recognition method.
Gait recognition faces many difficulties in practical applications. The main manifestation is that pedestrians are affected by the external environment and their own factors during walking, which leads to strong intra-class changes in extracted gait characteristics. The perspective factor is one of the most important factors affecting the recognition performance of systems. When a pedestrian's walking direction changes, or when the pedestrian moves from one surveillance area to another surveillance area with different setting, the perspective changes occur. The gait images in different perspectives have great differences and the gait contours inside perspectives [5,6] contain more valuable information. Thus, feature extraction is mostly based on side contours. However, the traditional single-view gait recognition has a significant decrease in recognition performance when the perspective changes.
Aiming at multi-view gait recognition, this paper proposes a new method combining deep Convolutional Neural Network [7][8][9][10] and Channel Attention Mechanism (CAMCNN). The main innovations of this paper are summarized as follows: 1. A new convolutional neural network architecture is proposed. Each layer is trained by back propagation learning, which enhances the learning ability of convolutional neural network.

A Convolutional Neural Network (ACNN) combined with channel Attention mechanism is
proposed. Compared with the original Convolutional Neural Network (CNN), the convolution layer is further strengthened due to adding channel attention mechanism. It can better express gait characteristics, which helps to improve the accuracy of recognition.
The overall structure of this article is as follows: Section 2 introduces the related work for multi-view gait recognition and motivation. Section 3 introduces the architecture and detailed process of the proposed method. Section 4 introduces experimental verification on gait recognition datasets and comparative analysis with other existing methods. Section 5 introduces conclusions and future work.

Related Work
Scholars have proposed lots of methods for the problem of gait recognition. For example, Ghaeminia et al. [11] proposed gait saliency images and classified templates by applying appropriate spatiotemporal filters. However, when the amount of data is large, the effect of this model will be affected greatly. Xu et al. [12] proposed a subspace-based multi-view maximum edge subspace learning method. This method simultaneously minimizes intra-class variation and maximizes local inter-class variation from low-dimensional embedding between views and within views. The data from different views are mapped to the projection matrix of common subspaces to improve model recognition rate. However, this method is prone to overfitting. Shaikh et al. [13] proposed a gait recognition method based on partial contours, which improved the recognition accuracy using a multi-modal system. However, when there are lots of image noises, its effect will be affected greatly and needs to be further improved. More et al. [14] proposed a multi-view gait recognition system that combines two features. It extracts dynamic features by cross wavelets and extracts static features by a bipartite graph model, which improves the accuracy of this system after fusion. However, this method is prone to overfitting. Verlekar et al. [15] proposed a gait recognition method based on four-dimensional space, which identified users' walking direction by fitting the direction of lines. However, the parameters of this model are more difficult to adjust. Xing et al. [16] proposed an invariant gait recognition method based on 3D convolutional neural network, which extracts spatial and temporal information by learning viewpoint-invariant features to improve model performance.
However, more image noises will reduce the recognition accuracy. Zhang et al. [17] proposed a local discriminative gait recognition method, which improves the robustness by extracting robust local weighted histogram feature vectors for training. However, this method ignores some edge features. Wang et al. [18] proposed a generalized LDA gait recognition method based on Euclidean norm. It uses discrete matrices to separate adjacent samples and improves the accuracy of this model. However, when the amount of data is large, the classification effect is not good. Portillo-Portillo et al. [19] proposed a visioninvariant gait recognition algorithm based on joint direct linear discriminant analysis. It improves the stability of this model through the dimensionality reduction feature provided by direct linear discriminant analysis. However, its features extracted by this method need to be further improved. Chhatrala et al. [20] proposed a gait recognition algorithm based on hidden Markov model. It uses moving average filter model to denies the gait data, which improves the accuracy of this model. However, the model takes longer to run. Yu et al. [21] proposed a gait recognition method based on curve transformation and PCANet. It extracts differentiated robust features by non-linear and irreversible reversal of PCANet, which improves the effectiveness of this model. However, some edge features are difficult to extract.
Gadaleta et al. [22] proposed the extraction of invariant gait based on the depth model of auto encoder. It realizes the stepwise synthesis of gait characteristics by multi-layer self-encoding, which improves the accuracy of this model. However, the model is prone to overfitting. Rashwan et al. [23] proposed a gait recognition method based on histogram. This method uses a two-dimensional array of histograms to encode the dynamics of gait cycle and improves the model's accuracy. However, when the amount of data is large, the recognition effect of this method is not good. Sun et al. [24] proposed a vision-invariant gait recognition method based on Kinect skeleton features, which improves the accuracy by fusing static and dynamic features. However, this method is prone to overfitting. Wu et al. [25] proposed a feedback weighted convolutional neural network, which extracts features by controlling weights and improves the model recognition rate. However, the parameters of this model are more difficult to adjust.
Based on the above analysis, it can be known that deep learning has good modeling and processing capabilities for massive data. Most existing multi-view gait recognition methods based on images or video sequences use gait sequences to superimpose and synthesize images for constructing energy maplike templates. However, information may be lost during image synthesis process, and the influence of factors such as period detection may introduce errors and affect the accuracy of recognition. In order to better solve these problems, this paper continues to study multi-view gait recognition based on deep convolutional neural networks. It mainly includes two parts: improving the learning ability of CNN by strengthening the training of each layer, and gait characteristics can be better expressed by strengthening the convolutional layer.

Proposed Method
The general process of gait recognition based on two-dimensional visual perception is: data preprocessing, gait contour extraction, gait cycle calculation, gait feature extraction, similarity calculation and gait classification. According to this process, we design a new gait recognition method.

Overall Architecture of the Proposed Method
The flow chart of convolutional neural network combined with channel attention mechanism is shown in Fig. 1. Acc represents acceleration, Gry represents angular velocity, Con represents convolution kernel, Pool represents pooling operation, Attention represents attention mechanism module and Output represents final classification features.

Lower Limb Electromyography Signal Capture
The lower limb EMG signal is a statistical waveform signal with zero mean. Observed from the perspective of mathematical, the signal can be considered as a system described by differential equations. Therefore, let the amplitude of the EMG signal be the variable x and the derivative of x with respect to time is the variable y. Therefore, ðx; yÞ can be considered as a coordinate of a state point and the phase diagram of the EMG signal segment can be drawn in the x À y phase plane. Fig. 2 shows the original EMG signal with a sampling frequency of 2000 Hz.
Take three rectangular windows on the signal, assuming a simple harmonic oscillator with a mass of m and a spring rate of k. Similarly, the amplitude and speed are expressed by variables x and y. The state of the vibrator can be expressed by the following dynamic system. x¼y y¼À k m x (1) x and y represent the amplitude and velocity of the vibrator respectively. And for a simple harmonic oscillator, x 2 is proportional to the potential energy of the vibrator and y 2 is proportional to its kinetic energy. Therefore, the total energy E of the vibrator is: Convert Eq. (2) to the following format: The energy core S can be expressed as Eq. (3) As can be seen from Eq. (4), the energy core is proportional to the energy of the EMG signal oscillator.
Besides, the energy of the EMG signal is the sum of the energy of each harmonic, that is, the energy of the action potential of the motor unit determines the energy of the EMG vibrator. Therefore, k and m reflect the inherent physical characteristics of the action potential conduction medium of the motor unit. For a list of harmonics, the average energy density is: q is the mass density of the medium. A is the amplitude, x is the angular frequency of the amplitude, And, the frequency of the motor unit action potential corresponding to the action potential of the motor neuron and the dominant frequency component is recorded as x F . Then, Eq. (5) can be rewritten as: A i represents the amplitude of the i component. It can be seen from Eq. (6) that the square root of E is directly proportional to x F and signal strength. Therefore, there is a linear relationship between the square root of E and the muscles.
Through the above analysis, when estimating the EMG signal, the sliding time window method can be used to calculate the energy core S in each window combining with Eq. (4). Then, the EMG signal is characterized by ffiffiffi S p .

Data Interpolation and Demising
Due to the inaccuracy of software clock, the smartphone's sampling [26] of gait data is uneven, and the obtained data sampling intervals are inconsistent. For the convenience of processing, the data is firstly spline interpolated three times to achieve one data every 5 milliseconds. A complete gait cycle takes about 1 second, so a gait cycle has 200 data points approximately. Then a low-pass Finite Impulse Response (FIR) filter is used to denies the data after interpolation and reduce motion artifacts that may occur at higher frequencies.
Generally, the cut-off frequency is f ¼ 40 Hz and the window length is set to 1 second.

Normalization of Contours
In the acquired image data, target contours occupy only a small part of original images. Besides, since the camera has a fixed angle when shooting, there is a change in the distance between pedestrian and cameras, which directly affects the size of silhouette contours. In this paper, contour images are cropped to obtain the target silhouette image, and then normalized to scale it into a fixed-size template. The bilinear interpolation method is mainly used and Eq. (1) is used to transform the image scale [27]. As shown in Fig. 3,

Gait Cycle Extraction
A complete gait cycle [28] refers to the process from one side heel to the ground to this heel to the ground again when walking. It contains the single-step feature of gait and is an important basis for gait recognition. In general, gait characteristics of the same person are stable and unique. Thus, continuous gait cycles should be highly correlated. In order to detect a dynamically changing gait cycle, firstly a gait cycle that is easier to distinguish is identified, and the gait signal of this cycle is used as a template. Then use the template matching to find signal segment with the greatest correlation as the next gait cycle. At the same time, iterative update of the template is performed, so that the detection of the next cycle is more accurate. During the gait cycle detection process, gait signals used is mainly the amplitude signal of total acceleration. For each sample iði ¼ 1; 2; …Þ, the total acceleration amplitude signal is calculated as follows: Let the amplitude value a mag ði 0 Þ of i 0 be a minimum value at the beginning of gait signals. Using i 0 as the center, we extract 200 acceleration data sets, the formula is as follows: where Z represents acceleration template in the first cycle. Let CðiÞ be the next continuous data segment starting with point i, and its length is N ¼ 200, that is:  The correlation distance V ðiÞ between CðiÞ and the template is calculated as follows: Here, Z and C represent the mean of each element in vectors Z and C respectively. 1 N is a vector of all 1s of length N , and jj:jj is the second norm of the vector.
The correspondence between two maximums is a gait cycle. The location of a large value can be located by a simple threshold method. A threshold of 0.5 is sufficient. Use the first template to find the next gait cycle, which is the second cycle Z 0 . Its update formula is as follows: It can be seen that the new template is a weighted average of old templates Z and Z 0 . The above process continues until the last gait cycle of the data. In this way, a relatively accurate template can be obtained in each new cycle.

Directional Projection
Since the data is collected by mobile phone in pants pocket, and the position of mobile phone is not fixed, the acceleration and angular velocity will be slightly shifted in direction. To this end, a new direction-invariant coordinate system needs to be established for the collected data [29]. The three orthogonal coordinate axes in the new coordinate system are independent of the direction of smartphone and aligned with the direction of gravity and motion. Let a gait cycle sample length be N 1 , the acceleration and angular velocity of each sample are expressed as follows: where x represents the direction facing vertical screen of the phone, y represents the direction from left to right of the phone, and z represents the direction from bottom to the top of phone. A represents acceleration and K represents angular velocity. a x , a y and a z represent acceleration vectors in x, y and z directions. k x , k y and k z represent the angular velocity vectors in x, y and z directions.
The acceleration in the direction of gravity is main low frequency component in accelerometer data. However, since the position of smart phone changes during walking, the acceleration in the direction of gravity is not a constant vector in ðx; y; zÞ coordinate system. To this end, the mean acceleration vector in a gait cycle is used to estimate the gravity acceleration vector, which is expressed as follows: Here, a x , a y and a z represent the mean vector of accelerations in x, y and z directions in a gait cycle. Then, the first coordinate axis direction of the new coordinate system is calculated as follows: To find the second direction, the original acceleration is projected onto a horizontal plane orthogonal to f 1 . Let A 1 ¼ ½a 1 x ; a 1 y ; a 1 z be the acceleration data in horizontal plane, where a 1 x , a 1 y and a 1 z are the projected components (assuming that their lengths are N 1 ). The calculation formula is as follows: In the horizontal direction, we set the direction in which the acceleration data changes the most (that is, the direction of travel with the largest variance) as the second coordinate axis of the new coordinate system. For this reason, Principal Component Analysis (PCA) is used to calculate the direction in which the data variance is greatest. Firstly, the covariance matrix is calculated, the formula is as follows: where 1 N 1 is an all 1 vector of length N 1 The eigenvector. H 1 corresponding to the maximum eigenvalue of h 1 is the direction of maximum variance. In this way, the direction of the second coordinate system is calculated as follows: Since the above two directions are orthogonal, the third direction can be obtained by cross product: Original acceleration and angular velocity data are projected to the new coordinate space. Each component is calculated as follows: Then get the new gait data after coordinate transformation:

Data Normalization
Due to the changes in walking speed and stride, each gait cycle has a different duration, which causes the data length of each gait cycle to be inconsistent. Deep learning models usually need to keep the length of input data consistent. According to the characteristics of gait data in this paper, the data length is unified to 200 by interpolation and extraction in the experiment. In order to obtain better training and classification performance, the data is amplitude normalized to obtain a vector with zero mean and unit variance. Since acceleration and angular velocity each have data in three directions ðx; y; zÞ, plus the calculated total acceleration and total angular velocity, there are eight vectors of length 200 for each gait cycle. They together constitute the input signals in the experiments.

Gait Feature Extraction
For gait feature extraction, we propose a deep convolutional neural network architecture, which is suitable for gait recognition. The architecture consists of 8 layers, including 4 convolutional layers and 4 sampling layers. Besides, there are 8 feature maps in each layer. In each convolutional layer, 8 convolution filters are used for initialization, and there are 8 sampling maps in each sampling layer. The architecture trains these layers using a back propagation [30] learning algorithm. It also uses the root mean square propagation of stochastic gradient descent with an adaptive learning rate to optimize the algorithm, thereby minimizing the cost function [31].

Convolution Method
Use Xavier uniform variance scaling method to initialize the weights of convolution filter: A convolution filter is applied with a step size of 1 and a size of 5 Â 5.
The output feature map is added to the bias term and the result is transformed by a non-linear activation function. Each feature map in convolutional layer is calculated as follows: Here, represents convolution operation, and FM iÀ1 is the feature map of previous layer. In the first layer, FM iÀ1 represents the original pixels of GEI. Each feature map has a bias term b, which is initialized to zero. The activation function is: Here, x is the result of convolution operation. This operation is added to the bias term of the feature map, as shown in Eq. (21). In the first convolution layer, each unit outputs a feature map of 136 Â 136. In layer 3, each unit outputs 64 Â 64 feature maps. In layer 5, each unit outputs 28 Â 28 feature maps. In the final convolutional layer, each of 8 filters produces a 10 Â 10 output feature map. The output of the final convolutional layer is directly fed to 8 subsampling units in the pooling layer.

Pooling Method
In the proposed deep convolutional neural network architecture, each pooling layer outputs 8 pooled feature maps and summarizes the output values of the adjacent neuron groups mapped by each kernel. At the same time, the pooling layer also helps to reduce spectral changes in the input data and produce translation-invariant features. Because the body shape in gait recognition is a non-rigid shape that can experience many fluctuations, this advantage is very valuable in gait recognition. In the model of this paper, the pooling unit performs maximum pooling, where pooling factor C ¼ 2. The data is down sampled using a maximum pooling filter with a pooling unit size of 2 Â 2 in steps of 2. The pool windows in the model do not overlap and the specific operations are defined as follows: Here, MaxP is the maximum pool operation. In the first sub-sampling layer, each of the 8 merge filters produces 68 Â 68 outputs. In the fourth layer, each pooling filter produces 32 Â 32 outputs. In the sixth layer, each layer produces 14 Â 14 outputs. In the last pooling layer, each of the 8 pooling filters produces 5 Â 5 outputs. In the fully connected part, there are only two layers (input layer and output layer), where soft max is the classifier of this paper.
The proposed architecture does not have any hidden layers. The input layer has 200 neurons, which are mainly from the last pooling layer ð5 Â 5 Â 8Þ.

Layer Connection
The original deep convolutional neural network architecture consists of millions of parameters and is trained on large data sets [32]. However, the data set is relatively small and cannot train all of these parameters in gait recognition. Therefore, overfitting problems may occur.
In the proposed deep convolutional neural network architecture, each feature map FM i in l layer is connected to only one feature map FM i from the previous l À 1 layer. It greatly reduces the computational cost, speeds up training time and reduces the number of parameters. Fig. 4 shows an example of a one-to-one connection or a single connection between three layers of cores.

Gait Classification
As an auxiliary method, Attention Mechanism (AM) [33] is increasingly being introduced into deep networks to optimize network structure. It is more like the mode in which human eyes observe things, making the network more focused on learning, thereby improving network's learning ability. Generally, Channel Attention (CA) [34] selects and optimizes different channels of the same feature map to obtain re-adjusted channel information. An improved CA is proposed for image processing and it is shown in Fig. 5. For a given intermediate feature map X (L Â H Â W Â C; L, H, W represents the spatial dimension of feature map, and C represents the number of channels). The principle is as follows: MaxPooling and AvgPooling represent the global maximum pooling and global average pooling in spatial direction respectively. And RELU is activation function, r represents the sigmoid activation function, FC is a fully connected layer.
In the classification stage, the fully connected layer is used to compress high-dimensional feature e to a lower dimension equal to the number of categories. And then the probability of corresponding category is calculated by classifier. The formula is as follows: where W 2 and b 2 are weight matrices. The loss function of the entire network is classification cross entropy function, which is defined as: Figure 4: Schematic diagram of one-to-one connection between cores In the formula: k represents the number of training samples. During the training process, a random gradient descent algorithm is used to optimize and update all parameters in convolutional neural network combined with channel attention mechanism.

Experiments and Results
In order to verify the effectiveness of our proposed convolutional neural network combined with channel attention mechanism, sufficient experimental evaluations were performed on OULP dataset and CASIA-B dataset. Based on 3-D Convolutional Neural Network (3DCNN) proposed by [16], Deep Convolutional Location Weight Descriptor (DLWD) proposed by [25] and CAMCNN, we compared them by experiments. These methods are implemented in Python 3.0 using image processing toolbox.

Experimental Datasets
OULP dataset contains more than 4,000 experimenters, and each sample performs 2 normal walks. Each type of video clip is recorded by a camera with 4 angles (55°, 65°, 75°and 85°).

Comparison of Recognition Results under Different Classifiers
The algorithm model is used to model the angle variables, and the most classic gait feature GEI is used as input. Support Vector Machine (SVM), Random Forest (RF) and Gradient Boosting Decision Tree (GBDT) were used on both OULP dataset and CASIA-B dataset. We performed two sets of experiments separately, and the experimental results are shown in Figs. 8 and 9.
As can be seen from Figs. 8 and 9, CAMCNN can better learn the features because of channel attention mechanism using the same classifier and gait characteristics. Therefore, the recognition results obtained by CAMCNN are superior to several other comparison methods. The reason for poor performance of SVM is that, since its calculation time increases sharply with the increase of training samples, it is not suitable for

Comparison of Recognition Results with Advanced Methods
The simplest nearest neighbor classifier (K-Nearest Neighbor, KNN) [35] is used, and only the scenes with no angular difference in the state of cooperation are considered in the experiment. Since the experiment does not require a training set, the samples at each angle are used for testing. First test the performance of the original dimension on OULP dataset. Then, by the experiments, it is obtained that the computer performance and recognition accuracy have achieved a good balance of feature dimensions. Finally, the reference dataset and the probe data set are exchanged and then the recognition rate is obtained. Similarly, CASIA-B dataset also calculates the recognition rate at 11 angles in the same way. As can be seen from Tabs. 1 and 2, the proposed CAMCNN can achieve a higher recognition rate on gait discrimination task using only the simplest classifier compared with 3DCNN and DLWD which shows that CAMCNN can learn features well. At the same time, it can be found that the recognition rate of this model is not significantly different under different angles. Therefore, the proposed CAMCNN has a certain degree of angle invariance.

CMC and ROC Curves
To verify the superiority of the proposed multi-view gait recognition method using deep convolutional neural network and channel attention mechanism, experiments were conducted on OULP and CASIA-B datasets using different convolutional neural networks. To ensure comparability, the feature extraction methods all use multi-scale audio difference normalization algorithm. The experimental results are shown in Figs. 10-13. Since the part is just to verify the quality of the model, to facilitate comparison with similar studies, feature extraction method uses GEI.
As can be seen from Figs. 10-13, compared with the other two existing methods, the recognition rate of the proposed method is higher. As can be seen from the analysis of reasons, the proposed method improves the learning ability of the network by optimizing the cost function. In addition, the proposed method incorporates the attention mechanism, which improves the ability of expressing gait features of the network and suppresses recognition errors effectively.

Conclusions and Future Works
In this paper, a new multi-view gait recognition algorithm combining channel attention mechanism and convolutional neural network is proposed. The method trains the convolution layer through back propagation learning and uses the root mean square propagation of stochastic gradient descent to optimize the cost function, which improves the learning ability of the convolutional neural network. In addition, the channel attention mechanism is integrated, which makes the network more focused on learning and  Figure 12: ROC curve of OULP dataset under non-cooperative state further improves the ability of expressing gait features. In addition, two sets of comprehensive experiments are discussed, which uses different feature extraction and classification methods for multi-view gait recognition. The experiment results show that multi-view gait recognition performs better under our proposed CAMCNN neural network training and the same feature extraction methods or classification algorithms.
In the future human body gait recognition for serialized signals, in addition to considering the angle factor, other variables (such as clothing, backpacks and walking speed) must be comprehensively considered. In addition, in the actual landing process of the algorithm, the integration of long-distance face features is considered to achieve multi-modal body identity verification and authentication. Based on this, let the algorithm land as quickly as possible in digital security, digital entertainment and other scenarios that require identity recognition.