A weighted sparse coding model on product Grassmann manifold for video-based human gesture recognition

It is a challenging problem to classify multi-dimensional data with complex intrinsic geometry inherent, such as human gesture recognition based on videos. In particular, manifold structure is a good way to characterize intrinsic geometry of multi-dimensional data. The recently proposed sparse coding on Grassmann manifold shows high discriminative power in many visual classification tasks. It represents videos on Grassmann manifold using Singular Value Decomposition (SVD) of the data matrix by vectorizing each image in videos, while vectorization destroys the spatial structure of videos. To keep the spatial structure of videos, they can be represented as the form of data tensor. In this paper, we firstly represent human gesture videos on product Grassmann manifold (PGM) by Higher Order Singular Value Decomposition (HOSVD) of data tensor. Each factor manifold characterizes features of human gesture video from different perspectives and can be understood as appearance, horizontal motion and vertical motion of human gesture video respectively. We then propose a weighted sparse coding model on PGM, where weights can be understood as modeling the importance of factor manifolds. Furthermore, we propose an optimization algorithm for learning coding coefficients by embedding each factor Grassmann manifold into symmetric matrices space. Finally, we give a classification algorithm, and experimental results on three public datasets show that our method is competitive to some relevant excellent methods.


INTRODUCTION
Human action/gesture recognition (Pareek & Thakkar, 2021) is a hot research area due to its wide applications such as human-computer interaction, robot control, security and survillance, sign language assistance, education, medical, etc. Roughly speaking, human actions /gestures convey intentional information by physical movement of body parts. Usually, the term ''action'' is considered with a higher complexity level comparing to the term ''gesture' ' (Zhu et al., 2016). Researches for human gesture recognition are mainly divided into two categories: wearable device based techniques (Jung et al., 2015) and vision-based techniques (Ji et al., 2012). However, wearing devices requires users to carry special designed wearable sensors and sensors are usually quite expensive. For vision-based approaches, videos carry more information for gesture recognition than still images. Moreover, the number of available videos on the Internet significantly increased with the development of acquisition and storage device. Hence, video-based human gesture recognition (Ji et al., 2012;Chakraborty et al., 2018;Patil & Subbaraman, 2019) attracts more and more attentions.
For video-based human gesture recognition, each video is assigned to a class label and videos of the same class maybe acted by different person in different environment. It becomes more difficult for gesture recognition due to large variations, such as illumination, appearance, pose and scale. There exist variations even though for the same person. Therefore it is a challenging problem for video-based human gesture recognition. Basically, the key problems of video-based human gesture recognition are learning discriminative feature representations for a gesture video and designing an effective recognition method.
For recognition methods, sparsity representation classification (SRC) had been shown to deliver notable results for various visual-based tasks, such as face recognition (Wright et al., 2008;Wright et al., 2010), subspace clustering (Elhamifar & Vidal, 2013). Furthermore, some weighted forms for sparse coding were proposed for various applications, such as image denoising (Xu, Zhang & Zhang, 2018), visual tracking (Yan & Tong, 2011) and saliency detection (Li, Sun & Yu, 2015). Although the SRC method and its extended models had good performance in many applications, they assumed data come from linear space. However, many multi-dimensional data may reside in a non-linear manifold space. So it is desire to explore the latent non-linear manifold structure of data. Recently, for Grassmann manifold representation of videos/image sets, many researches had been proposed for kinds of applications and received good performance. For instance, Harandi et al. (2015) proposed a sparse coding algorithm on Grassmann manifold for classification tasks such as gesture classification, scene analysis and dynamic texture classification; Wang et al. (2020) proposed a self-expression learning framework on Grassmann manifolds for video/image-set subspace clustering; Verma & Choudhary (2020) did Grassmann manifold discriminant analysis for hand gesture recognition from depth data; Souza et al. (2020a) proposed an enhanced Grassmann discriminant analysis framework for classifying motion sequences.
Although the Grassmann manifold can well reflect the non-linear structure of data, the single space representation methods lose some important information by vectorizing each image in videos. Naturally, video and image set can be represented in the form of data tensor. Tensor computing had been successfully applied to many visual-based application (Kim & Cipolla, 2008). Lui (2012) factorized a data tensor using Higher Order Singular Value Decomposition (HOSVD) and imposed each factorized element on a Grassmann manifold, then a video can be represented as a point on product Grassmann manifold (PGM). This representation yielded a very discriminating structure for action recognition. Wang et al. (2016) proposed a low rank representation model on PGM, which received good performance for clustering of videos or image sets. Wang et al. (2018) proposed an extrinsic least square regression on PGM for video-based recognition.
In this paper, we represent a human gesture video as a point on PGM. In brief, there are three factor Grassmann manifolds which can reflect appearance, horizontal motion and vertical motion of human gesture video respectively. In addition, the importance of these three aspects should be considered. Hence, we explore a weighted sparse coding method on PGM for video-based human gesture recognition. It is solved by minimizing the reconstruction error with a l 1 −norm regularizer.
Our main contributions lie in the following three aspects: (1) Extending SRC model on Grassmann manifold into product Grassmann manifold to deal with multi-dimensional data such as videos and image-sets. (2) Discussing the different importance of three factor manifolds and proposing a weighted sparse coding model. (3) Comparing with several classification methods on three datasets to show the effectiveness of our proposed method.
The rest of this paper is organized as follows: 'Product Grassmann Manifold Representation for Data' introduces product Grassmann manifold representation for data; 'Weighted Sparse Coding on Product Grassmann Manifold' gives a weighted sparse coding model on PGM; 'Experiments' shows experiments on different datasets, and experiment results show that the proposed method achieves considerable accuracy; 'Computational Complexity' analyzes the computational complexity of our proposed method; 'Main Findings and Future Directions' gives main findings and future directions.

PRODUCT GRASSMANN MANIFOLD REPRESENTATION FOR DATA
In the following paper, we use the mathematical symbols in Table 1 which are commonly used.

Product Grassmann manifold
A point on Grassmann manifold G(p,d) is a p-dimensional subspace of R d (Absil, Mahony & Sepulchre, 2009). That means it can be spanned by any orthonormal basis X = [x 1 |x 2 |···|x p ] ∈ R d×p and it is denoted as span(X). For the sake of convenience, we use the same symbol X to represent span(X). The distance of two points X and Y on Grassmann manifold can be defined as is the symmetric matrices space with order d (refer to Harandi et al., 2015). Product Grassmann manifold (PGM) PG(p 1 ,...,p M | d 1 ,...,d M ) is defined as where each weight ω m (≥ 0) represents the importance of factor manifold G(p m ,d m ) and M m=1 ω m = 1.

Data representation on PGM
In the real world, there exists many data with multi-dimensional structure. For example, video can be represented as tensor A ∈ R J 1 ×J 2 ×J 3 , where J 1 , J 2 and J 3 represent height, width and length of video respectively; Image set can be represented as tensor A ∈ R J 1 ×J 2 ×J 3 , where J 1 , J 2 and J 3 represent height, width and number of image set respectively; Light field can be represented as tensor A ∈ R J 1 ×J 2 ×J 3 ×J 4 (Wang & Zhang, 2020), where J 1 and J 2 represent angular resolution of light field, J 3 and J 4 represent spatial resolution of light field. Before introducing data representation on PGM, we give a schematic of matrix unfolding for a third tensor in Fig. 1. The reader can refer to Kolda & Bader (2009) for more theory on tensor operation. For ease of understanding we give a corresponding example of two videos described by tensor in Fig. 2. We find that the corresponding unfolding  matrix is discriminative for two videos with different labels, hence the multi-dimensional information of video tensor is worth mining for classification task.
In the following, we discuss the way to represent multi-dimensional data on PGM. The variation for each mode of a tensor A ∈ R J 1 ×···×J M can be captured by HOSVD (followed as Lui, 2012), which factorize tensor A using the orthogonal matrices in the following equation: ..,M ) are orthogonal matrices spanning the row space with the first J m rows associated with non-zero singular values from the unfolded matrices respectively, Remark: The value of parameter p m (m =1 ,...,M ) reflects the principal information of data. In brief, the information of data may be redundant if the value of p m is too large and the information of data may be insufficient if the value of p m is too small. Hence it is important to select the parameters p m (m =1 ,...,M ) and we will discuss this problem in details in our experiments.

WEIGHTED SPARSE CODING ON PRODUCT GRASSMANN MANIFOLD Weighted sparse coding model on PGM
be a query sample on product Grassmann manifold. The sparse coding model on PGM is formulated as follows: and are used to simulate ''linear'' combination defined on PGM, i.e., addition and scalar-mulitplication.
To get the sparse coding model on PGM, proper definitions of distance and combination operator should be specified. According to the geometric property of Grassmann manifold, we use the embedded distance and linear combination on the space of symmetric matrices. Hence, we construct the weighted sparse coding model on PGM as follows,

Algorithm for the weighted sparse coding on PGM
In this subsection, we show how to solve the optimization Eq. (1). We have For simplicity, we define a matrix K m (X) and a vector K m (X,Y) as following, i.e., their elements are Therefore, the problem is convex and can be solved by a vectorized sparse coding problem.
In detail, let U U T be the SVD of

Ensure:
The sparse code α * for m = 1 : M do for i = 1 : N do for j = 1 : N do (2) return α *

Classification rule and algorithm
When model Eq.
The residual error of a query sample [Y] =(Y 1 ,Y 2 ,...,Y M ) by using the samples associated to class k is defined as Then the estimated class of the query Y is determined by The procedure of sparse representation classification on product Grassmann manifold is summarized in Algorithm 2.

EXPERIMENTS
In this section, we show performance of the proposed method against some state-of-the-art methods on three kinds of datasets. In the following experiments, all video data can be regarded as points on PGM G(p 1 ,d 1 ) × G(p 2 ,d 2 ) × G(p 3 ,d 3 ) and the parameter λ is all chosen as 0.1 by experience.

Cambridge hand gesture datasets
The Cambridge hand gesture datasets (Kim & Cipolla, 2008) contains 900 video sequences with 9 classes and it is divided into 5 sets according to different illuminations. The 9 classes are flat-leftward (FL), flat-rightward (FR), flat-contract (FC), spread-leftward (SL), spread-rightward (SR), spread-contract (SC), V-shape-leftward (VL), V-shape-rightward (VR) and V-shape-contract (VC) respectively. We follow the experimental protocol in paper (Kim & Cipolla, 2008), set 5 (normal illumination) is considered for training while the remaining sequences (with different illumination characteristics) are used for testing. In this experiment, the original sequences are converted to grayscale and resized to 20 × 20 × 20. Obviously, experiment results depend on the selection of parameters, so we firstly discuss the parameter setting in the following.
(1). Hence, we jointly determine the parameters (p 1 ,p 2 ,p 3 ,ω 1 ,ω 2 ). For this datasets, p 1 ,p 2 ,p 3 are optimized all in the range of 2 to 20 by step 2, and ω 1 ,ω 2 are optimized in the range as Table 2. We perform 5-fold cross validation on Set5 and find the optimal (p * 1 ,p * 2 ,p * 3 ,ω * 1 ,ω * 2 ) to obtain the best experimental results. Each time we leave one cross validation set as testing and the other four folds for training. Recursively, we perform experiments and record the correct recognition rate (CRR) of each fold.
Maximizing the average CRRs of five results to have good discrimination, there exist 33 optional parameter combinations. Meanwhile, we expect the data representation carrying more information to better fit the testing data. Hence, among the 33 combinations we choose the top 5 % combinations making p 1 + p 2 + p 3 larger. We list the selected  combinations of parameters (p 1 ,p 2 ,p 3 ,ω 1 ,ω 2 ,ω 3 ) in Table 3. Table 4 shows the CRRs of the five folds of Set5 with the combinations of parameter in Table 3. In order to illustrate the above parameter selection process, Figs. 3-5 show the slice of CRR's variation with each dimension of parameter corresponding to the optimal combinations listed in Table 3, respectively.

Experiment result on testing sets
In this experiment, the parameter λ is set as 0.1. With the three combinations of parameters (p 1 ,p 2 ,p 3 ,ω 1 ,ω 2 ), the samples of Set1-Set4 are represented as points on PG(8,18,12|400,400,400), PG(20,10,12|400,400,400) and PG(14,12,12|400,400,400) respectively. Table 5 summarizes the correct recognition rate for Set1-Set4 and the average correct recognition rate which followed by the standard deviation. As Table 5 shows, WSRC-PGM has superior performance compared with TCCA (Kim & Cipolla, 2008), PM (Lui, 2012), gSC and kgSC (Harandi et al., 2015), DMD+SC(SCCD2) (Singh et al., 2021). The confusion matrix of our proposed approach on the four testing sets under parameter combination 1 are given in Fig. 6. Naturally, confusion matrices for combination 2 and 3 can be discussed similarly and they are omitted here. From Fig. 6 see, the most misclassified class is SL and most of the misclassified samples with lable SL were misassigned to the SC class. The second most misclassified class is SC and most of the misclassified samples with lable SC were misassigned to the VC class.

Figure 3 The two graphs show the slice of CRR's variation with each parameter with combination 1 on the Cambridge Hand Gesture Datasets. (A)
The solid line shows the variation of CRR with varying p 1 while (p 2 ,p 3 ) are fixed as (18,12), and the optimal p 1 is 8 in this slice. The dotted line shows the variation of CRR with varying p 2 while (p 1 ,p 3 ) are fixed as (8,12), and the optimal p 2 is 18 in this slice. The dashdot line shows the variation of CRR with varying p 3 while (p 1 ,p 2 ) are fixed as (8,18), and the optimal p 3 is 12 in this slice. (B) The heatmap reflects the variation of CRR with different (ω 1 ,ω 2 ) and the optimal (ω 1 ,ω 2 ) is (  classes such as speed, clothing and motion paths. The frame images are normalized and centered in a fixed size of 20 × 20. We extract total 2400 sub-videos by randomly sampling 6 frames from original video that exhibited the same action and then images are converted to grayscale. We randomly select 1200 samples as training set and the remainder as testing set.  (10,12), and the optimal p 1 is 20 in this slice. The dotted line shows the variation of CRR with varying p 2 while (p 1 ,p 3 ) are fixed as (20,12), and the optimal p 2 is 10 in this slice. The dashdot line shows the variation of CRR with varying p 3 while (p 1 ,p 2 ) are fixed as (20,10), and the optimal p 3 is 12 in this slice. (B) The heatmap reflects the variation of CRR with different (ω 1 ,ω 2 ), and the optimal (ω 1 ,ω 2 ) is ( (12,12), and the optimal p 1 is 14 in this slice. The dotted line shows the variation of CRR with varying p 2 while (p 1 ,p 3 ) are fixed as (14,12), and the optimal p 2 is 12 in this slice. The dashdot line shows the variation of CRR with varying p 3 while (p 1 ,p 2 ) are fixed as (14,12), and the optimal p 3 is 12 in this slice. (B) The heatmap reflects the variation of CRR with different (ω 1 ,ω 2 ) and the optimal (ω 1 ,ω 2 ) is (0.2,0.4) in this slice.  Similar to the discussion for parameter setting of experiment on Cambridge hand gesture dataset, we jointly determine the parameters (p 1 ,p 2 ,p 3 ,ω 1 ,ω 2 ) by 5-fold cross validation on training set, where p 1 ,p 2 are all in the range of {2 : 2 : 20}, p 3 is in the range of {1 : 1 : 6} and ω 1 ,ω 2 are in the range as Table 2. The top 5 % optional parameter combinations of  (p 1 ,p 2 ,p 3 ,ω 1 ,ω 2 ,ω 3 ) are listed in Table 6. And the samples on testing set are represented on PG(10,6,2|120,120,400), PG(10,4,4|120,120,400) and PG(10,2,6|120,120,400) respectively in experiments. Table 7 summarizes the average correct recognition rate.
The results show that our algorithm has superior performance compared with some state-of-the-art methods. And the confusion matrix of our proposed approach on the testing set under the three parameter combinations are given in Fig. 7.

UMD Keck body-gesture datasets
The UMD Keck Body-Gesture Datasets contains 14 naval body gestures acquired from both static and dynamic backgrounds. The subjects and the camera remain stationary in the static backgrounds, the subjects and the camera are moving in the dynamic backgrounds. 126 videos and 168 videos are collected from the static scene and the dynamic environment kgSC-dic (Harandi et al., 2015) 83.53 ± 0.8% kgLC-dic (Harandi et al., 2015) 86.94 ± 1.1% DMD+SC (SCCD2) (Singh et al., 2021) 96 We follow the experimental setting proposed in paper (Lin, Jiang & Davis, 2009). In the static background, we adopt Leave One Out Cross Validation (LOOCV). For dynamic background, the gestures acquired from the static background are used for training, while the gestures in dynamic background are used for testing.
In our experiment, videos are firstly cropped by tracking the region of interest through a simple correlation filter, and then all videos are resized to 32×24×45. The videos whose frames are less than 45 are appended with the last frame added some Gaussian noise. Similar to the previous discussion, we jointly determine the parameters (p 1 ,p 2 ,p 3 ,ω 1 ,ω 2 ) by 5-fold cross validation on training set, where p 1 is in the range of {2 : 4 : 32}, p 2 is in the range of {2 : 4 : 24}, p 3 is in the range of {10 : 4 : 45} and (ω 1 ,ω 2 ) are in the range as Table 2. The top 5 % optional parameter combinations of (p 1 ,p 2 ,p 3 ,ω 1 ,ω 2 ,ω 3 ) are listed in Table 8. And the samples on testing set are represented on PG (6,22,14|1080,1440,768). Table 9 shows that WSRC-PGM has higher performance compared with TB (Lui, 2011),  Prototype-Tree (Lin, Jiang & Davis, 2009) and PM (Lui, 2012). The confusion matrix of our proposed approach with parameter combination 1 are given in Fig. 8.

Discussion
Through above experiments, we conclude that the proposed method is effective for video-based human gesture recognition. In experiments, the selection of parameters is a key step. We jointly selected optional parameters (p * 1 ,p * 2 ,p * 3 ,ω * 1 ,ω * 2 ,ω * 3 ) on grid parameter set, through maximizing the average CRRs of 5-fold cross validation on training set. The parameter selection process is time-consuming because of the high dimension of parameter. This limitation may be solved by alternative iterations of optimization, through setting rational initial values based on prior information of data distribution. The reason is that the dimensions of parameter for each iteration can be reduced.

COMPUTATIONAL COMPLEXITY
We analyze the time complexity of WSC-PGM algorithm in this section. The algorithm focus on improving the correct recognition rate by sparse coding on product Grassmann manifold. Compared with sparse coding on single Grassmann manifold named as gSC  (Harandi et al., 2015), we discuss the computation efficiency of WSC-PGM algorithm in the following. Same as the notations of algorithm WSC-PGM, the WSC-PGM algorithm requires O(N (d 1 p 2 1 + d 2 p 2 2 + d 3 p 2 3 )) flops for computing K m (X,Y). The gSC algorithm (Harandi et al., 2015) requires O(Ndp 2 ) flops for computing Z T D j 2 F (j =1 ,...,N ), where span(Z), span(D j ) ∈ G(p,d) while other steps of the two algorithms have the same computational complexity. To make it easier for the readers to understand, we take the Cambridge Hand Gesture Dataset as an example, we set d 1 = d 2 = d 3 = 400,p 1 = 8,p 2 = 18,p 3 = 12 of combination 1 in our experiment and d = 400,p = 50 are chosen in gSC (Harandi et al., 2015). We can see that d 1 p 2 1 + d 2 p 2 2 + d 3 p 2 3 = 212800 dp 2 = 1000000. However, the CRR of WSC-PGM algorithm is higher than that in gSC (Harandi et al., 2015).
We further evaluate the execution time of our WSC-PGM for classification in Table 10. And all experiments are executed on Intel(R) Core(TM) i7-10700 CPU with 32GB RAM.

MAIN FINDINGS AND FUTURE DIRECTIONS
Subject to video-based human gesture recognition, we proposed a novel weighted sparse coding model on product Grassmann manifold. A video can be viewed as a third order tensor and then represented as a point on product Grassmann manifold by factorizing the tensor through HOSVD. This representation can characterize the multi-dimensional information including appearance, horizontal motion, vertical motion from video data and also can efficiently take advantage of the nonlinear manifold structure of video data. Based on PGM representation of videos, we proposed a sparse coding method by embedding the product Grassmann manifold to the product space of symmetric matrices. Meanwhile, an efficient algorithm WSC-PGM and the corresponding classification algorithm WSRC-PGM are proposed. The method of this paper improves the correct recognition rate and meanwhile it reduces the time complexity comparing with sparse coding on single Grassmann manifold. Experiments on three kinds of public datasets show that our method performs very well.
In future work, we would like to study the product Grassmann manifold representation method combing with time series model in tensor form, in order to enhance the discriminant performance of videos.