Structure-Preserving Binary Representations for RGB-D Action Recognition

In this paper, we propose a novel binary local representation for RGB-D video data fusion with a structure-preserving projection. Our contribution consists of two aspects. To acquire a general feature for the video data, we convert the problem to describing the gradient fields of RGB and depth information of video sequences. With the local fluxes of the gradient fields, which include the orientation and the magnitude of the neighborhood of each point, a new kind of continuous local descriptor called Local Flux Feature(LFF) is obtained. Then the LFFs from RGB and depth channels are fused into a Hamming spacevia the Structure Preserving Projection (SPP). Specifically, an orthogonal projection matrix is applied to preserve the pairwise structure with a shape constraint to avoid the collapse of data structure in the projected space. Furthermore, a bipartite graph structure of data is taken into consideration, which is regarded as a higher level connection between samples and classes than the pairwise structure of local features. The extensive experiments show not only the high efficiency of binary codes and the effectiveness of combining LFFs from RGB-D channels via SPP on various action recognition benchmarks of RGB-D data, but also the potential power of LFF for general action recognition.


INTRODUCTION
R GB-D sensors such as Kinect receive increasing attention in the computer vision community [1].They have been widely applied to many areas such as: human activity recognition [2], robot path planning [3], object detection [4], scene labeling [5], interactive gaming [6], and 3D mapping [7].The combination of RGB and depth information enables enhanced capabilities of computer vision algorithms.It also provides an alternative way to learn features from video data for action recognition, especially through learning fused RGB-D representations.
To gain a more robust and accurate representation of samples, local feature descriptors such as: SIFT [8], HOG3D [9], HOG [10], HOF [11] and MBH [12] have been proposed and achieved notable success in classification and recognition.Based on these local features, the Bag-of-Words (BoW) model [13] and the Sparse Coding (SC) algorithm [14] have shown their effectiveness for both image classification and action recognition.During the last decade, extensive efforts have been put on the improvement of BoW and SC.However, in most situations, there are millions of local features with hundreds or even thousands of dimensions in vision-based tasks, which poses a severe restriction on the computational efficiency of similarity search in recognition algorithms.It is, therefore, highly desirable to find a compact and efficient but discriminative representation for local features.
The fast bitwise operations in Hamming space motivate us to propose a local binary representation for RGB-D video data.In this way, the similarity search is simply computing Hamming distances which are conducted by the XOR operation rather than computing Euclidean distances by the addition and multiplication in real numbers.Then the efficiency of classification and recognition algorithms will be significantly improved.Our proposed scheme is two-fold.
First, towards constructing a common representation applicable for both RGB and depth data, we view a video sequence in either RGB or depth as a scalar field in R 3 with the frame coordinate ðx; yÞ and the temporal axis t (for RGB data, we can use the three channels of red, green, and blue to form three scalar fields in R 3 separately.In the experiments, to alleviate the computational complexity, we only use the gray-scale information).To describe this scalar field, we compute the local flux of its gradient field and obtain a feature vector called Local Flux Feature (LFF) for each pixel.Generally speaking, the local flux f r ðP Þ at point P is defined as the rate of the gradient field (flow) passing through a sphere surface with radius r centered at P .In other words, the local flux at point P captures the information of the orientation and the magnitude of the gradient field over a neighborhood of P , and f r ðP Þ, as a continuous function, represents an average quantity of the flow over this neighborhood.Many gradientbased features have been successfully applied to practical situations, since the gradient field represents the direction of the greatest change of a function.Theoretically, the Helmholtz theorem [15] in fluid mechanics states that we only need to know the divergence and curl of a twice continuously differentiable vector field to determine it.Given a C 2 -smooth function V ðx; y; tÞ : R 3 !R, its gradient rV satisfies rÂrV ¼ r ty V À r yt V; r xt V À r tx V; r yx V À r xy V À Á ¼ 0; M. Yu, L. Liu and L. Shao are with the Department of Computer Science and Digital Technologies, Northumbria University, Newcastle upon Tyne NE1 8ST, UK.E-mail: {m.y.yu, ling.shao}@ieee.org,li2.liu@northumbria.ac.uk.
which means curlðrV Þ ¼ 0, showing that the divergence of rV provides the vital information for the gradient field.Fortunately, the divergence theorem converts computing the flux f r ðP Þ through a closed sphere to computing the volume integral of the divergence inside the sphere.Obviously, computing f r ðP Þ for every pixel is time-consuming and unnecessary.Thus we only calculate the local fluxes for the regions around the interest points or the points selected by dense sampling in RGB data and the corresponding pixels in depth data.Second, we fuse the LFFs from RGB and depth channels of points into Hamming space.To make the above features more discriminative and meaningful in Hamming space, we propose a Structure Preserving Projection (SPP) method.Generally speaking, SPP preserves two levels of data structure.In terms of low-level features, we consider the relationship among local feature descriptors, i.e., their pairwise structure, which is maintained in the binary representation learning to embed high dimensional feature descriptors into a lower-dimensional structure-preserved Hamming space.In the learning phase, each pair of local features is given a weak label related to their Euclidean distance.Specifically, a positive pair is a pair of local features, if one feature of the pair is within the k nearest neighbors of the other; otherwise, it is a negative pair.
Considering the shape of the data distribution, the pairwise structure also includes the angles between each pair of local feature descriptors.Taking two negative pairs ðx 1 ; x 2 Þ and ðx 1 ; x 3 Þ as an example (since the majority of pairs are negative), they are encoded to the pairs which have large distances in the Hamming space.Nevertheless, an over-fitting condition is that pair ðx 2 ; x 3 Þ is possibly mapped to the pair with a small distance as shown in Fig. 1.Therefore, preserving the angles can be regarded as a shape constraint for the structure of pairwise Euclidean distances.It ensures that the shape of data in the original space would not collapse in the Hamming space while pairwise distances are preserved.
Furthermore, in respect of high-level connection, we also want to establish links between samples and classes.The bipartite graph (a.k.a.bigraph) consisting of samples and classes, shows the relationship between samples and classes.To quantize the edges, we use the image-to-class (I2C) distance, which was first introduced in the naive Bayes nearest neighbor (NBNN) classifier [16] and was also proven to be an optimal distance for classification in [16].It represents the sum of all distances from the local features of an image to their corresponding nearest neighbors in each class.Although it was proposed for image classification, it can be applied to any kind of samples represented by local feature descriptors.I2C distances can effectively avoid the quantization error in the bag-of-features model.Our algorithm shows that the performance can be enhanced by combining the sample-to-class structure (bigraph regularization) and the pairwise geometrical structure.It is worthwhile to highlight several properties of the proposed scheme: LFF is a continuous feature descriptor without loss of orientations and magnitudes of the gradient field, which makes it more suitable for the discretization of the final binary representation since every discretization will bring the deviation into results.SPP simultaneously preserves two independent aspects of geometrical structure: Euclidean distances and angles, which could balance each other and avoid over-fitting.SPP considers two levels of the relationship of data structure based on local feature descriptors.Preserving the local structure and the global structure in the original feature space makes local feature descriptors more discriminative in the lower-dimensional space.Our scheme fuses RGB and depth information.The fused local feature descriptors have learned the complementary nature of RGB and depth information.Our representation is linear and binary.This makes it extremely fast and useful for many practical applications.

RELATED WORK
Feature extraction from RGB video data has been well explored [17], [18], [19], [20].Detectors such as Spatio-Temporal Interest Points (STIP) [21] and Dollar's [22] are usually used to locate interest points before feature extraction.Many video descriptors are extended from their counterparts in the image domain [8], [9], [12], [23], [24].As 3D versions of SURF [25], SIFT [8] and HOF [11], 3D speeded up robust features (SURF3D) [26], 3D scale invariant feature transforms (3D-SIFT) [27] and 3D motion features [28], [29] have been proposed for action recognition respectively.The Histogram of Oriented Gradients (HOG) is widely used in the above schemes, which discretizes the gradient orientations.In our work, however, discretization only performs in the pixel computation.Fathi and Mori [30] developed a method to extract mid-level motion features by using the low-level optical flow for action recognition.Recently, the dense trajectories [31] gained high accuracies in most action recognition datasets.However, this method suffers from extremely high computational complexity.More feature extraction Fig. 1.Basic principle of the projection with angle-preserving in a twodimensional example.The distances of two negative pairs kx 1 À x 2 k and kx 1 À x 3 k are expected to be maximized after the projection.The shape of ðx 1 ; x 2 ; x 3 Þ has collapsed in the Hamming space without anglepreserving, therefore, lost the discriminative ability.
methods for action recognition could be found in a survey provided by Poppe [32].
Compared to the conventional RGB cameras, the depth cameras are relatively new.The existing features are specifically extracted for the depth information, since characteristics such as color and texture on depth data are far less than on the RGB data.Motion History Image (MHI) [33] is a typical template matching method for the analysis of depth information and the applications of human motion recognition [34].Using the depth information only, Shotton et al. [35] proposed a method for human body joints analysis which is the core component of the Kinect gaming system.Nevertheless, more feature extraction methods are for the fusion with RGB information.Based on HOG, Spinello and Arras [4] proposed a method called Histogram of Oriented Depths (HOD) for depth description and probabilistically combined HOD and HOG into a Combo-HOD to detect people in urban environments.Methods in [36] and [37] simply optimize all available information in their algorithms for object detection and recognition respectively.Similarly, Ni et al. [38] designed two color-depth fusion schemes for human activity recognition.Using the depth and skeleton information of actions, Wang et al. [2] proposed a new feature called Local Occupancy Pattern (LOP) and an actionlet ensemble model which indicates a structure of features.Recently, the HON4D descriptor [39] was proposed to build the histogram of the normal unit vectors from the depth channel for activity recognition.
Apart from feature extraction, there are also many approaches to analyze actions with a temporal model.A typical one is dynamic time warping (DTW) [40], which was proposed for speech processing first.Due to the timesequential property, DTW was also widely used as a measurement method in human action recognition for both depth data [41] and body joints of skeletons [42].
The above works are specifically designed for either RGB or depth data.In our work, LFF is a general descriptor which is suitable for both RGB and depth data.Besides, by calculating the local flux of the continuous gradient vector field, there are no bins and histograms in the computation of LFF, which can avoid the quantization error in most histogram-based methods.The Gradient Vector Flow (GVF) [43] has been successfully used in active contour alignments by solving the PDEs for an energy minimization problem.Engel and Curio [44] calculated the flux flow on the GVF and adopted it for pedestrian detection.Based on the 3D vector field, a rotation invariant descriptor called 3D-Div [45] was proposed for 3D object recognition by computing the divergence of the vector field.Nonetheless, the pointwise divergence in [45] cannot capture the neighborhood information of each point.In our work, we focus on the discriminative ability of the local flux and its advantage in RGB-D action recognition.
Preserving the intrinsic manifold/subspace structure is also involved in our algorithm to seek a more discriminative representation of local features.Manifold learning methods such as ISOMAP [46], Laplacian Eigenmap (LE) [47] and Locally Linear Embedding (LLE) [48], were designed to preserve the manifold structure of data in the original space.A unified review and other manifold learning algorithms can be seen in [49].Normally, linear methods possess high efficiency.Locality Preserving Projection (LPP) [50] is the first linear projection preserving algorithm that preserves the high-dimensional local structure.Neighborhood Preserving Embedding (NPE) [51] also tries to preserve the local representation of data.Capturing the intrinsic geometrical structure of data, Sparse Concept Coding (SCC) [52], which is a matrix factorization method, provides a sparse representation of the image space.For pairwise structure preserving, a related work for fast vision applications [53] represents each image using a binary vector calculated via boosted coding.In contrast, few works have attempted angle preserving in dimensionality reduction.Caseiro et al. [54] applied rolling map to the classification problem.Although the angles measured by geodesics in the original manifold are equal to the ones in the mapped manifold, the algorithm is not linear.
However, these works mainly focused on the global representations rather than preserving both pairwise structure of local feature descriptors and bipartite graph structure between samples and classes in the original space for designing efficient binary codes in Hamming space.
In the aspect of hash/binary code learning [55], one classical method is Locality-Sensitive Hashing (LSH) [56].Another popular technique called Spectral Hashing (SpH) [57] was also proposed to preserve the locality information of data.Recently, a supervised method called Kernel-Based Supervised Hashing (KSH) [58] has shown good discriminative ability of binary codes and outperformed other supervised methods such as Linear Discriminant Analysis Hashing (LDAH) [59], Binary Reconstructive Embeddings (BRE) [60] and Minimal Loss Hashing (MLH) [61].The above works mainly focus on preserving the pairwise distance, which is one part of SPP.To avoid overfitting as shown in Fig. 1, SPP also takes the pairwise angle into account.Towards local descriptors, Hamming Embedding (HE) [62] was proposed to map real-valued local features to binary codes.SPP contains a sample-to-class relationship [63] when each sample is represented by a set of local descriptors, since most visual tasks are sample-oriented.Experimental results show that these three terms, i.e., the pairwise distance, the pairwise angle and the sample-to-class relationship, all contribute to the outstanding performance of the proposed method.

LOCAL FLUX FEATURE
Local features extracted from local regions in an image or a video sequence are used to describe the local structure of a sample.Usually, local regions are the neighborhoods of points which are determined by using an interest point detector or by dense sampling of the image plane or video volume.And then, a feature vector is computed for each local region by characterizing its properties.In our algorithm, we compute the new Local Flux Features (LFFs) from the RGB-D video data and then combine the local feature x RGB from RGB information with the local feature x Depth from depth information to obtain a concatenated feature vector X 2 R D .

Flux Computation
The concept of flux has been studied deeply in applied physics, especially in fluid mechanics and electromagnetic theory.The flux of a vector field over a simply-connected closed district (a sphere in this paper) is defined as the quantity of this vector field passing through the district.This quantity includes the information of the orientation and the magnitude of the vector field over the district.It is used for a description of the vector field.To describe a video sequence which is regarded as a scalar field, we consider its gradient field and compute the local flux of the gradient field.
Given a video sequence V ðx; y; tÞ in either RGB 1 or depth, it can be seen as a function V : R 3 !R. We assume V is a C 2 -smooth function, i.e., V 2 C 2 ðVÞ, where V is the district of the video sequence, usually an L Â W Â H cuboid.In fact, in discrete condition, derivative computation can be regarded as an approximation by a convolution operation of matrices.Then for scalar field V ðx; y; tÞ, we consider its gradient field rV ðx; y; tÞ ¼ ðr x V; r y V; r t V Þ.To describe the gradient field rV , we assign an l Â w Â h cuboid centered at each candidate point (interest points or dense samples) and compute the local flux of every pixel (or lattice point if we regard the coordinates of a pixel as integers) in the cuboid.To be specific, denote B P ðrÞ ¼ fðx 0 ; y 0 ; t 0 Þjðx 0 À xÞ 2 þ ðy 0 À yÞ 2 þ ðt 0 À tÞ 2 r 2 g as the sphere with the center P ¼ ðx; y; tÞ and radius r, the local flux at the point P over the sphere @B P ðrÞ is calculated as where dS represents the directed area unit of the boundary surface @B P ðrÞ.However, computing on the lattice points on the boundary @B P ðrÞ is difficult and inaccurate.According to the divergence theorem, we have i.e., we only need to compute for the points inside the sphere B P ðrÞ.Note that in the light of the Helmholtz theorem [15] in fluid mechanics, we only need to know the divergence and the curl of a twice continuously differentiable vector field to determine it.Hence, the fact that curlðrV Þ ¼ r Â rV ¼ 0 implies that the divergence of rV provides the vital information, which is captured by the local flux f r ðP Þ.For realistic computation, we adopt the numerical approximation for the discrete condition of pixels: DV dB P ðrÞ % X where D is the Laplace operator.Suppose there are D=2 pixels in an l Â w Â h cuboid, then we compute D=2 local fluxes in a specific order 2 and obtain an LFF vector x ¼ ðx 1 ; . . .; x D=2 Þ 2 R D=2 .Fig. 2 illustrates the outline of the computation of local fluxes.Having computed the LFF x RGB from the RGB channel and x Depth in the corresponding point from the depth channel, we concatenate their normalizations and obtain the new feature The combined LFF is regarded as the basic feature for the later learning of binary codes in our algorithm.

STRUCTURE PRESERVING PROJECTION
In this section, we introduce our Structure Preserving Projection (SPP) algorithm.SPP simultaneously preserves the local structure and the integrated shape of local features.In addition, SPP also considers a higher level relationship among local features, i.e., the bipartite graph consisting of samples and classes.SPP aims to seek a specific matrix Q 2 R DÂd (d < D) to construct a binary function such that their discriminative ability for action recognition is improved.For computational convenience, we choose fÀ1; þ1g rather than f0; 1g to represent binary codes in our algorithm.

Pairwise Structure Preserving
We denote the set composed of all local features by F ¼ fX 1 ; . . .; X N g, where N is the number of local features in training data.As mentioned above, we aim to seek the binary representations with discriminative ability in the lower-dimensional space.We are concerned about the relationship between every two local features in the highdimensional space, which should also be retained in the lower-dimensional space.1.In fact, we only need the gray-scale information in our algorithm.

Pairwise Label
First, we assign a weak label for each pair of local features.With the pairwise labels, acquiring the class information of each local feature is unnecessary.Besides, similar local features with small Euclidean distances may appear in samples from many different classes.Motivated by the binary property of HðXÞ, we employ the pairwise label fÀ1; þ1g to represent the relationship between two local features based on the pairwise distance between them.Thus we have the pairwise label where N k ðXÞ is the set of k nearest neighbors of X.To maintain the local structure, we make the product of each component in HðX i Þ and HðX j Þ consistent with their pairwise label ' ij , i.e., We denote P ¼ fði; jÞjX i ; X j 2 Fg.Therefore, we need to minimize the following function Then equivalently, we only need to maximize X ði;jÞ2P The above function reaches its maximum value when ' ij sgnðQ T X i Þ and sgnðQ T X j Þ are similarly sorted due to the rearrangement inequality [64].In other words, if ' ij ¼ 1, X i and X j are then similarly encoded and vice versa.
Considering the effect of noise, we additionally assign a pairwise weight W P ij to the local feature pair ði; jÞ to avoid the disturbance: Then the objective function for pairwise labels becomes X ði;jÞ2P

Pairwise Angle
In addition to the distance factor, we are also concerned about the shape of the entire set of local features, which is regarded as a constraint for preserving the pairwise Euclidean distances.The shape constraint firms the data structure in the projected space and avoids some certain errors caused by the pairwise labels.We denote the angle between two local features X i and X j by u ij .Note that angle u ij is with the vertex at coordinate origin.Thus, the local features should be centralized before the further learning process.
Orthogonal transformation (d preserves the lengths of local features and the angles between them since we have hQ T X i ; Q T X j i ¼ X T i QQ T X j ¼ X T i X j ¼ hX i ; X j i, 8i; j.When d < D, however, this property does not hold in orthogonal projection.We hope the angle b u ij in the projected space 3 is (approximately) equal to u ij .
Note that the distances are irrelevant with the angles, i.e., the pair of local features with a long distance can have a small angle and the pair with a short distance may have a large angle.Thus it is desirable to retain the angles of all pairs.We define our optimization problem for angle preserving in the low dimensional space: Although it is the optimization for preserving the inner product, the following proposition shows that the optimal Q Ã preserves the pairwise angles.
Proposition 1. Suppose Q Ã is the optimal solution of the optimization problem (10), then for any 1 i; j N, the projection Q Ã preserves the angle between the local features X i and X j .
Proof.According to the Cauchy-Schwarz inequality, we have and the equality holds if and only if hX i ; X j i ðði; jÞ 2 PÞ and hQ T X i ; Q T X j i ðði; jÞ 2 PÞ are collinear.We can first set a norm constraint P ði;jÞ2P hQ T X i ; Q T X j i 2 ¼ 1 for Q.Then the objective function in Eq. ( 10) is smaller than a constant.If Q Ã is the optimal solution of the optimization problem (10), the left-hand-side of the above inequality reaches its maximum value at Q Ã .Then there exists a constant 2 R such that hðQ Ã Þ T X i ; ðQ Ã Þ T X j i hX i ; X j i ¼ ; 8ði; jÞ 2 P: Therefore, for the projected angle b u ij , it satisfies which implies that the projection matrix Q Ã is an anglepreserving projection.t u 3. Since Hamming space is a discrete space, we first consider the angles in the linear subspace before taking the sign function.

Bigraph Regularization
Not only the pairwise structure of local features, but also the connection between samples and classes, which is regarded as a higher level relationship among local features, is considered in our algorithm.We use the imageto-class (I2C) distance to measure the bipartite graph (a.k.a. bigraph) that consists of video samples and classes.Although the I2C distance was first introduced to measure the distances between images and classes, it can also be applied to all kinds of samples that are represented by local features.Our goal is to preserve the I2C distances in the lower-dimensional space.Given the set of local features of a sample X i ¼ fX i1 ; . . .; X im i g, which contains all local features of sample i, the distance between sample i and class c is defined as where NN c ðXÞ is the nearest neighbor (NN) of the local feature X in class c and k Á k is the L 2 -norm.However, the complexity of NN-search linearly depends on the number of local features, which renders the nearest neighbor search in such a large-scale space of local features of each class will still cost much time.Hence, we first implement a K-means clustering algorithm for each class.In other words, we first find K centroids for each set S CðX i Þ¼c X i , c ¼ 1; . . .; C, where C is the number of classes and CðÁÞ 2 f1; . . .; Cg is the label information function that represents the class label of the input.In this way, the searching range of nearest neighbors is reduced to the set of cluster centers, which has a much smaller size than the original space, i.e., for c ¼ 1; . . .; C, we set NN c ðXÞ 2 Centroids fS 1 ; . . .; S K g of Having obtained I2C distances, we build a bigraph G ¼ ðV 1 ; V 2 ; EÞ, where V 1 and V 2 are the node sets of samples and classes respectively.G is a complete and weighted bigraph.For each edge in E connecting sample i and class c, it has the weight W D ic determined by the I2C distance, named the I2C similarity.By heat kernel, we define the I2C similarity as follows: where s is the Gaussian smoothing parameter and n is the number of training samples.Correspondingly, we have the I2C distance in the objective Hamming space: With the above defined I2C similarity W I2C ic and the projected I2C distance b I c X i , we can define the following optimization problem to quantize the bigraph regularization, i.e., I2C structure in the low dimensional space: By minimizing the above equation, the sample which has a small I2C distance to class c in the high-dimensional space is still close to class c in the low-dimensional space.According to the rearrangement inequality [64], the above objective function reaches its minimum value if and only if f b I c X i g and fI c X i g are similarly sorted, which means the projected I2C distances preserve the bigraph structure in the high dimensional space.

Objective Function and Optimization
In addition, to make the projected space more compact, we set the orthogonality constraint on the projection matrix, i.e., Q T Q ¼ I. Combining the objective functions for the pairwise structure and the bigraph regularizer, we obtain our final optimization problem for SPP: where b is the regularization parameter.
Optimization.Considering the discreteness of the binary function, we first use approximation sgnðxÞ % x to relax the objective function in the optimization problem (15) into a real-valued space.Then the objective function of the pairwise label part (see Eq. ( 9)) becomes X ði;jÞ2P And for I2C distances, we denote NN c ðXÞ ¼ X c .Note that after applying projection matrix Q, the nearest neighbors may change.However, for the large-scale local feature space, we approximately adopt the sum of the distances from Q T X to the projected nearest neighbor Q T X c .Then the projected I2C distance (see Eq. ( 13)) after applying matrix Q becomes where DX c ik ¼ X ik À X c ik , k ¼ 1; . . .; m i .Thus, by simple algebraic derivation, the optimization problem ( 15) is reduced to arg max where Notice that Thus M is a real-valued symmetric matrix.It is clear that the solution to the optimization problem ( 16) is the eigenvectors corresponding to the largest d eigenvalues of M. We summarize our algorithm in the following Algorithm 1.

Algorithm 1. Structure Preserving Projection for Local Flux Feature
Input: Training video sequences V 1 ; . . .; V n in gray-scale and V 0 1 ; . . .; V 0 n in depth, the radius r for the sphere B P ðrÞ, the parameter k for pairwise structure preserving, the number of centroids K in K-means, the label information function CðÁÞ 2 f1; . . .; Cg, the regularization parameter b and the objective dimension d.Output: The projection matrix Q.
1: Detect interest points (or densely sample) fP 1 ; . . .; P m i g from the i-th training video V i , i ¼ 1; . . .; n; 2: Compute two LFFs for each point in gray-scale and depth respectively by Eq. ( 3) and combine them by Eq. ( 4) to obtain the local feature set of the ith training video X i ¼ fX i1 ; . . .; X im i g and the whole local feature set F ¼ S X i ¼ fX 1 ; . . .; X N g; P N j¼1 X j , 8i; 4: Construct local feature pairing set P ¼ fði; jÞjX i ; X j 2 Fg and their corresponding pairwise labels

Complexity Analysis
In this section, we provide a time complexity analysis of our algorithm.During the training phase, our algorithm mainly consists of three parts.The first part is the computation of LFFs.The derivative computation is actually the convolution of matrices which at most needs Oð3DL m log L m Þ time [65], where L m ¼ maxfL; W; Hg.The second part is the computation of pairwise structure preserving.The k-NN algorithm in the construction of pairwise labels and the computation of pairwise angles cost OðkN 2 Þ and OðN 2 Þ time, respectively.The last part is the construction of the I2C similarity matrix W I2C ic À Á .The time complexity of this part is OðnCKDNÞ.In total, the time complexity of the training phase is at most In the test phase, binary codes can significantly reduce the runtime of the recognition algorithm since the distance computation in Hamming space is simply based on the XOR operation.Denote t m and t XOR as the time of one multiplication and one XOR operation, respectively.Then the computational complexity of NBNN in the original space is OðN train N test DÞt m , where N train and N test are the numbers of local features in training and test sets respectively.With the binary local features, the time complexity is reduced to OðN train N test dÞt XOR .In general, we have d ( D and t XOR ( t m .Thereby, when N train and N test are in the magnitude of millions or even greater, the hashing algorithm's effect is self-evident.We will list the run-time in the following section.

EXPERIMENTS AND RESULTS
In this section, we systematically evaluate our proposed method on three different RGB-D benchmarks: the SKIG hand gesture dataset [66], the MSRDailyActivity3D dataset [2] and the CAD-60 activity dataset [67].Fig. 3 shows some example frames of these three datasets.Details of the datasets are introduced in the following section.

Datasets and Settings
The SKIG dataset has 2,160 hand gesture sequences (1,080 RGB sequences and 1,080 depth sequences) collected from six subjects.All these sequences are synchronously captured with a Kinect sensor (including a RGB camera and a depth camera).This dataset collects 10 categories of hand gestures in total: circle (clockwise), triangle (anti-clockwise), up-down, right-left, wave, "Z", cross, comehere, turnaround and pat.In the collection process, all these ten categories are performed with three hand postures: fist, index and flat.To increase the diversity, the sequences are recorded under 3 different backgrounds (i.e., wooden board, white plain paper and paper with characters) and 2 illumination conditions (i.e., strong light and poor light).Consequently, for each subject, there are 10ðcategoriesÞ Â 3ðposesÞ Â 3ðbackgroundsÞ Â 2ðilluminationÞ Â 2ðRGB and depthÞ ¼ 360 gesture sequences.The training size for each category is varied as one of f10; 20; 35; 45; 60; 70g and the rest of the sequences are used for testing.
The MSRDailyActivity3D dataset is a human activity dataset captured with the RGB channel and the depth channel using the Kinect sensor.The total sequence number is 640 (i.e., 320 sequences for each channel) with 16 activities: drink, eat, read book, call cellphone, write on a paper, use laptop, use vacuum cleaner, cheer up, sit still, toss paper, play game, lie down on sofa, walk, play guitar, stand up, sit down.There are 10 subjects in the dataset and each subject performs each activity twice, once in standing position, and once in sitting position.The training size for each subject is chosen as one of f5; 10; 15; 20; 25g and the rest is used for testing.
The Cornell Activity dataset (CAD-60) contains 60 RGBdepth sequences acted by four subjects and captured with a Kinect camera.The actions in this dataset are categorized into five different environments: office, kitchen, bedroom, bathroom, and living room.Three or four common activities were identified for each environment, giving a total of twelve unique actions: rinsing mouth, brushing teeth, wearing contact lens, drinking water, opening pill container, cooking (chopping), cooking (stirring), talking on couch, relaxing on couch, talking on the phone, writing on whiteboard, working on computer.The training size for each action is assigned as one of f1; 2; 3; 4g and remaining sequences are adopted for testing.
All the training samples are selected randomly from every class in each dataset and all the procedures are repeated five times.We report the averages as the final results.
For the experimental settings, we fix the size of the cuboid l Â w Â h in the computation of LFF as 7 Â 7 Â 9. We set r ¼ 4; 4; 5 in each dataset respectively due to the comparison results with different radii r in Table 1.If the radius r is too small, the LFF degenerates to the second order derivative, and if r is too big, LFFs are almost the same for adjacent pixels, which tends to be less discriminative.We always set k ¼ 15 for the pairwise data structure.Actually, we utilize the training data as the cross-validation set in SPP.The parameter K of the K-means is selected from one of f100; 200; . . .; 1;000g with the step of 100, which yields the best performance by 10-fold cross-validation.The optimal parameter b is selected from f0:1; 0:2; . . .; 1:0g with the step of 0:1 by 10-fold cross-validation on the cross-validation set, as well.In particular, the nested cross-validation strategy is applied to these two parameters, i.e., K and b.We always first fix the value of K as one of f100; 200; . . .; 1;000g and  The training sizes are 70, 25 and 4 in each class for SKIG, MSRDailyActivity3D and CAD-60, respectively.All the code lengths are 96-bit.select the best parameter b from f0:1; 0:2; . . .; 1:0g, and then assign another value to K and select the best parameter b from f0:1; 0:2; . . .; 1:0g again.In this way, the optimal pair of parameters K and b can be obtained under the nested cross-validation strategy.
Since the acceleration of NBNN is quite conspicuous using the Hamming distance instead of the L 2 -norm in the NN-search and NBNN classifier always outperforms the BoW model, we mainly use NBNN to evaluate our recognition precision.

Compared Results
First of all, we illustrate the effectiveness of all the three terms used in SPP, i.e., the pairwise label preserving term, the pairwise angle preserving term and the bigraph regularization.We remove one of them and keep the other two terms, and optimize the problem in (15).The results are listed in Table 2, from which we can observe that the bigraph regularization contributes the most to the accuracies.
Next, for all three datasets, we apply three different schemes to achieve RGB-D video classification: 1) Detected interest points 4 + LFF + SPP; 2) Dense sampling 5 + LFF + SPP; 3) Detected interest points + LFF + SPP + Bag-of-Words.For (1) and ( 2), we adopt NBNN as the classifier and the linear SVM is applied for the third scheme for classification.The codebook lengths of BoW for each dataset are chosen as one of f500; 1;000; 1;500; 2;000g and the best results are reported.
For each scheme, we apply SPP on LFFs from RGB and depth information.According to all the possible combinations, we evaluate four different kind of local binary codes on three datasets: LFF(RGB-D)+SPP denotes our full algorithm; LFF(RGB)+SPP only uses RGB information to compute LFFs and then apply SPP; LFF(D)+SPP only uses depth information to compute LFFs and then apply SPP; LFF+SPP (RGB-D) concatenates LFF(RGB)+SPP and LFF(D)+SPP.
From Figs. 4, 5, and 6, we can observe that the performance of our full algorithm is consistently higher than that of other versions on the three datasets.And dense sampling generally outperforms interest points detection due to the large amount of local feature descriptors.Another observation is that LFF(RGB-D)+SPP always outperforms LFF+SPP(RGB-D), since the former outputs the fused binary representation with the consideration of the structures of RGB-D features.In contrast, LFF+SPP(RGB-D) outputs binary codes separately for RGB and depth features, therefore, loses the connection between RGB and depth features.
In Fig. 7, we also compare the performance of our algorithm with different code lengths by using different point selection methods, i.e., interest points detection (Dollar's detector and STIP) and dense sampling, on the three datasets.It is noticeable that, on the CAD-60 dataset, the accuracy of dense sampling is slightly lower than that of interest points detection because the noise of the background has a negative effect on the dense sampling when the code length increases.In this situation, the detection method is more effective than dense sampling.
Finally, Fig. 8 shows the average runtime comparison.Our learned binary codes show a significant advantage compared to the original LFF consisting of real numbers since NBNN largely depends on NN-search.All the experiments are conducted using Matlab 2013a on a server configured with a 12-core processor and 128G of RAM running the Linux OS.

Comparison with Other Methods
In Table 3, we first compare the proposed LFF descriptor with state-of-the-art video descriptors (i.e., HOG, HOF, MBH, HON4D and HOG3D) for RGB-D action recognition.All the methods are computed on the interest points from the RGB channel detected by Dollar's detector and the corresponding points from the depth channel.As we can see, LFF outperforms HOG, HOF, MBH and HOG3D in the RGB and depth channels and the RGB-D concatenation scheme.Although HON4D, as a descriptor specifically designed for depth sequences, achieves better performance in the depth channel, it can only be extracted from depth data and the recognition accuracies are relatively low.In contrast, our LFF is considered to be a general feature descriptor for both RGB and depth data and LFF in the RGB-D concatenation scheme reaches the highest accuracy in the experiment of feature comparison.
Since SPP is a projection for learning binary codes, we can also compare our SPP algorithm with other hashing methods.In our experiments, we compare the proposed method against seven general hashing algorithms including KSH [58], BRE [60], MLH [61], LSH [56], SpH [57], AGH [68], PCAH [69], BSSC [53] and RBM [70].All the above methods are computed on the same extracted LFFs for a unified standard.All the compared methods are then evaluated on five different lengths of codes (32,48,64,80,96) and their results at 96-bit, which appear to be the best, are reported.Under the same experimental setting, all the parameters used in the compared methods have been strictly chosen according to their original papers.We list the compared results in Table 3 where RGB channel and depth channel represent only employing the methods in RGB and depth respectively, RGB-D fusion is the procedure of our algorithm and RGB-D cat is the concatenation of the features gained in RGB channel and depth channel.The results of the above mentioned other hashing methods in RGB-D fusion are not consistently higher than that in RGB-D 4. Dollar's interest points detector [22] is used in our experiments.We only detect the interest points on the RGB data and find the corresponding locations on the depth video as the detected points for depth data.
5. We set the distance between adjacent pixels as 5.

Statistical Significance Test
To show the statistical significance of improvements, we conduct a t-test on the MAP improvements.In testing the null hypothesis that the population mean is equal to a specified value m 0 , the statistic is used, where x is the sample mean, s is the sample standard deviation of the sample and m is the sample size.Then the degree of freedom used in the test is m À 1.We set m ¼ 10 and code length d ¼ 96 for this experiment.Table 4 lists the one-tail results of the t-test, which shows that the improvements are statistically significant.

Results on RGB Video Dataset
To further illustrate the effectiveness of LFF, in this experiment, we compare the RGB version of LFF with the state-of- In the RGB-D fusion scheme, we first concatenate features in RGB and depth, then apply hashing methods.In the RGB-D concatenation (Cat) scheme, we first apply hashing methods to features in RGB and depth separately, then concatenate them.The bold numbers represent the best performance for each dataset.Ã The action ensemble method adopted the depth and skeleton information with real-valued features.The skeleton information is only available in MSRDailyActi-vity3D and CAD-60.
All the results (except action ensemble, LFF+IFV and HOG3D+IFV) are calculated by the NBNN classifier.The linear SVM is applied to LFF+IFV and HOG3D +IFV.the-art feature: dense trajectory features on the UCF You-Tube [72] and HMDB51 [73] datasets for action recognition.
The UCF YouTube dataset contains 1,168 video sequences collected from 11 action categories.Most of them are sports activities, which are drawn from existing YouTube videos; therefore, the dataset contains large variations and approximates a real-world database.For this dataset, we deliberately use the full-sized sequences without any bounding boxes as the input to evaluate our method's robustness against complex and noisy backgrounds.We use the Leave-One-Out setup, i.e., testing on each original sequence while training on all the other sequences.The HMDB51 dataset contains 6,849 realistic action sequences collected from a variety of movies and online videos.Specifically, it has 51 action classes and each has at least 101 positive samples.We adopt the official setting of [73] with three train/test splits.Each split has 70 training and 30 testing clips for each class.Table 5 illustrates that our proposed LFF (r ¼ 5) can achieve competitive results with dense trajectory feature (DTF) which produces the state-of-the-art performance on recent publications [31], [74].Note that for fair comparison of feature descriptors, all the compared features are extracted around the same points, i.e., the points on the trajectories.

CONCLUSION
The basic goal of this paper is to obtain a fused local binary representation for RGB-D action recognition.The LFF features are extracted along the same trajectories in the video sequences as the dense trajectory features.

Fig. 2 .
Fig. 2. Illustration of the computation of local fluxes in the gradient field.The output LFF is regarded as a foundation for learning binary codes.

Fig. 3 .
Fig. 3. Example frames of the three RGB-D datasets we used in the experiments.From top to bottom: SKIG, MSRDailyActivity3D and CAD-60.

Fig. 6 .Fig. 5 .
Fig. 6.Performance comparison with different training sizes in each action and different versions of LFFs on the CAD-60 dataset at 96-bit.

Fig. 4 .Fig. 7 .
Fig. 4. Performance comparison with different training sizes in each category and different versions of LFFs on the SKIG dataset at 96-bit.

Fig. 8 .
Fig. 8. Average runtime of one test sample of NBNN by using 96-bit binary codes after SPP and the original 882-dimensional LFF with different training sizes.

TABLE 1 Performance
Comparison (%) of NBNN with the LFFs Computed on Detected Points with Different Radii

TABLE 2 Performance
Comparison (%) of Different Variants of LFF+SPP to Prove the Effectiveness of the Improvement on RGB-D FusionAll the code lengths are 96-bit.The bold numbers represent the best performance for each dataset.(SPP 1 is the original SPP without the bigraph regularization; SPP 2 denotes the original SPP without the pairwise label preserving term; SPP 3 represents the original SPP without the pairwise angle preserving term.)

TABLE 3 Performance
Comparison (%) of Our Algorithm and Other Coding Methods on Three Datasets To achieve this goal, we first introduced a continuous local descriptor called Local Flux Feature (LFF) based on the gradient field of video data, which is more suitable for the discretization of binary codes than histogram based local descriptors.After acquiring LFFs from RGB and depth channels, we applied the Structure Preserving Projection (SPP) to learn discriminative local binary representations.SPP preserves the characteristics in two levels including pairwise structure of local features and the relationship between video samples and classes at the same time without the collapse of data structure.The systematical experiments have shown not only the high efficiency of the proposed local binary representations, but also its superior performance than other local features and other hashing methods in terms of recognition accuracy on three RGB-D datasets.

TABLE 5 Recognition
Accuracy (%) of LFF and Dense Trajectory Features on the UCF YouTube and HMDB51 Datasets