Using a Selective Ensemble Support Vector Machine to Fuse Multimodal Features for Human Action Recognition

The traditional human action recognition (HAR) method is based on RGB video. Recently, with the introduction of Microsoft Kinect and other consumer class depth cameras, HAR based on RGB-D (RGB-Depth) has drawn increasing attention from scholars and industry. Compared with the traditional method, the HAR based on RGB-D has high accuracy and strong robustness. In this paper, using a selective ensemble support vector machine to fuse multimodal features for human action recognition is proposed. The algorithm combines the improved HOG feature-based RGB modal data, the depth motion map-based local binary pattern features (DMM-LBP), and the hybrid joint features (HJF)-based joints modal data. Concomitantly, a frame-based selective ensemble support vector machine classification model (SESVM) is proposed, which effectively integrates the selective ensemble strategy with the selection of SVM base classifiers, thus increasing the differences between the base classifiers. The experimental results have demonstrated that the proposed method is simple, fast, and efficient on public datasets in comparison with other action recognition algorithms.


Introduction
Video has become the primary carrier of information owing to the rapid popularization and development of video acquisition equipment and broadband networks. With the massive emergence of video data, automating the procurement and analysis of the content has emerged as a problem that needs an urgent solution.
e main purpose of HAR based on vision is to process and analyze the original image or image sequence data collected by the sensor (camera) via computer, to learn and understand the human action and behavior. HAR based on computer vision technology has been extensively used in several fields of human life, such as smart video surveillance [1,2], human-machine interaction [3], robotics [3], video analytics [4], and human activity recognition [5][6][7][8][9].
Most of the existing human action recognition algorithms are based on the traditional RGB video data. However, human action recognition based on RGB information encounters multiple challenges as follows: (1) Complex background, occlusion, shadow, scale change, and different lighting conditions will induce tremendous difficulties for recognition, which is also the difficulty of action recognition based on RGB. (2) e same action will generate different views from different perspectives. (3) e same action performed by different people will be significantly varied, and two different types of action may have considerable similarity. ese inherent defects of RGB visual information would limit the performance of human action recognition based on RGB information.
Recently, RGB-D cameras, such as Kinect v1 and v2 sensor by Microsoft, have made depth images available for human action recognition [5,10,11]. Each pixel in the depth image records the depth value of the scene, instead of light intensity. e introduction of depth camera expands the ability of the computer system to perceive the 3D visual world and makes up for the lack of dimensional information while 3D object information is captured as 2D visual information. Compared with RGB visual information, depth images can greatly reduce the influence of occlusion, complex background, and other factors by providing scene structure information. e color and texture are invariant under different illumination conditions. From a single perspective, if the different behaviors have similar 2D projections, the depth images can provide additional body shape information to distinguish different behaviors. Furthermore, Kinect also provides a powerful skeleton tracking algorithm, which can output the position of each 3D human joint point in real time. e skeleton joints of human body will not be affected by the changes of the scale and perspective.
According to the different types of input data, HAR technology based on RGB-D video can be roughly divided into three categories, namely, HAR based on RGB data, depth image data, and skeleton joints data.

Human Action Recognition Based on RGB Image Data.
e early research on human action recognition based on RGB image sequence has been inspired by image processing technology, owing to the rich color and texture features of RGB image sequences. HAR is primarily carried out by extracting spatiotemporal interest points (STIP) in RGB video. Kovashka and Grauman [12] have proposed a human action recognition method based on hierarchical [13] model. is method combines HOG3D [14], HOG (histograms of oriented gradients), and HOF (histograms of optical flow) spatiotemporal domain descriptors and introduces a multicore learning model. Melfi et al. [15] have extended the Harris corner detection operator for video behavior recognition. First, the contour of the moving object is extracted, and then the 3D Harris points of interest are extracted from the moving object for HAR. In [16], the points of interest of the video frames are densely sampled in different scale spaces of the video frames to form dense trajectories. ereafter, the features, namely, HOG, HOF, and MBH (motion boundary histogram), of the trajectories are extracted. Finally, SVM is used to classify the features.
Recently, owing to the development of machine learning theory, we can also use deep learning to extract features from RGB video data, besides utilizing the spatiotemporal interest points to extract the video image features.
Gammulle et al. [17] have obtained the video frame features through Convolutional Neural Networks (CNN) and then used the dual stream Long Short-Term Memory (LSTM) to train the features to realize HAR. Bilen et al. [18] have proposed to convert a video sequence into a dynamic image using the rank pool technology and further used CNN model to extract the features from the dynamic image for HAR. Arif et al. [19] have proposed the concept of motion graph. First, the 3D CNN network is used to extract video features, and thereafter the features of video frames are integrated into the motion map. Subsequent to these steps, the LSTM method is used to improve the accuracy of HAR. Majd and Safabakhsh [20] have first obtained the CNN features of the video frames through the CNN deep learning network. ereafter, the CNN features are sent to the kernel cross correlation (KCC) filter to realize the automatic estimation of motion information.
Compared with the manually designed action features, although the video features are extracted automatically through deep learning, the accuracy of action recognition has increased. However, due to the unclear learning mechanism of deep learning, the stability of the extracted features is relatively poor, and a large number of parameter adjustment experiments need to be carried out manually. erefore, the method based on deep learning has some limitations in practical application.

HAR Based on Depth
Image. HAR based on depth image data primarily uses RGB image feature extraction method to extract the global and local features from the spatiotemporal volume. Compared with the RGB image, the depth image is not sensitive to illumination changes. Furthermore, it contains rich 3D structure information. However, the depth images also have some shortcomings. Owing to certain specific factors, such as specific materials, reflection, and interference, Kinect cannot estimate the depth of certain parts of the object in the scene. is results in the loss of part of the depth image obtained, forming several holes. Furthermore, the depth images obtained by Kinect lack the color features of objects, with abundant noises. ese factors make it difficult to obtain robust features from depth images. Inspired by STIP feature extraction algorithm of RGB image sequence, Xia and Aggarwal [21] have obtained Depth Spatial Temporal Interest Points (DSTIP) of the depth image, by the twodimensional Gaussian filtering and one-dimensional Gabor filtering. Based on this point of interest, the depth cuboid similarity features (DCSF) are extracted for HAR. Yang and Tian [22] have proposed a feature, namely, super normal vector, to represent the depth image sequence. e feature combines the local motion information and shape information in the depth image sequence and achieves outstanding experimental results on MSRDailyActivity3D and other datasets. Reza et al. [23] have proposed a weighted depth motion map (DMM) and then extracted the hog features from the weighted DMM for HAR.
Since the depth image lacks the description of the image color, texture, and other details, and the CNN neural network model is primarily intended to extract the color and texture features of the image, using CNN model to extract the features of depth image cannot achieve satisfactory results. Furthermore, the deep learning model needs a large amount of data for training. However, most of the depth image datasets have a small amount of data, which cannot be used for large-scale training using CNN and other neural networks. Hence the research output in this field is relatively small.

HAR Based on Skeleton Joints Data.
e recognition of human action based on skeleton joint features can be traced back to the moving light display (MLD) experiment by Johansson et al. Owing to the limitation of sensors, the early description of the skeleton joint features results in the high noise of joint points, which leads to a low accuracy of HAR. Owing to the development of computer vision technology, particularly Kinect, people can get the robust joint points in real time. Yang and Tian [24] have proposed a bone feature representation method, which is obtained by the position difference of the skeleton nodes between different frames. First, three kinds of skeleton node position differences are extracted, which are the differences in static posture, motion, and offset. ereafter, the three types of skeleton difference features are combined, and the EigenJoints features are obtained by the PCA dimension reduction. Finally, the action recognition is carried out by the naive Bayes classifier. Xia et al. [25] have proposed the usage of the histograms of 3D joints feature to realize the description of a skeleton action. e feature is to project the data of 12 main joints of the human body into the spherical coordinate system, then obtain their distribution histogram in the spherical coordinate system, and then use linear discriminant analysis to reduce the dimension of the obtained features. Finally, hidden Markov model is used to classify and express the features.
Researchers also try to use deep learning to learn features from human skeleton data. e main idea of this algorithm is to represent the human skeleton data into a suitable image form and then extract features from the skeleton image using CNN and other models for human action recognition. However, the constraints of the current deep learning theory make it very difficult to convert an appropriate skeleton image. Zhang et al. [26] have proposed Multilayer LSTM Networks for the skeleton feature learning and employed a smooth fractional fusion method to fuse the bone features of the multistream LSTM learning, which has improved the accuracy of the human action recognition. Li et al. [27] have proposed 3D skeleton-based action recognition using a novel symbiotic graph neural network, which handles action recognition and motion prediction jointly and uses graphbased operations to capture action patterns.
Briefly, despite the HAR methods based on state-of-theart RGB-D having progressed tremendously, reliability of their applications in the realistic engineering scenarios is still modest.
is is owing to the relatively large intraclass variations and small interclass differences of several actions, the variations in action speed, and the extreme computational complexities. is work fully utilizes the multimodal information acquired through a Kinect sensor to extract the features of human actions effectively. Moreover, an integrated multilearner strategy has been adopted for the classification to demonstrate exceptional generalizing capabilities. e rest of this paper is organized as follows. Section 2 presents a novel selective ensemble-based support vector machine (SESVM) approach to fuse the multimodal features for HAR. Section 3 explains the extraction of multimodal features from RGB-D images by employing different methods. In Section 4, a selective ensemble-based SVM classification framework is deployed for feature recognition. e experimental results on the G3D dataset and Cornell Activity Dataset 60 are presented in Section 5, showing the feasibility and performance of the proposed approach. Finally, a brief conclusion and notes on further work are given in Section 6.

The Framework
e Kinect sensors produced by Microsoft can provide both RGB and depth information of the scene, in addition to the skeleton joint locations of human bodies. e depth images captured by Kinect sensor can provide light-invariant foreground information with depth geometry structure, and they have the advantages of texture, color invariance, and insensitivity to the influences from illumination, environment, and shadows.
is paper utilizes multimodal data provided by the Kinect sensor and extracts three different features as the descriptors of the actions. us, an integrated multiclassifier algorithm is adopted for the classification to exploit the advantages of the different features. Figure 1 shows the system configuration of the proposed approach. It achieves efficient computation from handling simple features while ensuring the robustness and recognition capability of the features. Particularly, our framework consists of the following steps: (1) Acquire synchronized RGB, depth, and joint images from the Kinect sensor (2) Convert the input RGB image to grayscale, and then extract the improved histogram of the oriented gradient features (3) Compute the depth motion map-based local binary pattern (DMM-LBP) from the depth image, and then extract joint-based hybrid joint features (HJFs) from the acquired 3D skeleton image (4) Train the selective ensemble-based support vector machine (SESVM) using the sample sets with combined features (5) Implement the same extraction process to the predicting images during action recognition, enter them into SESVM for recognition, and work out recognition result e major contributions of this paper are summarized as follows: (1) A novel selective ensemble-based support vector machine (SESVM) method has been proposed to describe the human action features based on multimodal information. is method is capable of depicting human actions from the various points of view and has been verified by experiments on public datasets.
(2) e improved RGB-based histogram of oriented gradient (RGB-HOG) features is adopted in this paper, which is invariant to geometric and optical deformations of the images.

Computational Intelligence and Neuroscience
(3) e depth-based DMM-LBP features are created to maintain the dynamic characteristics of human actions with good local invariance. (4) e joint-based hybrid joint feature (HJF) has been adopted to provide the spatial structure information about human actions. (5) e correlation coefficient-based classifier selection algorithm (CCCSA) has been adopted to select classifiers from the existing ones for constructing the ensemble classifiers. is is for speeding up the prediction speed of the classifier, reducing the storage space requirements, and further improving the classification accuracy. By using fewer classifiers, the prediction speed can be accelerated because the computational overhead of prediction is reduced. In addition, due to the small number of individual classifiers in the selective ensemble learning system, the storage overhead is also reduced, because only a small number of individual models need to be saved.

Feature Extraction
is section introduces the feature extraction methods for various modalities. Particularly, Section 3.1 describes the improved HOG features for the RGB modality, Section 3.2 introduces the DMM-LBP features for the depth modality, and Section 3.3 explains the HJF features for the joint modality.

RGB-HOG Feature.
Dalal and Triggs have first proposed the HOG feature to detect pedestrians in static images [28].
ereafter, multiple researchers have presented the improved HOG features [29].
HOG algorithm is a feature extraction method recently used in the research of target recognition. However, the HOG feature extraction algorithm can only calculate the direction of information of a single gradient of pixels, which is not comprehensive enough, and has certain defects in describing the directional features of the target.
We have used the steerable filter algorithm which can obtain multidirectional information to make up for the deficiency of HOG algorithm.
is method expands the single-directional information of a pixel to N multiple-directional information.
Freeman and Adelson [30] first proposed the steerable filter, which convolutes the image by generating templates in different directions to get the edge of the image. e convolution process increases the weight of the effective pixels and decreases the weight of invalid pixels by a weighting operation. e general form of steerable filter is given as where N is the number of base filters and G i the ith fundamental filter. Further, k i (α) represents the coefficients of the filter related to the direction degree α, and G α is the filter in α direction. We have used the method of obtaining multidirectional filter by the linear combination of a group of basic filters and the derivation of two-dimensional Gaussian function. e corresponding expression is given as (2) e specific expressions are given as and the corresponding coefficient is given as where G 0 1 (x, y) and G 2π/3 1 (x, y), respectively, represent the second derivative of image pixels in the corresponding direction, that is, the basis filter in the corresponding direction. e amplitude information in any direction can be calculated by the linear combination of the three expressions. e calculation formula after linear combination is shown as We have combined the steerable filter algorithm with the traditional HOG algorithm. First, the steerable filter algorithm has been used to calculate the direction number and amplitude information with the highest direction value, and then the HOG algorithm is used to obtain the statistical direction histogram features. e algorithm flow, which shows the specific calculation, is depicted in Figure 2.
e implementation sequence of the HOG feature extraction algorithm can be described as follows: Step 1. Normalize the Gamma space and the color space. To reduce the influence of illumination, the image needs to be normalized first. e contribution of local surface exposure to the texture strength is relatively large.
erefore, this type of compression can effectively reduce the local variations, in the shadow and illumination of the image. e image is first converted to grayscale as the color information contributes little.
e Gamma compression formula is given as where I(x, y) is the input RGB image. Gamma usually takes the value of 1/2.
Step 2. Let p(x, y) be the pixel of the gray image. Construct two mutually perpendicular directional controllable filters of p pixel (the directions of the filters are α and β, respectively, and α + β � π/2), and record them as F (α) and F (β) , respectively. en, the gradient values of point p in α and β directions are given as Step 3. Compute the gradient of the image. Compute the gradient in the directions of the horizontal and vertical axes that are the gradient orientation of each pixel. e computation of derivatives can capture the contours, human figures, and certain texture information from the image, besides further reducing the influence from illumination. e gradient of a pixel (x, y) in the image is given as where G x (x, y), G y (x, y), G(x, y), and θ(x, y) are the horizontal gradient, the vertical gradient, the gradient amplitude, and the gradient angle at pixel (x, y), respectively.
Step 4. Construct a histogram of the oriented gradient for each cell. is provides coding for the local image area and is capable of maintaining the invariance to human postures and appearances in the image. We divide the image into a number of "unit cells," and each cell contains 6 * 6 pixels, for instance. Suppose that we use a 9-bin histogram to collect the gradient information of these 6 * 6 pixels, i.e., to divide the gradient orientation of the cell of 360 degrees into nine oriented blocks. For example, if the gradient orientation of the pixel is 20-40 degrees, then the 2nd histogram bin count will be increased by 1. By doing so, every pixel in the cell is projected with a weight onto the histogram by its gradient orientation (mapped into specific angle range). Consequently, the histogram of the oriented gradient of the cell is obtained, which is the 9D feature vector of the cell (since there are nine bins).
Step 5. Concatenate cells into blocks and normalize the oriented gradient histograms within each block. e strength of the gradient changes significantly owing to the variations in the local illumination strength and foreground and background contrast. Hence, the gradient strength needs to be normalized. e normalization can further compress the illumination, shadow, and edges. e implementation sequence is as follows: (1) to combine the unit cells into large and spatially connected blocks; (2) to concatenate feather vectors from all cells in the block to generate the HOG feature of the block. Since there are overlapping among the blocks, feature vector of each cell may appear in the final feature vector multiple times. We call this normalized block descriptor (vector) "the HOG descriptor." Step 6. Collect the HOG features. is last step is to collect the HOG features from all overlapping blocks in the testing window and combine them into the final feature vector to be used in the classification.
where i is the time sequence frame. MAP i V represents the projection of frame i on view V, and s and e represent the start frame and the end frame, respectively.
Several pixel values in the depth image are 0, which is not helpful for the description of action features. Hence, the    Tr by using the base classification algorithm SVM and added to the set Θ (5) End for (6) Selecting process: Relations Note. N AB is the number of samples in the dataset, classified correctly (A � 1) or incorrectly (A � 0) by SVM i , and correctly (B � 1) or incorrectly (B � 0) by SVM j .
e error rates of A (0) , A (1) , A (2) , andA (3) on verification set T Val were calculated, with the min error rates saved Computational Intelligence and Neuroscience region of interest operation should be performed for each frame image. To further filter the pixels in DMM, the local binary pattern (LBP) operation is performed on DMM. LBP is an effective texture feature description operator. It was first proposed by Ojala et al. [32]. It is used to extract texture features. Its advantage is that it has high robustness to the changes of illumination and rotation, and the extracted features are the local texture features of the image. For a given point DMM V (x c , y c ) on the image DMM V (x, y), LBP can be calculated as where m is number of sampling points. e coordinates of f(x i , y i ) m i�1 can be expressed as where r is the sampling radius of pixel f(x c , y c ). e LBP feature extraction algorithm of depth image is as follows: Step 1.
e region of interest of the depth image is extracted as the detection window.
Step 2. Get the projection view of the depth map in three different directions.
Step 5. For a pixel in each Cell i (x, y), the pixel value of its adjacent eight pixels is compared with it. If the value of the surrounding pixels is greater than the value of the center pixel, the position of the pixel is marked as 1; otherwise, it is 0. Accordingly, the eight points in the 3 * 3 domain can be compared to generate 8 bit binary number; that is, the LBP value of the center pixel of the window can be obtained.
Step 6. Calculate the histogram of each cell, i.e., the frequency of each number, and normalize the histogram.
LBPhog i ←BinCount LBP i (x, y), i � 1, 2, . . . , 16 . (18) Step 7. Finally, the statistical histogram of each cell is connected into a feature vector, which is the LBP feature vector of the whole depth image.

HJF Feature.
RGB-D sensor can quickly obtain the human joint position and three-dimensional skeleton through the depth image information. ese data contain rich information, which brings new ideas and methods to HAR. For example, Microsoft released Kinect v2 that provides us with the information of 20 human 3D bone points and then extracts the features of these information points. Further, the feature dimension will become minuscule, which is conducive to speeding up the calculation and improving the real time performance.
Different human actions are reflected not only in the difference of joint position information but also in the energy features of the joint point sequence. We have used the joint kinetic energy features, direction change features, and joint potential energy features as the hybrid joint features.

Computational Intelligence and Neuroscience
To calculate the kinetic energy information of the human joint points, it is necessary to obtain the three-dimensional coordinates of the human joint points P(x, y, z). erefore, according to the coordinate information changes of the two adjacent frames, the kinetic energy of the human joint points in each frame is calculated as where KEF i,t is the kinetic energy of the ith joint in F t frame and k is the kinetic energy parameter. In the experiment, k can be taken as 1. Δt is the time interval between the two adjacent frames. Human action is related to the information of the current and past positions. In different action states, the speed of movement of the joints randomly varies with time, and the direction of change may also vary. According to the coordinates of human 3D joint points, the direction change vector of each joint point is calculated as the human motion feature, given as where DC i,t represents the direction change vector of the ith joint point in the F t frame relative to the ith joint point in the previous F t−1 frame. Further, x i,t , y i,t , and z i,t represent the spatial three-dimensional coordinates of the joint point in the F t frame.
We have combined the features of the joint kinetic energy and joint direction change into a new feature, which is defined as the hybrid joint feature, given below x HJF ← KEF 1 , KEF 2 , . . . , KEF 20 , DC 1 , DC 2 , . . . , DC 20 ,

Feature Fusion.
Feature fusion is an effective method to clearly distinguish human action features. Currently, the major feature fusion methods include the pixel-level, feature-level, and decision-level fusions. We employ the Compared with the single action features, these composite features show excellent robustness as they are a collection of the advantages of every single feature and more suitable for describing the human action features.

Recognition Method
Recently, the research on the theory and algorithm of the ensemble learning has been a hotspot in the field of machine learning. e construction of an ensemble learning machine is divided into two steps, namely, the generation step and the merging step.
e key is to effectively generate a base learning machine with strong generalization ability and great differences. Alternatively, the accuracy and diversity of the base learning machines are two important factors. In general, the predictive effect of the ensemble learning machine is significantly better than that of the single base learning machine. However, the predictive speed of the ensemble learning machine is significantly slower than that of the single base learning machine. Moreover, as the number of the base learning machines increases, the needed storage space increases sharply, which is a serious problem for online learning. Zhou et al. [33] have proposed the "selective ensemble" to eliminate the basic learners with poor performance and, hence, to select certain ones to build the set for better prediction effect.
We propose a selective ensemble-based SVM classification framework for recognition. Assuming that T Tr � (x i , y i ) N Tr i�1 is a given training set for each training sample (x i , y i ), its input variable is action feature vector x i � (x i1 , x i2 , . . . , x iM ) ∈ R M , output variable is action category y i ∈ Ω � ω 1 , ω 2 , . . . , ω c , and c is the number of action classes. At the same time, let T Val � (x i , y i ) N Val i�1 denote verification set with the capacity of N Val . Table 1 shows the selective ensemble-based SVM classification algorithm (SESVM).

Computational Intelligence and Neuroscience
Selective ensemble learning assumes that the multiple base learning machines have been generated, and only some of them are selected to construct the final ensemble based on a certain selection strategy. In the selective ensemble learning, diversity among the base classifiers plays an important role in explaining the working mechanism of multiclassifier systems and constructing effective ensemble systems. Current diversity measures can be divided into two kinds, namely, (i) the paired diversity measures for calculating the diversity between two basic classifiers and (ii) the unpaired diversity measures targeted at all basic classifiers. Paired diversity measures include Q statistics, correlation coefficient, disagreement measure, and double error measure. Disagreement measure method is used in this study as it features simple calculation, wide application, and favorable results in most cases. Suppose that SVM i and SVM j are two different classifiers whose relationship is given in Table 2.
us, the classifiers with poorer performance have been given smaller weights while those with better performance have been given larger weights in majority voting based on confidence. Base classifiers set A � SVM * 1 , SVM * 2 , . . . , SVM * N have been obtained after being screened by CCCSA. en, the voting weight of each basic classifier has been determined based on its precision. e voting weight of a basic classifier SVM * 1 depends on its error rateε i , which is defined as Note that if the predicate p: SVM * i (x i ) ≠ y i is true, I(p) � 1; otherwise, it is 0. e weight of the basic classifier SVM * i can be defined as If ε i approaches 0, then w i is a large value. If ε i approaches 1, then w i is a large negative value. e classification result of the set of N classifiers SVM * (x) is given as

Experiments and Results.
In this section, we validate the feasibility and efficiency of the proposed method in two experiments. Cross-validation has been adopted in the experiments to train the classification model and to test its performance. First, we test the recognition rate on the G3D dataset and CAD60, based on the single feature and the algorithm in this paper. In the second experiment, we compare our method to alternative algorithms. e result of the first experiment is presented using the confusion matrix. e element (i, j) is the percentage of actions of class i that are classified as actions of class j. erefore, the classification result is better for larger numbers of diagonal elements.
In Figures 6-8, the recognition rates using the single feature on the G3D dataset have been illustrated with a confusion matrix. Figure 9 is the recognition rate of the proposed method using multimodal fusion information. From the experimental results shown in Figures 6-9, we can see that the recognition accuracy using combined features is higher than that using single features. is shows that the representation of human action feature directly affects the recognition effect of human action recognition methods. Single feature is often affected by human appearance, environment, camera setting, and other factors, and the recognition effect is limited. From Figure 9, we can see that the recognition rate of four actions (defend, tennis serve, throw bowing ball, and clap) is 100%, and the recognition rate of   three actions (walk, run, and jump) is low and easy to confuse. rough the analysis, it is found that for actions such as walk, run, and jump, the action feature that can really distinguish these actions is the motion frequency, which needs to use the correlation between the information of multiple frames and the characteristics of adjacent frames when training the action model. In Figures 10-12, the recognition rate using the single modal feature on the CAD60 has been illustrated with the confusion matrix. Figure 13 shows the recognition rate of the proposed method using multimodal features on the CAD60. rough comparison, it is obvious that the proposed method achieves a good recognition rate of 91.7% on CAD60. Table 4 shows the recognition rates using the single modal feature and multimodal features in terms of precision. It can be observed that the recognition rates of the proposed method using multimodal features are higher than the recognition rates of those methods using the single modal feature.
In the second experiment, we have compared the proposed method to alternative ones. Table 5 shows the comparison between our algorithm, boosting, bagging, support vector machine (SVM), and artificial neural networks (ANNs). Accordingly, the integrated multilearner recognition algorithm based on multimodal features has achieved the highest recognition rate of 92%.     Table 6 compares the average class accuracy of our method with results reported by other researchers. Compared with the existing approaches, our method outperforms the state-of-the-art approaches. Note that a precise comparison between the approaches is difficult, since experimental set-ups, e.g., different strategy in training, slightly differ with each approach.

Conclusion
is paper presents a novel approach to HAR, which is a challenging research topic. A Kinect sensor has been deployed to acquire RGB-D image data, and the multimodal features (RGB-HOG features, DMM-LBP features, and HJF features) were extracted. e selective ensemble-based support vector machine (SESVM) has been adopted to fully utilize the biasing effects from different learners. e experiments have been conducted on standard public datasets and achieved good recognition rates. However, a large number of tagged video training samples is required for the classifier to achieve a good generalizing capability. is demands abundant manual tagging work and thus increases the practical difficulties. erefore, our future work will focus on the utilization of the abundant untagged video samples in hand, to enhance the system performance.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.