Latent Hierarchical Model of Temporal Structure for Complex Activity Classification

Modeling the temporal structure of sub-activities is an important yet challenging problem in complex activity classification. This paper proposes a latent hierarchical model (LHM) to describe the decomposition of complex activity into sub-activities in a hierarchical way. The LHM has a tree-structure, where each node corresponds to a video segment (sub-activity) at certain temporal scale. The starting and ending time points of each sub-activity are represented by two latent variables, which are automatically determined during the inference process. We formulate the training problem of the LHM in a latent kernelized SVM framework and develop an efficient cascade inference method to speed up classification. The advantages of our methods come from: 1) LHM models the complex activity with a deep structure, which is decomposed into sub-activities in a coarse-to-fine manner and 2) the starting and ending time points of each segment are adaptively determined to deal with the temporal displacement and duration variation of sub-activity. We conduct experiments on three datasets: 1) the KTH; 2) the Hollywood2; and 3) the Olympic Sports. The experimental results show the effectiveness of the LHM in complex activity classification. With dense features, our LHM achieves the state-of-the-art performance on the Hollywood2 dataset and the Olympic Sports dataset.


I. INTRODUCTION
H UMAN activity classification is an important yet difficult problem in computer vision [1]- [3], whose aim is to determine what people are doing given an observed video. It has wide applications in video surveillance [4], [5], human-computer interface [6], sports video analysis [7], and content based video retrieval [8]. The challenges of activity classification come from many aspects. Firstly, there always exist large intra-class appearance and motion variations within the same activity class. Background clutter, illumination and viewpoint changes, and activity speed variations also increase the complexity and difficulty of classification. Secondly, compared with still image, activity video has a higher dimension. The high dimensionality of video increases not only computational cost but also difficulty to develop robust classification algorithm. Finally, human activity always consists of a sequence of sub-activities. Each sub-activity further includes gestures and motions of different body parts.
While activity exhibits complex temporal structure, its sequential decomposition yields an important cue for activity recognition. Complex activity usually is composed of several phases (see Fig. 1). Each phase corresponds to a relatively simple sub-activity, and there exists a temporal order among these phases. The importance of temporal structure in activity classification has been demonstrated in previous works [9]- [15]. However, the effective modeling of temporal structure is still challenging due to the following two problems.
The first problem is that "sub-activity" usually has no precise definition given a complex activity type. Sub-activity is a relatively "simple" part of a "complex" activity. Its definition depends on the temporal scale we are considering, which can be ambiguous. For example (see Fig. 1), high-jump in a long temporal scale can be divided into three sub-activities, namely running, jumping, and landing. However, in a finer temporal scale, running can be further decomposed into several primitive sub-activities, such as waiting, starting running, and speeding up. The decomposition of complex activity corresponds to a coarse-to-fine process.
The second problem is how to automatically decompose complex activity into several sub-activities given a specific video. It is a difficult problem because the sub-activities usually have various durations and temporal displacements due to the speed variations of motion. For instance, in the activity of basketball-layup, some may have a long running time before they layup the basketball, while others may have a short running time. Therefore, classification algorithm needs to automatically determine the starting and ending time points of each sub-activity.
In order to address both of the problems effectively, we propose a Latent Hierarchical Model (LHM) for complex activity recognition. LHM makes use of its tree structure to decompose activity into sub-activities automatically, and allows us to deal with the ambiguity of sub-activity. Nodes at the high layer correspond to the activities in a long temporal scale. Each activity is divided into several sub-activities at the next layer with a relatively shorter temporal scale. Fig. 1. Sub-activity decomposition is related to temporal scale. High Jump can be divided into running, jumping, and landing from a long temporal scale. However, running is further composed of waiting, start-running, and speeding up, if it is observed in a short temporal scale.
The decomposition repeats recursively until it reaches leaf nodes. For each video segment, we use Bag of Visual Words (BoVW) representation, for its simplicity and compactness, to summarize motion and appearance features. Besides, the locations of all sub-activities are specified by latent variables. The latent variables are adapted to different videos, which makes our model flexible and effective to deal with the duration variations and temporal displacements of sub-activities. We formulate the learning and inference problem of LHM in the latent SVM framework [16]. Since LHM has a deeper structure with more latent variables, it is infeasible to traverse all possible configurations of sub-activities during classification process. We develop a cascade inference algorithm based on dynamic programming and prune techniques, which greatly reduces the computational cost.
The main contributions of this paper can be summarized as follows: • We propose a latent hierarchical model (LHM), which describes the temporal structure of activity in a coarseto-fine manner. It introduces two latent variables to denote the starting and ending time points of each sub-activity. Thus, LHM is flexible in dealing with duration variation and temporal displacement (Section III). • We formulate the learning problem of LHM in the latent SVM framework, and we extend the traditional linear latent SVM by introducing non-linear kernels. Therefore, we can use χ 2 kernel for BoVW representation, which plays an important role in final recognition performance (Section IV-A). • Due to a lot of possible configurations for latent variables, we develop a cascade inference algorithm to improve classification efficiency based on dynamic programming and pruning techniques (Section IV-B). • We conduct experiments on the challenging Hollywood2 and Olympic Sports Datasets, and achieve recognition performance superior or comparable to that of the state-of-the-art approaches. Our experimental results also exhibit the effectiveness of hierarchical structure and latent variables (Section V).

II. RELATED WORK
Human activity classification has been studied extensively in recent years. In this paper, complex activities refer to those with long temporal structures such as Sports actions [12], Cooking Composite actions [17], and so on. Here we only overview a few related works and readers can refer to [1]- [3] for good surveys.
Video Representation. Video representation has been a central issue of activity recognition. Low-level local features turn out to be effective in action recognition [18]. In recent years, researchers have developed many different spatiotemporal detectors for video, such as 3D-Harris [19], 3D-Hessian [20], Cuboids [21], and Dense [22]. Then, a local 3D-region is extracted around the interested points and a histogram descriptor is computed to capture the appearance and motion information. There were some typical descriptors such as Histogram of Gradient and Histogram of Flow (HOG/HOF) [23], Histogram of Motion Boundary (MBH) [22], 3D Histogram of Gradient (HOG3D) [24], Extended SURF (ESURF) [20], Co-occurrence descriptor [25], and so on. Finally, a global representation is obtained for each video clip via a statistical model.
Among these statistical models, Bag of Visual Words (BoVW) is a common choice in action recognition [26]. Based on local features, BoVW construction usually is composed of two steps: (i) encoding of the local features, (ii) feature pooling and normalization. There were a large body of researches on the encoding methods such as Vector Quantization (VQ) [27], Soft-assignment Encoding (SA) [28], Fisher Vector (FV) [29], Sparse Coding (SPC) [30], Locality-constrained Linear Encoding (LLC) [31], and so on. These methods focus on minimizing information loss and improving encoding efficiency. For pooling method, there were usually two typical methods, sum pooling [27] and max pooling [30], and for normalization method, typical choices include 1 -normalization, 2 -normalization, and power-normalization [29].
In addition to these low-level local features and BoVW representation, there were some research works on mid-level and high-level representations such as motionlet [32], motion atom and phrase [15], action bank [33], and so on.
Temporal Structure. The importance of temporal structures in recognizing human activity has been studied in previous researches [9]- [15] and [34]. Probabilistic graphical models were usually adopted to model the temporal structure of human activity or motion trajectories, such as Hidden Markov Models (HMMs) [5], [9], Hidden Conditional Random Fields (HCRFs) [10], [11], and Dynamic Bayesian Networks (DBNs) [4], [34]. The learning and inference of graphical model were usually conducted by some approximate methods such as Expectation Maximum, Variational Methods, and Sampling Methods [35]. The learning process is complex and usually needs a large mount of data to avoid overfitting. In addition to graphical models, some research works resorted to Max-Margin Methods [12], [14]. They formulated the learning problem using Latent SVM [16], which has been shown to be effective in object detection. These methods maked use of Latent SVM to estimate the model parameters and conduct inference. The learning of LHM is formulated in the same latent SVM framework with these methods. But our model focuses on decomposing complex activity into sub-activities in a hierarchical manner. From our experimental results, the hierarchical structure plays an important role to improve the recognition performance.
Hierarchical Model. Hierarchical tree-structured model is biologically inspired by the brain architecture and vision system [36], [37]. It has been widely used in computer vision and achieved successes on various tasks, such as learning feature hierarchies [38], [39], object detection [16], [40], [41], human body parsing [42], image parsing [43], and video understanding [44]. Our model is partially inspired by the work of [40] in which Zhu et al. developed a hierarchical model with deep structure for object detection. In their method, an object was represented by a mixture of hierarchical tree models whose nodes represent object parts. The experimental results indicated that deep structures can convey rich descriptions of shape and appearance features. Similarly, we model human activity in a tree-structured manner and the root corresponds to the whole activity, while the other nodes represent subactivities at different temporal scales. We find that the deep structure yields much better results than a single-layer one in our activity classification experiments, which agrees with [40]'s conclusion on object detection.

III. LATENT HIERARCHICAL MODEL FOR ACTIVITY CLASSIFICATION
In this section, we firstly develop a Latent Hierarchical Model (LHM) to describe the temporal structure of activity video in a coarse-to-fine manner in Section III-A. Then, we summarize the key properties of LHM in Section III-B.

A. Latent Hierarchical Model
Latent Hierarchical Model (LHM) is a tree-structured model to capture the hierarchical decomposition of complex activity into sub-activities. As shown in Fig. 2, LHM can be seen as a tree decomposition of complex activity and each node represents a video segment (activity or sub-activity) at certain temporal scale. The root node describes the whole activity (e.g. long jump) in a rough manner. The root node is divided into several sub-activities in the next layer (e.g. run, jump, land). Each sub-activity can be further decomposed recursively until leaf node, which represents the atomic activity (e.g. start run, speed up, jump up, rolling). In essence, LHM is a generalization of STAR model [45] with the independence assumption that child nodes are independently placed in a coordinate system determined by their parent node. This generalization provides more descriptive capacity to LHM and yet allows for efficient inference algorithms due to the independence assumption.
The parameters to describe the structure of LHM include the depth of tree d and the number of nodes in each layer {n 1 , . . . , n d }. In the example of Fig. 2, the depth is set to 3 and each non-leaf node has 3 children. In principle, the structure is flexible and can be set to any others. In default we adopt the 1 − 3 − 9 structure and we will explore other structures in experiments. LHM enables us to divide each video into N segments in different temporal scales and each segment S i is specified by a pair For activity classification, we define a discriminant function of LHM for each video V given the configuration of latent variables h: where i (V, z i ) is the localized segment model, measuring the compatibility between video feature and segment model; E denotes a set of pairs of parent and child node; is the temporal deformation model, incorporating the structural constraints between the parent and child segments. We would like to maximize the discriminant function over all possible configurations of latent variables for each video V , then our model can find the best location for each segment: where H(V ) denotes the set of all possible configurations for latent variables h in video V . Segment Model. We denote φ(V, z i ) as a feature representation extracted from segment z i of video V . Then we can linearly parameterize the segment model as 1 In this way, each segment model acts like a linear classifier. Due to the popularization of local low-level features and bag of visual words (BoVW) representation [26], we make use of them as our features. Specifically, we use the spatiotemporal interest points (STIPs) [19] with HOG/HOF descriptors [23]. Then, we choose the vector quantization encoding and sum pooling to construct BoVW representation. Besides, in the further exploration part of Section V, we also use Dense Trajectories [22] as low-level features of LHM due to their good performance. We observe that using the dense features enables us to further boost the recognition performance of LHM.
Temporal Deformation Model. We denote (ds i , de i ) = (s i , e i )−((s j , e j )+v i ) as the temporal displacement of a child Fig. 2. An example of latent hierarchical model for activity video. In this example, LHM has a tree structure with three layers. The top layers has only one node (i.e. Root) and the middle layer has three nodes (i.e. Seg i , i ∈ {1, 2, 3}). There are in total nine nodes (i.e. Seg i j , i, j ∈ {1, 2, 3}) at the bottom layer. Nodes of different layers correspond to sub-activities in different temporal scale. Note that, we choose 1 − 3 − 9 structures in this example and we can also resort to other structures for LHM in practice. Fig. 3. Illustration of the temporal displacement between child node and the anchor point determined by its parent node. node relative to its anchor point determined by parent node (see Fig. 3). Then we can define the temporal deformation model as . This can be interpreted as a flexible term which allows the child node shift from its anchor point and will give penalty to large deformation. In fact, this term can be interpreted as a Gaussian distribution of child node relative to its anchor point: where covariance i is set to a diagonal matrix, Then for the log probability P( Thus this can provide a probabilistic explanation for our temporal deformation model.

B. Model Properties
LHM considers the hierarchical decomposition of complex activity into sub-activities in a recursive manner. There are several key properties about LHM which can be summarized as follows: • Hierarchical Structure. LHM is a hierarchical model and has a deep structure. It can provide more descriptive power for complex activities and capture activity temporal structure in a coarse-to-fine way. In root, we provide a global BoVW to describe the whole activity roughly.
In the next several layers, we focus on modeling the sub-activities in a finer manner. In addition to rich descriptive power, hierarchical structure can prune many unreasonable structures and allow us to design an efficient cascade inference algorithm, which will be discussed in Section IV-B. • Temporal Structure. In addition to hierarchical structure, LHM also models the temporal structure among different sub-activities. Each sub-activity occurs at different temporal location in the whole activity and there exits an order among them. LHM exhibits temporal constraints among sub-activities, and provides rich information for complex activity recognition. • Flexibility. LHM introduces two latent variables to indicate the starting and ending time points of sub-activity for each video. The latent model not only reduces the human annotation work during training period, but also increases the flexibility of our approach. During the inference phase, our model is capable of searching for a best match for each sub-activity and thus, the temporal location is adaptive to each specific video. Our model is very effective in dealing with the intra-class variation and is able to align the location of each sub-activity automatically. • Independence on Low-level Representation. LHM is a general model concentrating on modeling the hierarchical structure and temporal structure of complex activity based on latent variable. LHM does not depend on specific video representation. In experiment, we resort to bag of visual words (BoVW) representation of local spatial temporal features. Currently, we firstly use 3D Harris detector and HOG/HOF descriptor [23] for fair comparison with other methods. Then, we explore dense trajectory features [22] to boost the recognition performance of LHM. In addition to BoVW representations, we can also use other mid-level and highlevel features such as Motionlet [32], Motion Atom and Phrase [15], and Action Bank [33]. Furthermore, some detection and tracking techniques can be incorporated into LHM to help determine the spatial location of activity. These extensions are out the scope of this paper.

IV. LATENT LEARNING AND CASCADE INFERENCE OF LHM
In this section, we investigate how to learn the model parameters from a set of weakly labeled training samples (i.e. each training sample is only with a category label, without the detailed annotation of each sub-activity), and formulate the learning problem in a latent kernelized SVM framework in Section IV-A. Then we consider the inference problem of how to determine the locations of all sub-activities for each given video in Section IV-B. We design a cascade inference algorithm to search for the best match for each sub-activity given a video. Finally, we provide the implementation details of learning and inference algorithm in Section IV-C.

A. Latent Learning
The learning task is to estimate the model parameters in Equation (1) where C is a hyper parameter to balance between regularization term and loss term, · denotes the 2 norm, f * (V m ) is the maximum of discriminant function in Equation (1): where w and ϒ(V m , h) are the concatenation of model parameters and video features: z 1 ), . . . , φ N (V m , z N ), During training process, each training sample V m just have class label y m . Unlike traditional SVM [46], the problem (Equation (6)) is not convex since f * (V m ) contains an maximum operation over h, which is called Latent SVM in [16]. It can be shown that the problem will become convex for the model parameters w when latent variables h are fixed. Thus, this allows us to develop an iterative learning algorithm between estimating latent variables h and optimizing model parameters w alternatively. In practice, we optimize the learning problem in a "coordinate decent" approach: • Step 1. we initialize the model parameter w by a simple method, which will be discussed in Section IV-C. Note that there are many optimization algorithms to solve the convex problems in Step 3. In [16], the author develops an algorithm of stochastic gradient descent to solve prime problem. This algorithm is efficient but can not deal with nonliner kernels. Although there are a large number of works on kernel extension for traditional SVM [46], few works have been done for latent SVM. Here, we propose to solve the dual problem of Step 3 in order to incorporate non-linear kernel into latent SVM framework. Specifically, based on the estimated latent variables, we transform the learning problem (Equation (6)) into the following form: where h * m is the latent variable maximizing the score of positive sample V m , and H m = {h m | f (V m , h m ) ≥ −1} denotes a set of hard negative instances from sample V m . By using Lagragian function, we can get its dual form: where α are the dual variables and their relationship with w is determined by: In the dual problem (10), we can replace the dot product ϒ(V m , h m ) · ϒ(V n , h n ) with non-linear kernel K(ϒ(V m , h m ), ϒ(V n , h n )). In practice, we use linear kernel for temporal placement model and χ 2 kernel for BoVW representation, defined as follows: where S denotes the mean distance among training samples, S 1,r denotes the r −th element of histogram S 1 and D is the dimension of BoVW histogram. Then, the kernel for two training instances is defined as: . (13) Note that, due to non-linear kernel, we can not calculate w explicitly and the calculations of w · ϒ(V k , h k ) are replaced by the following formula: -Return maximum of score: MAX(F[n 0 , z]).
. if X has no child then return i (V, z i ). else foreach child node n j of n i do foreach possible location z j of n j do //Deformation pruning

B. Cascade Inference
The inference task of LHM is to predict class label y and latent variables h given the video V and model parameters w. The main challenge comes from the fact that the number of possible configurations for latent variables h is large, which prevents us from using brute force approach to calculate the discriminant function over all possible h. In [12], Niebles et al. used the dynamic programming and distance transform techniques in a similar fashion to [16]. They claim that this matching scheme is efficient once the appearance similarities between the video sequences and each motion segment classifiers are computed. However, in our problem, evaluating the appearance similarities is the bottleneck due to χ 2 kernel calculation. Besides, our LHM is a deep structure model and introduces two latent variables for each segment, thus it is very time-consuming to calculate the appearance similarities of all possible configurations in advance. Inspired by the method of cascade object detection in [47], we design a cascade inference algorithm for LHM. The core idea of our algorithm is to make use of dynamic programming and prune techniques to constrain the search space of h and accelerate the process of inference.
First, we convert the inference problem into the following subproblem using dynamic programming techniques. For a node n i at location z i specified by starting and ending time point pair (s i , e i ), its largest discriminant value F(n i , z i ) can be calculated by the following recursive function: Then, we evaluate the score of each node in a depth-first-search (DFS) order. The cascade inference algorithm for a tree-structured model with n + 1 nodes has 2n intermediate thresholds for two kinds of pruning techniques. As shown in Algorithm 1, during the DFS process, we use two kinds of pruning techniques, namely deformation pruning and hypothesis pruning: • Deformation pruning: we will skip the segment specified by z j if the temporal deformation term j,i (z j , z i ) is smaller than a threshold t j . Intuitively, the total score will decrease greatly if it is plus the temporal deformation term. This pruning technique enables us to constrain child node to move in a reasonable interval. • Hypothesis pruning: if the score maximum of a child node s[n j ] is less than a threshold t j , then we will prune its parent node n i at location z i . Intuitively, if the parent node n i is located correctly, then the maximum of its child node score would not be smaller than a threshold. So, the small score of its child node may indicate the location of parent node is not correct. During the DFS process, once we evaluate the response of node n i at location z i , we will store its value to avoid calculating it again. Using the cascade inference algorithm, we can find the maximum of score for each video V efficiently. Besides, during the inference process, we can keep the location of each node, thus we can find the best configuration of latent variables h effectively.

C. Implementation Details
Initialization. Unlike the heuristic initialization of [12], we propose a simple method to initialize our model structure and training samples. We set the anchor point of child node relative to parent node in a regular grid layout. For training samples, we initialize latent variable h according to the model structure i.e. ds = 0, de = 0. Then we get a set of instances for the first round of standard SVM training.
Updating Latent Variables. During the step to estimate latent variables h, the duration of root node is restricted to cover at least 80% of the whole video. The positions of the child nodes are ensured to overlap with the corresponding reference box. These restrictions can suppress some unreasonable structures and improve search efficiency.
Thresholds of Cascade Inference. During training process, we search all possible configurations for latent variables h without using prune techniques. For each node, we keep the minimum score of its child node over all positive samples. The hypothesis pruning thresholds t j will be the minima multiplied by a ratio β 1 (β 1 = 0.5 in experiments). We also store the values of temporal deformation term for different parent and child node pairs. The deformation pruning thresholds t j is set to be the minima of the deformation term multiplied by a ratio β 2 (β 2 = 1.3 in experiments) over positive training samples. Note that the deformation term is usually negative.

V. EXPERIMENTS
We firstly conduct experiments on three public action datasets: the KTH [48], the Hollywood2 [49], and the Olympic Sports Dataset [12]. Then we further explore some important aspects of LHM. For the three datasets, we use LIBSVM package [50] to solve the standard SVM problem in the learning framework of Section IV-A. For multi-class classification, we apply the one-vs-all training scheme.

A. KTH Dataset
The KTH is a relatively simple dataset among the three and it contains 6 action classes: boxing, hand-clapping, handwaving, jogging, running, and walking [48]. 2 Each action is performed by 25 actors in four controlled environments: outdoors, outdoors with scale variation, outdoors with different cloths, and indoors. There is no camera motion in these videos and the intra-class variations are relatively small compared with other datasets. Some video frames and their detected STIPs are shown in Fig. 4. We follow the experimental settings described in [48] and the codebook size is 1,000.
Experimental results are shown in Fig. 5 and Table I. From the results, we see that our method can achieve high accuracy rates for the actions of boxing, hand-waving, handclapping and walking. But for the action of running and jogging, the performance of our method decreases because the two actions are similar to each other and there is a strong confusion between these two kinds of action.
Comparison with Other Methods. We compare LHM with three other methods in Table I. The method of [48] is based on spatiotemporal jets at the center of each detected interest point using normalized derivatives, and use BoVW representation and SVM classifier. The other two methods [23], [23] are both based on HOG/HOF features. The method of [23] uses the traditional BoVW and the method of [12] uses a single-layer segment model. From the comparison, we find the three methods using HOG/HOF features obtain similar performance, which are much better than spatiotemporal jets. LHM is comparable to other methods using local features. The actions in KTH are relatively simple, and the detected local spatial-temporal features provide sufficient information for activity recognition.

B. Hollywood2 Dataset
The Hollywood2 action dataset [49] is collected from 69 different Hollywood movies. 3 In total, there are 1,707 action samples, which is composed on 823 training samples and 884 testing samples. The authors provide the clean and noisy versions of the dataset and we use the clean version. There are 12 action classes: answer-phone, drive-car, eat, fight-person, get-out-car, hand-shake, hug-person, kiss, run, sit-down, situp, and stand-up. Some video frames and their detected STIPs are shown in Fig. 4. As all the video clips are segmented from movies, the video quality is very high and there is no camera shaking. The performance is evaluated by average precision according to paper [49] and the codebook size is set as 4,000.
The final recognition results are shown in Table II. We see that the Hollywood2 dataset is more difficult than the KTH  dataset and our method obtains average precision as 48.1%. For action classes such as drive-car and fight-person, LHM can perform relatively well and get average precision larger than 70%. However, for the rest of action class, the recognition rate is relatively low. The videos are all extracted from realistic movies and the intra-class variance is very large compared with the KTH dataset.
Comparison with Other Methods. We compare our method with three other methods: the BoVW model [19] (baseline), the context model [49], and Convolutional Gated RBM (GRBM) [51]. The BoVW model uses the same features and codebook size and the context model exploits the static scenes as a cue for action recognition. The Convolutional Gated RBM aims to learn the features directly from the video intensity with some deep models. The BoVW is implemented by ourselves and use the same codebook with LHM. We find the result is similar to a recent empirical study of local features [18]. From the comparison, we observe that our method outperforms the other two methods in 8 action classes. For mean of average precision, our method achieves higher rate than traditional BoVW by 2.8% and than GRBM by 1.5%.

C. Olympic Sports Dataset
The Olympic Sports Dataset is collected by [12] and has 16 sports classes: basketball-layup, bowling, clean-and-jerk, discus-throw, diving-platform, diving-springboard, hammerthrow, high-jump, javelin-throw, long-jump, pole-vault, shotput, snatch, tennis-serve, triple-jump, and gym-vault. All the videos are from YouTube and each activity class contains a complex temporal structure compared with the activities in the KTH and Hollywood2 dataset. Note that the authors only release part of their dataset on their website. 4 There are 649 sequences for training and 134 sequences for testing. We conduct experiments according to the settings released on their website. In order to compare our method with those proposed by [12] and [14], we use the same feature representation and the codebook size is 1,000. The final performance is evaluated by computing the average precision (AP) for each of the action classes and reporting the mean AP over all the class (mAP).

CLASSES. THE BOLD FONTS INDICATES THE BEST PERFORMANCES
Our experiment results are shown in Table III and Fig. 8. From the results we see that our method obtains a relatively high performance in the Olympic Sports Dataset with mAP = 69.2%. For some activity categories such as basketballlayup, diving-springboard, diving-platform, gym-vault, our method performs pretty well and achieves average precisions larger than 90%. See Fig. 8, our model can automatically divide gym-vault into three sub-activities: running, rolling in the air, and landing; long-jump into three sub-activities: starting running, speeding up, and jumping; clean-and-jerk into three sub-activities: beginning, clean phase, and overhead jerk phase. The duration of each sub-activity varies and adapts to each activity video. Each sub-activity is further decomposed into more primitive sub-activities in the bottom layer.
However, for some activity categories such as tennisserve, high-jump, triple-jump, and discus-throw, our method performs poorly and the average precision is low. We analyze the reasons as follows. Firstly, we find there exist strong confusions among some activity categories. For example, the similarity among triple-jump, long-jump, and high-jump is very high. The three activities share some sub-activities such as running and jumping. For activities such as hammer throw, discus-throw, and shot-put, the whole processes of activities  [23] and two kinds temporal models [12], [14] in Table III. The method of [12] models the temporal structure of decomposable motion segments and formulates the problem in a similar framework. The model of [14] is based on the variable-duration hidden Markov model and it gets the state-of-the-art performance with local features in this dataset.
From the comparison, the proposed LHM achieves higher average precision for 10 of the 16 activity classes. For mean average precision, our LHM is higher than the baseline by 11% and than the state-of-the-art by 2.4%. These results exhibit that hierarchical decomposition of sub-activities and automatic adaptation of starting and ending time points is effective for complex activity classification.

D. Further Explorations
Our LHM provides a general framework for hierarchical modeling the temporal structure of complex activity. In this section, we study the different aspects of LHM in a more detailed way. Firstly, we explore the different structure settings and their influences on final recognition performance. Secondly, we investigate the effectiveness of latent variables by comparing the recognition performance of LHM with temporal pyramids [52]. Temporal pyramids decompose each video into segments of equal duration, while LHM automatically aligns video by efficient search in latent variable space. Then, we investigate the inference efficiency of the proposed cascade algorithm. Finally, we incorporate denser and richer local features based on dense trajectories [22] into LHM to boost final recognition performance.
Hierarchical Model is Better. In order to explore the performance of LHM with respect to model structure, we conduct additional experiments on the Olympic Sports Dataset. We choose three other structure settings, 1−2−4, 1−4, 1−9, and the results are shown in Fig. 6. From the results, we see that the structure 1−3−9 obtains the best performance (69.2%) and the second one is 1−2 −4 (67.7%). The deep structures are better than the shallow ones: 1 − 9 (63.4%) and 1 − 4 (66.0%). We conclude that deep structure is useful for the complex activity classification. LHM models the decomposition of complex activity into sub-activities in a coarse-to-fine manner. The deep structure provides extra descriptive power to LHM and contributes for more accurate alignment of different video samples. The comparison results indicate that hierarchical model is better for activities with complex and long temporal structure.
Latent Model is Better. We also implement the temporal pyramid representations on the Olympic Sports Dataset. For different structures, we compare the recognition performance of LHM and temporal pyramids [52], which uses fixed temporal segmentation, and the results are depicted in Fig. 6. From the experimental results, we observe that LHM performs much better than temporal pyramids: 1 − 3 − 9 (69.2% vs. 54.2%), 1 − 9 (63.4% vs. 52.1%), 1 − 2 − 4 (67.7% vs. 58.9%), 1−4 (66.0% vs. 59.0%). All these results indicate that model with latent variables, which are determined adaptively for different videos, can describe the complex activity more effectively. Besides, we observe that the recognition rates of temporal pyramids representation are similar to or even lower than those of the traditional BoVW method. It implies that if there exist strong temporal displacements among different videos, the temporal pyramids representation may harm the final performance. This observation can be ascribed to the fact that its assumption of approximate temporal correspondence in the temporal pyramid may not hold for the training and testing samples.
Efficiency of Cascade Inference. We explore the efficiency of cascade inference. For 300-frame length video, the number of segments needed to be calculated for inference with cascade and without cascade for 1 − 3 − 9 structure is shown in  [19] with HOG/HOF descriptor [23] is a common choice for local features. However, Wang et al. propose a much denser and richer feature called dense trajectory [22], which turns out to be effective in capturing the motion and appearance information for human activity recognition. From the experiment results shown in Table IV, BoVW with dense trajectory features obtains much better results than with STIPs features.
In this part, we explore incorporating dense features into LHM, which combines the richness of low level features with the descriptive and flexible power of LHM to further boost recognition performance. In experiment, we use four kinds of descriptors: HOG, HOF, MBHX, MBHY, and the codebook size is 4,000. We obtain recognition performance of 59.9% for the Hollywood2 Dataset and 83.2% for the Olympic Sports The numbers denote the indexes of frames in the video. Each video is decomposed into sub-activities in a 1 − 3 − 9 structure. In each layer, video is divided into several segments, whose durations are determined in inference. Each segment correspond to a learnt sub-activity, where the color lines indicate the durations of sub-activities. From the result, we see that our LHM can automatically decompose complex activity into several sub-activities hierarchically. Each complex activity video is represented as a whole segment in root node and it is divided into several sub-activities in the middle layer. Each sub-activity is further decomposed into more primitive actions in the bottom layer.
Dataset. For the Hollywood2 Dataset, the action types are relative simple and there are no complex temporal structures in them. Thus, there is only slight improvement for LHM compared with BoVW. However, for the Olympic Sports Dataset, the advantage of LHM is more evident and LHM obtains considerable performance improvement. Currently, Wang et al. [53] further improve their recognition performance by incorporating structural information into BoVW framework with spatiotemporal pyramids (STP) and obtain the best results on the two datasets. Our LHM with dense trajectory features are comparable to the best results on the Hollywood Dataset and much better than the best results on the Olympic Sports Dataset, even though we don't consider any spatial information in our model. In conclusion, dense features are more rich and effective than STIPs. With dense features, we can further boost the recognition performance of LHM and obtain the stateof-the-art results on the challenging Hollywood2 Dataset and Olympic Sports Dataset.

VI. CONCLUSION
This paper has proposed a Latent Hierarchical Model (LHM) for classifying complex activities. LHM is a hierarchical model with deep structure, which decomposes activity into sub-activities in a coarse-to-fine manner. We develop the latent learning algorithm to estimate the parameters of LHM. We also present a cascade inference algorithm to improve activity classification efficiency. The starting and ending time points of each sub-activity indicated by latent variables, are determined automatically in inference process. LHM is flexible and effective to deal with the duration variation and temporal displacement of each sub-activity. The experimental results show that the proposed method with dense features achieves recognition performance superior or comparable to that of the previous methods on two challenging action datasets: the Hollywood2 and the Olympic Sports. In particular, LHM is more suitable for activities with longer and more complex temporal structure and gains considerable recognition performance improvement.