Feature Extraction and Clustering for Static Video Summarization

Numerous limitations of shot based and content based key frame extraction approaches have encouraged the development of cluster based methods. This work provides OTMW, Optimal Threshold and Maximum Weight clustering method, as a novel cluster based key frame extraction method. The video feature dataset is constructed by computing the color, texture and information complexity features of frame images. An optimization function is developed to compute the optimal clustering threshold ̺ . It is constrained by ﬁdelity and ratio measure parameters. We turn to an empirical study on the proposed method in multi-type video key frame extraction tasks and compare it with popular cluster based methods including Mean-shift, DBSCAN, GMM and K-means. OTWM method achieves an average ﬁdelity and ratio of 96.12 and 97.13, respectively. Experimental results demonstrate that OTMW can bring higher ﬁdelity and ratio performance, while still maintaining a competitive performance over other cluster based methods. Overall, the proposed method can accurately extract key frames from multi-type videos.


Introduction
The concept of video summarization allows the users to browse the video data in a friendly and quick way without losing key information.Since the 1990s, video summarization technology has been studied extensively because of its simplicity and usefulness [1].
Static video summarization is the collection of representative frames (key frames) which are extracted from the original video sequence.Key frame extraction methods can be divided into three categories: shot based, content based and cluster based.Currently, some shot based techniques are developed in the area of computer vision and image processing.Huang C. extracts representative frames from each shot by computing the frame image difference in saliency and edge map features [2].Mehmood I. analyzes the difference between frame images in a shot by modeling an auditory and perceptual attention features [3].Song G. H. computes the color difference in one shot by employing the average histogram method [4].It is common for shot based methods to segment the original video into several shots at first.However, the shot segmentation process is computationally expensive.Content based methods can avoid this problem.Rachida H. proposes MSKVS, a content based method, to measure the inter-frame distance by time and visual features.MSKVS guarantees superior performance over other content based methods [5].Gianluigi C. conducts experiment on six new and sport competition videos by employing his content based method.Experimental results demonstrate that his method can effectively extract key frames [6].Generally, these content based methods analyze the video content by extracting color, texture or motion feature.A limiting factor of content based methods is that the computational cost is incurred in the process of frame image features [7].This limitation encourages the development of cluster based methods.Cluster based techniques work by clustering together the similar frames and extracting one representative frame of each class.They avoid shot segmentation error of shot based methods, decrease inter-frame difference analysis frequency of content based methods, and can be well-suited to key frame extraction in related fields [8,9,10].
In this paper, a novel cluster based key frame extraction approach is proposed.The benefits of this approach are as follows: • A video content analysis method is proposed to fully characterize the video information by using three visual features: color, texture and information complexity.• We develop a threshold optimization function to alleviate the task of manual choosing a cluster threshold.• We utilize the fusion of the frame density, inter-cluster distance and intracluster distance to filter the key frame candidates and employ the max weight factor parameter to further refine key frame candidates.The rest of this paper is organized as follows.Section 2 describes the details of cluster based methods.In section 3, we present the implementation of feature extraction method.Section 4 provides a detailed description of the proposed cluster based key frame extraction method.In section 5, we explore the performance of the proposed approach.Finally, the major work is discussed and wrapped up in Section 6.

Related work
The basic cluster based methods can be categorized into two: automatic and semiautomatic cluster based methods.In general, semi-automatic cluster based method requires manual determination the initial cluster centers and the number of clusters.This method is widely used in early time, but is not applicable to current abundant videos.Among the automatic cluster schemes, Kuanar uses Delaunay method and analyzes video contents by computing color and texture features [8].Liu and his colleagues calculate the initial cluster centers by employing hierarchical method [11].Another work uses the spectral algorithm to cluster the color histogram of frame images [10].In artificial intelligence research community, automatic cluster based methods have a bright application prospect [9].The core idea behind such automatic cluster based methods is to set a favorable threshold.In literature, researchers usually receive the threshold by defining formula or by setting fixed value.For example, Kuanar S. K. computes the cluster threshold by employing formula 2(1 − ε) [8].Jeong D. J. selects 0.0001 as the cluster threshold [10].The other researchers compute cluster threshold by a self-defined formula [9,11].These methods habitually neglect the mutation characteristic of individual in videos.In representative frame extraction, some cluster based techniques take the cluster centers or centroids as the representative frames of each class.In [8], the author selects the frames which are closest to the cluster centroid as the representative frames.In [11,12], the authors directly extract cluster centers as the representative frames.These methods evaluate the representative of the frames in one cluster by a single image feature.However, a single image feature can not being able to fully characterize the frame content and complexity.In an effort to improve this problem, researchers provide some variants to compute the representative of the frames by extracting multiple features including entropy, motion information or region of interest [13,14,15,16,17].

Methods
In this section, we provide a detailed description of the proposed OTMW method which includes feature extraction and key frame extraction.At first, the color, texture, and information complexity features are computed to express video content.Then, an optimization function is developed to compute the optimal clustering threshold.Next, the frame density, inter-distance and intra-distance are computed and fused as the clustering weight factor.Finally, a Max Weigh method is proposed to extract the cluster representative frame.The proposed approach is summarized in figure 1.

Threshold optimization
Frame feature vector FFV Optimal threshold

Compute the parameters
Figure 1: Framework of the proposed cluster based method.

Feature extraction
In this section, we describe the proposed video content analysis method which distinguishes frames by computing feature data [18].We extract the color, texture and information complexity features to discriminate different frame images.The video frame feature data is construct by where C, T and E represent the color, texture and information complexity, respectively.

Color feature
We take the color feature as one feature to characterize the difference of frame images.We compute the first color moment, second color moment and third color moment in H, S, and V channels to construct the color feature data vectors of frame images.The first color moment reflects the brightness difference, which is calculated by where the parameters w and h are the pixel width and height, f i (x p , y q ) is the pixel value in position (x p , y q ), and 1 p w, 1 q h.The second color moment reflects the color distribution range, which is compute by The third color moment represents the color distribution symmetry, which is computed by where C m , C v and C s include first moment mean, second moment variance and third moment slope three parameters, respectively.

Texture feature
We take the texture feature as another feature to characterize the difference of frame images in image surface structural organization information [19].We compute the mean of angular second moment, contrast, correlation and homogeneity texture features in 0, 45, 90 and 135 directions to construct video frame texture feature data vectors.The angular second moment characterizes the thickness and gray distribution uniformity of images, and it can be calculated by The contrast characterizes the groove depth and clarity of images, it can be calculated by The correlation characterizes the local gray similarity in row or column direction, it can be calculated by The homogenization characterizes the local gray level uniformity of images, it can be calculated by

Information complexity feature
We take the information complexity as the last feature to characterize the difference of frame images in aggregated and spatial feature.Information entropy measures the information complexity of images from holistic perspective, which is proposed by Shannon [20].The bigger information entropy means the bigger internal non-uniformity degree.The two-dimensional information entropy Ef i can be calculated by where Cf is the occurrence probability of each gray level in i-th frame image.

Clustering for key frame extraction
In this section, we describe the proposed cluster based key frame extraction method, which develops a new optimization function to compute the optimal threshold.

Threshold optimization
We narrow the search interval of optimal threshold by computing the function values of trial points.In cluster based method, the fidelity [6] and ratio [5] are negatively and positively correlated with the threshold, respectively.Therefore, we infer that the quality of key frames is optimal when fidelity and ratio are infinitely close.We introduce a new parameter FR to characterize this relationship and to obtain the optimal key frames.The distance between frame x (i) and frame x (j) is compute by where (i, j) ∈ [1, 2, ..., m].The average distance is compute by The threshold is defined as where ε is a variable factor, std ij is the standard deviation of d ij .We define the new parameter FR as The threshold optimization function is defined as We compute the optimal threshold by following steps.

Initial cluster centers and the number of clusters
In this section, we compute the initial cluster centers and the number of clusters by clustering the data of F F V .The process is shown in figure 2. The pseudo code of the proposed approach is given in Algorithm 1.The density ρ i is calculated by The distance between j and i is defined as where i and j are in the same class.The distance between i and j is defined as here j represents the cluster center, and j and i are in different classes.The weight factor ω i is defined as

Key frame extraction
In this section, we classify frame images into k clusters C = {C 1 , C 2 , ..., C k } and extract representative frames from this clusters.The error square sum criterion function is used as the criterion function.The frame images are classified into different clusters by employing Algorithm 2. We calculate the parameters ρ i , η i , ϕ i and ω i of clusters C = {C 1 , C 2 , ..., C k } and extract the representative frame of cluster

Results and discussion
In this section, we turn to an empirical study on the proposed OTMW method in key frame extraction tasks and compare it with cluster based methods including Mean-shift [21], DBSCAN [22], GMM [23] and K-means [24].We abbreviate these cluster based methods as Ms, DB, GM, Km and my, respectively.We report improved performance across open video dataset.We conduct a set of experiments by using surveillance, documentary, lecture on TV and phone recording four different video datasets.These videos are publicly shared on https://open-video.org/.In this section, we take Hcil2000 01 video as an example to report the performance of OTMW method.Here Hcil2000 01 is a random video of open video dataset.

13:
i ∈ C i , remove i from D;

Extraction of key frame
The key frames are extracted by employing OTMW method across open video dataset.The key frames of Hcil2000 01 video are shown in figure 3. The fidelity and ratio results are shown in table 2. The fidelity measures of different videos are changed from 93 to 98 with an average of 96.12.The ratio measures are changed from 95 to 98 with an average of 97.13.The key frames are consistent with artificial judgment.

Comparisons between OTMW and other cluster based methods
We compare OTMW with popular cluster based algorithm in term of the fidelity and ratio measure performance.In experimental, the number of clusters of semiautomatic cluster based methods are same as OTMW method.The results of fidelity and ratio are shown in figure 4 and table 3. Mean-shift cluster based method with Algorithm 2 Cluster the frame feature value Input: frame feature value F F V = {x (1) , x (2) , ..., x (m) }, the initial cluster centers {µ (1) , µ (2) , ..., µ (k) } and the number of clusters k. Output: compute the distance between the sample x (j) and the cluster center compute the λ j = argmind ji ; 6: divide the sample x (j) into the nearest cluster C λj = C λj ∪ x (j) ; end for 8: else keep the current mean vector unchanged; 14: end if end for 16: until the current cluster center vectors are not updated.

Conclusions
In this paper, an innovative cluster based key frame extraction method is presented for multi-type videos.This method analyzes the video content by extracting the color, texture and information complexity features.The threshold optimization function is constrained by fidelity and ratio measures.It avoids the dependence on a fixed threshold problem in traditional cluster based method by computing the optimal threshold.The parameters density ρ i , inter-distance η i , intra-distance ϕ i and weight factor ω i are used to compute the initial cluster centers and the number of clusters, extract representative frames from k clusters.The method shows promising result on different video datasets.Meanwhile, OTMW achieves competitive and even better fidelity and ratio measure performance when compared with other cluster based methods.Overall, we found that OTMW well suited to process key frame extraction problem in the field of static video summarization.Framework of the proposed cluster based method.

Appendix
The computation of initial cluster centers and the number of clusters Results of the video from Open video dataset The delity measure performance of cluster based methods The ratio error-bar of cluster based methods The delity results of cluster based methods on different datasets The new parameters a = d c − 3 × std ij and b = d c + 3 × std ij .Step1: We compute f (a), f (b) and f (c), where c = a + 0.618 * (b − a).Step2: If f (a) = f (b) = f (c), turn to Step4, else turn to Step3.Step3: If f (c) < 0, we change c to increase fidelity and decrease ratio, b = c; Else we change c to decrease fidelity and increase ratio, a = c.Then, return to Step1.Step4: Here c = a + 0.382 * (b − a) and return to Step1.Step5: If f (c) = f (a) = f (b) and (b − a) ≤ 0.001 in three successive computation, turn to Step6.Else return to Step4.Step6: We compute the optimal threshold ̺ by ̺ = a+b 2 .

Figure 2 :
Figure 2: The computation of initial cluster centers and the number of clusters

1: compute the distance d c ; 2 : 3 :
for each sample i ∈ D do compute the density ρ i ; 4: end for 5: while D = ∅ do 6:

4 . 1
Optimization of threshold The parameters of Hcil2000 01 video in threshold optimization are shown in table 1.In Hcil2000 01 video, the average distance d c and standard deviation std ij of d ij are 2.3107 and 0.2259, respectively.Therefore, the parameters a = 1.6442, b = 2.9993.The variable interval of parameter c is [1.6442,2.9993].The parameter c is computed by c = a+0.618* (b−a) = 2.481652.In sixth iteration, the f (a) = f (b) = f (c) = 0.0063.The calculation of parameter c is changed to c = a + 0.382 * (b − a) in subsequent iterations.As shown in table1, the value of f (a) = f (b) = f (c) = 0.0063 are not change in the next two iterations.However, b − a = 0.122156 > 0.001.Finally, b−a = 0.000509 < 0.001 in 13th iteration.Therefore, the optimal threshold ̺ = a+b 2 = 1.644709.

Figure 3 :
Figure 3: Results of the video from Open video dataset.

Figure 4 :
Figure 4: The fidelity measure performance of cluster based methods

Figure 5 :
Figure 5: The ratio error-bar of cluster based methods

Figure 6 :
Figure 6: The fidelity results of cluster based methods on different datasets OTMW method achieves a 10.63-12.49fidelityimprovementover other cluster based methods.The fluctuations of ratio measure of different videos are shown in figure5.OTMW method with a ratio variance of 0.73.The ratio variance of Mean-shift and DBSCAN cluster based methods are 22.11 and 11.12.They are 15 and 30 times larger than OTMW, respectively.OTMW method achieves

Table 2 :
22e fidelity and ratio performance of videos.Here Nf represents the number of frames, Nrf represents the number of key frames.Extraction of key frames on various datasets.To assess the performance of OTMW method, we consider key frame extraction tasks on surveillance, documentary, lecture on TV and phone recording datasets.The mean-shift cluster based method achieves an average fidelity of 82.42, 86.90,84.36 and 81.22, respectively.It achieves an average ratio of 95.76,92.73,94.767 and 99.05.DBSCAN cluster based method achieves an average fidelity of 88.40, 84.07, 81.02 and 89.57.It achieves an average ratio of 92.19, 97.37, 97.42 and 92.13.The average fidelities of K-MEANS cluster based method are 86.39,83.77, 85.51 and 87.62.The average ratios of GMM cluster based method are 86.39,83.18, 85.94 and 86.45.OTMW method achieves an average fidelity of 97.07, 94.40, 95.87 and 97.54 and an average ratio of 96.43, 97.83, 97.01 and 98.37.The fidelity measures on various datasets are shown in figure 6. OTMW method achieves a 9.91-11.66fidelity and 0.91-2.77ratio improvement over Mean-shift, DBSCAN, GMM and K-MEANS cluster based methods.

Table 3 :
The ratio measure performance of cluster based methods