Feature fusion and clustering for key frame extraction

: Numerous limitations of Shot-based and Content-based key-frame extraction approaches have encouraged the development of Cluster-based algorithms. This paper proposes an Optimal Threshold and Maximum Weight (OTMW) clustering approach that allows accurate and automatic extraction of video summarization. Firstly, the video content is analyzed using the image color, texture and information complexity, and video feature dataset is constructed. Then a Golden Section method is proposed to determine the threshold function optimal solution. The initial cluster center and the cluster number k are automatically obtained by employing the improved clustering algorithm. k-clusters video frames are produced with the help of K-MEANS algorithm. The representative frame of each cluster is extracted using the Maximum Weight method and an accurate video summarization is obtained. The proposed approach is tested on 16 multi-type videos, and the obtained key-frame quality evaluation index, and the average of Fidelity and Ratio are 96.11925 and 97.128, respectively. Fortunately, the key-frames extracted by the proposed approach are consistent with artificial visual judgement. The performance of the proposed approach is compared with several state-of-the-art cluster-based algorithms, and the Fidelity are increased by 12.49721, 10.86455, 10.62984 and 10.4984375, respectively. In addition, the Ratio is increased by 1.958 on average with small fluctuations. The obtained experimental results demonstrate the advantage of the proposed solution over several related baselines on sixteen diverse datasets and validated that proposed approach can accurately extract video summarization from multi-type videos.


Introduction
Video summarization, the task of concussing the original video to a summary, can fully catch the eye-catching video information. Since the 1990s, video summarization technology has gained considerable domestic and international attention. It has made significant contribution in quickly understanding the video information, which is an efficient tool for fast browsing and retrieval of videos [1].
Key-frame extraction is an indispensable assistant for static video summarization technology that characterizes the principal video contents with representative frames extraction in order to provide a convenient method to quickly and comprehensively grasp video information. The prevailing assumption is that the goal is to extract video summary accurately and automatically. Key frame extraction methods can be divided into three categories: shot based, content based and cluster based [2]. Currently, some shot based techniques are developed in the area of computer vision and image processing [3]. Huang C. extracts representative frames from each shot by computing the frame image difference in saliency and edge map features [4]. Mehmood I. analyzes the difference between frame images in a shot by modeling an auditory and perceptual attention feature [5]. Song G. H. computes the color difference in one shot by employing the average histogram method [6]. It is common for shot based methods to segment the original video into several shots at first. However, the shot segmentation process is computationally expensive. Content based methods can avoid this problem. Rachida H. proposes MSKVS, a content-based method, to measure the inter-frame distance by time and visual features. MSKVS guarantees superior performance over other content-based methods [7]. Gianluigi C. conducts experiment on six new and sport competition videos by employing his content-based method. Experimental results demonstrate that his method can effectively extract key frames [8]. Generally, these content-based methods analyze the video content by extracting color, texture or motion feature. A limiting factor of content-based methods is that the computational cost is incurred in the process of frame image features [9]. This limitation encourages the development of cluster-based methods. Cluster based techniques work by clustering together the similar frames and extracting one representative frame of each class. They avoid shot segmentation error of shot based methods, decrease inter-frame difference analysis frequency of content-based methods, and can be well-suited to key frame extraction in related fields [10][11][12].
Numerous limitations have been posed to encourage the development of cluster-based key-frame extraction algorithms [13]. The prevailing steps are cluster data, aggregate video frames into multiple clusters, and extract representative frame to compose a video summarization [14] . The cluster-based key-frame extraction not only avoids shot segmentation error and complexity, but also decreases the inter-frame difference analysis frequency. There are also many other problems of the cluster-based methods, such as extracting single image feature cannot fully represent frame image, manually determining the number of clusters and the initial cluster center caused a low degree of automation, the extracted key frame cannot represent the original videos. Therefore, this paper aims to improve the accuracy and the automation of key-frame extraction. In this paper, a novel cluster based key frame extraction approach is proposed. The benefits of this approach are as follows: 1) A video content analysis method is proposed to improve the representative of video feature data by using three visual features: color, texture and information complexity. 2) We develop a threshold optimization method to avoid manual selection of clustering threshold. This method can improve the automation of cluster-based key frame extraction. 3) We utilize the fusion of the frame density, inter-cluster distance and intra-cluster distance to filter the key frame candidates and employ the max weight factor parameter to further refine key frame candidates. This method is favorable to improve the frame representation and the overall fidelity. The rest of this paper is organized as follows. Section 2 describes the details of cluster-based methods. In section 3, we present the implementation of feature extraction method. Section 4 provides a detailed description of the proposed cluster based key frame extraction method. In section 5, we explore the performance of the proposed approach. Finally, the major work is discussed and wrapped up in Section 6.

Related work
It is quite common for cluster-based methods to transform the frame image into data points in feature space and cluster these data points to extract key frames. It is similar to the clustering algorithm, which gathers similar elements together, and takes the cluster center as the representative of clusters. Recently, some cluster-based key frame extraction methods have been proposed in literature [15,16].
The basic cluster-based methods can be categorized into two: automatic and semi-automatic cluster-based methods [17]. In general, semi-automatic cluster-based method requires manual determination the initial cluster centers and the number of clusters. Setting the number of clusters in advance may affect key frames extraction results [14]. It is more reasonable to determine the number of clusters according to different video contents in clustering process. Therefore, the automatic key frame extraction technology is more practical. Among the automatic cluster schemes, Kuanar extracts the color and texture features and propose an automated method of video key frame extraction using dynamic Delaunay graph clustering [10]. In [11], a fused key frame extraction framework is proposed. This method generates video summaries by combining sparse selection and agglomerative hierarchical clustering method based on mutual information. It extracts candidate key frames by an improved MIAHC algorithm caused the fidelity and operation efficiency are improved. In [12], the authors propose the coarse clustering and fine clustering key frame extraction. The traditional spectral clustering method with simple histogram features is used to remove most of the redundant frames. Then, the image classification method based on SIFT feature sparse coding is used to perform fine clustering for each time period. In [19], Liu and his colleges propose a key frame extraction method combining k-means algorithm and hierarchical clustering algorithm. They obtain initial clusters by employing the improved hierarchical clustering algorithm. Then they use the k-means algorithm to optimize the initial clusters to obtain the optimal clusters. In [20], the authors proposed a novel cluster-based algorithm, which is inspired by the idea of high-density peak search clustering algorithm. This method gathers similar frames into classes by integrating the important attributes of the video. In [21], the proposed cluster-based method combines image information entropy and uses a density clustering algorithm to extract key frames in gesture videos. In cluster-based key frame extraction, calculation method of clustering threshold has a great impact on the fidelity and compression rate of key frames. The core idea behind such automatic cluster-based methods is to set a favorable threshold. In literature, researchers usually receive the threshold by defining formula or by setting fixed value. Kuanar S. K. computes the cluster threshold by employing formula ( ) 2 1- [10]. Jeong D. J. selects 0.0001 as the cluster threshold [12]. The other researchers compute cluster threshold by a self-defined formula [11,19]. In representative frame extraction, some cluster-based techniques take the cluster centers or centroids as the representative frames of each class. In [10], the author selects the frames which are closest to the cluster centroids as the representative frames. In [19,20], the authors directly extract cluster centers as the representative frames.
In general, those cluster-based methods may remain redundant because the clustering threshold setting influences optimal key frame extraction. Also, these methods evaluate the representative of the frames in one cluster by a single image feature. However, a single image feature cannot be able to fully characterize the frame content and complexity. In a more reasonable way, optimal threshold and feature fusion should be computed in cluster-based key frame extraction.

Materials and method
In this section, we provide a detailed description of the proposed OTMW method which includes feature extraction and key frame extraction. At first, the color, texture, and information complexity features are computed to express video content. Then, an optimization function is developed to compute the optimal clustering threshold. Next, the frame density, inter-distance and intra-distance are computed and fused as the clustering weight factor. Finally, a Max Weigh method is proposed to extract the cluster representative frame. The proposed approach is summarized in figure 1.

Feature extraction
In this section, we describe the proposed video content analysis method which distinguishes frames by computing feature data [22] . We extract the color, texture and information complexity features to discriminate different frame images. The video frame feature data is construct by where C, T and E represent the color, texture and information complexity, respectively.

Color feature
We take the color feature as one feature to characterize the difference of frame images. We compute the first color moment, second color moment and third color moment in H, S, and V channels to construct the color feature data vectors of frame images. The first color moment reflects the brightness difference, which is calculated by where the parameters w and h are the pixel width and height, ( , ) i p q f x y is the pixel value in position ( , ) pq xy, and 1 ,1 p w q h     . The second color moment reflects the color distribution range, which is compute by The third color moment represents the color distribution symmetry, which is computed by where m C , v C and s C include first moment mean, second moment variance and third moment slope three parameters, respectively.

Texture feature
We take the texture feature as another feature to characterize the difference of frame images in image surface structural organization information\cite{r23}. We compute the mean of angular second moment, contrast, correlation and homogeneity texture features in 0, 45, 90 and 135 directions to construct video frame texture feature data vectors. The angular second moment characterizes the thickness and gray distribution uniformity of images, and it can be calculated by The contrast characterizes the groove depth and clarity of images, it can be calculated by The correlation characterizes the local gray similarity in row or column direction, it can be calculated by The homogenization characterizes the local gray level uniformity of images, it can be calculated by

Information complexity feature
We take the information complexity as the last feature to characterize the difference of frame images in aggregated and spatial feature. Information entropy proposed by Shannon is a holistic perspective information complexity measuring method [24] . It can characterize the image aggregated and spatial features. Larger image information entropy and greater internal non-uniformity degree commonly occur together in higher diversity level. The two-dimensional information entropy i Ef can be calculated by where Cf is the occurrence probability of each gray level in i -th frame image.

Clustering for key frame extraction
In this section, we describe the proposed cluster based key frame extraction method, which develops a new optimization function to compute the optimal threshold.

Threshold optimization
We narrow the search interval of optimal threshold by computing the function values of trial points. In cluster-based method, the fidelity [8] and ratio [7] are negatively and positively correlated with the threshold, respectively. Therefore, we infer that the quality of key frames is optimal when fidelity and ratio are infinitely close. We introduce a new parameter FR to characterize this relationship and to obtain the optimal key frames. The distance between frame () i x and frame () j x is compute by
The threshold is defined as where  is a variable factor, ij std is the standard deviation of ij d . We define the new parameter FR as The threshold optimization function is defined as Giving a, b(a<b, The distance of the j-th frame image from the i-th frame image in the same cluster is defined as: where i and j are in the same cluster. The distance between the i-th frame element and the element j of another cluster is defined as: where i is a frame in one cluster, j is the elements of the cluster center that has completed the clustering. According Eq. (17), i  may contain multiple elements. The product of i  , 1 i  − and i  weight factor is defined as the weighted product: Initial cluster center and cluster number k are directly related to ( ) x  , the solution is shown in Algorithm 2. Firstly, the density of samples is calculated using Eq. (15), and then the maximum density frame is selected as the first initial clustering center ( ) 1  . The distance between frames is computed and set as the first initial cluster center. Then the frames are classified with a less than t distance into the first cluster, and these frames are removed from D. The i  of D is also calculated using Eq. (18), and the second initial cluster center ( ) 2  is obtained by calculating the maximum of ( ) x  . Similarity, the samples that satisfies the same condition are classified as the second, three and k cluster, and are removed from D. Finally, all samples are assigned to cluster, and the initial cluster ， ， ， and the number of k are obtained. Clustering process is shown in Figure 3. Giving the cluster 12 , , , , the specific steps are as follows: Step1: Calculate frame i  , i  and i  in cluster C1 by Eq. (13), (14) and (15).
Step2: Compute maximum weight factor by Eq. (17) and select key-frame 1 f from cluster C1.
Step3: Similarly, the key-frame k f can be obtained by the computing of maximum weight factor in Eq. (17).
Step4: Repeat Step3 until all cluster representative frames are selected.

Algorithm 3 Cluster the frame feature value
Input: frame feature value , the initial cluster centers

Results and discussion
In this section, we turn to an empirical study on the proposed OTMW method in key frame extraction tasks and compare it with state-of-the-art cluster-based key frame extraction methods such as HVM [25], ACSC [26], RGPH [27] and FCME [28]. We report improved performance across open video dataset. We conduct a set of experiments by using surveillance, documentary, lecture on TV and phone recording four different video datasets. These videos are publicly shared on https://open-video.org/. In this section, we take Hcil2000_01 video as an example to report the performance of OTMW method. Here Hcil2000_01 is a random video of open video dataset.

Optimization of threshold
The parameters of Hcil2000_01 video in threshold optimization is shown in table 1. In

Extraction of key frames
The

Comparison with several state-of-the-art methods
We compare OTMW with state-of-the-art cluster-based algorithm in term of the fidelity and ratio measure performance. In experimental, the number of clusters of semi-automatic cluster-based methods are same as OTMW method. The results of fidelity and ratio are shown in figure 4  OTMW method with a ratio variance of 0.73. The ratio variance of HVM and ACSC cluster-based methods are 22.11 and 11.12. They are 15 and 30 times larger than OTMW, respectively. OTMW method achieves a 1.56-2.24 ratio improvement over other cluster-based methods and has a small fluctuation.

Extraction of key frames on various datasets
To assess the performance of OTMW method, we consider key frame extraction tasks on surveillance, documentary, lecture on TV and phone recording datasets.

Conclusion
In this paper, an innovative cluster based key frame extraction method is presented for multi-type videos. This method analyzes the video content by extracting the color, texture and information complexity features. The threshold optimization function is constrained by fidelity and ratio measures. It avoids the dependence on a fixed threshold problem in traditional cluster-based method by computing the optimal threshold. The parameters density i  , inter-distance intra-distance i  and weight factor i  are used to compute the initial cluster centers and the number of clusters, extract representative frames from k clusters. The method shows promising result on different video datasets. Meanwhile, OTMW achieves competitive and even better fidelity and ratio measure performance when compared with several state-of-the-art cluster-based methods. Overall, we found that OTMW well suited to process key frame extraction problem in the field of static video summarization. However, whether the proposed method also applies in the real-life production and life environment is subject to be verified. In the future, we will explore to apply our proposed method for real-time video surveillance. In addition, we will investigate how to integrate the proposed method into the camera client and how to apply it to daily production and life.