Multi-features combinatorial optimization for keyframe extraction

: Recent advancements in network and multimedia technologies have facilitated the distribution and sharing of digital videos over the Internet. These long videos contain very complex contents. Additionally, it is very challenging to use as few frames as possible to cover the video contents without missing too much information. There are at least two ways to describe these complex videos contents with minimal frames: the keyframes extracted from the video or the video summary. The former lays stress on covering the whole video contents as much as possible. The latter emphasizes covering the video contents of interest. As a consequence, keyframes are widely used in many areas such as video segmentation and object tracking. In this paper, we propose a keyframe extraction method based on multiple features via a novel combinatorial optimization algorithm. The key frame extraction is modeled as a combinatorial optimization problem. A fast dynamic programming algorithm based on a forward non-overlapping transfer matrix in polynomial time and a 0-1 integer linear programming algorithm based on an overlapping matrix is proposed to solve our maximization problem. In order to quantitatively evaluate our approach, a long video dataset named ‘Animal world’ is self-constructed, and the segmentation evaluation criterions are introduced. A good result is achieved on ’Animal world’ dataset and a public available Keyframe-Sydney KFSYD dataset [1].

video segments. The longest non-overlapping segments can cover the contents of the video as much as possible. In addition, a combinatorial optimization problem without thresholds or a pre-designed number of keyframes was solved by either a quick dynamic programming [29] or a 0-1 integer linear programming approach [30].
iii) Optimization problem: The one main optimization problem is the set cover problem. Set cover problem: Given a universe U = {e 1 , e 2 , ..., e n } of n elements, a collection of subsets of U, S = {S 1 , S 2 , ..., S k }, and a cost function c : S −→ Q + , and find a minimum cost subcollection of S that covers all elements of U.
The minimum set cover problem can be formulated as the following integer linear program: min c = w T y s.t. y T 1{e i ∈ S} ≥ 1, ∀i ∈ {1, 2, ..., n}, y ∈ {0, 1} k . (1.1) where w is the weight vector and y is a binary vector. y i = 1 denotes that S i is selected and 0 is otherwise. 1{e i ∈ S} is a indicator vector of k dimensions. If e i is in S j , then the j th element in 1{e i ∈ S} is 1 and 0 otherwise. It is an NP-complete problem [31]. The set cover problem has been widely applied in saliency detection [32] and keyframe extraction [1,33]. In [32], the social group candidates selection was formulated as the set cover problem, which was represented as the form of a quadratic integer programming, and the optimization was solved with a branch and bound method. In [1], the global keypoint pool is formed by matching keypoints, and the keypoints of the keyframes should cover the global keypoint pool as much as possible. It was formulated as a variation of the set cover problem, and a greedy algorithm was adopted to approximately tackle this issue. The set cover problem was also employed in [33], in which the suboptimal solution for the minimum covering problem was found using parts of the Quine-McCluskey algorithm. The disadvantage of this approach is obvious, since it can not obtain the optimal solution, but rather the suboptimal solution. The most closest to our approach is the combinatorial optimization problem. Combinatorial optimization is a topic that consists of finding an optimal object from a finite set of objects [34]. Chang et al. [33] constructed a tree-structured keyframe hierarchy in which a video segment for a given degree of fidelity was represented with a compact set of keyframes, and used a depth-first search scheme with pruning to enable an efficient content-based retrieval. The energy minimization or maximization based methods [35] extract keyframes by solving a rate-constrained problem, [36] extracting a small number of keyframes within a shot by maximizing the divergence between video objects in a feature space. This approach comes with two advantages: it's very intuitive and it can reach an optimal solution. The main contributions can be summarized as follows: • We propose a keyframe extraction method based on multiple features via a novel combinatorial optimization algorithm. • The proposed method achieves a promising performance on the Animal-world and Keyframe-Sydney (KFSYD) datasets. Extensive qualitative and quantitative experiments demonstrate the proposed method's effectiveness.
The rest of the paper is organized as follows. Section 2 introduces the proposed method, including features extraction, candidate video segments generation and key segments selection. In Section 3, the optimization methods are presented, including the dynamic programming approach and a 0-1 integer linear programming approach. In Section 4, some related analyses and discussions are presented. Experimental results are shown in Section 5 to verify the proposed approach. Finally, concluding remarks are given in Section 6. Figure 1 shows the main framework of the proposed keyframe extraction algorithm. Our algorithm includes three stages, (i.e., features extraction, candidate video segments generation, and key segments selection). The details will be introduced as follows.    Figure 1. Overview of our approach on KFSYD dataset.

Features extraction
There is a strong relationship among neighbouring frames. Meanwhile, there may exist a blurring in the video caused by either the quick movement of objects in the shot or the jitter of the camera lens. This will interfere with the shot detection or the segmentation of video sequence. Illumination variation is another challenging point. Errors in key frames selection can arise by solely using the color or texture features to cluster the video. Intuitively, the more features there are in the video, the more detail will be presented. In our problem, we use the following features to cluster the video sequence: f 1 = HS V histogram, f 2 = S IFT [37], f 3 = GIS T [38], and f 4 = PHOG [39]. These features are usually used in keyframe extraction. These low-level visual features describe the image from various aspects such as the color, shape and context, which are in favour of accurately and comprehensively representing the video's contents.

Candidate video segments generation
The candidate video segments generation includes the following three steps, in the given order.
• Step 1: For each feature, the k-means clustering method is used to obtain the initial video segments by clustering the video sequence. • Step 2: The initial video segments are further pruned to remove some invalid segments.  For example, as shown in Figure 2, the video sequence is clustered into four clusters with the GIS T feature according to Step 1. We can obtain five initial video segments, where the first video segment and the last video segment belong to the same cluster c 1 . In Step 2, the second video segment , which belongs to cluster c 2 , is pruned, since the frame length is less than 10. Such a segment is unlikely to compose a complete shot. The left video segments are labeled as s 1 ∼ s 4 . In Step 3, we start to mix the segments set {s v n } V,N v v=1,n=1 , where V refers to the number of features, and N v denotes the number of segments generated by v-th feature in Step 2. There may exist such a case that more than two segments in {s v n } V,N v v=1,n=1 have the same beginning and ending times. Then, only one of these reduplicated segments will ultimately remained in the segments set by random.
The remaining video segments are called candidate segments. These candidate segments are sorted by the beginning time, followed by being divided into D groups that don't overlap in time. The whole process of generating candidate video segments is similar to the Stage 2 of Figure 1.

Key segments selection
The candidate segments in the d th group are denoted as a set S d = {s j }, j ∈ {1 . . . n}, where n is the number of candidate segments in the group. The segments selected from the candidate segments are called key segments.
Combinatorial optimization problem: The target of our problem is to find the optimal subset S * d that covers all the video sequences as possible. S * d has the property that there is no overlap in time for any segment that are selected from the d th group. S * d is the set of key segments that we seek. Then, the problem then becomes how to find the longest non-overlapping segments in each group. For the d th group, it can be formulated as follows: where n is the number of selected candidate segments in the d th group and q i is the index of q th i segment in this group. The i th element l i in column vector L = (l 1 , . . . , l n ) denotes the number of frames of i th candidate segment. The k denotes the number of key segments selected from the current group. The first constraint denotes that the selected segment s q i and the selected segment s q j don't overlap.

Definition of symbols
A forward non-overlapping transfer matrix is defined to represent a directed graph G = (V, E) with a temporal non-overlapping relationship, as shown in Figure 3, where the segments correspond to the nodes V = {v 1 , · · · , v 6 }, and a directed edge e i j is connected between node i and j with non-overlapping segments, where the corresponding segment s i is in front of segment s j . The forward non-overlapping transfer matrix can be formulated in the n × n upper triangular matrix, where n is the number of nodes in the graph: Figure 3. A directed graph with temporal non-overlapping relationship. The node in the graph denotes a video segment in one group, the edge in the graph denotes that the connected two video segments don't overlap.
where P i j = 1, i j denotes that the i th and j th segment has no overlap, and 0 if j th segment is in front of the i th segment or there is no overlap between them. For any i ∈ {1, 2, ...n − 1}, P i,i+1 = 0, (i.e., the elements in the diagonal right above the main diagonal are all zeros). It is because the i th segment and the i + 1 th segment have overlapped with each other; otherwise, these two segments belong to different groups. Additionally, the element P i j in matrix P means that there exists a path of length 1 from the node i to the node j if P i j = 1 and no path if P i j = 0, according to [40]. In the same way, the element P k i j in the power matrix P k means that there exist P k i j path(s) of length k from the node i to the node j. The transfer matrix serves as the building block throughout our following algorithm.
We define a selection matrix A k , where k is the number of key segments. The elements of A k are the total lengths of the key segments. We set P 0 = I, where I is an identity matrix. An AND operator for non negative integer vectors is defined as c = a • b, where c = logical(a)&logical(b), logical(•) denotes that the vector is an element-wise binarization operation and & is a logical AND operator.

Dynamic programming algorithm (DP)
As mentioned previously, the candidate segments in each group can be represented as a graph. The difficulties in solving Eq (1.1) include the following: the variable is binary and the number of key segments is not given in advance and has to been optimized. To deal these problems, we propose the following dynamic programming based on the graph. For d th subgraph, our dynamic programming algorithm is described as follows: • Step 1: Construct a transfer matrix P and initialize a selection matrix A k = l · 1 T , k = 1, where l is a column vector of length. • Step 2: Determine whether P k is all zeros matrix. If not, determine which elements are activated in A k (i, :) in the i th row based on the activated indicator vector a = P k−1 (i, :) • P(:, j), where P k−1 (i, :), P(:, j) denote the elements in the i th row of P k−1 and in the j th column of P, respectively. Then, the maximum value A k (i, r) in A k (i, :) and the corresponding r th column segment l r are added to A k+1 (i, j). If so, go to Step 4. • Step 3: k = k + 1, and go to Step 2.
• Step 4: Return the maximum value in {A 1 , · · · , A k } and its corresponding key segments. Figure 4 displays a toy example and is a subgraph in graph G according to our dynamic programming. The forward non-overlapping transfer matrix and initial selection matrix are represented as follows:  Figure 4. A toy candidate segments s 1 ∼ s 6 in a group. The length of each segment is shown over the corresponding ellipse as above.
As mentioned earlier, the forward non-overlapping transfer matrix represents whether or not there exists a path of step length 1 from one node to subsequent node. In other words, P(i, j) = 1 denotes that s i and s j do not overlap in time. The activated indicator vector a = P 0 (1, :) • P(:, 3), a = [1 0 0 0 0 0] denotes that only one path A 1 (1, 1) is activated. The segment s 3 is added to path A 2 (1, 3) = A 1 (1, 1) + l 3 . In the same way, we can calculate A 2 (i, j), where i, j are the index of the non zeros elements in P. Since A 2 and P have the same data structure, the other elements in A 2 are all zeros. The elements with a box around them are activated as shown in A 1 and A 2 : where A 2 is a upper triangular matrix, and A 2 i, j 0 denotes that two candidate segments are selected.
Complexity analysis: Our algorithm is parallelizable and it's very easy to predict the iteration step k subjects to P k = 0 for each subgraph. In addition, the transfer matrix P is zero on the first diagonal above the main diagonal and below it for a subgraph of any, which will facilitate the acceleration of calculating P n , {n = 2, 3, . . . , k}. It is very easy to verify that P n has all zeros on the (2n − 1) th diagonal above and below the main diagonal. Since A n+1 and P n have the same data structure, A n+1 can be calculated quickly. The computation complexity is C( m 3 where m is the maximum number of nodes in all subgraphs. Once the longest key segments were found, the frames which are closest to the corresponding feature centroid of key segments are selected as keyframes.

0-1 integer linear programming algorithm (0-1 ILP)
The pairwise overlapping relationship of segments are represented as an overlapping matrix A ∈ {0, 1} m×n , where m is the number of pairwise overlapping segments, and n is the number of segments. If the i th segment and the j th ( j > i) problem overlap in time, then A i,i = 1, A i, j = 1. In addition, A · 1 = 2, i.e., only one pairwise overlapping segments are presented in each row of matrix A, where 1 is a column vector of all ones, and 2 are a column vector of all twos. For example, take the toy game in Figure 2. The overlapping matrix can be represented as follows: We define a selection indicator vector y ∈ {0, 1} n , i.e., y i = 1 denotes that the i th segment is selected, and 0 otherwise. Equation The inequality constraint means that, at most, one segment can be selected from the pairwise overlapping segments. The embedded matlab function bintprog is used to optimize the aforementioned problem. Once the optimal key segments are found, the frames which are closest to the corresponding feature centroids of key segments are selected as the keyframes.

Relation to set cover problem and shortest path problem
The main differences between our problem and the set cover problem, as well as the shortest path problem, are described as follows.
There are four main differences between our combinational problem and the set cover problem: 1) the subsets in S have no order in the set cover problem, but they are sorted in time order in our problem; 2) All the elements in the universe should be covered in set cover problem, but they should be covered as much as possible in our problem; 3) The subsets should not overlap with each other in our problem, but it has no such constraint in the set cover problem; and 4) The set cover problem is a minimal problem, but ours is a maximal problem.
Shortest path problem: In graph theory, the shortest path problem is the problem of finding a path between two vertices (or nodes) in a graph such that the sum of the weights of its constituent edges is minimized. The main difference between our problem and the shortest path problem is that the start node and the end node are not given in our problem, which should be inferred by our proposed combinatorial optimization.

The comparisons between DP and 0-1 ILP
The 0-1 inter linear program (ILP) approach is very time consuming when the dimensions of the constraint matrix are very large, but it is very easy to implement. It takes about 10 minutes to select key segments from one groups candidate segments on the Animal-world dataset with the 0-1 ILP approach, while the DP approach take less than 20 s to extract the same length of key segments. To obtain the longest key segments, the combination of key segments is not unique. The dynamic program (DP) approach searches optimal key segments from a smaller combination of candidate segments to a larger combination of candidate segments. In this process, it selects the minimal combination of candidate segments but with the longest sequences coverage. However, the 0-1 ILP approach selects more key segments than the dynamic approach with the same length of sequence coverage. The average gap between neighboring keyframes extracted by the DP approach is larger than the 0-1 ILP approach. The larger average gap between neighboring keyframes can reduce the redundancies between neighboring keyframes. The longer key segments can cover the representative content of the video as much as possible. In the experimental part, the dynamic approach is utilized to extract the key segments.

Specify the number of keyframes
When the number of keyframes is specified as k, all of the video segments will serve as a whole. In our DP approach or the 0-1 ILP approach, the forward non-overlapping transfer matrix will include the whole video segments. For the DP approach, the element P i,i+1 = 1, if i th and i + 1 th segments are in different groups. The selection matrix A k can be obtained after running the dynamic programming algorithm. The combinatorial segments, which correspond to the maximum value in A k , are the desired key segments. For the 0-1 ILP approach, a similar overlapping matrix is constructed. The new constraint y T 1 = k means that the k segments are selected as key segments.

Experiment
To evaluate the performance of the proposed keyframes extraction algorithm, two types of experiments were performed. The first one compared the performance of video segmentation by giving the groundtruth of keyframes on the Animal-world dataset. The second one was a quantitative experiment evaluated on the KFSYD dataset [1].

Databases and experimental settings
Databases: 1) The Animal-world dataset is a long video sequence of 8000 frames constructed by ourselves. The resolution of the frames is 352 × 288. It contains six different types of objects including leopard, rhinoceros, hyena, zebra, elephant, and human. We chose 36 shots, which included 4213 frames of the Animal-world dataset to evaluate our experiment. The same object class in the same frames were given different labels according to the order of their emergence. The groundtruth were annotated frame by frame, which took almost two months to complete this task. There existed frequent shot changes, visual transitions (such as dissolves), occlusions and complex scenes. 2) The KFSYD dataset [1] was constructed from the open video project (http://www.open-video.org) for quantitative evaluation. It consists of 10 video shots across several genres. The resolution of frames is 352 × 240. The ground-truth keyframes of the videos were manually selected by three students with video processing backgrounds. The main differences between these two datasets are listed in Table 1. Experimental settings: For feature f 1, 16 histogram bins were used to represent the features in each color channel. For feature f 2, SIFT features were computed on a uniform grid and a k-means cluster method was used to obtain 300 centroids for the bag-of-words [41] representation. The whole video sequence with different features were all clustered into C clusters by k-means clustering. In our experiment, the HSV (Hue, Saturation, Value) histogram had 32 bins, and the SIFT (scale-invariant feature transform) descriptor [37], which was quantized into visual words had 300 dimensions, the GIST [38] had 512 dimensions, and the PHOG (pyramid histogram of oriented gradients) [39] had 628 dimensions. All the experiments were conducted on an Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz and implemented with the MATLAB programming software.

Comparison of results from video segmentation on Animal-world dataset
By giving the groundtruth of keyframes extracted by [42,43], we used the publicly available code [42] * to segment the whole video. The algorithm can detect the shot change and determine the pixel-level label of each frame by either transferring the groundtruth of the nearest left keyframe or the nearest right keyframe based on propagation of errors.
In our approach, the k-means parameter C was experimentally set to 50 for each feature. The ultimate 52 key segments produced by our dynamic programming approach covered 4197 frames. The mean keyframe interval between neighbourhood keyframes was 84.1. For keyframe extraction approach in [42], a cost matrix was calculated with accumulative errors from the pixel flow propagation. A dynamic programming algorithm was used to extract the keyframes based on the cost matrix. The number of keyframes was set to 52. The Markov Random Field (MRF) model was utilized to achieve the video segmentation. For the keyframe extraction approach in [43], the keyframe extraction was formulated as a conditional k keyframe selection problem. A dynamic programming algorithm was proposed to search for the optimal k keyframes. The number of keyframes was also set to 52. By giving the groundtruth of keyframes extracted by three different algorithms, a similar MRF-based approach was used to segment the video.
The comparison between an algorithm's output and the groundtruth was performed on four evaluation metrics (i.e., average precision (AP), average recall (AR), average F-score (AF) and average intersectionover-union (AIoU)).
Pre kk i σ(y k i = gˆk i ), Rec kk i σ(y j i = gˆk i ), where R k i is the k th region of semantic segmentation in the i th frame and y k i is the label of it, Gˆk i is thek th groundtruth semantic region of the i th frame and gˆk i is the groundtruth label of it, n i is the number of segmentation objects including background in the i th frame, m i is the number of groundtruth objects including background in the i th frame and σ is a indicator function.
The results of the video segmentation are shown in Figure 5. As illustrated, our algorithm includes the 0-1 integer linear programming approach (ILP) and the dynamic programming approach (DP-1) yields a much better result. The higher value of AF and AIoU, the more representative of the keyframes. The illumination variation, blurring, and occlusion in the Animal-world dataset have a big impact in extracting keyframes based on the label propagation errors approach and the appearance transfer [42]. Though the algorithm can detect the shot change, it may fail in some cases mentioned above.
The video segmentation performance can be further improved by merging the uncovered frames, as shown in Figure 1, into key segments and transferring the appearance model based on flows between neighboring frames within key segments in our dynamic programming approach, as shown in ILP-2 and DP-2. The video segmentation is restricted within each segments. According to the segmentation method proposed in [42] Section 3.1, the unary terms include two terms (i.e., a unary potential based on an appearance model and a unary potential based on transferred label). The binary term is based on both the appearance and motion similarity of neighbouring pixels within each frame. Then, an MRF method is used to infer the labels of each pixels. As shown in Figure 5, the promotion of the segmentation performance is very obvious.  Figure 5. Comparison of segmentation performance. ILP-1 is our 0-1 integer linear programming approach. DP-1 is our dynamic programming approach. Compared with the original ILP-1 and DP-1, a minor modification is made on video segmentation process for ILP-2 and DP-2. The number of keyframes are 52. The same MRF segmentation propagation proposed in [42] is used to propagate the groundtruth of keyframes extracted by different algorithms including [42,43], ILP1 and DP1, but except ILP-2 and DP-2.
Execution times: The computational cost of our approach is largely affected by the multiple feature extraction and clustering. It takes about 20 minutes to complete the feature extraction and clustering. However, the dynamic programming optimization process takes less than 30 seconds, while the 0-1 integer linear programming approach takes more than 20 minutes in key segments selection. [42] takes about 40 hours in computing its cost matrix. The keyframe selection process takes about five minutes. [43] has a execution times of 90 minutes.

Quantitative evaluation on KFSYD dataset
A keyframe is considered a true keyframe if it is no more than 15 frames apart from a ground-truth keyframe in the KFSYD dataset. A groundtruth keyframe will be matched with, at most, one keyframe. The matching procedure can be achieved by the hungarian algorithm [44].
To evaluate the quality of keyframes, we used three evaluations metrics (i.e., the average precision (AP) metric, the average recall (AR), and the average F-score (AF) metric), which are defined as follows: where M i j denotes the number of keyframes extracted from the i th video, which are matched with the groundtruth keyframes selected by the j th student from the i th video, B i is the number of keyframes extracted from the i th video by the proposed approach, and N i j is the number of the groundtruth keyframes selected by the j th student from the i th video.  Figure 6. Quantitative evaluation on KFSYD dataset [1] in terms of average precision, average recall, average F-score. Our approach is compared against five state-of-art approaches: clustering [25], iso-content distance [ (2) is 0-1 integer linear programming approach. DP-1(2) is dynamic programming approach. In ILP-1 or DP-1, the frames closest to the feature centroids of the key segments are selected as keyframes. Compared with ILP-1 and DP-1, the beginning frame of the first key segment and the ending frame of the last key segment are selected as the keyframes in ILP-2 and DP-2.
In our approach, the number of feature clustering are all set to C = 5. As illustrated in Figure 6, our two approaches achieve an improved performance with regards to the state of the art clustering approach [25] and the method in [19]. A clustering-based method is unlikely to select either the first or the last frames as a keyframe, but our approach still gives as good of a result as in ILP-1 and DP-1. As shown in Figure 6, the performance of the proposed method is lower than existing the KBKS and KBKS-fast methods. The main reason is that the proposed clustering-based method is unlikely to select either the first or the last frames as keyframe. If the beginning frame of the first key segment and the ending frame of the last key segment are selected as the keyframes, just like the method in [42], our approach can achieve a further improvement of the experimental performance in the ILP-2 or DP-2. These results show that the keyframes extracted with our approach are efficient. The visualization result of the DP-1 are shown in Figure 7. Our method can cover the primary contents of the video. As shown in Figure 7(a), our method can capture not only the posture change of an airplane, but also the shape change of the clouds. The keyframes extracted by our method have few redundancies. As shown in Figure 7

Discussion
In order to explore the influence of the k-means parameter C in the number of keyframes, the experiments are conducted on the Animal-world dataset by varying C at five intervals from 20 to 200. As shown in Figure 7, the parameter C does influence the number of keyframes. As C increases, the video will be separated into more segments, and the number of candidate segments also increases. The final number of keyframes becomes larger. When C is more than 50, the number of keyframes is far less than the parameter C as C increases. It shows that the relation between them is not very strong.
The main advantages of our proposed method are listed as follows: 1) From the result on the Animalworld dataset, our proposed method is adaptive to long video content, including complex video; 2) Since our proposed method is based on multiple features clustering, it is also easy to integrate other image descriptors; 3) The number of keyframes can be determined by semiautomatic, which has some influence from the parameter C, or specified by the users; and 4) Our proposed method is easy to implement and can obtain an optimal solution. However, every coin has two sides. Our proposed method is no exception. The main disadvantages or limitations are given as follows: (i) The procedures for feature extraction and clustering take too much time. It is not ideal for a real-time application. Through current popular hash technology [45], the large scale image clustering problem can be addressed very fast; (ii) The k-means parameter C has some influence on the number of keyframes. It requires human intervention to determine parameter C; and (iii) The motion information between frames or local descriptors within frames has not been taken into consideration. It is very valuable to mine these information and will require further investigation regarding integrating them into the keyframe extraction.

Conclusions
In this paper, we present a key frame extraction approach based on a combinatorial optimization problem. A novel dynamic programming approach and a 0-1 integer linear programming approach is developed to solve the combinatorial optimization problem. The experimental results on the Animalworld dataset and the KFSYD dataset demonstrated that the performance of the proposed approach is very promising. More experiments on other large and diverse datasets will be explored in future.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.