Adaptive Edge-Oriented Shot Boundary Detection

We study the problem of video shot boundary detection using an adaptive edge-oriented framework. Our approach is distinct in its use of multiple multilevel features in the required processing. Adaptation is provided by a careful analysis of these multilevel features, based on shot variability. We consider three levels of adaptation: at the feature extraction stage using locally-adaptive edge maps, at the video sequence level, and at the individual shot level. We show how to provide adaptive parameters for the multilevel edge-based approach, and how to determine adaptive thresholds for the shot boundaries based on the characteristics of the particular shot being indexed. The result is a fast adaptive scheme that provides a slightly better performance in terms of robustness, and a ﬁve fold e ﬃ ciency improvement in shot characterization and classiﬁcation. The reported work has applications beyond direct video indexing, and could be used in real-time applications, such as in dynamic monitoring and modeling of video data tra ﬃ c in multimedia communications, and in real-time video surveillance. Experimental results are included.


Introduction
Video shot boundary detection (also called video partitioning or video segmentation) is a fundamental step in video indexing and retrieval, and in general video data management.The general objectives are to segment a given video sequence into its constituent shots, and to identify and classify the different shot transitions in the sequence.Different algorithms have been proposed, for instance, based on simple color histograms [1,2], pixel color differences [3], color ratio histograms [4], edges [5], and motion [6][7][8].In this work, we study the problem of video partitioning using an edge-based approach.Unlike ordinary colors, edges are largely invariant under local illumination changes and are much less affected by the possible motion in the video.To ensure robustness, we use both edge-based and colorbased features under a multilevel decomposition framework.With the multiple decompositions, we can avoid the timeconsuming problem of motion estimation by a careful choice of the decomposition level to operate at.Improvements in video partitioning have been recorded by performing a dynamic classification of the shots as the video is being analyzed, and then adaptively choosing the shot partitioning parameters based on the predicted class of the shot [9].Automatic shot classification can also serve as an important step in approaching the elusive problem of capturing semantics or meaning in the video sequence (see, e.g., [14]).
We note that the problem of video shot partitioning (or segmentation) is not only relevant to video indexing and video data management.(See [11][12][13][14] for discussion on video query, browsing, and video object management).It is also an important issue in other areas of video communication, such as video compression and video traffic modeling [15].In particular, for problems such as video traffic characterization and modeling, shot-level adaptation becomes mandatory, if the network is to dynamically allocate limited network resources in response to changing video data traffic.
In this work, we introduce adaptation at different stages in the video analysis process-both at the feature extraction stage and at the later stage of frame difference comparison.We propose a new method for fast shot characterization and classification required for such adaptation, using a new set of edge-based features.We introduce a method for automated threshold selection for adaptive scene partitioning schemes.In the next section, we describe recent reported work that is closely related to our approach.Section 3 presents the multilevel edge-response vectors, the basic features we propose for video partitioning.Shot characterization and adaptation in video partitioning in the context of the edgebased features is described in Section 4. Section 5 presents results on real video sequences.We conclude the paper in Section 6.

Related Work
The first step in content-based video data management is shot boundary detection.Simply put, it is the process of partitioning a given video sequence into its constituent shots.The purpose is to determine the beginning (and/or end) of different types of transitions that may occur in the sequence.The problem of video partitioning is compounded by the various changes that might occur in the video, (say due to illumination, motion and/or occlusion), and by the different types of shot transitions (such as fades and dissolves).The inherent variability in video shot characteristics, even for shots from the same sequence introduces further complication.The partitioning algorithm depends on the specific features used, and the similarity evaluation functions adopted.Earlier methods for video shot partitioning are described in [2][3][4][16][17][18].See [19][20][21] for a survey.
Most approaches to video partitioning make use of the color (or gray level) information in the video.The limitations of color in video partitioning are the problems of illumination variation and motion-induced false alarm.Edge based methods have thus been proposed to reduce the problem of invariance due to illumination and motion.Zabih et al. [5] made explicit use of edges in video indexing, and showed how the exiting and entering edges can be used to classify different types of shot breaks.Related methods that exploit edge information for shot detection directly in the compressed domain were proposed in [4,18,22,23].In [4] color ratio features were proposed as an alternative to color histograms, and were used to identify different types of shot changes without decompressing the video.The motivation was that color ratios capture the color boundaries or color edges in the frames.In [18,23] methods were proposed to extract edges directly from the DCT coefficients, which can then be used for video partitioning.In [22], Abdel-Mottaleb and Krishnamachari described the edge-based information used as part of the descriptions in MPEG-7.Edge descriptors were given as 4-bin histograms, where each bin is for one of the four directions: vertical, horizontal, left-diagonal, and right-diagonal.Other related compressed domain methods are reported in [9,13,16].
More recent approaches to the video partitioning problem have been proposed in [9,[24][25][26][27].Li and Lai [28] described methods for video partitioning using motion estimation, where the motion vectors are extracted using optical flow computations.To account for potential changes in the lighting conditions, the optical flow computations included a parameter to model the local illumination changes during motion estimation.Cooper et al. [25,29] partitioning techniques that exploit possible self-similarity in the video, by classifying temporal patterns in the video sequence using kernel-based correlation.Li and Lee [26] studied video partitioning, with special emphasis on gradual transitions.Yoo et al. [27] studied both gradual and abrupt shot transitions, and proposed methods based on localized edge blocks.For abrupt shot boundaries, they proposed a correlation-based method, based on which localized edge gradients are then used for detecting gradual shot transitions.
The need for adaptation in the video indexing process was first identified in [30] (see also [9]), where they showed that video shots vary considerably from one shot to the other, even for shots that come from the same video sequence.They thus suggested that the results of an indexing scheme could be improved by treating different shots differently, for instance, by use of a different set of analysis parameters.Since then, there has been an increasing attention to the problem.In [31], detailed experiments were carried out using television news video.It was concluded that the selection of similarity thresholds was a major problem, and hence there is a need for adaptive thresholds to capture the different characteristics of broadcast news video.Vansconcelos and Lippman [10,32] considered the duration of video shots, and showed that the shot duration can be used to predict the position of a new shot partition, and that the short duration depends critically on the video content.They used a statistical model of the shot duration to propose shot break thresholds.By classifying video shots in terms of the shot complexity and shot duration, and then performing indexing adaptively based on the video shot classes, it was shown in [9,30] that, indeed, adaptation could be used to improve both the precision and recall simultaneously, without introducing an intolerable amount of extra computation.Dawood and Ghanbari [15] used a similar classification to model MPEG video traffic.The problem of video indexing and retrieval is very closely related to that of image indexing.Surveys on video (and/or image) indexing and retrieval can be found in [19,21,33,34].Video partitioning or segmentation has been reviewed in [20].
In this paper, we study the use of both color and edges in adaptive video partitioning.Our approach is distinct in its use of multilevel edge-based features in video partitioning, and in the provision of adaptation by a careful analysis of these multilevel features, based on the notion of shot variability.Adaptation is provided at three levels-at the feature extraction stage for the locally-adaptive edge maps, at the video sequence level, and at the individual shot level.

Multilevel Edge-Response Vectors
In our approach, we place emphasis on the structural information in the video, as these are generally invariant under various changes in the video, such as illumination changes, translation, and partial occlusion.Thus, in addition to the intensity values, we also make use of the edges in computing the features to be used.In particular, we use multi scale edges, since these can more easily capture localized structures in the video frames.

Multilevel Image Decomposition.
Let I(x, y) be an M 1 × M 2 image, with x = 1, 2, . . ., M 1 ; y = 1, 2, . . ., M 2 .Given I(x, y), we decompose it into different blocks.For each block, we consider its content at different scales, and compute edgebased features at each of these scales.We then use the features to compare two adjacent frames in the video sequence.For simplicity in the discussion, we assume images are square, that is, M 1 = M 2 = M.We also assume M = 2 p , for some integer p.The ideas can easily be extended for the general rectangular image.
Let b be the number of blocks at a given decomposition level.We choose k, the level of decomposition, such that b = 1, 4, 16, . . ., 2 2k , k = 0, 1, 2, . . ., log 2 M. Let s be the scale, s = 0, 1, 2, . . ., S.Then, given the original image, I(x, y), we can select relevant areas of the image at different scales, s.Let I s (i, j) be the sub image part selected at scale s, where i, j = 1, 2, . . ., (M/2)(1 + (1/2 s )).At the lowest scale (s = 0), we will have the entire image, viz: where x = 1, 2, . . ., M; y = 1, 2, . . ., M, and x 0 , y 0 are starting positions (typically, x 0 = y 0 = 0).Let x s 0 , y s 0 be the corresponding starting positions at scale s (these are with respect to x 0 , y 0 in I(x, y)).Then we have: where, ( At a given scale s, the starting positions are computed as The size of the image block selected at scale s will therefore be m s × m s , where m s = (M/2)(1 − 1/2 s ).For a given decomposition level, we consider each of the m s × m s -sized blocks and compute the required image features.If we fix the number of scales to 1 (i.e., s = 0) at each level k, (i.e., at each level, we select all the image positions within the block to compute the feature), then the multi scale scheme defaults to a simple multilevel representation of the image.Thus, using s = 0 with L maximum number of levels (i.e., k = 0, 1, 2, . . ., L − 1), we will have an N-dimensional feature vector, where Clearly, the number of features grows quickly with increasing L (e.g., at L = 4, N = 85; at L = 5, N = 341).For the multi scale representation, we have more than one scale per level.With S as the number of scales per level (i.e., s = 0, 1, 2, . . ., S − 1), we will have S • N feature values for each particular feature.In the following, we will assume a single scale (i.e., S = 1, and s = 0).Figure 1 shows schematic diagram of an image at different levels of decomposition, and a tree representation of the individual blocks from each level.

Edge-Oriented Features.
To reduce the possible effect of noise in the video, we first apply Gaussian smoothening on the input image before computing the edge-based features.
Given the image I(x, y), the edge gradients are defined as G x (x, y) = ∂I/∂x, G y (x, y) = ∂I/∂y.The gradients are obtained using appropriate edge kernels: where H x and H y are the horizontal and vertical gradient masks, respectively, and * represents convolution.The gradient amplitude is given by which can be approximated using the simple absolute sum: The phase angle is given by These will be calculated once for each frame, but will be used at different levels of decomposition.

Locally Adaptive-Edge Map.
The major motivation for a multilevel approach is that certain variations in an image, such as those due to edges are local in nature, and hence will be better captured by use of local (rather than global) information.For video in particular, this becomes very important.Although some variations (such as panning, tilting, and illumination) in the video could be global with respect to a particular frame, object motion and some other camera operations (such as zooming) are more easily modeled as a local phenomenon.(Note, although zooming could also be global over the video frame, the direction of the motion vectors will vary from one area of the image to the other).We capture global information by using information from the lower levels of decomposition (smaller values of k).
With higher levels, we can obtain information about more localized structures in the frame.Such localized structures could be treated differently for improved performance.
We use locally adaptive thresholds to define the edge map at different decomposition levels.Each block is considered using it's own local threshold.For a given block r, at the kth level (r = 1, 2, 3, . . ., 2 2k ), we define the edge map as follows: EURASIP Journal on Image and Video Processing Multilevel decomposition Tree representation for resultant sub-image blocks where G k A,r is the corresponding gradient response in the rth block at level k, and τ k r is a local threshold.We can choose the threshold simply as where (m k r ) 2 is the size of the rth block at level k, α is a constant.While the above approach to local thresholds is simple and conceptual, it however considers each block independent of the other blocks in the frame.It might be advantageous to consider the local threshold with respect to the global image variations [35].At a given k, we can write m k r = m k , ∀r since the block size would all be the same for any block, r.
Define the overall global image threshold as where α is a constant (which can be determined empirically).
The local threshold for a given block r at each level k is then given by where μ k e,r , σ k e,r are, respectively, the edge response mean and standard deviation for block r, at level k.

Edge-Based Features.
At a given level k, and for each given block r, we compute the following features.
(i) Color, μ k c,r , σ k c,r : color mean and standard deviation using I(x, y).
(ii) Edge response, μ k e,r , σ k e,r : edge response mean and standard deviation using G A (x, y): (iii) Phase angle, μ k ϕ,r , σ k ϕ,r : mean phase angle and standard deviation using G φ (x, y).
(iv) Edge length, μ k λ,r : edge length using the edge map, E k r (x, y), where μ k λ,r = x,y E k r (x, y).(v) Edge response at the edge points, μ k p,r , σ k p,r : edge response mean and standard deviation computed only at the edge points, as defined by the edge maps.
The edge points are the pixel positions that lie on the edgesas determined by the thresholds above.We call the combined features including the color features multilevel edge-response vectors (MERVs).

Similarity Evaluation Using MERVs.
Having extracted the features, the next question is how to find appropriate metrics to compare two video frames using these features.Given two images I 1 (x, y) and I 2 (x, y), we can compute the distance between them using the general Minkowski distance, or some other metrics.In the following we use the simple city-block distance.
For the edge length, there will be no standard deviation, hence the distance will be For the other features, we need to consider both the mean and the standard deviation.For example, for the edge response feature, we will have Similarly, we obtain the corresponding distance d c (•), d ϕ (•), and d p (•) for color, phase angle, and edge response at edge points, respectively.The overall distance between the two images is then determined as a simple weighted-average of the individual distances from the different features: where w c + w e + w p + w ϕ + w λ = 1.The parameters w c , w e , w ϕ , w λ , w p are respective weights for features based on the color, edge response, phase angle, edge length, and edge response at edge points.By simply varying the weights, we can completely ignore the contribution of any particular feature.For the weights above to be meaningful however, we need to be sure that the range of values for the individual distances will be similar.Thus, we either have to normalize all the features to the same range of values, or we can compute the distance such that the overall distance from each feature is normalized.We take the later approach, and perform normalization at the time of distance computation, based on the model-data feature pairs where again w μ and w σ are weights, with w μ + w σ = 1.The normalized distances can then be used with the weights in (17) to obtain the overall distance between the frames.Another important issue is the effect of each individual block in the overall difference.Let w k f ,r be the weight of feature f from the rth block at level k.That is, f ∈ {c, e, ϕ, λ, p}, where c, e, ϕ, λ, p denote respective features based on color, edge response, phase angle, edge length, and edge response at edge points.A simple approach is to adopt a method whereby for a chosen feature f, the contribution from every block at each level is given an equal weight.Effectively, . N k is simply the number of blocks at the kth level.This makes the features from the lower levels of the decomposition to become more important.As the number of decomposition levels L increases, the lower-level features will dominate in the computation of the overall difference, and hence this will become very sensitive to small spatial differences in the frames.This will hence be more susceptible to noise and minute motion variations in the video.For shot classification however, this can be beneficial, since the domination of global movement or features in the video can be avoided.
A better approach could be to divide up the contribution to the overall difference amongst the k levels.The blocks that make up the kth level will then share the contribution allocated to that level.A simple way to do this will be by using an equal distribution of the contribution to all the levels: In all cases, we must have r k w k f ,r = 1.The effect of the weights using the two cases considered above can be appreciated from Table 1.
Considering the weights at each level, the distance between adjacent frames can be computed: w k e,r , (20) or in weighted and normalized form:

Adaptive Video Partitioning
When the distance D(•) is computed for a series of adjacent video frames, the result will be a sequence of frame differences, FD-sequence for short.The actual video partitioning is performed by a further analysis of the FDsequence.Let D i = D( f i , f i+1 ) be the difference between two adjacent frames, f i and f i+1 .The FD-sequence is defined as where n is the number of frames in the video.The FD-sequence is usually characterized by significant peaks at frame positions where a shot change has occurred.With the FD-sequence, the video partitioning problem then becomes that of determining appropriate thresholds to isolate these "significant peaks" from other peaks that might occur in the sequence.The shot threshold is defined as τ s = τ • max i {D( f i , f i+1 )}.We declare a shot partition at frame t whenever the distance exceeds the threshold: that is, whenever D t = FD(t) > τ S .

Adaptation at the Video Sequence
Level.The description above assumes that video sequences are homogeneous, and hence can all be considered using the same set of parameters.However, video sequences vary considerably from one sequence to the other.First we consider adapting the video analysis algorithm based on the entire video sequence.That is, for each video sequence, we determine the set of analysis parameters that will produce the best results.This set of parameters is then used to analyze all the frames or shots in the video sequence.Given the weights on the multilevel features (see ( 17)), we can parameterize the analysis algorithm in terms of these weights, w = (w c , w e , w p , w ϕ , w λ ) and the threshold, τ.For adaptation at the sequence level, rather than considering all the features for the distance calculation, we consider only the features that are relevant to the video being analyzed.Thus, based on the particular video, we can determine the best (w, τ) pair for segmenting the video.
To check the effect of the weights and the thresholds on different video sequences, we used a combination of the weights at different thresholds.Based on empirical analysis, we chose 32 combinations of the weights (Table 2) and 9 thresholds (Table 3).
We observed that different videos may require different contributions from each feature (i.e., different weights, w) for best results.Also, at a given w, different thresholds could produce different results.(See Table 6, Section 5).Similarly, for a given video sequence, various sets of weights can produce the same (best) results, but at different thresholds.Conceptually, adaptation at the sequence level should be simple.But there are several problems.First, at the sequence level, the video is still being considered at a very coarse granularity.Video shots are known to vary greatly, even for shots in the same video.Hence, different shots in the same video sequence could be very different in content.More importantly, automated mapping of the (w, τ) pair for each given video is a major problem, requiring a two-pass approach.This makes sequence-level adaptation unsuitable for real-time applications, or for network applications, where dynamic modeling of video data traffic is required.

Shot-Level Adaptation.
The above problems can be addressed by considering the individual shots that make up the video.In [9], shots were characterized based on the activity and motion in the shots, and the respective shot duration.Using the characterization, video shots were grouped into nine classes, based on which video partitioning was performed by adaptively choosing different thresholds for each shot class.In the current work, we take a different approach for the problems of video characterization and classification.

Estimating Video Shot Complexity.
To make the thresholds sensitive to the different shot classes, we need some methods to make such thresholds locally adaptive.The overall video shot complexity depends on the activity and the motion, while the shot class depends on both the complexity and the duration of the shot.The shot duration has a strong correlation with the amount of motion in the video.The length of the shot is typically inversely proportional to the amount of motion in the video [9].We can determine the temporal duration as we analyze the shot.We could also determine the motion complexity by computing the motion vectors using motion estimation techniques [36].However, motion estimation is very computationally intensive.w p 0 0 1 0 0 0 1 2 w λ 1 0 0 0 0 1 2 Since we do not need accurate motion estimation to classify the shots or for adaptive indexing, an estimate of the amount of motion in the shot is enough.Thus, we can approximate the amount of motion using the differences between adjacent frames (e.g., by analyzing the FD-sequence), rather than direct computation of the motion vectors.A similar observation has been made by Tao and Orchard [37], where they noticed that the residual signal generated after motion-compensated predication is highly correlated with the gradient magnitude: the motion compensated error is larger for pixels with larger gradient magnitude on average.They thus suggested that the gradient (from one frame to the other) could be estimated from the reconstructed image using the motion estimates.In this work, we are interested in the reverse procedure; given the gradient information (as captured by the edge response vectors), we wish to estimate the amount of motion in the shot, without explicit motion estimation.
We can estimate both the image activity and the motion by using the already available multilevel edge response vectors, with appropriate weights.For example, if we use ∀i, w i = 1/N (e.g., w i = 1/85, for 4 decomposition levels), or if we ignore the global averages altogether, (i.e., the contributions from level k = 0), then the lower-level features (which are increasingly localized) can be used to predict the amount of motion.We could also ignore further higher level features, for instance, levels at k ≤ L/2.We can estimate the activity by using the MERVs from just one frame in a given shot.
The motion and activity will generally result in an overall variability of the shot.The shot complexity depends directly on this shot variability.To estimate the shot variability, we use the mean and standard deviation of the frame-difference sequence (the FD-sequence) within the shot.We compute this for each of the MERV features, and use a weighted average to determine the shot variability.Given two time instants, t 1 and t 2 , (t 2 > t 1 ), we compute the shot variability as follows.Let T = t 2 − t 1 be the duration.Let FD λ be the frame difference sequence using a particular multilevel feature, say λ: 2 . ( Similarly, we compute for color, edge response, edgeresponse at edge points, and the phase angle.Then, as with the between-frame distances, we obtain the shot-variability using a weighted combination from all the features: The weights here may not necessarily be the same as those used for the distances. In [4,9], different methods were proposed for computing the motion and image complexities, for instance, using the spectral entropy, and other metrics.With the above approach, one problem will be computing the standard deviation at each frame as the shot is progressing.This problem can be solved by doing the computations at only defined periodic intervals (the periods could also be chosen adaptively).However, one advantage of using the shot variability defined above is that the parameters required can be computed incrementally, using the preceding values.We can do this from the general definition of mean and standard deviation.
Given a data ensemble, X = x 1 , x 2 , x 3 , . . ., x k−1 , x k , x k+1 , . .., and the mean of the first k items, x i , we can estimate the mean when the (k + 1)th item is added: Similarly, for the variance (or standard deviation), we have Solving ( 25) simultaneously, we obtain the incremental formula We can use these to incrementally estimate the shot variability using the available FD-sequence.Based on the shot variability, we classify the shots into nine classes, as follows.
Given μ(t 1 , t 2 ) and σ(t 1 , t 2 ) for a given shot, we classify each into three classes, namely, low (I), medium (II), and high (III), based on an equi probability classification.Let μ c (t 1 , t 2 ) ∈ {I, II, III} be the classification due to μ(t 1 , t 2 ).Similarly, let σ c (t 1 , t 2 ) ∈ {I, II, III} be the classification due to σ(t 1 , t 2 ).Using the classifications from the two dimensions of shot variability, we define a simple mapping function f (•) to determine the overall shot class, viz: Table 4 shows the classification results for the test video sequence, based on the above scheme.

Adaptive Shot Thresholds.
Having characterized and classified the shots based on the shot variability, the next question is to determine the parameters for video shot partitioning for a given shot.Ideally, given the FD sequence, (and assuming that it was obtained from a distance (and not a similarity) measure), we expect that the threshold for shot changes should decrease with increasing shot length, but increase with increasing shot complexity (or variability).Formally, given a video shot s j , we classify it into a certain shot class, c i ∈ {I, II, III, . . ., IX}.The problem of shot-level adaptation then is to determine the parameter set (i.e., the (w, τ) pair) that will produce the best results for all shots, s j ∈ c i , ∀i, j.Here, best results are defined in terms of information retrieval measures of precision and recall.
We take a pragmatic approach to the problem of determining the parameters.Using a training set of video shots, we use a simple clustering technique to determine the (w, τ) pairs that produce the best results for each shot class in the training set.We then use these pairs for analysis of the test video sequences.
Let P = (w, τ) be the weight-threshold pair that defines the parameter set for video segmentation.Let V be the number of video sequences used for the training set.We use the edge-response vectors to analyze the video shots in the training set, using all the available weights and thresholds (i.e., 32 weights and 9 thresholds in all, see Tables 2 and 3. Let P c j denote the set of (w, τ) pairs that produced correct partitioning results for the class c shots in video sequence j.To select the best (w, τ) pair for a given shot class, c, all we need is the intersection of P c j , for all the V sequences: When we have |P c | > 1, then any member of P c can be used as the best parameter set.The major problem is when P c = ∅, that is, the intersection is empty, implying that no single parameter set always produced correct results for all the class c shots in the training sequences.Two approaches can be used to address this problem.
For each shot in a given video sequence, we define an array a i, j , i = 1, 2, . . ., w max , j = 1, 2, . . ., τ max , such that a i, j = 1 if the shot is correctly partitioned with the parameter set (i, j) pair, and a i, j = 0 otherwise.We use w max = 32, τ max = 9 in our implementation.Let a c i, j (q) denote the cumulative value in the a i, j arrays for all the class c shots in video sequence q.Then, the best parameter set for the class c shots is determined as where, a c i, j = V q=1 a c i, j (q).The above selects the parameter set that produced the best overall result, over all the shots of a given class in the training set.This could be dominated by one video sequence that has many shots of the given type.A variation could be to use the parameter set that produced the best result over the shots of a given class from most of the sequences, although it may not necessarily produce the best results over all shots.Thus, where, P c (q) = argmax i, j {a c i, j (q)}.

Results
To test the performance of the proposed edge-based adaptive method, we ran some experiments using two sets of video sequences.The first set had 6 sequences taken from standard MPEG-7 sequences, and from available online video sources [31].For each video sequence, the frame size was fixed at 352 × 288.The second set had 5 sequences taken from the US National Institute of Standards (NIST) benchmark TRECVID 2001 test sequences.The frame size for sequences in this set was 320 × 240.The experiments were carried out in a MATLAB Version 7.3.0.267 (R2006b) environment using a personal computer with Intel(R) CPU T2400, running at 1.83 GHz with 1.99 GB RAM.We measure performance in terms of the information retrieval measures of precision and recall.We use the following notation: D = set of all positions of true scene cuts in a test video sequence, B = set of all positions of scene cuts returned by the system, C = subset of B that are true scene cuts (i.e., correct detection, or C = B ∩ D).Then, precision Pr = |C|/|B|, and recall Rc = |C|/|D|.

Effectiveness of MERVs on Non-Adaptive Partitioning.
First, we tested the effectiveness of the proposed edgeresponse vectors in video partitioning, without consideration for adaptation.This is important, since the results of the adaptive schemes will also be influenced by the inherent robustness of the edge-based features.The results are shown in Table 5.As can be seen, the edge-oriented approach produced about 90% in terms of precision and recall.

Adaptive Partitioning.
Table 6 shows the results for adaptation at the video sequence level.The last two columns show the weight-threshold parameter pairs that were used to produce the indicated results.Where there are more than two entries, it means that the indicated entries all produced the same result.The table shows a significant improvement over the non-adaptive approach.The sequence-level adaptation is a two-pass method.That is, it needs a first pass on the data to determine the analysis parameters, and a second pass to perform the analysis.For some applications, such as real-time video streaming, the twopass approach may not be applicable.Shot-level adaptation avoids the two-pass problem.Table 7 shows that results for shot-level adaptation, based on shot characterization and classification using the proposed shot variability measure.
The results are a little worse than the two-pass method using sequence-level adaptation, but generally better than the static approach.

Comparative Results
. We performed a comparative experiment using other popular techniques.Table 8 shows the results.For color histograms, we used region-based histograms with 16 blocks ((M 1 /4) × (M 2 /4) regions) per frame, where M 1 and M 2 are the frame dimensions.Analysis using motion-vector-based methods [28] are based on 20 × 20 sub blocks.The specific kernel size used for Cooper et al.'s DCT-based method [25] are also indicated in the table, as this varied significantly from sequence to sequence.In all cases, we have reported results using the parameters that gave the best overall result for a given video sequence, or for the test video set used.Apart from the results in [9], none of the other methods used adaptive partitioning.Thus, we can compare the performance of the static (nonadaptive) method using the proposed MERVs as features with the results from the other schemes.The table shows that MERV features are very competitive, having a comparable performance with the correlation-based method [27], the best performing technique of the other schemes tested.While simple color histogram did well on some video sequences, it produced poor performance on CROPS and CANYON video sets.This is mainly because these two sequences have both indoor and outdoor scenes involving significant variation in illumination.Obviously color features are easily affected by this variation, and hence the precision of the color-histogram based method was quite low for these sequences.The same explains the poor performance of the motion-vector-based method.Illumination variation between frames often leads to poor motion detection and thus a significant error in the motion vectors (even with the special parameter for illumination handling used in [28]).Overall, the results from the adaptive schemes are generally better than those from the non-adaptive schemes.This can be explained by the fact that the adaptive schemes spend time to analyze each shot first, before deciding on the analysis parameters.Thus, they are able to adapt better to the changing nature of shot characteristics as we move along in the video sequence.
We then tested the methods on another set of video sequences, this time using five sequences from the NIST benchmark TRECVID 2001 video sequences.The sequences and annotations by NIST, such as positions of true scene cuts are available via the NIST TRECVID website (http://www-nlpir.nist.gov/projects/trecvid/revised.html).(We could not get access to more recent sequences used in the TRECVID series.The most recent versions are available only for competitors in the TRECVID challenge.All the same, we believe that the 2001 data still provides another independent data set suitable for testing the algorithms).The results on the TRECVID sequences are shown in Table 9.The overall result is not too different from those of Table 8.Both the proposed method and the correlation method produced better results than the others.Both had about the same average recall, with the proposed method performing slightly better in recall (0.922 versus 0.906).
We also compared the proposed adaptive scheme with the scene-adaptive method proposed in [9].The major difference was in terms of scene characterization.After characterization, we then used the same MERV features to perform video partitioning.Thus, this essentially compares the performance of the proposed shot variability measure for video shot characterization and classification against that of characterization using explicit motion and activity.Using shot variability for shot characterization and classification is slightly superior to using motion and activity complexity measures [9], with (precision, recall) values of (0.96, 0.93) versus (0.94, 0.91).A more striking difference, however, can be observed by considering the computational requirements for the two approaches.Using shot variability as a shot complexity measure is about 5 times faster than using motion and activity.The shot variability measure does not involve explicit motion estimation and activity characterization, but rather uses the same features (i.e., the FD-sequence) that were used in the analysis.Thus, it is generally more efficient than using motion and activity.Table 10 shows the overall time taken by the different methods in video partitioning.The reported time represents the average feature extraction time per frame required in analyzing a given video sequence.
Experimental results show that the proposed multilevel edge-based features provide a performance of about 90% in terms of average precision and recall.In comparison with traditional approaches, the adaptive schemes provide a better performance over non-adaptive approaches, using the same multilevel edge-based features-with video sequence level adaptation producing about 99% performance.Further, the use of shot variability as a measure of shot complexity resulted in a slightly superior performance (about 2% improvement in precision) over a previously proposed method of explicit motion estimation and shot activity analysis.However, in terms of efficiency, using the shot variability led to a five fold improvement in efficiency.The reported work has applications beyond video indexing and retrieval.In particular, given the significant reduction in computations, the approach becomes attractive for real-time applications, such as in dynamic monitoring, characterization and modeling of video data traffic, and in real-time video surveillance.

Figure 1 :
Figure 1: Multilevel decomposition (a) an image at three levels of decomposition; (b) tree representation of the decomposition.

Table 1 :
Weights for two choices for the contribution from each decomposition level (L = 4).

Table 4 :
Shot classification results based on shot variability.

Table 5 :
Effectiveness of MERVs for non-adaptive video partitioning.

Table 6 :
Results for proposed sequence-level adaptive partitioning.

Table 7 :
Results for adaptive partitioning using proposed shot variability measure.