A fuzzy video content representation for video summarization and content-based retrieval
Introduction
The increasing amount of digital image and video data has stimulated new technologies for efficient searching, indexing, content-based retrieving and managing multimedia databases. The traditional approach of keyword annotation to accessing image or video information has the drawback that, apart from the large amount of effort for developing annotations, it cannot efficiently characterize the rich visual content using only text. For this reason, content-based retrieval algorithms have been recently proposed and they have attracted a great research interest in the image processing community. [10], [19]. Examples of content-based retrieval systems, either academic or in the first stage of commercial exploitation include the QBIC [8], Virage [11] or VisualSeek [21] prototypes. In this framework, the moving picture expert group (MPEG) is currently defining the new MPEG-7 standard [17], to specify a set of descriptors for an efficient interface of multimedia information.
The aforementioned systems are mainly restricted to still images and cannot easily be applied to video databases [4]. This is due to the fact that the standard representation of video as a sequence of consecutive frames results in significant temporal redundancy of visual content and thus it is very inefficient and time consuming to perform queries on every video frame. Furthermore, most video databases are often located on distributed platforms and impose both large storage and transmission bandwidth requirements, even if they are compressed. Such linear representation of video sequences is also not adequate for the new emerging multimedia applications, such as video browsing, content-based indexing and retrieval. For this reason, a content-based sampling algorithm is usually applied to video data for extracting a small but “meaningful” amount of the video information [3], [13]. This results in a video summarization scheme similar to that used in document search engines, where a brief text summary corresponds to one or multiple documents.
However, efficient implementation of content-based retrieval algorithms and video summarization schemes requires a more meaningful representation of visual content than the traditional pixel-based one. This is due to the fact that there is a lack of semantic information at the pixel level. For this reason, several works have been presented in the literature towards a more efficient image/video representation. A hidden Markov model has been investigated in [15] for color image retrieval, while in [5] an approach of image retrieval based on user sketches has been reported. A hierarchical color clustering method has been presented in [22]. For video summarization, construction of a compact image map or image mosaics has been described in [13], while a pictorial summary of video sequences based on story units has been presented in [24].
In the context of this paper, a fuzzy representation of visual content is proposed for both video summarization and content-based indexing and retrieval. This representation increases the flexibility of content-based retrieval systems since it provides an interpretation closer to the human perception [14]. It also results in a more robust description of visual content, since possible instabilities of the segmentation, used for describing the visual content, are reduced. In particular, the adopted fuzzy representation is applied for both video summarization and content-based retrieval. In the first case, a small set of key-frames is extracted which provides an efficient description of visual content. This is performed by minimizing a cross correlation criterion among the video frames by means of a genetic algorithm. The correlation is computed using several features extracted using a color/ motion segmentation on a fuzzy feature vector formulation basis. In the second case, the user provides queries in the form of images or sketches which are analyzed in the same way as video frames in video summarization scheme. A metric distance or similarity measure is then used to find a set of frames that best match the user's query.
This paper is organized as follows: In Section 2, the video sequences are analyzed by applying a color/motion segmentation algorithm. The extracted features for each color or motion segment are fuzzy classified as is presented in Section 3. Application of the proposed fuzzy representation schemes to video summarization is discussed in Section 4, while the application to content-based retrieval is discussed in Section 5. Furthermore, several practical implementation issues, such as selected parameters and numerical values, are also mentioned in these sections. Experimental results on a large image/video databases are presented in Section 6 along with comparisons with other known techniques to show the good performance of the proposed scheme. Finally, Section 7 concludes the paper.
Section snippets
Video sequence analysis
Semantic segmentation, i.e., extraction of meaningful entities, is essential in a content-based retrieval environment. However, this remains one of the most difficult problems in the image analysis community, especially if no constraints are imposed on the kind of the examined video sequences [6], [7], [9]. For this reason, a color/motion segmentation algorithm is applied in this paper for visual content description.
A multiresolution implementation of the recursive shortest spanning tree (RSST)
Fuzzy visual content representation
The size, location and average color components of all color segments are used as color properties. In a similar way, motion properties include the size, location and average motion vectors of all motion segments. Since the segment number is not constant for each video frame, the aforementioned properties cannot be directly included in a feature vector, because the size of this vector is not constant. Thus, direct comparison between vectors of different frames is practically impossible. For
Video summarization
Fig. 5 depicts the block diagram of the proposed video summarization scheme. Since a video sequence is a collection of different shots, each of which corresponds to a continuous action of a single camera operation, a shot cut detection algorithm is first applied, to identify video frames of similar visual content. The algorithm proposed in [22] has been adopted for this purpose, since it presents high accuracy and low computational complexity compared to other techniques [3]. Then, analyzing
Content-based retrieval
The problem of content-based retrieval from image and video databases is discussed in this section. Particularly, for content-based video retrieval the aforementioned video summarization scheme is applied so that all the redundant temporal video information is discarded. At this point, the problem of content-based retrieval from a video database has actually reduced to still image retrieval since video queries are applied on the selected key-frames. The proposed fuzzy representation scheme is
Experimental results
The proposed fuzzy representation of visual content has been evaluated both for video summarization and content-based indexing and retrieval, using a large database consisting of MPEG coded video sequences and several images compressed in JPEG format. The Optibase Fusion MPEG encoder at a bit-rate of 2 Mbits/s has been used to encode the video sequences.
Fig. 11 illustrates a shot used to demonstrate the performance of the key-frame extraction algorithm. The shot comes from an educational series
Conclusions
A new approach for efficient visual content representation has been presented in this paper. In particular, in the proposed framework, the traditional pixel-based representation of visual content is transformed to a fuzzy feature-based one, which is more suitable for the new emerging multimedia applications, such as video browsing, content-based image indexing and retrieval and video summarization. First, an analysis of video sequences is performed by applying a color/motion segmentation
Acknowledgements
The authors would like to thank Georgios Akrivas, for providing them with an efficient implementation of the key-frame selection technique presented in [23].
References (24)
- A. Alatan, L. Onural, M. Wollborn, R. Mech, E.Tuncel, T. Sikora, Image sequence analysis for emerging interactive...
- Y. Avrithis, A. Doulamis, N.D. Doulamis, S. Kollias, An adaptive approach to video indexing and retrieval using fuzzy...
- Y. Avrithis, A. Doulamis, N. Doulamis, S. Kollias, A stochastic framework for optimal key frame extraction from MPEG...
- S.-F. Chang, W. Chen, H.J. Meng, H. Sundaram, D. Zhong, A fully automated content-based video search engine supporting...
- A. Del Bimbo, P. Pala, Visual image retrieval by elastic matching of user sketches, IEEE Trans. Pattern Anal. Mach....
- N. Doulamis, A. Doulamis, D. Kalogeras, S. Kollias, Very low bit-rate coding of image sequences using adaptive regions...
- A. Doulamis, N. Doulamis, S. Kollias, On line retrainable neural networks: improving the performance of neural networks...
- M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D....
- L. Garrido, F. Marques, M. Pardas, P. Salembier, V. Vilaplana, A hierarchical technique for image sequence analysis, in...
- V.N. Gudivada, J.V. Raghavan (Eds.), Special Issue on Content-Based Image Retrieval Systems, IEEE Comput. Mag. 28 (9)...
Adaptive Filter Theory
Cited by (114)
Deep multi-scale pyramidal features network for supervised video summarization
2024, Expert Systems with ApplicationsA comprehensive survey and mathematical insights towards video summarization
2022, Journal of Visual Communication and Image RepresentationSpatio-temporal summarization of dance choreographies
2018, Computers and Graphics (Pergamon)Citation Excerpt :Such an abstract content representation is useful in many applications ranging from multimedia systems (e.g., indexing, browsing, content-based search and retrieval) [2] and education (e.g., teaching/learning of a dance choreography) [3,4] to documentation and preservation of the Intangible Cultural Heritage (ICH) assets, [5]. Extraction of representative key frames for an abstract description of a video sequence, is an important topic in multimedia research [6,7]. Actually, video summarization algorithms are content-based sampling procedures that reduce semantically unimportant or redundant content.
Integrating Language Guidance Into Image-Text Matching for Correcting False Negatives
2024, IEEE Transactions on Multimedia