A hybrid domain enhanced framework for video retargeting with spatial–temporal importance and 3D grid optimization
Introduction
According to a comScore survey [1], 181 million U.S. Internet users watched nearly 40 billion online video clips in January 2012. As we all know, video applications and services involve significant amount of computation and storage requirements, especially when serving millions of video users at the same time. Recently, it has prominent needs for mobile devices to rely on cloud since mobile devices have limitations in memory, computing power, and battery life. With cloud computing, Internet users can receive services from a cloud as if they were employing a super computer. Obviously, video services can store their video data in the cloud, run their applications with software deployed in the cloud and make ubiquitous video access possible through the cloud.
However, video services in the cloud face great challenges. As different types of devices have different capabilities to display videos, one big challenge for the cloud is that it shall have video adaptation capability for presenting decent videos on diverse devices including cell phones, PDAs, laptops, PSPs and digital cameras. In video cloud computing, video retargeting is necessary to adapt videos to display on devices with clear and smooth imaging quality. Given a device screen, the original video is adapted to a suitable version in terms of scales and aspect ratios. Fig. 1 demonstrates our framework of video retargeting which has wide application scenarios. (1) Home video service: owing to the widely used high definition TV (HDTV) technologies, video retargeting is required to alleviate the distortion from mapping 4:3 programs to a 16:9 wide screen. (2) Mobile video service: the heterogeneity of mobile devices has placed a demand for video adaptation, which aims to take advantages of content analysis to outperform traditional solutions of cropping, squashing or black padding, especially on the small screens. (3) Public video service: diverse public facilities (e.g. entertainment zones, LED stadium walls, and digital bulletin board) need video retargeting for content adaptation across different displays. (4) Internet video service: the success of online video sharing communities have demonstrated the popularity of UGC, where it is desirable to utilize video retargeting technique to meet users’ requirements of different devices.
With video retargeting, users with difference requirements, interests and purposes can enjoy pervasive video services at anytime and any place without considering any technical issue of the mobile devices, without worrying about where the videos come from.
After retargeting, the videos may have some changes such as part removal, adding and distortion in order to adapt different display sizes. To maximally comfort users’ watching experiences, there exist trade-offs among amount of resources allocated to different regions in a video (e.g. foreground and background subjects) for generating a retargeted video. Retargeted videos need to contain those contents which attract user interests and help them well understand video stories. Therefore, visual attention and higher level features (e.g. visual concepts) should be involved to determine regions of importance. Visual attention can be used to describe the regions within a video frame, which attracts different levels of user awareness. In this paper, high-level features are defined as those visual concepts which are important components (as shown in Table 1) in videos and play significant roles for users to understand video contents. We argue that a reliable video retargeting scheme needs to consider both visual attention and higher level concepts.
To achieve a comprehensive video retargeting, we need to face two major challenges. The first challenge comes from user preference on video content. Video retargeting should allocate resources to user preferred content. In different video domains, the important contents to users are varied. For example, playfields, players and balls in sports, product images, spokesperson faces and brand logos in ad videos, blackboards and lectures in lecture videos, moving objects in surveillance videos and news reports in news videos are important contents. A hybrid video retargeting framework needs to be proposed by considering domain specific knowledge.
Second, the requirements of information maximization and deformation minimization certainly bring many difficulties to video retargeting. Toward ubiquitous cloud video services, we consider that a satisfactory video retargeting should achieve:
- •
Information maximization (IM): after removing redundancy, the retargeted videos have to keep both the scene layout (background) and the focused region (object) identifiable.
- •
Deformation minimization (DM): the retargeted video should introduce minimum distortion within salient regions. The distortion comes from two aspects:
- 1.
Spatial distortion: caused by aspect ratio change, and
- 2.
Temporal incoherence: caused by discontinued frames.
- 1.
Our preliminary research of using spatial–temporal grid optimization for sports video and advertisement video retargeting has been published in [2]. With spatial–temporal grid optimization, we propose a hybrid domain enhanced video retargeting framework in this paper. In this framework, video parsing, spatial–temporal importance determination, and 3D grid optimization are the three main components. This framework considers domain specific knowledge in particular domains, i.e. different high-level visual concepts attracting user interests in different video domains. Through video parsing, high-level domain-specific visual concepts are firstly identified. The framework combines visual attention with high-level domain-specific visual concepts to determine important regions within videos. Therefore, it achieves a domain enhanced retargeting solution to comfort user watching experience. Through allocating more resources for user preferred video content, the proposed framework can maximally satisfy user watching experience. In order to further enhance user comfort, we consider both spatial importance and temporal consistency by proposing a grid optimization method. Incorporating different domain-specific knowledge, the proposed framework can easily adapt to various video domains.
Compared to our existing work in [2], our major contributions are summarized as follows:
- •
Different from most of the existing work, the proposed retargeting framework considers user preference on video content. Directly affecting user understanding of video content, these high-level visual concepts are user preferred content. Through combining visual attention with high-level visual concepts, the proposed framework allocates resources to important video contents.
- •
Our framework incorporates with domain specific knowledge. In this framework, domain knowledge is used to determine the important contents in videos. The video retargeting is achieved by allocating more resources to important contents in spatial importance and temporal consistency manners. Applying domain specific knowledge, the proposed framework can be easily adapted to different video domains. Popular video domains including advertisement, sports, news, lecture and surveillance videos are investigated.
- •
We combine visual attention, semantic analysis, and temporal consistency to generate a spatial–temporal importance map for accommodating proper retargeting strategies with a novel spatial–temporal 3D rectilinear grid framework to take advantages of importance maps in an optimal manner. In this way, we can keep the proportion of important regions as well as the perceptual coherency to minimize both spatial and temporal distortions.
The rest of the paper is organized as follows. In Section 2, related researches are reviewed. Section 3 briefly introduces the framework of our hybrid video retargeting framework. In Section 4, combining visual concepts with visual attention model, we detect important contents in videos with high-level concepts. Spatial–temporal importance determination is presented in Section 5. The 3D grid optimization is proposed to minimize both spatial and temporal distortions in Section 6. Experimental results based on comparisons with the state-of-the-arts methods are listed and discussed in Section 7. Conclusions are in Section 8.
Section snippets
Image retargeting
Many content-aware retargeting methods have been proposed, such as cropping [3], [4], [5], [6], seam craving [7], [8], [9], [10], [11], warping [12], [13], [14], and hybrid approaches [15], [16], [17]. Cropping methods try to find a window with a target size which covers the most important contents, and take this window as a result while completely discarding the part outside. Seam based methods [7] are to find an optimal seam which actually is a continuous chain of the pixels from each row or
Overview of our hybrid video retargeting
As shown in Fig. 1, our video retargeting framework has four main components: video storage, video parsing, spatial–temporal importance determination and 3D grid optimization. Our methods and novelties are mainly in the last three components. In the rest of the paper, we will focus on these three components:
- 1.
Hierarchical parsing of videos. This component parses video contents from low-level features to high-level visual concepts to extract semantically meaningful information.
- 2.
Spatial–temporal
Domain specific video parsing
Syntactic and semantic analysis [34] are important stages in content-aware video retargeting to identify important visual concepts as well as the important contents in videos. In our research, we focus on video shot classification and visual concept extraction. We use five different types of videos, including surveillance videos, ad videos, news videos, lecture videos and sport videos, to prove the effectiveness of our proposed method. Video type classification can be achieved automatically
Spatial–temporal importance determination
To determine proper importance is critical for video retargeting. In this section, we combine visual concepts, visual attention and temporal consistency to compute a spatial–temporal importance map for videos. At the frame-level, we take advantages of visual attention and visual concepts to derive a semantics contained importance map.
Visual attention alone cannot well represent the semantically meaningful importance. To give an example in sport videos, the prominent red clothes of cheering fans
3D Grid optimization
Based on the importance map, we build up a 3D rectilinear grid for temporally consistent retargeting in heterogeneous consumer video. Our target is to maintain proportions of important regions and achieve visual consistency. Different from [13], the rectilinear grid, which acts as a scaleplate, is to reduce structural deformation with fewer parameters.
Each frame is presented as a rectilinear grid , in which M is the 2D grid coordinate and Q is a set of quads at the diagonal from lower
Experiments
In order to visually demonstrate the performance of our method, we compare our retargeting with the state-of-the-art methods including original (or uniformly) scaling, non-homogeneous resizing [27], seam carving [7], and scale-and-stretch mesh method [13] on both image retargeting and video retargeting.
Since perceptual satisfaction is much more important than quality completeness for retargeting, we compare our results with cropping and resizing methods and conduct user studies on images and
Conclusions
We have proposed a hybrid framework for video retargeting that keeps object proportions and has coherent visual effect. Instead of adopting local saliency features, we have integrated hierarchical semantic parsing and visual attention to build up a semantic importance map as a more accurate descriptor. Combined with a temporal consistency map, we have developed a 3D spatial–temporal grid framework for non-homogeneous and smooth video retargeting, which is optimized by incorporating both spatial
Acknowledgements
This work was supported by 973 Program (2010CB327905) and National Natural Science Foundation of China (61273034, 61070104, 61005027 and 61272329).
References (47)
- et al.
Patchwise scaling method for content-aware image resizing
Signal Processing
(2012) - et al.
Learning saliency-based visual attention: a review
Signal Processing
(2013) - et al.
Video semantic analysis based on structure-sensitive anisotropic manifold ranking
Signal Processing
(2009) - comScore 2012, comScore releases January 2012 U.S. online video rankings, 2012. URL...
- L. Shi, J. Wang, L. Duan, H. Lu, Consumer video retargeting: context assisted spatial–temporal grid optimization, in:...
- et al.
A visual attention model for adapting images on small displays
ACM Multimedia System Journal
(2004) - H. Liu, X. Xie, W.-Y. Ma, H. Zhang, Automatic browsing of large pictures on mobile devices, in: Proceedings of ACM...
- B. Suh, H. Ling, B. Bederson, D. Jacobs, Automatic browsing of large pictures on mobile devices, in: Proceedings of ACM...
- A. Santella, M. Agrawala, D. Decarlo, D. Salesin, M. Cohen, Gaze-based interaction for semiautomatic photo cropping,...
- et al.
Seam carving for content-aware image resizing
ACM Transaction on Graphics
(2007)
Improved seam carving for video retargeting
ACM Transaction on Graphics
Feature-aware texturing
Computer
Optimized scale-and-stretch for image resizing
ACM Transaction on Graphics
Multi-operator media retargeting
ACM Transaction on Graphics
Patchmatcha randomized correspondence algorithm for structural image editing
ACM Transaction on Graphics
Cited by (3)
Adaptive cropping with interframe relative displacement constraint for video retargeting
2022, Signal Processing: Image CommunicationCitation Excerpt :Section 5 is the conclusion of this paper. The existing content-aware video retargeting methods can be classified into four categories: seam carving [4–11],warping [12–20] ,multi-operator [21–27], and content-aware cropping [26,28–32]. The seam carving techniques resize videos by adding or removing some seams that have the least impact on the important contents in original frames.
Adaptive Content Condensation Based on Grid Optimization for Thumbnail Image Generation
2016, IEEE Transactions on Circuits and Systems for Video TechnologyImproving Visual Saliency Computing with Emotion Intensity
2016, IEEE Transactions on Neural Networks and Learning Systems