Abstract
Video frame synthesis, which consists of interpolation and extrapolation, is an essential video processing technique that can be applied to various scenarios. However, most existing methods cannot handle small objects or large motion well, especially in high-resolution videos such as 4K videos. To eliminate such limitations, we introduce a neighbor correspondence matching (NCM) algorithm for flow-based frame synthesis. Since the current frame is not available in video frame synthesis, NCM is performed in a current-frame-agnostic fashion to establish multi-scale correspondences in the spatial-temporal neighborhoods of each pixel. Based on the powerful motion representation capability of NCM, we propose a heterogeneous coarse-to-fine scheme for intermediate flow estimation. The coarse-scale and fine-scale modules are trained progressively, making NCM computationally efficient and robust to large motions. We further explore the mechanism of NCM and find that neighbor correspondence is powerful, since it provides multiple-hypotheses motion information for synthesis. Based on this analysis, we introduce a multiple-hypotheses estimation process for video frame extrapolation, resulting in a more robust framework, NCM-MH. Experimental results show that NCM and NCM-MH achieve 31.63 and 28.08 dB for interpolation and extrapolation on the most challenging X4K1000FPS benchmark, outperforming all the other state-of-the-art methods that use two reference frames as input.
- [1] . 2019. Depth-aware video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3703–3712.Google ScholarCross Ref
- [2] . 2019. Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3 (2019), 933–948.Google ScholarCross Ref
- [3] . 2019. Learning to synthesize motion blur. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6840–6848.Google ScholarCross Ref
- [4] Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. 2021. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems 34 (2021), 11781–11794.Google Scholar
- [5] . 2020. Channel attention is all you need for video frame interpolation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10663–10671.Google ScholarCross Ref
- [6] . 2016. Deepstereo: Learning to predict new views from the world’s imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5515–5524.Google ScholarCross Ref
- [7] . 2003. Multiple View Geometry in Computer Vision. Cambridge University Press.Google ScholarDigital Library
- [8] . 2023. A dynamic multi-scale voxel flow network for video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6121–6131.Google ScholarCross Ref
- [9] . 2022. Real-time intermediate flow estimation for video frame interpolation. In European Conference on Computer Vision. Springer, 624–642.Google ScholarDigital Library
- [10] . 2020. Liteflownet3: Resolving correspondence ambiguity for more accurate optical flow estimation. In European Conference on Computer Vision. Springer, 169–184.Google ScholarDigital Library
- [11] . 2022. Neighbor correspondence matching for flow-based video frame synthesis. In Proceedings of the 30th ACM International Conference on Multimedia. 5389–5397.Google ScholarDigital Library
- [12] . 2018. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9000–9008.Google ScholarCross Ref
- [13] . 2021. Learning to estimate hidden motions with global motion aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9772–9781.Google ScholarCross Ref
- [14] . 2016. Learning-based view synthesis for light field cameras. ACM Trans. Graph. 35, 6 (2016), 1–10.Google ScholarDigital Library
- [15] . 2020. Adacof: Adaptive collaboration of flows for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5316–5325.Google ScholarCross Ref
- [16] . 2023. AMT: All-pairs multi-field transforms for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9801–9810.Google ScholarCross Ref
- [17] . 2019. Deep reference generation with multi-domain hierarchical constraints for inter prediction. IEEE Trans. Multimedia 22, 10 (2019), 2497–2510.Google ScholarCross Ref
- [18] . 2020. Enhanced quadratic video interpolation. In European Conference on Computer Vision. Springer, 41–56.Google ScholarDigital Library
- [19] . 2017. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision. 4463–4471.Google ScholarCross Ref
- [20] . 2018. Fixing weight decay regularization in adam. (2018).Google Scholar
- [21] . 2019. Dvc: An end-to-end deep video compression framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11006–11015.Google ScholarCross Ref
- [22] . 2022. Video frame interpolation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3532–3542.Google ScholarCross Ref
- [23] . 2018. Context-aware synthesis for video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1701–1710.Google ScholarCross Ref
- [24] . 2020. Softmax splatting for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5437–5446.Google ScholarCross Ref
- [25] . 2017. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision. 261–270.Google ScholarCross Ref
- [26] . 2020. Bmbc: Bilateral motion estimation with bilateral cost volume for video interpolation. In European Conference on Computer Vision. Springer, 109–125.Google ScholarDigital Library
- [27] . 2021. Asymmetric bilateral motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14539–14548.Google ScholarCross Ref
- [28] . 2021. Extending neural p-frame codecs for b-frame coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6680–6689.Google ScholarCross Ref
- [29] . 2017. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4161–4170.Google ScholarCross Ref
- [30] . 2016. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4104–4113.Google ScholarCross Ref
- [31] Xihua Sheng, Jiahao Li, Bin Li, Li Li, Dong Liu, and Yan Lu. 2023. Temporal context mining for learned video compression. IEEE Transactions on Multimedia 25 (2023), 7311–7322.Google Scholar
- [32] . 2021. Video frame interpolation via generalized deformable convolution. IEEE Trans. Multimedia 24 (2021), 426–439.Google ScholarDigital Library
- [33] . 2021. XVFI: Extreme video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14489–14498.Google ScholarCross Ref
- [34] . 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. Retrieved from https://arxiv.org/abs/1212.0402.Google Scholar
- [35] . 2012. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circ. Syst. Vid. Technol. 22, 12 (2012), 1649–1668.Google ScholarDigital Library
- [36] . 2018. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8934–8943.Google ScholarCross Ref
- [37] . 2020. Raft: Recurrent all-pairs field transforms for optical flow. In European Conference on Computer Vision. Springer, 402–419.Google ScholarDigital Library
- [38] . 2022. Optimizing video prediction via video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17814–17823.Google ScholarCross Ref
- [39] Jing Xiao, Kangmin Xu, Mengshun Hu, Liang Liao, Zheng Wang, Chia-Wen Lin, Mi Wang, and Shin’ichi Satoh. 2022. Progressive motion boosting for video frame interpolation. IEEE Transactions on Multimedia (2022), 1–14.Google Scholar
- [40] . 2019. Video enhancement with task-oriented flow. Int. J. Comput. Vis. 127, 8 (2019), 1106–1125.Google ScholarDigital Library
- [41] Gengshan Yang and Deva Ramanan. 2019. Volumetric correspondence networks for optical flow. Advances in Neural Information Processing Systems 32 (2019), 794–805.Google Scholar
- [42] Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2021), 4701–4712.Google Scholar
- [43] . 2023. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5682–5692.Google ScholarCross Ref
- [44] . 2022. Global matching with overlapping attention for optical flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17592–17601.Google ScholarCross Ref
Index Terms
- Exploring Neighbor Correspondence Matching for Multiple-hypotheses Video Frame Synthesis
Recommendations
Neighbor Correspondence Matching for Flow-based Video Frame Synthesis
MM '22: Proceedings of the 30th ACM International Conference on MultimediaVideo frame synthesis, which consists of interpolation and extrapolation, is an essential video processing technique that can be applied to various scenarios. However, most existing methods cannot handle small objects or large motion well, especially in ...
Augmented Coarse-to-Fine Video Frame Synthesis with Semantic Loss
Pattern Recognition and Computer VisionAbstractExisting video frame synthesis works suffer from improving perceptual quality and preserving semantic representation ability. In this paper, we propose a Progressive Motion-texture Synthesis Network (PMSN) to address this problem. Instead of ...
Multi-Frame Correspondence Estimation Using Subspace Constraints
When a rigid scene is imaged by a moving camera, the set of all displacements of all points across multiple frames often resides in a low-dimensional linear subspace. Linear subspace constraints have been used successfully in the past for recovering 3D ...
Comments