research-article

Exploring Neighbor Correspondence Matching for Multiple-hypotheses Video Frame Synthesis

Authors:
Zhaoyang Jia

University of Science and Technology of China, China

University of Science and Technology of China, China

0000-0001-8814-9691
View Profile

,
Yan Lu

Microsoft Research Asia, China

Microsoft Research Asia, China

0000-0001-5383-6424
View Profile

,
Houqiang Li

University of Science and Technology of China, China

University of Science and Technology of China, China

0000-0003-2188-3028
View Profile

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20 Issue 4Article No.: 111pp 1–20https://doi.org/10.1145/3633780

Published:11 January 2024Publication History

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

Video frame synthesis, which consists of interpolation and extrapolation, is an essential video processing technique that can be applied to various scenarios. However, most existing methods cannot handle small objects or large motion well, especially in high-resolution videos such as 4K videos. To eliminate such limitations, we introduce a neighbor correspondence matching (NCM) algorithm for flow-based frame synthesis. Since the current frame is not available in video frame synthesis, NCM is performed in a current-frame-agnostic fashion to establish multi-scale correspondences in the spatial-temporal neighborhoods of each pixel. Based on the powerful motion representation capability of NCM, we propose a heterogeneous coarse-to-fine scheme for intermediate flow estimation. The coarse-scale and fine-scale modules are trained progressively, making NCM computationally efficient and robust to large motions. We further explore the mechanism of NCM and find that neighbor correspondence is powerful, since it provides multiple-hypotheses motion information for synthesis. Based on this analysis, we introduce a multiple-hypotheses estimation process for video frame extrapolation, resulting in a more robust framework, NCM-MH. Experimental results show that NCM and NCM-MH achieve 31.63 and 28.08 dB for interpolation and extrapolation on the most challenging X4K1000FPS benchmark, outperforming all the other state-of-the-art methods that use two reference frames as input.

REFERENCES

[1] Bao Wenbo, Lai Wei-Sheng, Ma Chao, Zhang Xiaoyun, Gao Zhiyong, and Yang Ming-Hsuan. 2019. Depth-aware video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3703–3712.Google ScholarCross Ref
[2] Bao Wenbo, Lai Wei-Sheng, Zhang Xiaoyun, Gao Zhiyong, and Yang Ming-Hsuan. 2019. Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3 (2019), 933–948.Google ScholarCross Ref
[3] Brooks Tim and Barron Jonathan T. 2019. Learning to synthesize motion blur. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6840–6848.Google ScholarCross Ref
[4] Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. 2021. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems 34 (2021), 11781–11794.Google Scholar
[5] Choi Myungsub, Kim Heewon, Han Bohyung, Xu Ning, and Lee Kyoung Mu. 2020. Channel attention is all you need for video frame interpolation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 10663–10671.Google ScholarCross Ref
[6] Flynn John, Neulander Ivan, Philbin James, and Snavely Noah. 2016. Deepstereo: Learning to predict new views from the world’s imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5515–5524.Google ScholarCross Ref
[7] Hartley Richard and Zisserman Andrew. 2003. Multiple View Geometry in Computer Vision. Cambridge University Press.Google ScholarDigital Library
[8] Hu Xiaotao, Huang Zhewei, Huang Ailin, Xu Jun, and Zhou Shuchang. 2023. A dynamic multi-scale voxel flow network for video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6121–6131.Google ScholarCross Ref
[9] Huang Zhewei, Zhang Tianyuan, Heng Wen, Shi Boxin, and Zhou Shuchang. 2022. Real-time intermediate flow estimation for video frame interpolation. In European Conference on Computer Vision. Springer, 624–642.Google ScholarDigital Library
[10] Hui Tak-Wai and Loy Chen Change. 2020. Liteflownet3: Resolving correspondence ambiguity for more accurate optical flow estimation. In European Conference on Computer Vision. Springer, 169–184.Google ScholarDigital Library
[11] Jia Zhaoyang, Lu Yan, and Li Houqiang. 2022. Neighbor correspondence matching for flow-based video frame synthesis. In Proceedings of the 30th ACM International Conference on Multimedia. 5389–5397.Google ScholarDigital Library
[12] Jiang Huaizu, Sun Deqing, Jampani Varun, Yang Ming-Hsuan, Learned-Miller Erik, and Kautz Jan. 2018. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9000–9008.Google ScholarCross Ref
[13] Jiang Shihao, Campbell Dylan, Lu Yao, Li Hongdong, and Hartley Richard. 2021. Learning to estimate hidden motions with global motion aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9772–9781.Google ScholarCross Ref
[14] Kalantari Nima Khademi, Wang Ting-Chun, and Ramamoorthi Ravi. 2016. Learning-based view synthesis for light field cameras. ACM Trans. Graph. 35, 6 (2016), 1–10.Google ScholarDigital Library
[15] Lee Hyeongmin, Kim Taeoh, Chung Tae-young, Pak Daehyun, Ban Yuseok, and Lee Sangyoun. 2020. Adacof: Adaptive collaboration of flows for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5316–5325.Google ScholarCross Ref
[16] Li Zhen, Zhu Zuo-Liang, Han Ling-Hao, Hou Qibin, Guo Chun-Le, and Cheng Ming-Ming. 2023. AMT: All-pairs multi-field transforms for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9801–9810.Google ScholarCross Ref
[17] Liu Jiaying, Xia Sifeng, and Yang Wenhan. 2019. Deep reference generation with multi-domain hierarchical constraints for inter prediction. IEEE Trans. Multimedia 22, 10 (2019), 2497–2510.Google ScholarCross Ref
[18] Liu Yihao, Xie Liangbin, Siyao Li, Sun Wenxiu, Qiao Yu, and Dong Chao. 2020. Enhanced quadratic video interpolation. In European Conference on Computer Vision. Springer, 41–56.Google ScholarDigital Library
[19] Liu Ziwei, Yeh Raymond A, Tang Xiaoou, Liu Yiming, and Agarwala Aseem. 2017. Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision. 4463–4471.Google ScholarCross Ref
[20] Loshchilov Ilya and Hutter Frank. 2018. Fixing weight decay regularization in adam. (2018).Google Scholar
[21] Lu Guo, Ouyang Wanli, Xu Dong, Zhang Xiaoyun, Cai Chunlei, and Gao Zhiyong. 2019. Dvc: An end-to-end deep video compression framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11006–11015.Google ScholarCross Ref
[22] Lu Liying, Wu Ruizheng, Lin Huaijia, Lu Jiangbo, and Jia Jiaya. 2022. Video frame interpolation with transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3532–3542.Google ScholarCross Ref
[23] Niklaus Simon and Liu Feng. 2018. Context-aware synthesis for video frame interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1701–1710.Google ScholarCross Ref
[24] Niklaus S. and Liu Feng. 2020. Softmax splatting for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5437–5446.Google ScholarCross Ref
[25] Niklaus Simon, Mai Long, and Liu Feng. 2017. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision. 261–270.Google ScholarCross Ref
[26] Park Junheum, Ko Keunsoo, Lee Chul, and Kim Chang-Su. 2020. Bmbc: Bilateral motion estimation with bilateral cost volume for video interpolation. In European Conference on Computer Vision. Springer, 109–125.Google ScholarDigital Library
[27] Park Junheum, Lee Chul, and Kim Chang-Su. 2021. Asymmetric bilateral motion estimation for video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14539–14548.Google ScholarCross Ref
[28] Pourreza Reza and Cohen Taco. 2021. Extending neural p-frame codecs for b-frame coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6680–6689.Google ScholarCross Ref
[29] Ranjan Anurag and Black Michael J. 2017. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4161–4170.Google ScholarCross Ref
[30] Schonberger Johannes L and Frahm Jan-Michael. 2016. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4104–4113.Google ScholarCross Ref
[31] Xihua Sheng, Jiahao Li, Bin Li, Li Li, Dong Liu, and Yan Lu. 2023. Temporal context mining for learned video compression. IEEE Transactions on Multimedia 25 (2023), 7311–7322.Google Scholar
[32] Shi Zhihao, Liu Xiaohong, Shi Kangdi, Dai Linhui, and Chen Jun. 2021. Video frame interpolation via generalized deformable convolution. IEEE Trans. Multimedia 24 (2021), 426–439.Google ScholarDigital Library
[33] Sim Hyeonjun, Oh Jihyong, and Kim Munchurl. 2021. XVFI: Extreme video frame interpolation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14489–14498.Google ScholarCross Ref
[34] Soomro Khurram, Zamir Amir Roshan, and Shah Mubarak. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402. Retrieved from https://arxiv.org/abs/1212.0402.Google Scholar
[35] Sullivan Gary J, Ohm Jens-Rainer, Han Woo-Jin, and Wiegand Thomas. 2012. Overview of the high efficiency video coding (HEVC) standard. IEEE Trans. Circ. Syst. Vid. Technol. 22, 12 (2012), 1649–1668.Google ScholarDigital Library
[36] Sun Deqing, Yang Xiaodong, Liu Ming-Yu, and Kautz Jan. 2018. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8934–8943.Google ScholarCross Ref
[37] Teed Zachary and Deng Jia. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In European Conference on Computer Vision. Springer, 402–419.Google ScholarDigital Library
[38] Wu Yue, Wen Qiang, and Chen Qifeng. 2022. Optimizing video prediction via video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17814–17823.Google ScholarCross Ref
[39] Jing Xiao, Kangmin Xu, Mengshun Hu, Liang Liao, Zheng Wang, Chia-Wen Lin, Mi Wang, and Shin’ichi Satoh. 2022. Progressive motion boosting for video frame interpolation. IEEE Transactions on Multimedia (2022), 1–14.Google Scholar
[40] Xue Tianfan, Chen Baian, Wu Jiajun, Wei Donglai, and Freeman William T.. 2019. Video enhancement with task-oriented flow. Int. J. Comput. Vis. 127, 8 (2019), 1106–1125.Google ScholarDigital Library
[41] Gengshan Yang and Deva Ramanan. 2019. Volumetric correspondence networks for optical flow. Advances in Neural Information Processing Systems 32 (2019), 794–805.Google Scholar
[42] Zongxin Yang, Yunchao Wei, and Yi Yang. 2021. Collaborative video object segmentation by multi-scale foreground-background integration. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 9 (2021), 4701–4712.Google Scholar
[43] Zhang Guozhen, Zhu Yuhan, Wang Haonan, Chen Youxin, Wu Gangshan, and Wang Limin. 2023. Extracting motion and appearance via inter-frame attention for efficient video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5682–5692.Google ScholarCross Ref
[44] Zhao Shiyu, Zhao Long, Zhang Zhixing, Zhou Enyu, and Metaxas Dimitris. 2022. Global matching with overlapping attention for optical flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17592–17601.Google ScholarCross Ref

Index Terms

Exploring Neighbor Correspondence Matching for Multiple-hypotheses Video Frame Synthesis
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Image and video acquisition
        Motion capture

Recommendations

Neighbor Correspondence Matching for Flow-based Video Frame Synthesis
MM '22: Proceedings of the 30th ACM International Conference on Multimedia

Video frame synthesis, which consists of interpolation and extrapolation, is an essential video processing technique that can be applied to various scenarios. However, most existing methods cannot handle small objects or large motion well, especially in ...
Read More
Augmented Coarse-to-Fine Video Frame Synthesis with Semantic Loss
Pattern Recognition and Computer Vision
Abstract
Existing video frame synthesis works suffer from improving perceptual quality and preserving semantic representation ability. In this paper, we propose a Progressive Motion-texture Synthesis Network (PMSN) to address this problem. Instead of ...
Read More
Multi-Frame Correspondence Estimation Using Subspace Constraints

When a rigid scene is imaged by a moving camera, the set of all displacements of all points across multiple frames often resides in a low-dimensional linear subspace. Linear subspace constraints have been used successfully in the past for recovering 3D ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Multimedia Computing, Communications, and Applications Volume 20, Issue 4
April 2024
676 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3613617
Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 January 2024
- Online AM: 23 November 2023
- Accepted: 11 November 2023
- Revised: 10 November 2023
- Received: 11 May 2023
Published in tomm Volume 20, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Video frame synthesis
correspondence matching
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 2,841
  Total Downloads
- Downloads (Last 12 months)2,841
- Downloads (Last 6 weeks)23
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Exploring Neighbor Correspondence Matching for Multiple-hypotheses Video Frame Synthesis

ACM Transactions on Multimedia Computing, Communications, and Applications

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Neighbor Correspondence Matching for Flow-based Video Frame Synthesis

Augmented Coarse-to-Fine Video Frame Synthesis with Semantic Loss

Multi-Frame Correspondence Estimation Using Subspace Constraints