Cross-Attention Transformer for Video Interpolation

Kim, Hannah Halin; Yu, Shuzhi; Yuan, Shuai; Tomasi, Carlo

doi:10.1007/978-3-031-27066-6_23

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13848))

Included in the following conference series:

Asian Conference on Computer Vision

229 Accesses
1 Citations

Abstract

We propose TAIN (Transformers and Attention for video INterpolation), a residual neural network for video interpolation, which aims to interpolate an intermediate frame given two consecutive image frames around it. We first present a novel vision transformer module, named Cross-Similarity (CS), to globally aggregate input image features with similar appearance as those of the predicted interpolated frame. These CS features are then used to refine the interpolated prediction. To account for occlusions in the CS features, we propose an Image Attention (IA) module to allow the network to focus on CS features from one frame over those of the other. TAIN outperforms existing methods that do not require flow estimation and performs comparably to flow-based methods while being computationally efficient in terms of inference time on Vimeo90k, UCF101, and SNU-FILM benchmarks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Code is available at https://github.com/hannahhalin/TAIN.

References

Choi, M., Kim, H., Han, B., Xu, N., Lee, K.M.: Channel attention is all you need for video frame interpolation. In: AAAI (2020)
Google Scholar
Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive separable convolution. In: IEEE International Conference on Computer Vision (2017)
Google Scholar
Niklaus, S., Liu, F.: Softmax splatting for video frame interpolation. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Park, J., Ko, K., Lee, C., Kim, C.S.: Bmbc: bilateral motion estimation with bilateral cost volume for video interpolation. In: European Conference on Computer Vision (2020)
Google Scholar
Bao, W., Lai, W.S., Ma, C., Zhang, X., Gao, Z., Yang, M.H.: Depth-aware video frame interpolation. In: IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Niklaus, S., Mai, L., Wang, O.: Revisiting adaptive convolutions for video frame interpolation. In: IEEE Winter Conference on Applications of Computer Vision (2021)
Google Scholar
Xue, T., Chen, B., Wu, J., Wei, D., Freeman, W.T.: Video enhancement with task-oriented flow. Int. J. Comput. Vis. (IJCV) 127, 1106–1125 (2019)
Article Google Scholar
Bao, W., Zhang, X., Chen, L., Ding, L., Gao, Z.: High order model and dynamic filtering for frame rate up conversion. IEEE Trans. Image Process. 27(8), 3813–3826 (2018)
Article MathSciNet MATH Google Scholar
Kuroki, Y., Nishi, T., Kobayashi, S., Oyaizu, H., Yoshimura, S.: A psychophysical study of improvements in motion-image quality by using high frame rate. J. Soc. Inf. Display 15(1), 1–68 (2007)
Article Google Scholar
Meyer, S., Cornillère, V., Djelouah, A., Schroers, C., Gross, M.H.: Deep video color propagation. In: BMVC (2018)
Google Scholar
Jiang, H., Sun, D., Jampani, V., Yang, M.H., Learned-Miller, E., Kautz, J.: Super slomo: high quality estimation of multiple intermediate frames for video interpolation. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9000–9008 (2018)
Google Scholar
Wu, C., Singhal, N., Krähenbühl, P.: Video compression through image interpolation. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Niklaus, S., Liu, F.: Context-aware synthesis for video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Bao, W., Lai, W.S., Zhang, X., Gao, Z., Yang, M.H.: Memc-net: motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE Trans. Pattern Anal. Mach. Intell. 43(3), 933–948 (2018)
Article Google Scholar
Hu, P., Niklaus, S., Sclaroff, S., Saenko, K.: Many-to-many splatting for efficient video frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3553–3562 (2022)
Google Scholar
Dosovitskiy, A., et al.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Google Scholar
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: cnns for optical flow using pyramid, warping, and cost volume. In: Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Gui, S., Wang, C., Chen, Q., Tao, D.: Featureflow: robust video interpolation via structure-to-texture generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Liu, Z., Yeh, R., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: Proceedings of International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Xiang, X., Tian, Y., Zhang, Y., Fu, Y., Allebach, J.P., Xu, C.: Zooming slow-mo: fast and accurate one-stage space-time video super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Danier, D., Zhang, F., Bull, D.: St-mfnet: a spatio-temporal multi-flow network for frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3521–3531 (2022)
Google Scholar
Lu, L., Wu, R., Lin, H., Lu, J., Jia, J.: Video frame interpolation with transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3532–3542 (2022)
Google Scholar
Park, J., Lee, C., Kim, C.S.: Asymmetric bilateral motion estimation for video frame interpolation. In: International Conference on Computer Vision (2021)
Google Scholar
Choi, M., Lee, S., Kim, H., Lee, K.M.: Motion-aware dynamic architecture for efficient frame interpolation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13839–13848 (2021)
Google Scholar
Teed, Z., Deng, J.: RAFT: recurrent all-pairs field transforms for optical flow. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12347, pp. 402–419. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58536-5_24
Chapter Google Scholar
Kim, H.H., Yu, S., Tomasi, C.: Joint detection of motion boundaries and occlusions. In: British Machine Vision Conference (BMVC) (2021)
Google Scholar
Yu, S., Kim, H.H., Yuan, S., Tomasi, C.: Unsupervised flow refinement near motion boundaries. In: British Machine Vision Conference (BMVC) (2022)
Google Scholar
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7577, pp. 611–625. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33783-3_44
Chapter Google Scholar
Mayer, N., et al.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: IEEE International Conference on Computer Vision and Pattern Recognition (2016) arXiv:1512.02134
Yuan, S., Sun, X., Kim, H., Yu, S., Tomasi, C.: Optical flow training under limited label budget via active learning. In: European Conference on Computer Vision (ECCV) (2022)
Google Scholar
Lee, H., Kim, T., Chung, T.Y., Pak, D., Ban, Y., Lee, S.: Adacof: adaptive collaboration of flows for video frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Shi, Z., Xu, X., Liu, X., Chen, J., Yang, M.H.: Video frame interpolation transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17482–17491 (2022)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Google Scholar
Yang, M., Liu, S.C., Delbruck, T.: A dynamic vision sensor with 1% temporal contrast sensitivity and in-pixel asynchronous delta modulator for event encoding. IEEE J. Solid-State Circuits 50, 2149–2160 (2015)
Article Google Scholar
Tulyakov, S., et al.: Time lens: event-based video frame interpolation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16155–16164 (2021)
Google Scholar
Zhang, X., Yu, L.: Unifying motion deblurring and frame interpolation with events. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17765–17774 (2022)
Google Scholar
Tulyakov, S., Bochicchio, A., Gehrig, D., Georgoulis, S., Li, Y., Scaramuzza, D.: Time lens++: event-based frame interpolation with parametric non-linear flow and multi-scale fusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 17755–17764 (2022)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale (2021)
Google Scholar
Huang, Z., et al.: Ccnet: criss-cross attention for semantic segmentation (2020)
Google Scholar
Vaswani, A., et al.: Attention is all you need (2017)
Google Scholar
Jiang, S., Campbell, D., Lu, Y., Li, H., Hartley, R.: Learning to estimate hidden motions with global motion aggregation (2021)
Google Scholar
Zhang, X., Wang, T., Qi, J., Lu, H., Wang, G.: Progressive attention guided recurrent network for salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Long, G., Kneip, L., Alvarez, J.M., Li, H., Zhang, X., Yu, Q.: Learning image matching by simply watching video. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 434–450. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_26
Chapter Google Scholar
Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive convolution. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
Google Scholar
Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.: Stand-alone self-attention in vision models (2019)
Google Scholar
Galassi, A., Lippi, M., Torroni, P.: Attention in natural language processing. IEEE Trans. Neural Netw. Learn. Syst. 32, 4291–4308 (2021)
Article Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Shi, W., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network (2016)
Google Scholar
Baker, S., Roth, S., Scharstein, D., Black, M.J., Lewis, J., Szeliski, R.: A database and evaluation methodology for optical flow. In: 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8 (2007)
Google Scholar
Brox, T., Bregler, C., Malik, J.: Large displacement optical flow. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 41–48. IEEE (2009)
Google Scholar
Janai, J., Guney, F., Ranjan, A., Black, M., Geiger, A.: Unsupervised learning of multi-frame optical flow with occlusions. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, vol. 32, pp. 8024–8035. Curran Associates, Inc. (2019)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012)
Google Scholar
Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13, 600–612 (2004)
Article Google Scholar
Liu, Y., Liao, Y., Lin, Y.Y., Chuang, Y.Y.: Deep video frame interpolation using cyclic frame generation. In: AAAI (2019)
Google Scholar

Download references

Acknowledgments

This research is based upon work supported in part by the National Science Foundation under Grant No. 1909821 and by an Amazon AWS cloud computing award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Duke University, Durham, NC, 27708, USA
Hannah Halin Kim, Shuzhi Yu, Shuai Yuan & Carlo Tomasi

Authors

Hannah Halin Kim
View author publications
You can also search for this author in PubMed Google Scholar
Shuzhi Yu
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Tomasi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hannah Halin Kim .

Editor information

Editors and Affiliations

University of Tokyo, Tokyo, Japan
Yinqiang Zheng
Hacettepe University, Ankara, Türkiye
Hacer Yalim Keleş
Data61/CSIRO, Canberra, ACT, Australia
Piotr Koniusz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kim, H.H., Yu, S., Yuan, S., Tomasi, C. (2023). Cross-Attention Transformer for Video Interpolation. In: Zheng, Y., Keleş, H.Y., Koniusz, P. (eds) Computer Vision – ACCV 2022 Workshops. ACCV 2022. Lecture Notes in Computer Science, vol 13848. Springer, Cham. https://doi.org/10.1007/978-3-031-27066-6_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-27066-6_23
Published: 09 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-27065-9
Online ISBN: 978-3-031-27066-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cross-Attention Transformer for Video Interpolation