Learning Video-Independent Eye Contact Segmentation from In-the-Wild Videos

Wu, Tianyi; Sugano, Yusuke

doi:10.1007/978-3-031-26316-3_4

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13844))

Included in the following conference series:

Asian Conference on Computer Vision

340 Accesses

Abstract

Human eye contact is a form of non-verbal communication and can have a great influence on social behavior. Since the location and size of the eye contact targets vary across different videos, learning a generic video-independent eye contact detector is still a challenging task. In this work, we address the task of one-way eye contact detection for videos in the wild. Our goal is to build a unified model that can identify when a person is looking at his gaze targets in an arbitrary input video. Considering that this requires time-series relative eye movement information, we propose to formulate the task as a temporal segmentation. Due to the scarcity of labeled training data, we further propose a gaze target discovery method to generate pseudo-labels for unlabeled videos, which allows us to train a generic eye contact segmentation model in an unsupervised way using in-the-wild videos. To evaluate our proposed approach, we manually annotated a test dataset consisting of 52 videos of human conversations. Experimental results show that our eye contact segmentation model outperforms the previous video-dependent eye contact detector and can achieve \(71.88\%\) framewise accuracy on our annotated test set. Our code and evaluation dataset are available at https://github.com/ut-vision/Video-Independent-ECS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. SIGMOD Rec. 28(2), 49–60 (1999)
Article Google Scholar
Argyle, M., Dean, J.E.: Eye-contact, distance and affiliation. Sociometry 28, 289–304 (1965)
Article Google Scholar
Broz, F., Lehmann, H., Nehaniv, C.L., Dautenhahn, K.: Mutual gaze, personality, and familiarity: dual eye-tracking during conversation. In: IEEE International Symposium on Robot and Human Interactive Communication, pp. 858–864 (2012)
Google Scholar
Cao, Q., Shen, L., Xie, W., Parkhi, O.M., Zisserman, A.: VGGFace2: a dataset for recognising faces across pose and age. In: IEEE International Conference on Automatic Face & Gesture Recognition, pp. 67–74 (2018). https://doi.org/10.1109/FG.2018.00020
Cao, Z., Simon, T., Wei, S.E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1302–1310 (2017). https://doi.org/10.1109/CVPR.2017.143
Cañigueral, R., de C. Hamilton, A.F.: The role of eye gaze during natural social interactions in typical and autistic people. Front. Psychol. 10, 560 (2019). https://doi.org/10.3389/fpsyg.2019.00560
Cheng, Y., Lu, F., Zhang, X.: Appearance-based gaze estimation via evaluation-guided asymmetric regression. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11218, pp. 105–121. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01264-9_7
Chapter Google Scholar
Cheng, Y., Zhang, X., Lu, F., Sato, Y.: Gaze estimation by exploring two-eye asymmetry. IEEE Trans. Image Process. 29, 5259–5272 (2020)
Article MATH Google Scholar
Chong, E., et al.: Detecting gaze towards eyes in natural social interactions and its use in child assessment. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 1(3), 1–20 (2017)
Google Scholar
Chong, E., Ruiz, N., Wang, Y., Zhang, Y., Rozga, A., Rehg, J.M.: Connecting gaze, scene, and attention: generalized attention estimation via joint modeling of gaze and scene saliency. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11209, pp. 397–412. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_24
Chapter Google Scholar
Chong, E., Wang, Y., Ruiz, N., Rehg, J.M.: Detecting attended visual targets in video. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5396–5406 (2020)
Google Scholar
Chung, J.S., Nagrani, A., Zisserman, A.: VoxCeleb2: deep speaker recognition. In: Proceedings of Interspeech, pp. 1086–1090 (2018)
Google Scholar
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4685–4694 (2019)
Google Scholar
Fang, Y., et al.: Dual attention guided gaze target detection in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11390–11399 (2021)
Google Scholar
Farha, Y.A., Gall, J.: MS-TCN: multi-stage temporal convolutional network for action segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3575–3584 (2019)
Google Scholar
Fischer, T., Chang, H.J., Demiris, Y.: RT-GENE: real-time eye gaze estimation in natural environments. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 339–357. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_21
Chapter Google Scholar
Funes Mora, K.A., Monay, F., Odobez, J.M.: EYEDIAP: a database for the development and evaluation of gaze estimation algorithms from RGB and RGB-D cameras. In: Proceedings of the Symposium on Eye Tracking Research and Applications, pp. 255–258 (2014)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90
Ho, S., Foulsham, T., Kingstone, A.: Speaking and listening with the eyes: gaze signaling during dyadic interactions. PloS One 10(8), e0136905 (2015)
Article Google Scholar
Joon Son Son, A.J., Zisserman, A.: You said that? In: Proceedings of the British Machine Vision Conference (BMVC), pp. 109.1–109.12 (2017)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR) (2015)
Google Scholar
Kleinke, C.L.: Gaze and eye contact: a research review. Psychol. Bull. 100(1), 78–100 (1986)
Article Google Scholar
Kukleva, A., Kuehne, H., Sener, F., Gall, J.: Unsupervised learning of action classes with continuous temporal embedding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12066–12074 (2019)
Google Scholar
Lea, C., Flynn, M.D., Vidal, R., Reiter, A., Hager, G.D.: Temporal convolutional networks for action segmentation and detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 156–165 (2017)
Google Scholar
Lei, P., Todorovic, S.: Temporal deformable residual networks for action segmentation in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6742–6751 (2018)
Google Scholar
Lepetit, V., Moreno-Noguer, F., Fua, P.: EPNP: an accurate O(n) solution to the PnP problem. Int. J. Comput. Vision (IJCV) 81(2), 155–166 (2009)
Article Google Scholar
Li, J., Todorovic, S.: Action shuffle alternating learning for unsupervised action segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12628–12636, June 2021
Google Scholar
Li, S.J., AbuFarha, Y., Liu, Y., Cheng, M.M., Gall, J.: MS-TCN++: multi-stage temporal convolutional network for action segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 1 (2020)
Google Scholar
Liu, Y., Liu, R., Wang, H., Lu, F.: Generalizing gaze estimation with outlier-guided collaborative adaptation. In: International Conference on Computer Vision (ICCV), pp. 3835–3844 (2021)
Google Scholar
Manuel Marin-Jimenez, A.Z., Ferrari, V.: “Here’s looking at you, kid”. Detecting people looking at each other in videos. In: Proceedings of the British Machine Vision Conference (BMVC), pp. 22.1–22.12 (2011)
Google Scholar
Marin-Jimenez, M.J., Zisserman, A., Eichner, M., Ferrari, V.: Detecting people looking at each other in videos. Int. J. Comput. Vision (IJCV) 106(3), 282–296 (2014)
Article Google Scholar
Marin-Jimenez, M.J., Kalogeiton, V., Medina-Suarez, P., Zisserman, A.: LAEO-Net: revisiting people looking at each other in videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3477–3485, June 2019
Google Scholar
Marin-Jimenez, M.J., Kalogeiton, V., Medina-Suarez, P., Zisserman, A.: LAEO-Net++: revisiting people looking at each other in videos. IEEE Trans. Pattern Anal. Mach. Intell. 44(6), 3069–3081 (2022)
Article Google Scholar
Marshall, R., Summerskill, S.: Chapter 25 - posture and anthropometry. In: DHM and Posturography, pp. 333–350. Academic Press (2019)
Google Scholar
Miller, S.R., Miller, C.J., Bloom, J.S., Hynd, G.W., Craggs, J.G.: Right hemisphere brain morphology, attention-deficit hyperactivity disorder (ADHD) subtype, and social comprehension. J. Child Neurol. 21(2), 139–144 (2006). https://doi.org/10.1177/08830738060210021901
Article Google Scholar
Müller, P., Huang, M.X., Zhang, X., Bulling, A.: Robust eye contact detection in natural multi-person interactions using gaze and speaking behaviour. In: Proceedings of the ACM Symposium on Eye Tracking Research & Applications, pp. 1–10 (2018)
Google Scholar
Mundy, P.C., Sigman, M.D., Ungerer, J.A., Sherman, T.: Defining the social deficits of autism: the contribution of non-verbal communication measures. J. Child Psychol. Psychiatry 27(5), 657–69 (1986)
Article Google Scholar
Park, S., Mello, S.D., Molchanov, P., Iqbal, U., Hilliges, O., Kautz, J.: Few-shot adaptive gaze estimation. In: International Conference on Computer Vision (ICCV), pp. 9368–9377 (2019)
Google Scholar
Qin, J., Shimoyama, T., Sugano, Y.: Learning-by-novel-view-synthesis for full-face appearance-based 3D gaze estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 4981–4991 (2022)
Google Scholar
Recasens, A., Khosla, A., Vondrick, C., Torralba, A.: Where are they looking? In: International Conference on Neural Information Processing Systems, pp. 199–207 (2015)
Google Scholar
Recasens, A., Vondrick, C., Khosla, A., Torralba, A.: Following gaze in video. In: IEEE International Conference on Computer Vision (ICCV), pp. 1444–1452 (2017)
Google Scholar
Ruiz, N., Chong, E., Rehg, J.M.: Fine-grained head pose estimation without keypoints. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 2155–215509 (2018)
Google Scholar
Sener, F., Yao, A.: Unsupervised learning and segmentation of complex activities from video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8368–8376 (2018)
Google Scholar
Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., Webb, R.: Learning from simulated and unsupervised images through adversarial training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2107–2116 (2017)
Google Scholar
Singh, B., Marks, T.K., Jones, M., Tuzel, O., Shao, M.: A multi-stream bi-directional recurrent neural network for fine-grained action detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1961–1970 (2016)
Google Scholar
Smith, B.A., Yin, Q., Feiner, S.K., Nayar, S.K.: Gaze locking: passive eye contact detection for human-object interaction. In: Proceedings of the Annual ACM Symposium on User Interface Software and Technology, pp. 271–280 (2013)
Google Scholar
Sugano, Y., Matsushita, Y., Sato, Y.: Learning-by-synthesis for appearance-based 3D gaze estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1821–1828 (2014)
Google Scholar
Swetha, S., Kuehne, H., Rawat, Y.S., Shah, M.: Unsupervised discriminative embedding for sub-action learning in complex activities. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 2588–2592 (2021)
Google Scholar
Tu, D., Min, X., Duan, H., Guo, G., Zhai, G., Shen, W.: End-to-end human-gaze-target detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2202–2210 (2022)
Google Scholar
VidalMata, R.G., Scheirer, W.J., Kukleva, A., Cox, D., Kuehne, H.: Joint visual-temporal embedding for unsupervised learning of actions in untrimmed sequences. In: IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1237–1246 (2021)
Google Scholar
Wang, B., Hu, T., Li, B., Chen, X., Zhang, Z.: GaTector: a unified framework for gaze object prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 19588–19597 (2022)
Google Scholar
Wei, P., Liu, Y., Shu, T., Zheng, N., Zhu, S.C.: Where and why are they looking? Jointly inferring human attention and intentions in complex tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6801–6809 (2018)
Google Scholar
Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., Sloetjes, H.: ELAN: a professional framework for multimodality research. In: Proceedings of the Fifth International Conference on Language Resources and Evaluation, pp. 1556–1559 (2006)
Google Scholar
Ye, Z., Li, Y., Liu, Y., Bridges, C., Rozga, A., Rehg, J.M.: Detecting bids for eye contact using a wearable camera. In: IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 1, pp. 1–8 (2015)
Google Scholar
Yu, Y., Liu, G., Odobez, J.M.: Improving few-shot user-specific gaze adaptation via gaze redirection synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11937–11946 (2019)
Google Scholar
Yu, Y., Odobez, J.M.: Unsupervised representation learning for gaze estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7314–7324 (2020)
Google Scholar
Zhang, X., Park, S., Beeler, T., Bradley, D., Tang, S., Hilliges, O.: ETH-XGaze: a large scale dataset for gaze estimation under extreme head pose and gaze variation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 365–381. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_22
Chapter Google Scholar
Zhang, X., Sugano, Y., Bulling, A.: Everyday eye contact detection using unsupervised gaze target discovery. In: Proceedings of the Annual ACM Symposium on User Interface Software and Technology, pp. 193–203 (2017)
Google Scholar
Zhang, X., Sugano, Y., Bulling, A.: Revisiting data normalization for appearance-based gaze estimation. In: Proceedings of the ACM Symposium on Eye Tracking Research & Applications, pp. 1–9 (2018)
Google Scholar
Zhang, X., Sugano, Y., Bulling, A.: Evaluation of appearance-based methods and implications for gaze-based applications. In: Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–13 (2019)
Google Scholar
Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: Appearance-based gaze estimation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4511–4520 (2015)
Google Scholar
Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: It’s written all over your face: full-face appearance-based gaze estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 2299–2308 (2017)
Google Scholar
Zhang, X., Sugano, Y., Fritz, M., Bulling, A.: Mpiigaze: real-world dataset and deep appearance-based gaze estimation. IEEE Trans. Pattern Anal. Mach. Intell. 41(1), 162–175 (2019)
Article Google Scholar
Zheng, Y., Park, S., Zhang, X., Mello, S.D., Hilliges, O.: Self-learning transformations for improving gaze and head redirection. In: International Conference on Neural Information Processing Systems, pp. 13127–13138 (2020)
Google Scholar

Download references

Acknowledgement

This work was supported by JST CREST Grant Number JPMJCR1781.

Author information

Authors and Affiliations

Institute of Industrial Science, The University of Tokyo, Tokyo, Japan
Tianyi Wu & Yusuke Sugano

Authors

Tianyi Wu
View author publications
You can also search for this author in PubMed Google Scholar
Yusuke Sugano
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tianyi Wu .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, T., Sugano, Y. (2023). Learning Video-Independent Eye Contact Segmentation from In-the-Wild Videos. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13844. Springer, Cham. https://doi.org/10.1007/978-3-031-26316-3_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-26316-3_4
Published: 02 March 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26315-6
Online ISBN: 978-3-031-26316-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Video-Independent Eye Contact Segmentation from In-the-Wild Videos