Multi-stream pose convolutional neural networks for human interaction recognition in images

https://doi.org/10.1016/j.image.2021.116265Get rights and content

Highlights

  • We explore the contribution of poses to human–human interaction recognition in images.

  • Several pose-based representations are proposed.

  • Multiple pose based CNN streams are utilized, where various fusion architectures for pose-based representations are explored.

  • An extended benchmark dataset for human–human interaction recognition in images is collected.

  • Experiments show that paying more attention on poses has a positive affect on recognizing interactions.

Abstract

Recognizing human interactions in still images is quite a challenging task since compared to videos, there is only a glimpse of interaction in a single image. This work investigates the role of human poses in recognizing human–human interactions in still images. To this end, a multi-stream convolutional neural network architecture is proposed, which fuses different levels of human pose information to recognize human interactions better. In this context, several pose-based representations are explored. Experimental evaluations in an extended benchmark dataset show that the proposed multi-stream pose Convolutional Neural Network is successful in discriminating a wide range of human–human interactions and human poses when used in conjunction with the overall context provides discriminative cues about human–human interactions.

Introduction

Human interaction recognition in images is a challenging task because of the following reasons: (1) Some interactions (such as dining, partying, speech, etc.) can have a large deal of intra-class variety. (2) While two people usually perform some interactions (such as kissing and handshaking), some interactions (such as dining, partying, or speech) may involve hundreds or thousands of people, making the granularity of the problem quite diverse. (3) There is no constraint on image characteristics, where the camera position, light amount, or clutter can vary dramatically from image to image. (4) There can be other people in the scene not related to the interaction, which makes it harder to focus on the actual interaction. (5) Lack of temporal movement data makes it harder to diversify the interaction from the background.

This paper aims at proposing a deep learning-based framework for recognizing human–human interactions from still images. Although recognizing human actions [1], [2], [3] or human–object interactions have attracted many researchers [4], [5], [6], recognizing human–human interactions is a lesser extent, especially when it comes to working on still images. There are a few key features to identify human–human interactions such as context, scene, and poses. Among these features, one can argue that the human poses are the main ingredient of an interaction being performed since the context and scene information can change dramatically, while the poses should remain as a member of a recognizable set of discriminative elements.

In this context, a multi-stream convolutional neural network (CNN) architecture is formulated, which focuses mainly on pose information. Several pose-based inputs are formulated: (1) Pose Masks, where poses are extracted and represented as binary masks. (2) Pose Only Images: poses cropped out of original images. (3) Pose Highlighted Images: Instead of directly cropping poses, they are highlighted with a Gaussian mask where the background regions are de-emphasized. Fig. 1 shows examples of these pose based representations. This work proposes to use a single stream for each of these inputs and fuse the knowledge from each stream within a multi-stream CNN architecture. Experiments show that these pose sets are good enough to describe and discriminate interactions to a great extent. Besides, proposed multi-stream architecture yields better results compared to the single-stream counterparts.

The benchmark datasets which focus on human interaction recognition are mainly limited to videos [7], [8], [9]. While there are several recent datasets proposed [10], [11], [12], there are some limitations, such as [10] is not available for download; [11] contains only three classes: family, group and wedding photos; [12] defines classes as touch codes of interactions such as hand touch hand, hand touch shoulder, hand touch torso which semantically differs from the problem addressed in this work. Amongst the available datasets, the dataset of [13] is the only one focusing solely on human–human interactions. Therefore, to test the proposed approach, this benchmark dataset is used and also an extended version of it is collected, since it is a relatively small dataset for training deep models.

To sum up, the main contributions of this work are:

  • A Variety of different pose-based inputs for interaction recognition are proposed.

  • A multi-stream pose CNN is formulated that aggregates information from different pose streams.

  • An existing benchmark dataset [13] is extended which is used for human interaction recognition in still images.

The rest of the paper is organized as follows. Section 2 gives a brief overview of related work. Section 3 presents the proposed approach to the problem. Section 4 introduces extended dataset in this work. Experimental results are reported in Section 5 and finally Section 6 presents the conclusions and discuss possible future directions.

Section snippets

Related work

Recognition of human actions, human–object interactions, and human–human interactions are widely studied research areas on video sequences. However, when it comes to working on still images, the number of works decreases dramatically. There is a vast literature of recognizing human–human interactions on videos such as [15], [16], [17] with traditional classification methods. Nguyen and Yoshitaka [18] use handcrafted features with a three-layer convolutional neural network to train the model.

Technical approach

The human pose is one of the most distinctive features for recognizing human–human interactions. In order to make use of most of the pose information, several different representations of poses, and their utilization within CNN architectures are explored. In accordance with this purpose, three representation forms along with the original RGB inputs are employed: (i) pose masks (PM) images to describe poses for interactions (ii) pose only (PO) images to estimate context information (iii) and

Dataset

With the rise of deep learning, the hunger for more data has increased substantially. Small datasets with images at the scale of the hundreds are shown to be insufficient for deeper convolutional neural network architectures to converge [63]. For the human–human interaction recognition in images, the size and the number of available datasets are also quite limited. A thorough search of the relevant literature yielded only one image dataset available online for human–human interaction

Implementation details

To evaluate the effectiveness of the proposed multi-stream pose networks, individual CNN streams are trained by fine-tuning the CNN architectures AlexNet [64], DenseNet [65], ResNet [62], Inception-V3 [66], and VGG-19 [67] that were pretrained on ImageNet [61]. ResNet [62] is used as the major architecture to test, tune and make inferences since it is faster to train and performs generally better than DenseNet, Inception, and VGG architectures. On the other hand, to make a fair comparison with

Conclusion

In this work, the role of poses in classifying human–human interactions in still images are examined. To this end, multi-stream pose CNNs are introduced, where each stream operates over different human pose representations. Three different pose input variations are proposed, namely, pose masks, pose only images, and pose highlighted images. To evaluate the proposed approach, an existing benchmark dataset is extended, which contains ten classes with more than 10000 images in total. It has shown

CRediT authorship contribution statement

Gokhan Tanisik: Conceptualization, Methodology, Software, Formal analysis, Investigation, Data curation, Writing - original draft, Visualization. Cemil Zalluhoglu: Software, Investigation, Resources, Validation, Data curation. Nazli Ikizler-Cinbis: Conceptualization, Supervision, Writing - review & editing,Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This project was partially supported by TUBITAK, Turkey project no 112E149.

References (72)

  • F. Siyahjani, S. Motiian, H. Bharthavarapu, S. Sharlemin, G. Doretto, Online geometric human interaction segmentation...
  • R. Li, P. Porfilio, T. Zickler, Finding group interactions in social clutter, in: IEEE Conference on Computer Vision...
  • MotiianS. et al.

    Online human interaction detection and recognition with multiple cameras

    IEEE Trans. Circuits Syst. Video Technol.

    (2017)
  • GongW. et al.

    A new image dataset on human interactions

  • A.C. Gallagher, T. Chen, Understanding images of groups of people, in: IEEE Conference on Computer Vision and Pattern...
  • Y. Yang, S. Baker, A. Kannan, D. Ramanan, Recognizing proxemics in personal photos, in: IEEE Conference on Computer...
  • K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: IEEE International Conference on Computer Vision (ICCV),...
  • Y. Zhou, B. Ni, . Richang Hong, . Meng Wang, Q. Tian, Interaction part mining: A mid-level approach for fine-grained...
  • KongY. et al.

    Interactive phrases: Semantic descriptions for human interaction recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • M. Tapaswi, M. Bäuml, R. Stiefelhagen, Storygraphs: Visualizing character interactions as a timeline, in: IEEE...
  • N. Nguyen, A. Yoshitaka, Human interaction recognition using independent subspace analysis algorithm, in: IEEE...
  • Y. Yan, B. Ni, X. Yang, Predicting human interaction via relative attention model, in: International Joint Conference...
  • HochreiterS. et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • C.J. Taylor, Reconstruction of articulated objects from point correspondences in a single uncalibrated image, in: IEEE...
  • . Yang Wang, . Hao Jiang, M.S. Drew, . Ze-Nian Li, G. Mori, Unsupervised discovery of action classes, in: IEEE Computer...
  • L. Li, . Li Fei-Fei, What, where and who? Classifying events by scene and object recognition, in: International...
  • N. Ikizler, R.G. Cinbis, S. Pehlivan, P. Duygulu, Recognizing actions from still images, in: International Conference...
  • C. Thurau, V. Hlavac, Pose primitive based human action recognition in videos or still images, in: IEEE Conference on...
  • N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Computer Society Conference on...
  • DelaitreV. et al.

    Recognizing human actions in still images: a study of bag-of-features and part-based representations

  • B. Yao, X. Jiang, A. Khosla, A.L. Lin, L. Guibas, L. Fei-Fei, Human action recognition by learning bases of action...
  • Y. Zheng, Y. Zhang, X. Li, B. Liu, Action recognition in still images using a combination of human pose and context...
  • K. Raja, I. Laptev, P. Pérez, L. Oisel, Joint pose estimation and action recognition in image graphs, in: IEEE...
  • SenerF. et al.

    On recognizing actions in still images via multiple features

  • WangZ. et al.

    Geometric pose affordance: 3d human pose with scene constraints

    (2019)
  • WangZ. et al.

    Predicting camera viewpoint improves cross-dataset generalization for 3d human pose estimation

  • Cited by (7)

    • Learning Human-Human Interactions in Images from Weak Textual Supervision

      2023, Proceedings of the IEEE International Conference on Computer Vision
    • Recognition of Human Interactions in Still Images using AdaptiveDRNet with Multi-level Attention

      2023, International Journal of Advanced Computer Science and Applications
    View all citing articles on Scopus
    View full text