Multi-stream pose convolutional neural networks for human interaction recognition in images
Introduction
Human interaction recognition in images is a challenging task because of the following reasons: (1) Some interactions (such as dining, partying, speech, etc.) can have a large deal of intra-class variety. (2) While two people usually perform some interactions (such as kissing and handshaking), some interactions (such as dining, partying, or speech) may involve hundreds or thousands of people, making the granularity of the problem quite diverse. (3) There is no constraint on image characteristics, where the camera position, light amount, or clutter can vary dramatically from image to image. (4) There can be other people in the scene not related to the interaction, which makes it harder to focus on the actual interaction. (5) Lack of temporal movement data makes it harder to diversify the interaction from the background.
This paper aims at proposing a deep learning-based framework for recognizing human–human interactions from still images. Although recognizing human actions [1], [2], [3] or human–object interactions have attracted many researchers [4], [5], [6], recognizing human–human interactions is a lesser extent, especially when it comes to working on still images. There are a few key features to identify human–human interactions such as context, scene, and poses. Among these features, one can argue that the human poses are the main ingredient of an interaction being performed since the context and scene information can change dramatically, while the poses should remain as a member of a recognizable set of discriminative elements.
In this context, a multi-stream convolutional neural network (CNN) architecture is formulated, which focuses mainly on pose information. Several pose-based inputs are formulated: (1) Pose Masks, where poses are extracted and represented as binary masks. (2) Pose Only Images: poses cropped out of original images. (3) Pose Highlighted Images: Instead of directly cropping poses, they are highlighted with a Gaussian mask where the background regions are de-emphasized. Fig. 1 shows examples of these pose based representations. This work proposes to use a single stream for each of these inputs and fuse the knowledge from each stream within a multi-stream CNN architecture. Experiments show that these pose sets are good enough to describe and discriminate interactions to a great extent. Besides, proposed multi-stream architecture yields better results compared to the single-stream counterparts.
The benchmark datasets which focus on human interaction recognition are mainly limited to videos [7], [8], [9]. While there are several recent datasets proposed [10], [11], [12], there are some limitations, such as [10] is not available for download; [11] contains only three classes: family, group and wedding photos; [12] defines classes as touch codes of interactions such as hand touch hand, hand touch shoulder, hand touch torso which semantically differs from the problem addressed in this work. Amongst the available datasets, the dataset of [13] is the only one focusing solely on human–human interactions. Therefore, to test the proposed approach, this benchmark dataset is used and also an extended version of it is collected, since it is a relatively small dataset for training deep models.
To sum up, the main contributions of this work are:
- •
A Variety of different pose-based inputs for interaction recognition are proposed.
- •
A multi-stream pose CNN is formulated that aggregates information from different pose streams.
- •
An existing benchmark dataset [13] is extended which is used for human interaction recognition in still images.
The rest of the paper is organized as follows. Section 2 gives a brief overview of related work. Section 3 presents the proposed approach to the problem. Section 4 introduces extended dataset in this work. Experimental results are reported in Section 5 and finally Section 6 presents the conclusions and discuss possible future directions.
Section snippets
Related work
Recognition of human actions, human–object interactions, and human–human interactions are widely studied research areas on video sequences. However, when it comes to working on still images, the number of works decreases dramatically. There is a vast literature of recognizing human–human interactions on videos such as [15], [16], [17] with traditional classification methods. Nguyen and Yoshitaka [18] use handcrafted features with a three-layer convolutional neural network to train the model.
Technical approach
The human pose is one of the most distinctive features for recognizing human–human interactions. In order to make use of most of the pose information, several different representations of poses, and their utilization within CNN architectures are explored. In accordance with this purpose, three representation forms along with the original RGB inputs are employed: (i) pose masks (PM) images to describe poses for interactions (ii) pose only (PO) images to estimate context information (iii) and
Dataset
With the rise of deep learning, the hunger for more data has increased substantially. Small datasets with images at the scale of the hundreds are shown to be insufficient for deeper convolutional neural network architectures to converge [63]. For the human–human interaction recognition in images, the size and the number of available datasets are also quite limited. A thorough search of the relevant literature yielded only one image dataset available online for human–human interaction
Implementation details
To evaluate the effectiveness of the proposed multi-stream pose networks, individual CNN streams are trained by fine-tuning the CNN architectures AlexNet [64], DenseNet [65], ResNet [62], Inception-V3 [66], and VGG-19 [67] that were pretrained on ImageNet [61]. ResNet [62] is used as the major architecture to test, tune and make inferences since it is faster to train and performs generally better than DenseNet, Inception, and VGG architectures. On the other hand, to make a fair comparison with
Conclusion
In this work, the role of poses in classifying human–human interactions in still images are examined. To this end, multi-stream pose CNNs are introduced, where each stream operates over different human pose representations. Three different pose input variations are proposed, namely, pose masks, pose only images, and pose highlighted images. To evaluate the proposed approach, an existing benchmark dataset is extended, which contains ten classes with more than 10000 images in total. It has shown
CRediT authorship contribution statement
Gokhan Tanisik: Conceptualization, Methodology, Software, Formal analysis, Investigation, Data curation, Writing - original draft, Visualization. Cemil Zalluhoglu: Software, Investigation, Resources, Validation, Data curation. Nazli Ikizler-Cinbis: Conceptualization, Supervision, Writing - review & editing,Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This project was partially supported by TUBITAK, Turkey project no 112E149.
References (72)
- et al.
Facial descriptors for human interaction recognition in still images
Pattern Recognit. Lett.
(2016) - et al.
Going deeper into action recognition: A survey
Image Vision Comput.
(2017) - et al.
Multi-stream CNN: Learning representations based on human-related regions for action recognition
Pattern Recognit.
(2018) - et al.
Region based multi-stream convolutional neural networks for collective activity recognition
J. Vis. Commun. Image Represent.
(2019) - et al.
Long-term temporal convolutions for action recognition
IEEE Trans. Pattern Anal. Mach. Intell.
(2018) - R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, B. Russell, Actionvlad: Learning spatio-temporal aggregation for action...
- et al.
An end-to-end spatio-temporal attention model for human action recognition from skeleton data
- et al.
Learning human-object interactions by graph parsing neural networks
- Y. Chao, Y. Liu, X. Liu, H. Zeng, J. Deng, Learning to detect human-object interactions, in: IEEE Winter Conference on...
- et al.
Learning models for actions and person-object interactions with transfer to question answering
Online human interaction detection and recognition with multiple cameras
IEEE Trans. Circuits Syst. Video Technol.
A new image dataset on human interactions
Interactive phrases: Semantic descriptions for human interaction recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Long short-term memory
Neural Comput.
Recognizing human actions in still images: a study of bag-of-features and part-based representations
On recognizing actions in still images via multiple features
Geometric pose affordance: 3d human pose with scene constraints
Predicting camera viewpoint improves cross-dataset generalization for 3d human pose estimation
Cited by (7)
Multi-level channel attention excitation network for human action recognition in videos
2023, Signal Processing: Image CommunicationLearning Human-Human Interactions in Images from Weak Textual Supervision
2023, Proceedings of the IEEE International Conference on Computer VisionRecognition of Human Interactions in Still Images using AdaptiveDRNet with Multi-level Attention
2023, International Journal of Advanced Computer Science and Applications