Multi-stream pose convolutional neural networks for human interaction recognition in images

doi:10.1016/j.image.2021.116265

Signal Processing: Image Communication

Volume 95, July 2021, 116265

https://doi.org/10.1016/j.image.2021.116265 Get rights and content

Highlights

•
We explore the contribution of poses to human–human interaction recognition in images.
•
Several pose-based representations are proposed.
•
Multiple pose based CNN streams are utilized, where various fusion architectures for pose-based representations are explored.
•
An extended benchmark dataset for human–human interaction recognition in images is collected.
•
Experiments show that paying more attention on poses has a positive affect on recognizing interactions.

Abstract

Recognizing human interactions in still images is quite a challenging task since compared to videos, there is only a glimpse of interaction in a single image. This work investigates the role of human poses in recognizing human–human interactions in still images. To this end, a multi-stream convolutional neural network architecture is proposed, which fuses different levels of human pose information to recognize human interactions better. In this context, several pose-based representations are explored. Experimental evaluations in an extended benchmark dataset show that the proposed multi-stream pose Convolutional Neural Network is successful in discriminating a wide range of human–human interactions and human poses when used in conjunction with the overall context provides discriminative cues about human–human interactions.

Introduction

Human interaction recognition in images is a challenging task because of the following reasons: (1) Some interactions (such as dining, partying, speech, etc.) can have a large deal of intra-class variety. (2) While two people usually perform some interactions (such as kissing and handshaking), some interactions (such as dining, partying, or speech) may involve hundreds or thousands of people, making the granularity of the problem quite diverse. (3) There is no constraint on image characteristics, where the camera position, light amount, or clutter can vary dramatically from image to image. (4) There can be other people in the scene not related to the interaction, which makes it harder to focus on the actual interaction. (5) Lack of temporal movement data makes it harder to diversify the interaction from the background.

This paper aims at proposing a deep learning-based framework for recognizing human–human interactions from still images. Although recognizing human actions [1], [2], [3] or human–object interactions have attracted many researchers [4], [5], [6], recognizing human–human interactions is a lesser extent, especially when it comes to working on still images. There are a few key features to identify human–human interactions such as context, scene, and poses. Among these features, one can argue that the human poses are the main ingredient of an interaction being performed since the context and scene information can change dramatically, while the poses should remain as a member of a recognizable set of discriminative elements.

In this context, a multi-stream convolutional neural network (CNN) architecture is formulated, which focuses mainly on pose information. Several pose-based inputs are formulated: (1) Pose Masks, where poses are extracted and represented as binary masks. (2) Pose Only Images: poses cropped out of original images. (3) Pose Highlighted Images: Instead of directly cropping poses, they are highlighted with a Gaussian mask where the background regions are de-emphasized. Fig. 1 shows examples of these pose based representations. This work proposes to use a single stream for each of these inputs and fuse the knowledge from each stream within a multi-stream CNN architecture. Experiments show that these pose sets are good enough to describe and discriminate interactions to a great extent. Besides, proposed multi-stream architecture yields better results compared to the single-stream counterparts.

The benchmark datasets which focus on human interaction recognition are mainly limited to videos [7], [8], [9]. While there are several recent datasets proposed [10], [11], [12], there are some limitations, such as [10] is not available for download; [11] contains only three classes: family, group and wedding photos; [12] defines classes as touch codes of interactions such as hand touch hand, hand touch shoulder, hand touch torso which semantically differs from the problem addressed in this work. Amongst the available datasets, the dataset of [13] is the only one focusing solely on human–human interactions. Therefore, to test the proposed approach, this benchmark dataset is used and also an extended version of it is collected, since it is a relatively small dataset for training deep models.

To sum up, the main contributions of this work are:

•
A Variety of different pose-based inputs for interaction recognition are proposed.
•
A multi-stream pose CNN is formulated that aggregates information from different pose streams.
•
An existing benchmark dataset [13] is extended which is used for human interaction recognition in still images.

The rest of the paper is organized as follows. Section 2 gives a brief overview of related work. Section 3 presents the proposed approach to the problem. Section 4 introduces extended dataset in this work. Experimental results are reported in Section 5 and finally Section 6 presents the conclusions and discuss possible future directions.

Section snippets

Related work

Recognition of human actions, human–object interactions, and human–human interactions are widely studied research areas on video sequences. However, when it comes to working on still images, the number of works decreases dramatically. There is a vast literature of recognizing human–human interactions on videos such as [15], [16], [17] with traditional classification methods. Nguyen and Yoshitaka [18] use handcrafted features with a three-layer convolutional neural network to train the model.

Technical approach

The human pose is one of the most distinctive features for recognizing human–human interactions. In order to make use of most of the pose information, several different representations of poses, and their utilization within CNN architectures are explored. In accordance with this purpose, three representation forms along with the original RGB inputs are employed: (i) pose masks (PM) images to describe poses for interactions (ii) pose only (PO) images to estimate context information (iii) and

Dataset

With the rise of deep learning, the hunger for more data has increased substantially. Small datasets with images at the scale of the hundreds are shown to be insufficient for deeper convolutional neural network architectures to converge [63]. For the human–human interaction recognition in images, the size and the number of available datasets are also quite limited. A thorough search of the relevant literature yielded only one image dataset available online for human–human interaction

Implementation details

To evaluate the effectiveness of the proposed multi-stream pose networks, individual CNN streams are trained by fine-tuning the CNN architectures AlexNet [64], DenseNet [65], ResNet [62], Inception-V3 [66], and VGG-19 [67] that were pretrained on ImageNet [61]. ResNet [62] is used as the major architecture to test, tune and make inferences since it is faster to train and performs generally better than DenseNet, Inception, and VGG architectures. On the other hand, to make a fair comparison with

Conclusion

In this work, the role of poses in classifying human–human interactions in still images are examined. To this end, multi-stream pose CNNs are introduced, where each stream operates over different human pose representations. Three different pose input variations are proposed, namely, pose masks, pose only images, and pose highlighted images. To evaluate the proposed approach, an existing benchmark dataset is extended, which contains ten classes with more than 10000 images in total. It has shown

CRediT authorship contribution statement

Gokhan Tanisik: Conceptualization, Methodology, Software, Formal analysis, Investigation, Data curation, Writing - original draft, Visualization. Cemil Zalluhoglu: Software, Investigation, Resources, Validation, Data curation. Nazli Ikizler-Cinbis: Conceptualization, Supervision, Writing - review & editing,Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This project was partially supported by TUBITAK, Turkey project no 112E149.

References (72)

TanisikG. et al.
Facial descriptors for human interaction recognition in still images
Pattern Recognit. Lett.
(2016)
HerathS. et al.
Going deeper into action recognition: A survey
Image Vision Comput.
(2017)
TuZ. et al.
Multi-stream CNN: Learning representations based on human-related regions for action recognition
Pattern Recognit.
(2018)
ZalluhogluC. et al.
Region based multi-stream convolutional neural networks for collective activity recognition
J. Vis. Commun. Image Represent.
(2019)
VarolG. et al.
Long-term temporal convolutions for action recognition
IEEE Trans. Pattern Anal. Mach. Intell.
(2018)
R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, B. Russell, Actionvlad: Learning spatio-temporal aggregation for action...
SongS. et al.
An end-to-end spatio-temporal attention model for human action recognition from skeleton data
QiS. et al.
Learning human-object interactions by graph parsing neural networks
Y. Chao, Y. Liu, X. Liu, H. Zeng, J. Deng, Learning to detect human-object interactions, in: IEEE Winter Conference on...
MallyaA. et al.
Learning models for actions and person-object interactions with transfer to question answering

F. Siyahjani, S. Motiian, H. Bharthavarapu, S. Sharlemin, G. Doretto, Online geometric human interaction segmentation...

R. Li, P. Porfilio, T. Zickler, Finding group interactions in social clutter, in: IEEE Conference on Computer Vision...

MotiianS. et al.

Online human interaction detection and recognition with multiple cameras

IEEE Trans. Circuits Syst. Video Technol.

(2017)

GongW. et al.

A new image dataset on human interactions

A.C. Gallagher, T. Chen, Understanding images of groups of people, in: IEEE Conference on Computer Vision and Pattern...

Y. Yang, S. Baker, A. Kannan, D. Ramanan, Recognizing proxemics in personal photos, in: IEEE Conference on Computer...

K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: IEEE International Conference on Computer Vision (ICCV),...

Y. Zhou, B. Ni, . Richang Hong, . Meng Wang, Q. Tian, Interaction part mining: A mid-level approach for fine-grained...

KongY. et al.

Interactive phrases: Semantic descriptions for human interaction recognition

IEEE Trans. Pattern Anal. Mach. Intell.

(2014)

M. Tapaswi, M. Bäuml, R. Stiefelhagen, Storygraphs: Visualizing character interactions as a timeline, in: IEEE...

N. Nguyen, A. Yoshitaka, Human interaction recognition using independent subspace analysis algorithm, in: IEEE...

Y. Yan, B. Ni, X. Yang, Predicting human interaction via relative attention model, in: International Joint Conference...

HochreiterS. et al.

Long short-term memory

Neural Comput.

(1997)

C.J. Taylor, Reconstruction of articulated objects from point correspondences in a single uncalibrated image, in: IEEE...

. Yang Wang, . Hao Jiang, M.S. Drew, . Ze-Nian Li, G. Mori, Unsupervised discovery of action classes, in: IEEE Computer...

L. Li, . Li Fei-Fei, What, where and who? Classifying events by scene and object recognition, in: International...

N. Ikizler, R.G. Cinbis, S. Pehlivan, P. Duygulu, Recognizing actions from still images, in: International Conference...

C. Thurau, V. Hlavac, Pose primitive based human action recognition in videos or still images, in: IEEE Conference on...

N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: IEEE Computer Society Conference on...

DelaitreV. et al.

Recognizing human actions in still images: a study of bag-of-features and part-based representations

B. Yao, X. Jiang, A. Khosla, A.L. Lin, L. Guibas, L. Fei-Fei, Human action recognition by learning bases of action...

Y. Zheng, Y. Zhang, X. Li, B. Liu, Action recognition in still images using a combination of human pose and context...

K. Raja, I. Laptev, P. Pérez, L. Oisel, Joint pose estimation and action recognition in image graphs, in: IEEE...

SenerF. et al.

On recognizing actions in still images via multiple features

WangZ. et al.

Geometric pose affordance: 3d human pose with scene constraints

(2019)

WangZ. et al.

Predicting camera viewpoint improves cross-dataset generalization for 3d human pose estimation

Cited by (7)

Design element extraction of plantar pressure imaging employing meta-learning-based graphic convolutional neural networks
2024, Applied Soft Computing
Segmenting plantar pressure images intelligently can provide valuable insight for people with high blood pressure, making bespoke footwear requirements possible and resulting in more comfortable shoe designs. It is, however, difficult to extract design elements from a segmented image dataset. To address this challenge, we propose an ML-GNN model that segments plantar pressure images using metal-earning. The first part of the paper presents a method for extracting image features that reduce the complexity of the ML-GNN algorithm. To create the network structure, we propose optimization meta-based learning. Using a meta-learning-based graphic neural network, we enhance our mask-based CNN prediction model with VGG16 and CNN layers. We pre-processed the plantar pressure dataset using pressure-sensing data acquisition and compared the results. By defining standard image segmentation indices, we demonstrate the high effectiveness of our research. We have developed an ML-GNN model that improves the segmentation accuracy of plantar pressure images and can also be applied to other sensor image datasets. Through our shoe-last customization approach, we enable the shoe industry to manufacture shoes more efficiently, particularly for people with specific healthcare needs who require bespoke shoe designs. Our findings demonstrate the potential of intelligent image segmentation to advance the field of footwear design and improve the lives of people with specific health requirements.
Multi-level channel attention excitation network for human action recognition in videos
2023, Signal Processing: Image Communication
Channel attention mechanism has continuously attracted strong interests and shown great potential in enhancing the performance of deep CNNs. However, when applied to video-based human action recognition task, most existing methods generally learn channel attention at frame level, which ignores the temporal dependencies and may limit the recognition performance. In this paper, we propose a novel multi-level channel attention excitation (MCAE) module to model the temporal-related channel attention at both frame and video levels. Specifically, based on video convolutional feature maps, frame-level channel attention (FCA) is generated by exploring time-channel correlations, and video-level channel attention (VCA) is generated by aggregating global motion variations. MCAE firstly recalibrates video feature responses with frame-wise FCA, and then activates the motion-sensitive channel features with motion-aware VCA. MCAE module learns the channel discriminability from multiple levels and can act as a guidance to facilitate efficient spatiotemporal feature modeling in activated motion-sensitive channels. It can be flexibly embedded into 2D networks with very limited extra computation cost to construct MCAE-Net, which effectively enhances the spatiotemporal representation of 2D models for video action recognition task Extensive experiments on five human action datasets show that our method achieves superior or very competitive performance compared with the state-of-the-arts, which demonstrates the effectiveness of the proposed method for improving the performance of human action recognition.
Learning Human-Human Interactions in Images from Weak Textual Supervision
2023, arXiv
Learning Human-Human Interactions in Images from Weak Textual Supervision
2023, Proceedings of the IEEE International Conference on Computer Vision
Recognition of Human Interactions in Still Images using AdaptiveDRNet with Multi-level Attention
2023, International Journal of Advanced Computer Science and Applications
Design Elements Extraction of Plantar Pressure Imaging Employing Meta-Learning Based Graphic Convolutional Neural Networks
2022, SSRN

View all citing articles on Scopus

View full text

Multi-stream pose convolutional neural networks for human interaction recognition in images

Highlights

Abstract

Introduction

Section snippets

Related work

Technical approach

Dataset

Implementation details

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Pattern Recognit. Lett.

Image Vision Comput.

Pattern Recognit.

J. Vis. Commun. Image Represent.

Long-term temporal convolutions for action recognition

IEEE Trans. Pattern Anal. Mach. Intell.

An end-to-end spatio-temporal attention model for human action recognition from skeleton data

Learning human-object interactions by graph parsing neural networks

Learning models for actions and person-object interactions with transfer to question answering

Online human interaction detection and recognition with multiple cameras

IEEE Trans. Circuits Syst. Video Technol.

A new image dataset on human interactions

Interactive phrases: Semantic descriptions for human interaction recognition

IEEE Trans. Pattern Anal. Mach. Intell.

Long short-term memory

Neural Comput.

Recognizing human actions in still images: a study of bag-of-features and part-based representations

On recognizing actions in still images via multiple features

Geometric pose affordance: 3d human pose with scene constraints

Predicting camera viewpoint improves cross-dataset generalization for 3d human pose estimation