1 Introduction

Recent advances in the deep learning architectures, improvements in communication mechanisms of Graphics Processing Unit (GPU) hardware, as well as software stacks have provided a boost in supporting computationally complex tasks such as Multimodal Human Action Recognition (MHAR). Understanding activities in a multimodal information setting are a complex and resource hungry task, one which has become an important research problem in computer vision. Audio and video-based action recognition has many potential applications in: visual acoustics-based platforms; as described by Gao and Grauman [9]; sports analytics as noted by Vinyes Mora and Knottenbelt [64]; smart social surveillance for COVID-19 as detailed by citehuu2022action; self-driving vehicles as identified by Kala [23]; and content-based search systems as examined by Gibbon and Liu [13].

An action could be defined as: “Action is the most elementary human -surrounding interaction with a meaning” [19, p.2]. Action recognition typically aims to discover a class of short, segmented, atomic actions. Due to their multifaceted nature, some of these approaches refer to action recognition as plan recognition, goal recognition, intent recognition, behavior recognition, location estimation, event recognition and interaction recognition, as evident in Shaikh and Chai [51], Vandersmissen et al [63], Slade et al [54] and Jing et al [21], respectively. Human Action Recognition (HAR) is the process of labeling actions performed by humans within a given sequence of images, where it becomes the classification of goals of human agents in a series of image frames.

It is observable that many objects in our daily life operate concurrently to give meaning to an activity. This interaction relies on quantifiable, yet multimodal signals. Accordingly, joint learning of these multimodal quantities may result in better feature representation for any downstream analysis. Multimodal understanding is a natural human ability. In the context of MHAR, the goal is to improve recognition performance by collecting critical features from distinct modalities. These candidate features are then combined using an optimal aggregation strategy, which contributes to the overall learning of an action category. From an early age, humans become used to applying different senses to acquire visual as well as acoustics information from their surroundings in order to understand an event. However, recent approaches have generally ignored the contribution of audio data in enhancing action recognition. Further, multimodal data fusion has benefits that can be applied across different fields Boehm et al [3].

Audio signals have significant properties that help with efficient recognition in videos, where audio contains dynamics and rich contextual temporal information Gaver [12]. Most importantly, audio has a much lighter computational complexity as compared to video frames. Across an entire video, audio can also be beneficial for selecting critical features that are useful for recognition. For example, the sound of a person talking at the start of a video can suggest that the actual action has not yet started, while the sound of an electric saw may indicate that the action is taking place. Figure 1 shows a scenario where audio features combined with video features yield a significant improvement in recognizing classes where audio features are the discriminating factor. In Fig. 1a, the sample instance was incorrectly classified as a “Boxing Speed Bag” with video-only features. However, with the fusion of audio-image and video features, the same instance was correctly recognized as “Boxing Punching Bag” due to distinct audio features (shown in Fig. 1b).

Video modality contains spatial information, which is inherently helpful for CNN-based classification architectures. To better capture multimodality aspects of action data, a recent trend has been to combine information from different modalities, such as optical flow, RGB-difference and warped-optical flow Wang et al [66]. For example, as shown in Fig. 1, within a short clip of action of ’Boxing a Punching Bag’, a single audio-image frame contains most of the dynamic contextual information contained in the audio, (i.e., the sound of a boxing glove hitting the bag), while the accompanying video clip contains useful cues of spatial dynamics.

Advanced computer vision technologies and artificial intelligence algorithms have been proposed to recognize human actions. Accordingly, Convolutional Neural Networks (CNNs) have been successful in image classification Paoletti et al [45], Seo and Shin [50], Wan et al [65], Sharma et al [53], digital recognition Baldominos et al [2], Tao et al [60], Kulkarni and Rajendran [27], object detection Jung et al [22], Deng et al [6] and many information retrieval domains. In action recognition research, CNN studies have predominantly used visual data, which is rich in spatial information. However, most of these works have not exploited the temporal and contextual information that lies in accompanying audio.

Fig. 1
figure 1

Illustration of an example a miss-classified unimodal (Video-only) compared to b correctly classified example with multimodal data (Audio-image+Video) using proposed MAiVAR framework

In this paper, we pose and seek to answer the following questions: (Q1) How well can CNN encode audio signals using image-based representations? (Q2) How does knowing information from one modality influences the model in learning the action efficiently? (Q3) How do we fuse information found in both modalities so that it improves model performance? While related questions may been studied in the literature, to the best of our knowledge, no study has been conducted to answer these collective questions as a whole. Hence, we propose the Multimodal Audio-Image and Video Action Recognizer (MAiVAR) framework, a CNN-based audio-image to video fusion-based technique that can be applied to video and audio features for classifying actions. Additionally, we propose an approach to extract meaningful image representations of audio, which can significantly improve classification scores when used with CNN-based models.

The benefits of the proposed MAiVAR are threefold. Firstly, MAiVAR has shown superior performance to outperform state-of-the-art models on the same data configuration for video representation when evaluated on UCF51 Takahashi et al [58] and Kinetics Sounds Arandjelovic and Zisserman [1] datasets. Secondly, MAiVAR naturally supports both audio and video inputs and can be applied to different tasks without any change of architecture. In particular, the models we used have different architectures for video and audio feature extraction. In contrast, existing video-only models typically require variants of RGB modality to obtain optimal performance, such as optical flow, warped-optical flow and RGB difference. Thirdly, compared with state-of-the-art video-based CNN models, experimental results show that MAiVAR features converge faster during training. To the best of our knowledge, MAiVAR is the first audio-image to video fusion-based action classification framework that uses image-based representations of audio to leverage CNN architecture for better action recognition. The key contributions of our work can be summarized as follows:

  • We introduce MAiVAR, and a new multi-staged multimodal framework that supports novel dominant head audio-visual fusion. Our fusion approach eliminates expensive fusion operations, significantly reducing the computational complexity of the model. Our framework can be applied to different tasks with minimal changes to the overall architecture.

  • We build a new feature representation strategy to select the most informative candidate representations for audio-visual fusion.

  • Unlike previous methods, we propose a high-level weights assignment algorithm for better audio-visual interaction and convergence.

  • We achieve state-of-the-art or competitive results on standard public benchmarks, validating the generalizability of our proposed approach through an extensive evaluation.

Fig. 2
figure 2

Overview of our proposed multi-staged architecture. In stage 1, along the audio stream, the 2D audio is transformed into a grid-compatible representation, which is subsequently resized to \(224\times 224\) for feature extraction through chroma diffuser and then, linearly projected onto the stage 2 for Multimodal Fusion. Along the visual stream, the visual input is passed through a segment diffuser and then, linearly projected onto the stage 2 for fusion. The output of the stage 2 is used for classification with a linear layer. For the fusion network, weights from Video MLP were used to initialize the stage 2

This paper is a significant extension of our previous work Shaikh et al [52] in a number of aspects. We extend the original MAiVAR framework to validate scalability and generalization by testing on a larger dataset (i.e., Kinetics-Sounds). We also provide more detailed discussion on technical aspects and include more comprehensive review of the related literature to better contextualize our contribution.

The remainder of the paper is structured as follows. Section 2 describes how our work differs from other closely related works on MHAR using audio and visual modalities. Section 3 describes our proposed methodology. Section 4 discusses the dataset and the training environment of the system. Section 5 provides an analysis and comparison of the results obtained using the proposed approach, and Sect. 6 presents the conclusion of this paper.

2 Related work

Much research has been conducted in the field of MHAR Takahashi et al. [58], Long et al. [40], Tian et al. [61], Brousmiche et al. [5], Li et al. [30,31,32, 35], Li and Tang [34], where methods have been designed to combine information from distinct modalities. This section uncovers some existing works related to our approach, examining them in terms of feature extraction and multimodal action recognition approaches.

2.1 Feature extraction

Feature extraction is a process for yielding critical information from raw instances, which in turn contributes to the learning process. Temporal Segment Network (TSN) is used as a feature extractor based on its temporal pooling of frame-level features, where it has been rigorously used as an efficient video feature extractor for different problems. Gate-Shift Module (GSM) can turn a 2D CNN into a highly efficient spatio-temporal feature extractor. For example, when TSN is plugged into GSM Sudhakaran et al. [56], an accuracy improvement of 32% is achieved. Furthermore, Yang et al. [68] have used TSN with soft attention mechanism to capture important frames from each segment. Moreover, Zhang et al. [69] have used the TSN model as a feature extractor with ResNet101 for efficient behavior recognition of pigs.

Recently, TSN has been adapted as a backbone in video understanding scenarios Lei et al. [29], Girdhar et al. [14], Li et al. [33], Zhou et al. [70], Kwon et al. [28] and is typically used in conjunction with a succeeding module. In Kwon et al. [28], TSN was employed as a 2D CNN backbone to learn motion dynamics in videos. However, IRV2 has been used for feature extraction from images Mei et al. [43], helping with different image restoration and enhancement tasks Gu et al. [16], Yan et al. [67].

MAFnet Brousmiche et al. [5] are a multi-level attention-based fusion network that uses a lateral connection in the form of a Feature-wise Linear Modulation (FiLM) layer to incorporate modality conditioning among audio and video clip streams. MAFnet uses both intermediate feature fusion and late fusion of features to project the effect of the combined feature, which adds an overhead to the overall workflow. Spatial-Temporal Network (StNet) He et al. [17] adapt IRV2 to model local and global spatio-temporal features. The closest works to ours are Takahashi et al. [58] and Wang et al. [66]. Takahashi et al. [58] have classified audio events using 3D CNN with some representations of audio action data, while we have proposed a different audio representation strategy. Similarly, the temporal segment-based approach used by Wang et al. [66] applies optical flow modality variants with RGB, while our work uses RGB-based features using TSN as the feature extractor for the visual stream.

2.2 Multimodal action recognition

Multimodal action recognition employs multi-stream approaches to incorporate different modalities. Motivated by successes in image classification, CNNs have also been applied in different visual action recognition works, where several approaches have been designed to gain the benefits of appearance information. TSN Wang et al. [66], TRN Zhou et al. [70] and TSM Lin et al. [37] all are based on 2D CNNs. All three models employ a two-stream approach that uses both RGB and optical flow. Besides RGB and optical flow streams, Temporal Binding Networks (TBN) Kazakos et al. [25] adds audio as an additional modality as well. SlowFast Feichtenhofer et al. [8] uses two RGB streams with different resolutions and frame rates.

IMGAUD2VID Gao et al. [10] introduces a video skimming mechanism for untrimmed videos, aided by audio, to eliminate both short-term and long-term redundancies. Accordingly, it uses audio to extract dynamic scene information along with a single frame that captures most of the appearance information, in order to form an image-audio pair. These pairs are then used to select key moments from the video for action recognition. Unlike IMGAUD2VID, our idea captures more spatial information along with the holistic dynamic information of the scene from the image representations of audio.

An optimal strategy to fuse features from different modalities is critical to take maximum advantage of each modality. Existing multimodal action recognition approaches have used fusion at early Feichtenhofer et al. [7], middle Roitberg et al. [48] and late Patel et al. [47] stages of neural networks. Different levels of fusion in the network have been tested by Long et al. [39]. However, most of these approaches have used only visual modalities. In Long et al. [38], concatenation was used to fuse the visual and audio modalities. Complex fusion techniques, such as multimodal compact bilinear pooling (MCB) Gao et al. [11] or dual multimodal residual fusion (DMR) Tian et al [61] have also been studied.

In recent years, the surge in mobile device usage has underscored the need for robust security measures, particularly given the integral role smartphones play in our lives and the looming threats to user privacy. To address this, Li et al. [32] has introduced ADFFDA, an innovative mobile continuous authentication system that integrates an Adaptive Deep Feature Fusion scheme and a transformer-based GAN for Data Augmentation, employing common smartphone sensors like the accelerometer, gyroscope, and magnetometer, achieving a remarkably low mean equal error rate of 0.01%. Similarly, DeFFusion, another system by Li et al. [31], harnessed the same sensors but focused on CNN-based continuous authentication by transforming time domain data into the frequency domain, with an error rate of 1.00 % in a 5-second window. Furthermore, FusionAuth exploited these sensors to capture user behavior, employing two novel feature fusion strategies and achieving error rates as low as 1.47 % Li et al. [30]. Beside authentication, the massive influx of community-contributed images has led to advancements in image understanding. For instance, the Deep Collaborative Embedding (DCE) model, as proposed by citeli2018deep, seeks to understand these images by unifying the latent space for images and tags. Concurrently, the weakly supervised deep metric learning (WDML) algorithm leverages both visual content and user tags for more efficient image retrieval, addressing challenges like noisy or subjective tags Li and Tang [34]. These diverse yet interrelated studies collectively highlight the evolving landscape of device security and image understanding, stressing the importance of leveraging deep learning techniques and sensor data for better results.

2.3 Chromagram representations

Chroma features have been widely used in action recognition because they are effective not only in capturing the musical and rhythmic structure of action, but also in the spectral information of an audio signal compactly and efficiently. The main justification for using chroma features for action recognition lies in their ability to represent the pitch content of an audio signal in a musically meaningful way. Chroma features are derived from the short-time Fourier transform (STFT) of an audio signal, where each frame of the STFT is mapped onto a 12-dimensional pitch class profile. This pitch class profile summarizes the energy of each pitch class in the frame, thereby providing a concise representation of the harmonic content of the signal. For example, in a dance performance, the rhythm of the music and the dancer’s movements is highly correlated. Chroma features can capture this correlation by highlighting the prominent pitch classes in the music and the corresponding rhythmic patterns in the dancer’s movements. Similarly, in sports action recognition, the sound of the ball being hit or kicked can be captured by the chroma features, which can then be used to identify the type of sport being played. Another advantage of chroma features is their robustness to noise and variations in audio signal. Since chroma features only capture the pitch content of an audio signal, they are less sensitive to changes in the timbre or dynamics of a signal. This makes them particularly useful for recognizing actions in noisy or challenging environments where other features may be more prone to error.

Gouyon et al. [15] have shown that rhythmic descriptors based on chroma features are effective for musical genre classification. Moreover, chroma features are among the most effective audio features for human action recognition, particularly for recognizing actions that are accompanied by music. In addition, chroma features are robust to variations in the audio signal. Besides this, Lidy and Rauber [36] have evaluated feature sets and fusion strategies for genre classification of music, finding that chroma features are relatively robust to variations in the timbre and dynamics of the signal. Similarly, citeravanelli2018learning have used chroma features to learn speaker representations for speaker diarization and found that they are robust to noise and reverberation. Overall, the selection of chroma features for action recognition is justified by their ability to capture the musical and rhythmic structure of an action robustly and effectively. These characteristics make them a popular choice for many researchers and practitioners in the field.

3 Proposed methodology

In this section, we describe our overall proposed framework, which fuses multimodal (audio-image and video) data for action recognition. We start by providing a brief overview of our proposed approach. We then discuss individual components of the framework, which consist of video and audio streams. Furthermore, we discuss network architectures of visual and audio feature extractors. This is followed by a brief discussion about the multimodal fusion network, which fuses the features from individual streams.

Fig. 3
figure 3

Scheme showing the multimodal model architecture. The audio-DNN and video-DNN models output independent action prediction \(\mathcal {L}_1\) and \(\mathcal {L}_2\), respectively, based on their respective input datasets. The features from the average pooling layers of the audio-DNN and video-DNN are combined to feed the fusion module, which outputs the multimodal action prediction \(\mathcal {L}_{multi}\)

3.1 Overview

Figure 2 presents the schematics of the MAiVAR framework. We split each video V into k number of segments s of equal duration so that an ith video sample could be defined as

$$\begin{aligned} \ V_i = \{s_1, s_2, ..., s_k\}. \end{aligned}$$
(1)

A short snippet is then randomly selected from each segment. We then extract visual feature maps \((\{v_1,...,v_T\},v_t\in \mathbb {R}^{D_{video}})\) and audio feature maps \((\{a_1,...,a_{T}\},a_t\in \mathbb {R}^{D_{audio}})\) with pretrained modality specific CNNs. Features maps are then reduced using average pooling, which extracts \(t = 1,...,T\) temporal features across \(k = 1,...,K\) modalities. Accordingly, we obtain a richer multimodal representation of the entire video. These features are then passed through a Multimodal Fusion Network (MFN), which outputs the learned classifications. The output of the framework is \(y \in \mathbb {R}^{N}\) with N being the number of classes. An example scheme of feature fusion is shown in Fig. 3. To go beyond simple fusion, we initialize the fusion network with weights from a trained Video Multi-Layer Perceptron (VMLP) model. The weights from VMLP are then directly injected into the fusion model, which helps the fusion model perform quicker convergence.

3.2 Visual stream

The duration of video input is t seconds, which is then converted into a frame-based representation that is compatible with TSN. This is then fed as an input to the TSN feature extractor. Each video clip thus represents an action example. We extract the feature from TSN using the AvgPool2d layer while pruning the original average consensus layer. TSN produces an embedding of size \(1024 \times 25\) to allow the model to capture the spatio-temporal structure of the RGB video. The resulting sequence is then fed as input to the fusion model. The advantages of this setup are: 1) the standard TSN architecture is easy to implement and reproducible as it is off-the-shelf in TensorFlow and PyTorch, and 2) we intend to apply transfer learning for Fusion MLP, whereby a standard architecture makes transfer learning easier. We have adapted the MMAction2Footnote 1 interface for using TSN as a feature extractor.

Fig. 4
figure 4

Audio and video representations at different stages of the framework of data samples

3.3 Audio stream

For audio with a duration of t seconds, we convert it into an image-based representation. This results in a \(299\times 299\) image-based representation, referred to as an audio-image of the whole action data sample. This audio-image is then fed as an input to the IRV2 backbone. We then extract the audio features from IRV2 using the AvgPool2D layer while pruning the original classification layer. IRV2 produces an embedding of size 1536 to allow the model to capture the spectral audio information through image-based representation. The resulting sequence is then fed as input to the fusion model. The advantage of this simple setup is that the standard IRV2 architecture performs better on audio-image representations as compared to several other CNN-based models (discussed in Sect. 5.2), whereby it is comparatively easy to reproduce, as it is off-the-shelf and available with PyTorch.Footnote 2

3.4 Segment diffuser

A segment diffuser uses a TSN-based Wang et al [66] feature extractor, pretrained with Kinetics-400 for projecting our visual embeddings. In TSN, the original BNInception backbone is used. The consensus layer is removed to expose features from the AvgPool2D layer of the TSN model. The features from TSN are then fed as input to the visual multilayer perceptron, which classifies visual features into class labels. The weights from the trained VMLP are then used to initialize the fusion network.

3.5 Chroma diffuser

A chroma diffuser uses an IRV2-based feature extractor initialized with weights from ImageNet for creating audio embeddings for the audio MLP. In IRV2, all gradients were kept frozen and the last layer was removed to expose features from the average pooling layers of the model. The features from IRV2 were then fed as input to the fusion module, which fuses it with visual features.

3.6 Dominant head fusion

Dominant Head Fusion (DHF) processes unimodal feature representations separately and then, learns a joint representation using a middle layer. DHF concatenates the visual and audio feature vectors and classifies the combined feature dimension into action labels. The VMLP model produces higher classification accuracy. Therefore, the weights from the trained visual network were used to initialize items and to fine-tune the DHF process.

All hidden layers, except the last fully connected layer, are equipped with Rectified Linear Unit (ReLU) nonlinearity. The fusion network was trained by minimizing the cross-entropy loss L with \(l_1\) regularization using back-propagation:

$$\begin{aligned} \theta ^* = \arg \min _{W} \sum _{i,j} L (x^i_j+z^i_j,y^i_j,W) \end{aligned}$$
(2)

where \(x^i_j\) and \(z^i_j\) are the jth input vector from audio and video features, respectively, \(y_j\) is the corresponding class label, and W is the set of network parameters, respectively. For the audio-video fusion model, we used initialization weights from the video classification model (see Algorithm 1). Empirically, the weights from the video model boost the convergence speed of the fusion model.

Algorithm 1
figure a

Weights assignment algorithm

4 Experimental setup

4.1 Datasets

We evaluated the MAiVAR on two popular action recognition datasets: UCF51 Takahashi et al [58] and Kinetics Sounds Arandjelovic and Zisserman [1].

UCF51 is a subset of standard benchmark dataset UCF101 Soomro et al [55] for action recognition. UCF101 consists of 13, 320 videos of 101 human action categories, such as Apply Eye Makeup, Blow Dry Hair and Table Tennis. However, 6,836 videos from 51 action classes have audio channels. The average video length is 7.0 s. This dataset was partitioned into three splits for training and testing. Number of samples per class is shown in Fig. 5.

Kinetics Sounds is a subset of Kinetics Human Action Video dataset (referred as Kinetics-400) Kay et al [24]. The original version of Kinetics-400 dataset consists of 306, 245 videos divided into 400 action categories with \(1-150\) clips for each action. However, Kinetics Sounds consists of only action classes that are both visually and aurally recognizable. The average video length is 9.7 seconds. The subset possesses more than 22, 000 videos from 27 different action classes.

Fig. 5
figure 5

A snapshot from UCF51 showing number of samples per action class with audio-channel

4.2 Training

Input Pre-processing: We consider two modalities: RGB and audio. For RGB, we used video frames as input that are grouped into 25 frames for each segment. We followed Wang et al [66] for visual pre-processing and augmentation. For audio, we used Librosa library McFee et al [42] to generate six distinct image-based representations of audio samples.


Data Augmentation: Audio-image representations were normalized as per ImageNet configuration, with random horizontal and vertical flips. We have followed Wang et al [66] for visual data transformations.


Feature Extraction: The TSN-based Wang et al [66] feature extractor was used with BNInception Ioffe and Szegedy [20] as backbone for visual data and IRV2 Szegedy et al [57], which uses residual inception blocks for audio feature extraction.

Environment: The training environment consists of an NVIDIA GeForce GTX 1080 Ti GPU accelerator 12GB memory, Intel(R) Xeon(R) CPU E5-2650 v4 CPU.


Implementation: An IRV2 Szegedy et al [57] model pretrained on the ImageNet Russakovsky et al [49] dataset was used to extract candidate feature representations from image representations of audio. The models were implemented using the PyTorch library Paszke et al [46]. Similar to Takahashi et al [58], we have used split 1 of standard train-test split provided by UCF101 Soomro et al [55].


Hyperparameters: The number of neurons in the hidden layer for MFN block was 1024. An optimal batch size of 768 for MFN and 512 for VMLP was used. The networks were trained using cross-entropy loss and Adam optimizer Kingma and Ba [26] with a learning rate of \(3 \times 10^{-5}\) for MFN and \(10^{-4}\) for VMLP. We shifted learned weights from the video model to the fusion model for better weights initialization.

5 Results and discussion

Tables 1 and 2 compare recognition performance of the MAiVAR framework versus previous state-of-the-art methods on two datasets. Firstly, we compared the performance of the framework with baselines on both UCF51 and Kinetics Sounds datasets. Secondly, we demonstrate the considerations that evolved the design of our framework. We present a comparison of different CNN-based feature extractors followed by study of the efficiency of different audio-image representations. Finally, we have analyzed the impact of multimodal fusion as compared to the unimodal alternatives. The benchmarks were reproduced using accuracy over the standard train and test split. Following the evaluation protocol of Takahashi et al [58], we used accuracy metrics to evaluate the performance of the models that could be calculated as:

$$\begin{aligned} \text{Accuracy} = \frac{\sum _{i=1}^{I} TP_{i}}{\sum _{i=1}^{I} TP_{i} + \sum _{i=1}^{I} FP_{i}} \end{aligned}$$
(3)

where respective \(TP_i\) and \(FP_i\) indicate the number of correct and wrong predictions in the ith class. Accordingly, I is the number of classes.

5.1 Comparisons with the state-of-the-art methods

The validation sample from the datasets action categories evaluate the system performance. Table 1 shows the MAiVAR performance compared to other methods that use audio-visual modalities including AENet Takahashi et al [58], C3D Tran et al [62], IMGAUD2VID Gao et al [10] and other TSN-based techniques Wang et al [66]. The UCF51 dataset is comprised of fewer classes with relevant audio information. Therefore, MAiVAR was able to exploit the significant features that lies in audio signals among all the architectures that use audio-visual features.

To demonstrate the generalization of MAiVAR, we conducted experiments on a larger dataset, Kinetics-Sounds, which is also focused toward human actions in daily life and has potentially more recognizable samples with both acoustic and visual information. Based in Table 2, our proposed approach yielded competitive results compared to other methods. MAiVAR is better capable of getting benefits from both modalities. Moreover, the proposed framework demonstrated better mixing of audio and visual information 3.6 than other baseline models.

Table 1 Classification accuracy of MAiVAR compared to the state-of-the-art methods on UCF51 dataset
Table 2 Classification accuracy of MAiVAR compared to the state-of-the-art methods on Kinetics Sounds dataset

5.2 Ablation study

5.2.1 Audio-image feature extraction

Several experiments were conducted to determine the optimal design configuration for the proposed framework, which can be broken down into the following points: (1) different feature extractors for audio-image representations, (2) optimal audio-image representation, and (3) analyzing the influence of hyper-parameter settings. We tried several CNN-based feature extractors for audio-image representations including: the smaller as well as wider ResNet He et al [18], EfficientNet Tan and Le [59] and Inception ResNet Szegedy et al [57]. Compared to IRV2, ResNet101 and ResNet18 showed relatively better performance (see Fig. 6).

Fig. 6
figure 6

Performance analysis of audio-image feature extractors on the UCF51 dataset

5.2.2 Convergence of the audio representations

We also evaluated the performance of six different audio-image representations for fusion with video features including: (i) Wave plot, (ii) MFCC, (iii) MFCC feature scaling, (iv) Spectral Centroids, (v) Spectral Rolloff and (vi) Chromogram. A sample example of each audio-image representation is shown in Fig. 9, along with a visual representation of same data sample. As the structure of the chromagram is well-suited for CNN-based models, chromagram-based representations performed better than other competitor representations in the fusion process. Convergence of each audio-image representation after fusion with video modality is illustrated in Fig. 7. However, it can also be observed that a unimodal representation learning with higher accuracy is not the optimal candidate for achieving better performance after multimodal fusion. As presented in Table 3, it is evident that MFCCs Feature Scaling-based representation produced the best audio-level accuracy, after which fusion with chromagram-based representation produced optimal features for the fusion.


Fig. 7
figure 7

Convergence of audio representations on the UCF51 dataset

Table 3 Validation accuracy of different audio representations before and after fusion on UCF51

5.2.3 Multimodal fusion analysis

In this section, we analyze the impact of feature fusion. It was observed that fusing extracted features from audio-image and video provides better results than without fusion. However, empirically it has been shown that beyond simple concatenation, weights from one modality can influence fusion with the other modality. The best results are obtained when visual features and weights from the VMLP model interact with audio-image features. We learned that vision features and weights contribute more than acoustic signals in this regard. This is aligned with previous observations in Brousmiche et al [4, 5].

Fig. 8
figure 8

t-SNE visualization of the Kinetics Sounds embedding of different features (video (left), audio (middle) and fused (right)) in the downstream path. Note that t-SNE embeddings do not use the class labels. Labels are only used during final visualization

Figure 8 compares the embedding of the different features extracted at different phases of the framework (shown in Fig. 2) on Kinetics-Sounds. For compatibility with t-SNE der Maaten and Hinton [41], we transformed feature maps from the average pooling layer to reduce the embedding dimensions to 2D. We observed a better clustering of the different action classes after the fusion of audio and video features in the framework.

Fig. 9
figure 9

Examples of segmented video input and six different audio-image representations of the same action

6 Conclusion

In this paper, we proposed a multimodal audio-image to video action recognition framework called the Multimodal Audio-image and Video Action Recognizer (MAiVAR). We generated several audio-image representations and compared their efficiency. We then extracted audio and visual features using CNN-based models and fused them. The framework achieves 87.91% accuracy on the UCF51 dataset and 79.01% accuracy on the Kinetics Sounds dataset. Experiments demonstrated that visual modalities are not indispensable. MAiVAR performed better in comparison with other baseline methods.

Future research could focus on exploring different transformer-based architectures such as Video Transformer Network Architecture (VidTr) Neimark et al [44], with other optimal feature representations. The drawback of the fusion-based model is that using complex modality representations greatly influence computational complexity. To address this issue, data and model-parallelism could be used to simultaneously compute audio-visual modality streams instead of synchronous execution.