Modeling Long-Term Multimodal Representations for Active Speaker Detection With Spatio-Positional Encoder

In this study, we present an end-to-end framework for active speaker detection to achieve robust performance in challenging scenarios with multiple speakers. In contrast to recent approaches, which rely heavily on the visual relational context between all speakers in a video frame, we propose collaboratively learning multimodal representations based on the audio and visual signals of a single candidate. Firstly, we propose a spatio-positional encoder to effectively address the problem of false detections caused by indiscernible faces in a video frame. Secondly, we present an efficient multimodal approach that models the long-term temporal contextual interactions between audio and visual modalities. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that our framework notably outperforms recent state-of-the-art approaches under challenging multi-speaker settings. Additionally, the proposed framework significantly improves the robustness against auditory and visual noise interference without relying on pre-trained networks or hand-crafted training strategies.


I. INTRODUCTION
Active speaker detection (ASD) aims to determine who speaks in a video by combining the analysis of both facial movements and voice.This multimodal (audio-visual) task is a key component in a wide range of modern practical applications, such as human-robot interaction [1], [2], [3], speech separation [4] and, speaker tracking [5], [6].Particularly, by identifying speech for on-screen speakers, ASD improves audio-visual speaker diarization [7], [35], [36], [37].However, earlier efforts on ASD were limited to short sequences of frontal faces, which did not accurately reflect the complexities of real-world situations [8], [9], [10].With the release of the AVA-ActiveSpeaker dataset [13], which is the first large-scale standard benchmark for ASD tasks, several high-performing large networks have been proposed to model various types of relational information in audio- The associate editor coordinating the review of this manuscript and approving it for publication was Byung-Gyu Kim.
Despite the success of recent studies, there are potential limitations in their performance.Most of the existing approaches rely on two-stage pipelines in order to capture both audio-visual correlations and relational contextual information from multiple candidates simultaneously.Firstly, they optimize a multimodal encoder using relatively shortterm audio-visual signals (approximately 1 to 2 seconds) owing to the high computational cost of representing multiple candidates [11], [16], [17], [28].This encoder then serves as a fixed feature extractor for the second stage, which models the visual relational context between multimodal embedded features from multiple speakers.However, these approaches may limit the interpretation of longer-term audio and visual information, as it does not fully leverage the learning capabilities of progressive neural architectures to directly capture the evidence of speaking activities.Moreover, as the number of speakers in a video frame increases, they rely heavily on the visual relational context between multiple candidates in a video.As a result, they often lack robustness and yield unsatisfactory results owing to the presence of auditory and visual noise interference, particularly in the case of videos with multiple speakers and long durations.
To address these limitations, we deviate from a recent standard approach and devise a strategy to achieve robust multimodal feature representations by exploring the relationship between individual visual tracklets and their corresponding audio streams.In TalkNet [18]-a previous work related to audio-visual correlation learning-a transformer-style architecture [19] is used to capture the interactions between audio and visual signals.However, this method suffers from visual noise, such as low resolution and indiscernible faces in video frames, leading to relatively lower performance compared to recent approaches.
In this study, we develop a novel framework that aims to address the auditory and visual noise problem and achieve robust performance in challenging situations with multiple speakers.Fig. 1 provides an overview of the proposed framework.This framework is designed as an end-to-end pipeline that takes both the facial video and its corresponding audio as input and classifies whether the person is speaking in each video frame.To deal with visual noise, we assume that understanding the spatial location of each face in each video frame can determine the usefulness of the visual modality and prevent false detections owing to indistinguishable facial images.Therefore, we proposed a spatio-positional encoder that effectively leverages the spatial information of each face.Moreover, we present an efficient multimodal approach, which we call cross-conformer, that captures the long-term temporal contextual interactions between the audio stream and visual lip movements.Further details on the model architecture are provided in Section II.
We demonstrated the effectiveness of our framework through extensive experiments conducted on the AVA-ActiveSpeaker dataset [13].Our framework notably outperforms recent state-of-the-art approaches by a large margin under challenging multi-speaker scenarios involving 2-3 speakers within a single video.Specifically, the experimental results show that the proposed approach significantly improves the robustness of ASD against auditory and visual interference without relying on pre-trained networks or handcrafted training strategies.
Our primary contributions of this paper are summarized as follows: • We propose an end-to-end trainable framework for active speaker detection that notably outperforms the state-of-the-art in challenging multi-speaker set-tings without the need for large-scale pre-training or hand-crafted training strategies.
• The proposed multimodal approach is designed to model long-term temporal audio-visual representations by directly capturing the correspondence between the audio and visual lip movements for speaking activities.
• Our framework effectively tackles the problem of auditory and visual noise interference by reducing the number of false positive detections when dealing with longer-duration videos.

II. PROPOSED METHOD
The proposed framework is an end-to-end pipeline consisting of two main components: an audiovisual temporal encoder and a cross-conformer, as illustrated in Fig. 1.
The audiovisual temporal encoder takes the facial video and its associated audio as input and generates frame-level sequences of audio and visual features that represent the spatial temporal context, respectively.Our audiovisual temporal encoder is based on TalkNet [18], which is designed to extract long-term temporal features.In this paper, we describe the spatio-positional encoder and training strategies, aiming to enhance robustness against auditory and visual noise compared to TalkNet.Additionally, to efficiently model the multimodal representations in an end-to-end manner, we design the cross-conformer which consists of a cross-attention module and an integration module.The cross-attention module captures contextual correlations between audio and visual representations.To model temporal relationships, the integration module learns longer-term context in the multimodal representations.

A. AUDIOVISUAL TEMPORAL ENCODER
The audiovisual temporal encoder is a two-stream architecture consisting of an audio-temporal encoder and a visual-temporal encoder with a spatio-positional encoder.
The audio-temporal encoder uses a sequence of audio frames as input, which is represented by a vector of 80-dimensional log-Mel spectrograms.Similar to TalkNet [18], we used a Resnet-34 network with a squeeze-and-excitation module [20].Resnet-34 is specifically designed with dilated convolutions to match the temporal length of audio embeddings with visual embeddings.During the training process, we applied SpecAugment [21] (time and frequency masking only) to make our method robust to auditory noise and variations in acoustic environments.In addition, we augmented the audio data using negative sampling, as introduced in [18].This technique randomly selects an audio stream from another video in the same batch and adds it as noise, thereby effectively increasing the number of training samples by utilizing negative samples from the entire training dataset.Similar to audio stream, the visual-temporal encoder represents the face video as a sequence of visual embeddings with the same temporal length.First, the visual embeddings  were encoded by a 3D convolutional layer followed by a 2D Resnet-18 network [22] to explore the spatial information within each video frame.These embeddings are then input into a video temporal convolutional block that comprises a residual-connected depth-wise separable convolutional layer [23] with ReLU activation and batch normalization [31].The dimensions of the visual embeddings were reduced to 128 using a 1D convolutional layer to achieve computational efficiency.
As shown in Fig. 2, it is difficult to distinguish whether a person is speaking, particularly if the face has a very low resolution or if the shape of the mouth is indistinct.In this study, we assume that active speakers are often situated at the center of the scene and have higher visual saliency, particularly in movies.Therefore, understanding the spatial information of speakers can determine the utility of the visual modality and help assist in eliminating unlikely candidates.To exploit the spatial information of each face, we propose a spatio-positional encoder that represents the 4D spatial feature of each face region, parameterized by the normalized central coordinates, height, and width, into a 128D spatial feature vector using a single fully connected layer.The resulting spatial feature vector is incorporated into the visual embeddings in the form of feature-wise biasing.In the experimental section, we demonstrate the effectiveness of the spatio-positional encoder as an additional inductive bias.

B. CROSS-CONFORMER
In the field of speech enhancement, a cross-attentional conformer model [34] has been proposed.The model is an attention-based architecture for context modeling within a single audio modality.In this study, we propose a multimodal architecture, specifically designed to detect active speakers.Fig. 3 shows a block diagram of the cross-attention module.Firstly, to effectively model the contextual correlations between the audio and visual representations, we developed a cross-attention module that separately learns features for each modality while considering the constraints imposed by the other modality.This approach enables the learned features for each modality to encode intermodal relationships while preserving exclusive and meaningful intramodal characteristics.The input audio and visual features were first processed through the sequence of a half-step feed-forward module, convolution module, and layer normalization.Within the sequence, the convolution module aggregates contextual information from each feature to implicitly encode relative positional information [25], [26].These processed features are then input to a multi-headed cross-attention module that uses the processed audio (source) features to derive the key K a and value V a vectors and the processed video (target) features to derive the query Q v vector for audio attention features X att,a as formulated in Eq. ( 1), where d denotes the dimensions of Q v , K a , and V a .
Similar to X att,a , the visual attention features X att,v were generated for the visual modality using Eq. ( 2).
Thus, in this study, two separate modules-the audio attended module and the visual attended module-were employed, enabling independent attention to both the audio and video signals, resulting in audio attention features X att,a and visual attention features X att,v , respectively.The architecture combines convolution and self-attention to capture relativeoffset-based local and global interactions.The cross-attention modules can be stacked for a more extensive exploration of cross-modal information.As illustrated in Fig. 4, after obtaining the audio and video attention features, we feed them individually as target features into their corresponding subsequent layers within the stacked cross-attention modules.
Thereafter, the attended audio and visual features were concatenated to yield multimodal representations.
We propose an integration module to capture both the local and global dependencies of the processed multimodal features.This module is motivated by the conformer architecture [24], which combines convolution and self-attention to model short-and longer-term correlations in the input features.In contrast to the conformer architecture, the convolution module precedes the self-attention module in our model.This enables the self-attention blocks not to utilize relative positional embedding [25] because it can be partially captured by temporal convolution [26].
As illustrated in Fig. 5, the processed features pass through the sequence of a convolution module, multi-headed self-attention module, half-step feed-forward module, and layer normalization, with residual connections between each processing module.The self-attention module utilizes the query, key, and value in the attention layer, all of which originate from the joint audiovisual feature.Drawing a parallel to the stacked approach in the conformer architecture, we repeatedly applied the integration module to learn deeper multimodal representations.The resulting features are then fed into a fully connected layer, followed by a softmax function to obtain the final prediction sequence.

III. EXPERIMENTS
In this section, we provide an experimental analysis and a comparative evaluation to demonstrate the effectiveness of the proposed framework.We conducted experiments on the AVA-ActiveSpeaker dataset [13], which contains 262 YouTube videos from various film industries worldwide.The dataset presents several challenges, such as multiple languages, varying frame rates, numerous low-resolution face crops, and noisy audio.We used an official evaluation code that computes the mean Average Precision (mAP).When available, the Area Under Receiver Operating Characteristic Curve (AUC) was also reported.

A. IMPLEMENTATION DETAILS
Our framework and training algorithm were implemented using the PyTorch library [29].We used the Adam optimizer [30] with an initial learning rate of 0.0001, which decreased by 5% for every epoch.All face crops were converted to grayscale and reshaped to 112×112.During the training process, we performed visual augmentation by randomly flipping, rotating, and cropping original images, as well as audio augmentation, as described in Section II-A.Our framework consisted of four cross-attention modules and four integration modules, each with eight attention heads.The kernel size for depth-wise convolutional layer within the convolution module was set to 15.Finally, we used cross-entropy loss to compare the predicted sequence with ground truth sequence.

B. COMPARISON WITH THE STATE-OF-THE-ART
We summarize the performance comparisons of our framework with recent state-of-the-art methods on the validation 116564 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.set of the AVA-ActiveSpeaker dataset in Table 1.The proposed framework notably outperforms all other approaches without relying on hand-crafted training or pre-training strategies.It is worth highlighting that most of previous methods [11], [12], [16], [17], [28] have relied on a more restrictive approach, which includes scene layout and speaker suppression, or large-scale pre-training to reliably optimize the multi-stage pipeline.In contrast, our proposed approach relies exclusively on contextual interaction from audio-visual signals to maximize the agreement between the audio and visual lip movements.

1) NUMBER OF FACES
As the number of faces in a video increases, detecting active speakers becomes increasingly challenging.Table 2 shows that our framework consistently outperforms all other stateof-the-art approaches under these challenging multi-speaker conditions.Notably, when three speakers are present in a video frame, our performance significantly surpasses the previous state-of-the-art [17], showing an improvement of 2.0% over the prior 82.5% mAP.This result is a significant advancement-unlike existing approaches that focus on utilizing the relational context among all speakers in a video, our framework achieves this superior performance without requiring extra memory or computational resources to handle multiple candidates simultaneously.

2) FACE SIZE
Dealing with low resolution is quite a challenge, as described in Section II-A.Following previous evaluation procedures [18], [33], the validation set is divided into three bins according to the width of the detected faces: small (width less than 64 pixels), middle (width between 64 and 128 pixels), and large (width larger than 128 pixels).From Table 3, we can observe that our framework exhibits a behavior similar to recent state-of-the-art results, where smaller faces are more difficult to detect (76.3% mAP).This result is favorable, as our framework shows an improved performance in low-resolution scenarios compared to the baseline [18] (12.6% mAP improvement for small faces).Furthermore, for middle faces, our framework outperformed all other state-of-the-art methods, with a mAP score of 91.4%.For large faces, we also recorded the highest performance of 96.8%.The results demonstrate that the proposed spatio-positional encoder effectively represents the spatial information, thereby contributing to significant performance improvement.

3) LENGTH OF THE VIDEO
We present the performance comparisons by the length of the video on the AVA-ActiveSpeaker validation set in Table 4 and 5. To the best of our knowledge, there has been no trial to check the performance by the length of the videos.For evaluation, we partitioned the dataset into three categories based on the duration of the videos: short (0-3 seconds), medium (4-8 seconds), and long (9-12 seconds).From Table 4, we can observe that our framework significantly outperforms existing methods when handling longer-duration videos.In particular, our framework shows a remarkably improved performance on the ''long'' subset compared to the baseline [18], with an improvement of 1.8% mAP for videos of longer duration.In the ''short'' subset, our framework achieved a 92.7% mAP-an enhancement of 2.0% compared to the baseline [18].However, our experimental results also indicate relatively lower performance of our framework compared to recent approaches when applied to short videos.
Additionally, to further analyze the efficacy of our framework, we computed the False Positive Rate (FPR) as shown in Table 5.Our observation is that our framework greatly reduces the number of false positive detections in both medium and long-duration video inputs.These results confirm the effectiveness of the proposed framework in modeling intermodal relationships and long-term temporal contexts between the audio and visual representations.

C. NOISE ROBUSTNESS
To evaluate the auditory noise robustness of our framework, test data were generated with five levels of signal-to-noise (SNR) settings by adding the Librispeech dataset [27], which contains speech utterances from various speaker inventories, to the validation set of the AVA-ActiveSpeaker dataset [13].The level of additive noise corresponds to the given power of individual examples.Table 6 shows that our framework is relatively robust compared with TalkNet [18], which quickly degrades as the noise becomes more intense.

D. ABLATION STUDY
Table 7 summarizes the contributions of each component, namely, Audio-Aug (audio data augmentation), Spatio-Pos (spatio-positional encoder), Conformer (cross-attention module + conformer [24]), and Integration (cross-attention module + integration module).It is important to note that these components were built and evaluated based on the baseline [18].Compared to the baseline (92.3% mAP), each component helps to improve performance.In particular, the incorporation of audio data augmentation works synergistically with the spatio-positional encoder (+1.1% mAP).We observed that this improvement was particularly pronounced in video with auditory and visual noise interference.The use of the integration module within the proposed cross-conformer model resulted in a 0.6% perforimprovement compared to the conformer architecture.The key differences are the removal of the feed forward module, which was previously positioned before the selfattention module, and the non-use of the relative positional embedding.These adjustments led to improved performance and computational efficiency.Finally, by incorporating all the proposed modules, we achieved 94.4% mAP-an improvement of 2.1% from the baseline.
In addition, we provide a more in-depth analysis of the effects of the cross-conformer model.Table 8 presents the experimental results regarding whether the proposed model is helpful in enhancing temporal modeling performance.It can be observed that the proposed model achieved an average performance improvement of approximately 0.9% mAP across all subsets.Particularly on the ''long'' subset, we outperformed the TalkNet [18] + audio data augmentation + spatio-positional encoder (94.4% mAP) by a substantial margin of 1.4%.Based on these results, we conclude that the cross-conformer architecture is more suitable for long-term temporal modeling of multimodal representations.
E. QUALITATIVE ANALYSIS Fig. 6 illustrates several examples that compare our results with those of SPELL [17], providing a qualitative analysis of our framework.The selected video examples contain multiple speakers and visual interferences.In the first example, which involves changes in other speakers over a long duration of approximately 7 seconds, SPELL exhibits both false-positive and false-negative results, whereas our framework correctly identifies every speaker.The second example presents a situation where the person positioned in the center of the video is a non-speaker but still moves his mouth.In the third example, a person located on the right side of the video, though not speaking, exhibits large movements in the mouth region unrelated to speech.In these instances, our framework accurately identifies all active speakers, while SPELL yields numerous false positives owing to visual interferences.These results indicate that our framework significantly enhances robustness against auditory and visual noise interference by directly capturing long-term temporal contextual interactions between audio and visual lip movements.

IV. CONCLUSION
In this study, we propose a novel framework for active speaker detection that optimizes the correspondence between audio and visual lip movements in an end-to-end manner.We achieved robust multimodal by capturing the long-term temporal contextual interactions between individual visual tracklets and their corresponding audio streams.Experimental results demonstrate that the proposed framework notably outperforms recent state-of-the-art methods without the need for large-scale pre-training or hand-crafted training strategies, particularly in challenging multi-speaker situations with auditory and visual interference.Moreover, our framework effectively reduces the number of false positive detections, especially when handling lengthy video inputs.However, our approach exhibited lower performance compared to recent existing methods when targeting short videos.Future work aims to enhance the performance of our proposed framework by more effectively integrating the spatio-positional encoder and cross-conformer with crossmodal alignments, such as IRRA [38].Furthermore, we will evaluate the robustness of our proposed framework across various datasets, including data augmentation for shortduration videos.

FIGURE 1 .
FIGURE 1.Overview of an end-to-end pipeline of our framework.

FIGURE 2 .
FIGURE 2. Example of how visual modality becomes ambiguous with respect to the spatial locations of speakers within the problem of ASD.(a, b) represent that the low resolution of their visible faces makes it difficult to determine whether they are speaking in the scene.On the contrary, in (c), the speaker's visual features, particularly the shape of mouth, are distinct, making it easier to detect.

FIGURE 4 .
FIGURE 4. Example of the stacked cross-attention modules.

FIGURE 6 .
FIGURE 6. Qualitative results: The green box indicates active speakers, while the red box indicates non-active speakers.The number displayed on the boxes represents the prediction score.

TABLE 1 .
Performance comparisons with the state-of-the-art methods on the AVA-ActiveSpeaker validation set.

TABLE 2 .
Performance evaluation by number of faces, measured in terms of mAP (%).

TABLE 3 .
Performance evaluation by face size, measured in terms of mAP (%).

TABLE 4 .
Performance evaluation by length of the video, measured in terms of mAP and AUC (%).

TABLE 5 .
Performance evaluation by length of the video, measured in terms of FPR (%).

TABLE 6 .
Performance evaluation by different SNR levels of auditory noise, measured in terms of mAP(%).

TABLE 7 .
Performance comparisons of different ablative settings, measured in terms of mAP (%).

TABLE 8 .
Effect of the cross-conformer by video length, measured in terms of mAP (%).