Deep Metric Learning-Assisted 3D Audio-Visual Speaker Tracking via Two-Layer Particle Filter

. For speaker tracking, integrating multimodal information from audio and video provides an eﬀective and promising solution. The current challenges are focused on the construction of a stable observation model. To this end, we propose a 3D audio-visual speaker tracker assisted by deep metric learning on the two-layer particle ﬁlter framework. Firstly, the audio-guided motion model is applied to generate candidate samples in the hierarchical structure consisting of an audio layer and a visual layer. Then, a stable observation model is proposed with a designed Siamese network, which provides the similarity-based likelihood to calculate particle weights. The speaker position is estimated using an optimal particle set, which integrates the decisions from audio particles and visual particles. Finally, the long short-term mechanism-based template update strategy is adopted to prevent drift during tracking. Experimental results demonstrate that the proposed method outperforms the single-modal trackers and comparison methods. Eﬃcient and robust tracking is achieved both in 3D space and on image plane.


Introduction
Audio-visual speaker tracking is a key technology of humanmachine interaction, driven by applications such as intelligent surveillance, smart space, and multimedia systems. By analyzing the audio-visual data captured by multimodal sensor arrays, the positions of the speakers in the scene are continuously tracked, providing the underlying basis for subsequent action recognition and interaction. Compared with the conventional single-modal tracking, complementary information from audio and video streams is utilised to improve the tracking accuracy and robustness [1].
Current methods for speaker tracking are built on the probabilistic generation models due to their ability to process multimodal information. As the representative statespace approach based on Bayesian framework, Kalman filter (KF) [2], extended KF (EKF), and particle filter (PF) [3] are commonly used methods. Among them, PF can recursively approximate the filtering distribution of tracking targets by using dynamic models and random sampling. However, traditional PF assumes that the number of targets is a priori, which is not suitable for natural scene containing multiple speakers with random motion. Probability hypothesis density (PHD) filter [4] is introduced to solve the problem, which is another random method based on finite set statistics (FISST) theory. Different from the above Bayesian methods, the speaker number is estimated during the PHDbased tracking process, and therefore, the PHD filter is considered promising for multispeaker tracking. However, the PHD filter restricts the propagation of the multitarget posterior distribution to the first-order moment, resulting in the loss of high-order cardinal information, which leads to speaker number estimation errors in low signal-to-noise ratio situation [5]. PF is selected as the tracking framework in this paper since it easily approaches the Bayesian optimal estimates without being constrained by linear systems and Gaussian assumptions [6].
Many works try to improve the architecture of PF to integrate data streams from different modalities into a unified tracking framework. e direction of arrival (DOA) derived from the audio source is used to reshape the typical Gaussian noise distribution of the particles in the propagation step of PF, and the weights of particles are recalculated according to their distance to the DOA results [7]. Tracking efficiency and accuracy usually depend on the number of particles and noise variance used in the state model and the propagation equation. Moreover, as an enhanced version of the PF, an adaptive algorithm is proposed to dynamically adjust the number of particles and noise variance by using audio-visual information [5]. e audio information obtained from the generalized cross correlation (GCC) algorithm and the video information extracted by the continuous adaptive mean shift (CAMshift) method are combined using a particle swarm optimization-(PSO-) based fusion technology [8].
e PSO algorithm can also be utilised to optimize the particle sampling in PF and improve the particle convergence to the active speaker region by incorporating an interaction mechanism [9].
To analyze and infer the dynamic system applied for speaker tracking, Bayesian theory provides an effective framework, which includes a state model and an observation model. Among them, the state model is used to describe the evolution of the state with time, and the observation model associates the observed information with the state of the speaker [10]. e prevailing fusion strategies in PF-based framework are performed by modifying the observation model to fuse the observations collected from multisensor devices [5]. Specifically, audio and visual likelihoods are constructed separately in the observation model to update the particle weights. A joint observation model is proposed in [11], which fuses audio, shape, and structure observations derived from audio and video in a multiplicative likelihood. e visual observation model [12] is derived by a face detector and reverts to a color-based generative model during misdetection. Furthermore, the visual observation is used to calculate the video-assisted global coherence field-(GCF-) based audio likelihood by limiting the acoustic map to the horizontal plane determined by the predicted height of the speaker [13]. Probabilities of the visual and acoustic observations are combined using an adaptive weighting factor, which is adjusted dynamically according to an acoustic confidence measure based on a generalized cross correlation with phase transform (GCC-PHAT) approach [14]. PF is combined to a pretrained convolutional neural network (CNN), which provides a generic target representation. e conventional color histogram-based appearance models cannot deal with sudden changes effectively, while a more stable observation model is presented by fusing deep features and manual features [15]. e purpose of this work is to adopt deep metric learning to optimize the observation model in the PF tracker. e method can effectively describe the similarity between samples by learning a distance metric. By designing the network structure and constructing a distance-based cost function, the similarity between the particle diffusion area and the matching template can be obtained. e likelihood function based on the network output can better define the particle weights and reflect the confidence of observations from different modalities.
is work is based on a two-layer PF framework, which achieves audio-visual fusion through a hierarchical structure including an audio layer and a visual layer [16].
In the propagation step, two groups of particles from the audio and video streams are diffused through the audioguided motion model. In the update step, the similarity between the particle diffusion area and the template is obtained through a pretrained Siamese network to calculate particle weight. In the estimation step, an optimal particle set is constructed to determine the speaker position. Finally, the target template is updated by a long short-term mechanism. e main work of the proposed 3D audio-visual speaker tracker is discussed in following sections: Section 2 describes the methodology of the tracker in detail, including the motion model, the observation model, the ensemble method, and the template update method; Section 3 presents the experimental results and detailed analysis; finally, Section 4 concludes this work.

Methodology
For the state-space model in the tracking task, the recursive filtering approach is commonly used to realize the dynamic system estimation based on the Bayesian theory. In PF, the state of the speaker is estimated according to the posterior probability distribution p(x t | a t , v t ), which is approximated by a set of random particles with associated weights, where x t is the state vector at time t and (a t , v t ) indicates the current observations from the audio signals and the video frames. Assuming that the target state transition is a first-order Markov process, the required posterior probability distribution is formulated in terms of likelihood function L(a t , v t | x t ) and state transition model p(x t | x t− 1 ): (1) Sampling importance resampling (SIR) algorithm is applied to implement the recursive Bayesian filter by Monte Carlo (MC) simulations [10]. e importance density is chosen to be the prior density p(x t | x t− 1 ). As the particles are drawn from the proposal importance density, Figure 1 depicts the framework of the proposed audiovisual speaker tracker. Audio and visual particles, x a and x v , are propagated, respectively, in audio layer and visual layer driven by the audio-guided motion model (Section 2.1).
rough the pretrained Siamese network, the likelihoods are obtained to weigh the particles (Section 2.2), and the speaker position is estimated using a set of optimal particles (Section 2.3). Finally, a long short-term mechanism is applied to update templates of the target (Section 2.4).

Audio-Guided Motion Model.
In audio processing, the DOAs of input audio signals are adopted as position observations to assist particle diffusion. To obtain the DOA observations, the two-step sam-sparse-mean (SSM) approach is employed for sound source localization [17]. e first step is to perform sector-based detection and localization.
e space around the microphone array is dispersed into multiple sectors with corresponding SSM activity value. e existence of the activity source in the sector is determined by comparing the activity value with the adaptive threshold. e second step is to conduct a point-based search in each active sector. e parametric approach is utilised for localization, which uses the cost function SRP-PHAT to optimize the spatial position parameters.
e proposed audio-visual speaker tracker is equipped with a set of audio particles x j a , j � 1, . . . , N s and a set of visual particles . . , N s and N s is the number of particles. e audio particle that propagated in 3D world is modeled in spherical coordinates, which is represented as the state vector: where (α t , β t , c t ) indicates azimuth, elevation, and radius of the audio particle at time t and (α t . , β t . , c t . ) indicates corresponding velocities. Unlike the above definition, the visual particle is propagated on the image plane and modeled in rectangular coordinates. e state vector of visual particle is defined as follows: where h t and l t are horizontal and vertical coordinates of the image frame and (h t . ,l t . ) represents velocities in the corresponding direction.
In the propagation step, due to the inaccuracy of elevation and radius estimation, the relatively accurate azimuth estimates are used to optimize the motion model based on the Langevin process [18]. In the azimuth direction of audio particles, the Langevin motion model is expressed as where λ a and ζ a are designed parameters, Γ a is the zeromean Gaussian-distributed noise, and ΔT is the time interval between two consecutive frames. In addition, the position of particles is further modified according to the DOA estimation results: where ρ a is a correction factor and θ t is the DOA (azimuth) estimated by the microphone measurements. In the elevation and radius direction, the particles are only diffused through the Langevin model without additional adjustment. For the motion model of visual particles, additional operations related to coordinate transformation are required. A pinhole camera model is employed to project the point (θ t , β t− 1 , c t− 1 ) located in 3D world coordinates to the image plane: where β t− 1 and c t− 1 are elevation and radius from the tracking result of the previous frame and ϖ and M are normalization coefficient and projection matrix. (h t , l t ) is the projected point on the image plane, and its direction relative to the microphone array center can be calculated: where (h, t) indicates the coordinates of the array center. e visual particle is firstly propagated through the Langevin model, whose coordinate is denoted as (h t ′ , l t ′ ). e audioguided motion model and coordinate transformation for visual particles are expressed as where (h t , l t ) is the modified particle coordinate. e DOA information is projected onto the image plane through the pinhole camera model. In this way, the particles converge to the sound source direction by moving toward the projection point.   Figure 1: Overall framework of the proposed 3D audio-visual speaker tracking method.

Deep Metric Learning-Assisted Observation
Model. e observation model is constructed to measure the candidate samples determined by the particles. Deep metric learning method using Siamese network provides a solution for similarity metric tasks, which convert the tracking problem into a similarity problem in the feature space of the known target and search area [19]. e framework of the adopted Siamese network is shown in Figure 2. It is equipped with two subnetworks with the same network structure and identical shared parameters. Each branch consists of three convolutional layers with 15 kernels, followed by a 2 × 2 MaxPooling layer. e filter size is 5 × 5, 3 × 3, 3 × 3. e input of the network is an image pair (I p , I s ) with a label Y ∈ 0, 1 { }, where Y indicates whether the image pair represents the same speaker. Each image is fed to a branch of CNN. rough the network, the feature mapping function, G W (I) with parameter W, is trained to map the input image pair to the target feature space. In this space, a distance-based metric function, E W (I p , I s ) � ‖G W (I p ) − G W (I s )‖, is used to measure the similarity of two images. e loss function proposed in [19] is adopted, which is defined as follows: where E n W � E W (I p , I s ) n , n is the index of image pairs, N is the number of training sample pairs, and L G and L F indicate the loss of positive and negative image pairs. Constant Q is set to the upper bound of E W . e rectangular boxes around the particles are cropped as candidate samples. e audio particles in the 3D world coordinates are projected to the image plane to obtain their rectangular boxes. Candidate samples are fed into the network, and the outputs indicate the similarity of the sample to the template, which is used to calculate the likelihood: where κ v is a designed parameter, E W′ is the normalized output, and x s t is the bounding box of the particle. Observation noise is assumed to follow a Gaussian distribution. To prevent tracking failures due to occlusion, deformation, and speaker walking out of the camera view, audio likelihood is added to modify the particle weights. Using the current DOA estimation result, the audio likelihood is defined as follows: where κ a is a designed parameter. Reverse process of the pinhole camera model is used to reconstruct the 3D coordinates of the visual particle to obtain its azimuth. is process requires a prior parameter that is derived from the radius of the audio particle closest to the visual particle. e weight of particles is calculated as follows: where ρ w is a user-defined threshold. When the likelihoods of all particles are less than the threshold, it indicates that visual observations are unreliable; therefore, audio observations are added to improve the particle weights.

Ensemble Method with Optimal Particle Set.
Ensemble method is used to integrate the decisions of audio and visual particles to estimate the position. By comparing the weights of all particles, the N s particles with the largest weight are defined as the optimal particles and used to form the optimal particle set: where x j opt a and x k opt v represent the set of optimal audio particles and visual particles, respectively. e speaker position is estimated by the optimal particle set: where ω i opt(t) denotes the normalized weight of the optimal particle. Finally, the optimal particle set is utilised to reset the audio and visual particles at the next frame, which ensures the effectiveness and diversity of the particles.

Template Update with Long Short-Term Mechanism.
e traditional method performs speaker tracking according to the template provided in the first frame, which is not updated in subsequent frames to avoid contamination of target features. However, in real scenes, nonrigid targets such as speakers have various deformations, thus showing large differences in appearance. erefore, template update method is used to adapt to changes in speaker appearance and prevent drift problem in tracking.

Complexity
An indicator is proposed to measure the tracking confidence and selectively update the template. e confidence of the tracking result p t is defined as follows: , (15) where L min is the smallest likelihood among all particles and L(p t ) is the likelihood of p t , which is calculated by equation (12). reshold ξ 2 is set. When R(p t ) ≥ ξ 2 , the tracking success rate is considered high enough to update the template, and the area around p t is cropped as a short-term template I p s . In addition, the past appearance of the speaker is essential for tracking. It is inevitable that noise will be merged into template through successive updating. erefore, the target image defined in the first frame by user is continuously adopted as a long-term template I p l . e samples are matched with I p l and I p s , respectively, and the similarities are measured by the Siamese network. e modified likelihood is defined as follows: where ρ u is a designed weighting factor, by which the longterm template and the short-term template are combined. Equation (17) is substituted into equation (12) to calculate the particle weight.

Experimental Setting.
e proposed tracker is evaluated on the AV16.3 corpus [20], which is a commonly used dataset for audio-visual speaker tracking captured by spatially distributed audio-visual sensors. e corpus is collected in a conference room with three cameras on the wall and two microphone arrays on the table. Audio signals are recorded at 16 kHz using two 10 cm radius, 8microphone uniform circular arrays. Video sequences are recorded at 25 Hz captured by 3 monocular cameras. Each frame is a color image of 360 × 288 pixels. e camera calibration information provided by the dataset is used for coordinate conversion in the pinhole camera model. e trackers are evaluated in single speaker case with three camera views on seq08, 11, and 12, where the speaker is moving and speaking at the same time with some challenging poses, such as outside of the camera view, not facing the cameras or fast motion. In each experiment, we use data streams from a camera and a microphone array. e number of audio particles, visual particles, and optimal particles is set to 50, respectively. e audio-guided motion model is set with (λ a , ζ a ) � (5, 0.25) and (λ v , ζ v ) � (4, 5), which depends on the particle velocity at different coordinates and (ρ a , ρ v ) � (0.3, 0.25). Parameters of likelihood functions are set as (κ a , κ v ) � (15,5).
resholds and weighting factors are set as

Experimental
Results. First, the proposed 3D audiovisual speaker tracker is compared with PF-based singlemodal approach referred to as audio-only (AO) tracker and video-only (VO) tracker. In Figure 3, (a)-(c) display the 3D tracking results on the three coordinates of azimuth (rad), elevation (rad), and radius (m) on seq11c2. e AO method uses the sound source localization result as the observation value in the PF algorithm. It performs well in azimuth localization, but it is difficult to accurately track in the other two directions. VO tracker achieves effective tracking on the image plane. Assuming that the speaker height is 1.7 m, and the tracking results on the image plane can be projected   6 Complexity the proposed motion model. e errors in elevation and radius are mainly caused by the speaker walking out of the camera field of view and large movements. Figure 3(d) shows the 3D MAE of above three methods, reflecting the superiority of the tracking performance with audio-visual fusion. e effectiveness of the proposed deep metric learningassisted observation model is evaluated by comparing with a two-layer particle filter (2-LPF), which uses the color histogram matching method to measure the similarity between the rectangle around the particle and the reference template. e HSV color model is extracted to calculate the Bhattacharyya distance, thereby defining likelihood and particle weights in the same form as equation (10). Partial 3D trajectories of two trackers on seq08c1 and seq11c2 are shown in Figures 3(e) and 3(f ), where the tracking results of the proposed method (yellow) are closer to the ground truth trajectories (green). Compared with traditional color features, the features extracted by the network are more distinguishable. Siamese network can better handle the similarity measurement of input image pairs, based on which more accurate particle weights are applied. e trajectory errors shown in figures are mainly due to the large error in the image 3D reconstruction process when the speaker, microphone, and camera are in the same vertical plane.
In order to investigate the effect of the proposed template update method on tracking performance, another comparative experiment is conducted. In the comparison method, the fixed template is the target image defined in the first frame without updating during the tracking process. Figure 4(a) shows the MAE on image plane (pixels) of two methods on seq11c1. e first peak of the error curve is caused by the speaker leaving the screen for a while and then reentering the scene. After frame 460, the target scale and appearance changes obviously due to the speaker moving close to the camera. As shown in the frame samples in Figure 4(b), the tracker equipped with template update mechanism achieves stable tracking (red), while the rectangular box produced by the comparison method (green) deviates from the target.
Finally, the tracking accuracy of the proposed tracker and the existing audio-visual trackers [5,12,21] are tested on nine single-person sequences captured by three cameras. Table 1 lists the MAE on the image plane (pixels) and 3D (m), respectively. e proposed tracker achieves outstanding performance both on image plane and in 3D space.

Conclusions
is paper presents a deep metric learning-assisted 3D audio-visual speaker tracker, which integrates a designed Siamese network into the two-layer PF framework. In the proposed observation model, the similarity measures of the template and the particle diffusion areas are calculated to update weights of audio particles and visual particles. e template that adapts to the speaker appearance change is obtained through a long short-term update mechanism, which prevents drift during tracking. Audio information and video information are fused through the audio-guided motion model, conditional weighting formula, and the optimal particle set. e proposed algorithm is evaluated on single speaker sequences and achieves substantial performance improvement compared to the trackers using individual modalities and the comparison of audio-visual methods. Future work will focus on acoustic feature extraction and multimodal confidence evaluation methods.
Data Availability e dataset AV16.3 corpus used in this study is an open source dataset provided by the Idiap Research Institute. e dataset is publicly available on the website https://www. idiap.ch/dataset/av16-3/. We cited this dataset in Section 3.1 of the article with a corresponding reference in the reference list as [20].