Robust Indoor Speaker Recognition in a Network of Audio and Video Sensors

Situational awareness is achieved naturally by the human senses of sight and hearing in combination. Automatic scene understanding aims at replicating this human ability using microphones and cameras in cooperation. In this paper, audio and video signals are fused and integrated at different levels of semantic abstractions. We detect and track a speaker who is relatively unconstrained, i.e., free to move indoors within an area larger than the comparable reported work, which is usually limited to round table meetings. The system is relatively simple: consisting of just 4 microphone pairs and a single camera. Results show that the overall multimodal tracker is more reliable than single modality systems, tolerating large occlusions and cross-talk. System evaluation is performed on both single and multi-modality tracking. The performance improvement given by the audio – video integration and fusion is quanti ﬁ ed in terms of tracking precision and accuracy as well as speaker diarisation error rate and precision – recall (recognition). Improvements vs. the closest works are evaluated: 56% sound source localisation computational cost over an audio only system, 8% speaker diarisation error rate over an audio only speaker recognition unit and 36% on the precision – recall metric over an audio – video dominant speaker recognition method.


Introduction
The establishment of the digital era has created applications which combine audio and video to automate human activity analysis and understanding.We highlight the main areas of interest.First, for surveillance applications, i.e., detecting a person's biometric features to ensure that there are no intruders in a restricted area [1].Second, understanding people social behaviour and interaction to determine their "role" and their intentions [2].Third, detecting a possible threat in a public place [3] and, consequently, beam-forming and segmenting a dialogue [4].Typical surveillance scenarios are characterised by the use of many wide area, distributed sensors covering unconstrained scenarios.Scene monitoring is often required to be real-time, thus computationally inexpensive algorithms are fundamental to the development of an effective system, but this is not always evident in the literature at present, and this is the challenge we address in this work.

Related work
The first step to full audio-video (AV) human activity analysis and understanding systems is detecting and tracking speakers through significant occlusions.State-of-the-art sound source localisation algorithms [5,6] are still computationally expensive, hence they are not suitable for "real-time" (or frame-rate) applications.Solving large video occlusions is still an inherently challenging research problem: many existing papers solve the problem by using advanced multi-camera 3-dimensional (3D) systems [7] which are prone to error when the camera views do not overlap.They are computationally expensive, requiring GPU/FPGA implementations to function at frame-rate when parallelisation is possible.Complementary use of audio and video is able to compensate for noisy, missing and erroneous data, reducing the number of sensors and the computational resources required at the expense of minimal effort in integrating or fusing signals [8][9][10][11]2,[12][13][14][15]3,[16][17][18][19][20][21][22].
Audio and video fusion can be achieved in several ways chiefly using variations of sampling techniques [8,14,15,19,21].Existing AV person tracking system architectures work well only in highly sanitised, i.e., constrained and predictable scenarios: principally meeting rooms and diarisation [13,16,18,20] in which the person motion is either stationary, e.g., when people are talking seated around a fixed table.Existing systems use large sensor networks in which microphones and cameras are often very close or even attached to people [13][14][15][16].A hierarchical system is more likely to achieve robust situational awareness.These are more robust, as accurate and require lower algorithmic and hardware complexity [2,16].The weakness is that such systems often treat the two signals as if they were derived from truly independent processes: assuming one source of noise can affect only one kind of signal.None of the previous work explores whether an underlying relation between audio and video exists and seeks to exploit it fully.
AV event or anomaly detection literature is generally based on inferring AV signal correlations to recognise whether a relevant event has happened in some scenario of interest [3,11,12,17].The same correlation approach may be used to pick out the dominant speaker from a group of speaking people, without audio beamforming, filtering, blind source separation and data association [23,24].The definition of dominant speaker is clearly useful: a high degree of gesticulation and speaking activity are the fundamental cues to define dominance [25][26][27][28].In fact, gesturing is 80-90% of the time associated to speaking activity [29].Focussing on gesticulation detection is particularly suitable for low resolution video, where fine lip motion detection is not applicable and where close microphones may not be available.
To aid the reader, a schematic is shown in Fig. 1 and links to the different sections of the papers are explicitly made in the caption.
Section 2 presents the integration of audio and video data at the signal level and their fusion at decision level for speaker detection and tracking (see [30]).A speaker voice recognition unit is implemented to make the multimodal tracking robust to occlusions (see also [31]).In Section 3, the experiments and the results related to the first part of the system are described.Here, the benefits of fusing multimodal data are highlighted remarking that standalone trackers have worse performances than the AV solution.Then, a possible solution to the problem of tracking the current speaker identity through occlusions by recognising speakers voices is demonstrated.Section 4 presents how to visualise in large indoor surveillance-like scenarios the dominant speaker identity when multiple people speak contemporaneously without resorting to sophisticated algorithms (see also [32]).Finally, in Section 5 the conclusions of this research study are highlighted and future avenues of research enumerated.
The exact contributions of this work relative to the published literature are: (a) definition of a new, high accuracy, fast audio source localisation algorithm augmented by video (stochastic region contraction with height estimation (SRC-HE)) which outperforms the baseline method stochastic region contraction (SRC) of Do et al. [6]; (b) extension of AV techniques for speaker tracking and event detection where people dynamically move and interact which outperforms the baseline method of Izadinia et al. [17]; (c) exploitation of a small sensor network, deploying only a single Fig. 1.A detailed schematic diagram of the overall system presented in this paper.The schematic in (a) shows how the audio and video features cooperate at different levels of semantic abstraction.Block cooperations are represented by highlighted arrows which coincide with the novelties of this work.In (b) an audio localisation algorithm is cued by video data which becomes faster and not less accurate (see Section 2.1).In (c) it is shown how Mel-frequency cepstral coefficients (MFCC) voice signature recognition helps video ID tracking to be consistent through occlusions and ID swaps (see Section 2.7).In (d) the system describes how the correlation between optical flow associated with gesturing and sound signature of the scene helps the speaker ID recognition through speech interferences (see Section 4.3).Fundamentally, this system represents the combination of the detections of three "weak" classifiers into one robust process.camera and 8 microphones, which operates in open rooms vs. "standard" meeting rooms with constrained participants; (d) detection and tracking of speaker identity through occlusion and speech overlaps in a joint audio-video algorithm outperforming the state-of-the-art (Tables 1 and 2).
Early elements of this work were already presented in [30][31][32] and this paper makes two additional contributions relative to these papers.First, we give a unified presentation of the earlier work in a broader and fuller context; second, in this paper we present additional, new material, specifically graphs (Figs.1a and  4), Tables 3 and 4 and original results (Figs. 3 and 6).

Audio-video speaker tracking
Bayesian inference is the foundation of most of the existing joint AV tracking schemes.The Kalman Filter and its Extended version [10,15], the Particle Filter [8,19,21] as well as hybrid approaches using Monte Carlo Markov chains [13] have been all used to tackle the problem.However, these filters work in meeting room scenarios and use close-field sensors deployed in large array configurations [13][14][15][16].In this section, an AV speaker identity (ID) localisation and tracking algorithm which works in more unconstrained situation using a small sensor network is presented.Participants are not forced to wear sensors or to orient themselves towards the sensor.Audio source position estimates are computed by the SRC sound source localisation algorithm [6].The novelty here is that SRC is aided by available video information which estimates head height over the whole scene and gives a speed improvement of the 56% over the original SRC algorithm [6].We call this approach SRC-HE.Finally, audio and video data are combined in a Kalman filter (KF) which fuses person-position likelihoods and tracks speaker positions and identities through occlusions demonstrating that the global audio-video tracker (AVT) outperforms single modality trackers.

Feature extraction
Height detection and video tracking: The appearance-based video tracker extracting person height is based on a GPU-accelerated particle filter with ellipsoid models [33].Implementation is first described by Limprasert [34] and we direct the reader there for more details.In our work the video data coming from a single camera is exploited.Height measurements z h for each detected target = … i N 1, 2, , v (N v number of detected targets) are extracted to cue the audio localisation algorithm (Fig. 2), since they directly correspond to a good estimate of the speaker's head position.Each detected position z v can be described at each time step t as a × N 2 v vector, i.e., .

Audio source localisation cued by video height and audio tracking:
A popular method of audio source tracking is extracting maximal time-difference of arrival (time difference of arrival (TDOA)) values from the generalised cross correlation phase transform (GCC-PHAT) of the signals from a pair of microphones, in the frequency domain (see Knapp for full details [35]).
A method more robust to reverberation, the Steered Response Power (steered response power (SRP)), makes use of the GCC-  # Objects present at time t Fig. 2. Video detected height data is used to reduce the search space for the audio sound source localisation (SSL) algorithm SRC [6].Note that the third person is not picked up from the height estimation algorithm as she was part of the background right from the start of the video signal processing.In fact, the last updates on the basis of a background subtraction algorithm.
PHAT to build an energy map in a system with multiple microphones [6].This is the sum over all pairs ( ) m n , of microphones of the corresponding value of the GCC-PHAT for the TDOA.Evaluating the SRP across an entire room is computationally costly.In this work, an enhanced version of the SRC algorithm to localise quicker and better an audio source is used.SRC works by sampling the SRP randomly and choosing a subset of the largest samples to form a new region to sample within.This is repeated until the process has discovered a maximum.In order to further improve upon the SRC, instead of sampling uniformly over height, a different sampling distribution is used, centred around a head height.
Around each person, a tracking algorithm can be relatively confident of their height.Further away from them, the decreasing confidence is modelled by increasing the variance of the sampling probability density function (PDF).Hence, the variance at a distance l metres from a speaker is chosen to be modelled by a sigmoid function q, such as (1), which is a scaled error function: This function is zero at the origin and asymptotically approaches a constant as its argument tends towards infinity.All the variances around each detected speaker height are combined to form a global variance in the following equation: At any point p in space, the appropriate variance q to use will be the sigmoid function q of the minimum of the set of all 2-di- mensional (2D) Euclidean distances pq to known sources, where the set of known source locations is denoted as  and an element from the set of known sources is denoted as q.The minimum is chosen to ensure that the change in variance remains smooth even for overlapping sigmoids from multiple sources.
From a sparse set of people, the head height at every x-y coordinate in the SRP map needs to be defined.This is achieved using interpolation and extrapolation.When doing the interpolation, there is a trade-off between the smoothness of the curve produced and the size of ripples produced.The interpolation should not contain severe ripples as they would lead to large errors in the head height estimation across the room.Ideally, it should be monotonic and one way to achieve this is to use Delaunay triangulation [36] on the set of speakers, which creates a surface which can be evaluated at any 2D point.
To choose head height, existing knowledge of the current positions and heights of people in a room which is obtained from a camera (Fig. 2) is used (SRC-HE).In particular, the height data is updated on each iteration to the height of the last SRP peak found.Finally, the height h sub to use at each time step for every 2D point where, as said,  is the set of known speaker locations and H is the set of interpolated heights: This mixes a Gaussian with a Uniform distribution across h r , the entire height of the room.The resulting SRP value for the point p 2 then is given by = (see Algorithm 1 and Fig. 1b).The SRC-HE algorithm allows for direct speaker position calculation.Nevertheless, speaker position estimations are characterised by missing and false detections.This is mostly due to speech pauses and room reverberation respectively.Algorithm 1. Finding the global maximum using video height.Thus, SRC estimated positions are filtered by a KF.The signal vector obtained z a can be written as

Input: video detected heights
which the speaker ID at any given time t, ( ) S t A , is assigned.

Fusion of audio and video decisions
As previously stated, to speed up SRC search time, the speaker's height (computed by the video particle filter (PF)) is input into the audio unit to drive height sampling (SRC-HE).Then, after the audio and video data have been aligned, the posteriors of the KF audio tracker and of the video PF, x a and x v respectively, are fused in a common KF node.As data are gathered simultaneously and used all at once in a centralised fashion, audio and video pdfs are assumed to be independent of one another.On the basis of the a priori local estimates for the state predicted by the single-modality trackers at each time step, we evaluate the joint state estimate (Algorithm 2).
The final, joint AV output is fed back into the individual audio and video trackers as the best estimate of the previous time step to improve the single modality estimation.It is important to note that, since the assumption that people speak alternatively (which is a strong assumption for a normal conversation) has been made, a single audio signal corresponds to several video measurements at a time, one for each of the detected targets.By basing the audioto-video data association step on spatial proximity, i.e., nearest neighbour (NN) (more than one speaker cannot exist at the same point in space) speaker segmentation and ID recognition can also be obtained as long as people are resolved by the AV tracker.Its measurements can be considered robust with respect to the speaker motion model (see Algorithm 2).
In particular, the speaker ID inferred by the joint AVT is equal to the one of the i-th target if Once a visual ID i has been assigned to every target in an image, the speaker change detection output by the audio unit is used to solve video occlusion.In particular, when a pair of video detections fall within a certain region D which depends on the video tracker accuracy for each pair i j , of video detected targets), audio only contributes to KF filter innovation.If audio and video do not both fall within a certain region A, based on both audio and video tracker accuracy, , then a new speaker is conservatively considered to be detected according to the audio ID guess ( ) S A , successfully resolving occlusions (Algorithm 2).However, in a large reverberant room audio false positives do exist and compromise the speaker ID recognition based on positional data only.The integration of a speaker recognition (SR) module is proposed to make the multimodal AVT more robust to video occlusions in reverberant rooms where people move around.3: end if 10: end for 11: return x av and S AV

Dealing with occlusion
In Section 2.2 it is pointed out that when people occlude each other, as in normal social interactions behaviours, Bayesian multimodal speaker tracking based on audio and video position detections in certain situations cannot distinguish the actual speaker ID in a conversation.This mainly occurs when the video tracks merge or cross over and the signal to reverberation ratio (SRR) is too low.As long as the video target ID recognition is based on general properties such as characteristic clothing, the natural dynamic and ambiguous behaviour of such a feature may lead to situations like occlusions in which they are completely useless, e. g., two people who wear clothes of the same colours will have an associated histogram of colours very similar (see Fig. 9 for an example of such a situation).In the literature, this is normally solved either by using proximity models or placing physical constraints on people.However, if target ID is decided on the basis of a more specific feature such as voice, the fact that it is seldom observable could reduce the number of cases in which visual ID determination is compromised, representing a more elegant and less invasive solution.Voice spectral features are now calculated for each speaker and such information is incorporated into the AVT, so as to simplify the video-to-audio data nearest neighbour association step.By doing so, it is demonstrated that the AVT ID tracking performance improves.In turn, when speakers are distant from the microphones, recognising a speaker by their voice can be very complicated [37,38].Thus, exploiting audio-video positional cues also benefits the speaker voice recognition task at a distance (see Section 2.5).

SRC-HE vs. GCC-PHAT audio tracking
Despite the fact that SRC-HE reduces the number of FEs, audio measurements extraction based on SRC would still be not suitable for real-time applications [39].The previous SRC-HE module is then replaced by the generalised cross correlation phase transform (GCC-PHAT) introduced in Section 2.1, as this does not involve cumbersome point function estimations.The drawback is that the basic GCC algorithm can only detect one source at a time and it is known to be sensitive to room reverberations [5], however it is still effective under moderate reverberant environments ( ≈ ) T 0.3 s 60 [40].For these reasons, at first experiments where only a speaker is active at any given time are carried out, as it often happens in a polite conversation between two or more people.Speech segments using a voice activity detector (VAD) [41] are further extracted and processed using a GCC-PHAT step, for the signal to be more robust to reverberations.Thus, the measure vector obtained z a (see Section 2.1) can now be rewritten as , where each component τ m is the TDOA collected at the m-th microphone pair at each time step t.Since TDOAs are not linear in the speaker position, they must be input into an extended Kalman filter (EKF), as in [10] to get an audio position estimation.

Text-independent speaker recognition
We propose that, since the microphones already gather audio information for tracking purposes, the temporal spectral content of the signal can be used to extract speakers voice features and recognise their ID.Specifically, a SR module is chosen which performs text-independent speaker identification based on Gaussian mixture model (GMM) [42], under the assumptions that there exist N v possible speaker identities (as many as the detected video targets), whose "voiceprints" models ( ) p s i are learned beforehand. 1 In particular, speaker voice models are calculated on the base of 60s training signal for each speaker.From every voice sequence 12 sets of Mel-frequency cepstral coefficients (MFCC) [43] are extracted.Each model is represented by a 32-mixture GMM, whose parameters are estimated on the base of the extracted MFCC vectors by expectation maximisation (EM) [44].The test conversation sequences, not recorded in matching conditions, as it would be in a surveillance scenario characterised from different noises day by day, are framed in small speech-only subsegments which are considered to be long enough to detect a speaker change.For each speech subsegment its MFCCs are extracted and compared to the available database of speaker models to determine the likelihood ( ) p S s i of a particular speaker ID to be the one who uttered the actual speech subsegment S. Finally, the speaker ID opinion is output as: and its GMM's likelihood is used as a confidence measure.Our experiments here are characterised by just one speaker change detection point.The performance of the SR unit is better evaluated in terms of speaker verification [45].Hence, Fig. 3 shows the comparison of each voice in our database against each other voice model.Fig. 3a illustrate performances for 1 close microphone pair recording (2 channels) and Fig. 3b for 3 far microphone pairs recording (6 channels) to highlight the difficulties of detecting speaker at a distance despite the increased number of channels.In particular, the equalisation error rate (EER) [46] is 0 in the first case whereas it raises to 5.12% in the second.Finally, Fig. 3c shows results from all 8 channels; note that despite adding in the 2 more close recordings used for the first measure (Fig. 3a), the far distance microphones detrimentally affect the global performance whose EER is still as high as 4.96%.

Speaker conversation model
A new speaker switching probability is now introduced to .We call this a conversation model (CM).The CM is initially triggered by the i-th speaker ID detection obtained as a weighted speaker score fusion of the AVT and the SR modules [38].The actual speaker ID in this case is given by: Algorithm 3. Audio video tracking aided by speaker recognition algorithm.
Input: Audio z a and video ( )  Once a video ID i has been assigned to every target in each frame, the person recognition score derived from a SR þCM combination may be used in order to recover tracking ID data when occlusions occur.In such a case, competitive association hypotheses exist for the AVT, i.e., the AVT confidence drops below a certain threshold; thus, the SR and the CM opinions ratify the actual speaker ID, according to a weighted sum fusion rule [47] where weights are decided on the base of their estimate confidences [38].Hence, the speaker ID is first fed back into the AVT to aid resolving the nearest neighbour AV association and successively to correct the wrongly inferred speaker ID.Secondly, it is sent to the video tracker to indirectly re-assign the correct appearance models to the targets thus resolving the occlusion (see Algorithm 3 and Fig. 1c).

Experimental evaluation
Since the presented overall system (Fig. 4) is composed of several modules, in order to show the validity of the proposed approach, results are now presented separately to aid readability.Nevertheless, a summary of the overall conducted experiments is already presented in Table 3 to clarify the evolution of their rationale.
It is worth nothing that, given the wide range of human activity analysis applications, scenarios of interest and sensors configurations are varied, hence no standard data set has yet been collected for general purpose benchmarking.Systematic evaluation and comparison of the different fusion techniques for the specific AV speaker localisation and tracking is not possible and for this work it has been decided to develop a custom setup with less constraint on people, in contrast to classic meeting room applications.
1 camera and 4 directional microphones pairs are used to record AV data in a typical open office room, whose size is  11 m Â 10.1 m, where the area considered of interest is 3 m Â 4 m and where people can freely move.A picture of the room and the sensor layout is presented in Fig. 5 and a graphical explanation of why the microphones were placed is presented in Fig. 6.Such a positioning for the microphones pairs was chosen to maximise the performance of the TDOA estimation in the analysed room.In particular, three configurations of microphones were compared, i. e., (a) 1 linear array of 8 microphones (1 Â 8), (b) 2 arrays composed of 4 microphones (2 Â 4) at two sides of the area, and (c) 4 pairs of microphones (4 Â 2) at the four sides of the analysed area, among which the winning solution is the one proposed (4 pairs of microphones).Fig. 6 shows the contour maps of the room in the xy plane.As it can be seen, the contours for the 4 microphone pairs are the most dense and distinct all around the room unlike the other configurations.
Ground-truth data were hand labelled on a ground plane common to camera and microphones.Audio signals were sampled by the audio interface with a 24-bit precision resolution at 44.1 kHz, whereas the camera recorded the 640 Â 480 RGB video frames at a rate of E7.5 Hz.No attempt to reduce normal background noise (desk fans, footsteps, talking, etc.) was made and a reverberation time ≈ T 0.5 s 60 was measured [48].Synchronisation of the data was achieved by processing audio and video streams according to the camera frame rate, i.e., each 133 ms.Filters were initialised using the video detected position of their correspondent targets and static matrices Q and R [10], whose values were chosen on the basis of an optimisation step.
Results are described in terms of multiple object tracking precision (MOTP) and multiple object tracking accuracy (MOTA) [49]: The tracker is considered to have correctly hit the target if the distance between its output and the ground truth is within 0.5 m.Furthermore, the ability of the system to detect speaker ID by localising their voice is measured in terms of Diarisation Error Rate (DER) [46], expressing the speaker error only parameter, i.e., percentage of speech assigned to the wrong speaker.

Experiments and results
The first set of experiments is designed to simulate a 60 s long personal and intimate conversation between two people, according to Hall's classification of the social interpersonal distance in relation to physical interpersonal distance [50].Specifically: Experiment 'Formal Conversation' (Fig. 7a) considers two people whom throughout the experiment are separated by a distance of approximately 1 m.
Experiment 'Informal Conversation' (Fig. 7b, c) considers two people whom throughout the experiment are at a distance of approximately 0.4 m, resulting in an occlusion for the video tracker.

Results
Results are shown in Table 5a for 2 off-line cycles of the SRC-HE detection algorithm against the original SRC.Sound source localisation (SSL) accuracy changes by 4% when adding up extracted video height info.More interesting is the number of functional evaluations (FEs) which on average is reduced by 56% (FEs 56,281 vs 24,797) for the SRC-HE implementation, meaning that narrowing down the space of search our algorithm effectively speeds up the localisation task.In Table 5b performances of the multimodal AVT against single modality trackers are introduced.Results averaged over both the experiments and 100 Monte Carlo runs performance comparison show fusion of audio and video data that improves on single modality trackers when an occlusion occurs (see 'Informal Conversation' results).In particular, by fusing audio and video the AVT results in a 53% higher MOTA, which is reflected also in a far higher MOTP (80%) with respect to the video-only solution which just half the time of experiments tracks the correct person ID.In fact, the video tracker on its own cannot resolve occlusions.At last, note that the DER is 8% better as expected for the multimodal AVT solution with respect to the audio only system in the 'Formal' experiment, whereas in the 'Informal' one is slightly worst.This is due to the fact that the appearance based video ID estimations are completely wrong and they corrupt the multimodal decision.This motivated the further integration of the speaker voice features in the system.Hence, next experiments demonstrate how this algorithm can more robustly maintain and recover tracking ID through occlusions by recognising people voice signatures.
In the following experiments, every dataset is normally 2-5 min long and features people speaking in turns in a non-meeting scenario.The focus is on ID recognition results, rather than on the precision ones, which are obviously not high in such a challenging scenario if any further signal processing is used, as stated also in [51].Note that we have deliberately recorded our unique set of audiovisual data.This choice was made as classical AV datasets [52,51] are not suitable for our purposes: none provide speaker's voices recording for recognition purposes, as their principal aim is tracking people ID by means of video cues only.
Experiment 'Single Speaker' (Fig. 8) considers a person speaking along a rectangular trajectory for two times its perimeter, appearing and disappearing from behind an occlusion.His trajectory is shown in the lower left corner of Fig. 8 as detected by the video tracker.
Experiment 'Abandoning' (Fig. 9) shows a person walking and talking along a rectangular trajectory, as in the previous experiment, disappearing behind an occlusion.Then a second person, who looks like the first one and who is speaking as well, reappears from behind the occlusion and walks along the same trajectory till the point he disappears again.
Experiment 'Crossing' (Fig. 10) shows two people with very similar appearance walking while having a conversation.They meet along a diagonal where they keep on walking past each other causing an occlusion in the resulting image.Again, trajectory is shown in the lower right corner of Fig. 10.
Results are now presented in Table 6. 3 Here, it is worth noting that there is no real improvement between the AVT, AVTþ SR and  AVT þSR þCM tracking results as the final decision on the speaker voiceprint influences the AVT performance only when their confidences ratio is large.In this case, as the audio tracker is quite confident in its estimation, this ratio is close to 1 for almost the whole trajectory, therefore the DER is the index which shows the benefits of adding speaker voice to the system.Results averaged over all the experiments and 100 Monte Carlo runs show that AVT þSR is better by 13% than AVT, whereas AVTþ SR þ CM outperforms the AVT by 27% because of the conversation smoothing prior.Furthermore, a very good result when detecting speaker ID from far-field microphones is achieved, i.e., almost 5% in the worst case.In fact, a 2 m distant microphone normally shows a E20% DER [38].In the presented experiments instead, an average 8% DER improvement of the AVTþ SR þCM multimodal system over the SR only results (very first column of Table 6) is measured.

An indoor surveillance scenario
In Section 2 it is highlighted that the generalised cross correlation audio localisation function is not useful for cross-talking situations such as general security and surveillance (e.g., see Section 2.4).Automatic speaker recognition fails when two people are talking at once [23].In fact, as the CLEAR 2007 evaluation proved [51], temporal overlaps accounted for more than 70% of error for the speaker ID recognition task.However, the grand aim of automatic surveillance applications is to correctly detect the "dominant" speaker in a large scenario where speech overlaps are highly probable.A speaker is dominant in that their speaking energy is higher with respect to the other people who are talking whom can be instead considered as babble noise.This for example, may be the case of a bank where isolating individual sources is useful for safety reasons.Audio wise, such a task would normally require filtering techniques, beam-forming, non-trivial data association  and blind source separation [23,24].The concept of dominance in the literature has multiple definitions often used as equivalent [53].Nevertheless, many studies do agree that speaker loudness or energy and speaking time and rate, as well as gesture based cues are the fundamental features to define dominance [26,54].In this section, a novel method to automatically detect and localise the actual (dominant) speaker in an enclosed and cluttered scenario is introduced.Specifically, one more video feature is added on top of the system presented in Section 2.7, i.e., optical flow velocity and acceleration and Δ-MFCC, and audio and video are finally combined across semantic data levels.The motivating insight is that gesturing means speaking.This implies that observing strong motion implies an audio signal may be causally linked to such a video signal.We seek the correlation between the optical flow in a scene and its associated audio MFCC coefficients (see Section 2.5).Furthermore, audio and video position estimates of the actual speaker given by the AVT þSR þ CM (see Section 2.2) are used and combined with correlation cues at the feature level to narrow down the visual space of search of the correlation algorithm, hence reducing the probability of inferring a wrong sound-to-pixel region association.Using this solution we further improve on ID recognition-at-a-distance in a surveillance scenario.

Feature extraction 4.1.1. Optical flow video features
The video features for AV correlation computation are at first computed as the forward and backward dense optical flow of each image.Then, velocity and acceleration of two adjacent frames motion is calculated.If ( ) + U t p, represents the optical flow (u,v) at pixel position = ( ) i j p , , at time t, calculated between frames F t and + F t 1 and analogously ( ) − U t p, the flow vector computed over time between F t and F t À 1 , then the velocity and acceleration vectors are defined as: Hence, we combine the RGB colour, velocity and acceleration of each pixel p in a frame into a single feature vector: = ( ) v c o l v e l a c l p, , , ij .Thus, we spatially segment every frame using the QuickShift [55] algorithm with γ = 0.25, σ = 1 and τ = 15, i.e., the same as in [17].Furthermore, we compute across frames a K-means [56] spatiotemporal segmentation where K¼30.In consequence of that, when the processing ends, every pixel in a frame can be ascribed to the spatio-temporal centre of mass of the k-th segment found by Kmeans.The K final segments ( = … ) S k K 1, , k are described by the averaged normalised velocity and acceleration of the pixels they enclose, in addition to their mean RGB colour: Finally, the m 1 top segments for velocities and the m 2 top for acceleration are chosen to compose the final video features vector v.In practice, v is a × m t matrix whose columns correspond to frames.Organisation of MFCC audio features: To compute AV correlation, audio features are now represented by the first n/2 MFCC (see Section 2.5) coefficients, i.e., signal velocity and their n/2 derivatives, i.e., signal acceleration.The audio feature vector a is a n Â t matrix whose columns correspond to frames.Note that for the following experiments also the audio signal MFCC derivatives (Δ-MFCC) have been computed.

Audio video correlation
Canonical correlation analysis (CCA) is used to seek audio and video feature vectors correlation, hypothesising that a hidden correspondence between the image motion velocity and the audio MFCC exists, as well as between the image motion acceleration and the MFCC derivatives (Δ-MFCC) [57].CCA computes a common coordinate system where a and v can be projected, and where their maximised correlation is immediately known.This ensures the retrieved video segment to be the one which maximises the correlation between audio and video data, hence to be associated with the dominant audio source.Specifically, the CCA problem between two random variables has the closed form solution:  Joint audio-video tracking aided by speaker recognition experiment results.Note that, as it can be inferred in [51] which report the results of the CLEAR 2007 evaluation in real-word interactive seminar scenarios, perfect tracking of multiple people in such challenging situations is still unrealistic.Moreover, this statement refers to meeting room scenarios equipped with large sensor networks, thus more constrained and densely covered with sensors than scenes such as ours.Most significant here are the MOTA the DER indices which express the ability of the system to maintain the correct speaker ID.  1 are said to be the ones where the sound is originating.Only the normalised elements of w v 1 largest then a predefined threshold is selected, thus those segments are identified by a binary confidence map ( ) Q S k v 1 smoothed over space and time by a Gaussian kernel  12: end for 13: return video F

Fusion of audio-video correlation and audio-video tracking decisions
The integration of the speaker trajectory and the CCA result is carried out at confidence map level.For every frame F t we project the actual audio source trajectory calculated by the AVT x av (Sec- tion 2) onto the pixel domain ( ) p t .Then, said trajectory points are associated to the k-th segmented region to which they belong ( ) S i j , p p kp .Successively, a second confidence map ( ) Q S kp is set for ( ) S i j , p p kp , other than the ones already given by the first base eigenvector coefficients as described in Section 4.2.Furthermore, a smoothing Gaussian kernel on the segment S kp is defined, which we denote with ( ) G i j , p p .By doing this, the heatmap ( ) i j , p p AVT is finally obtained; this has to be overlaid on the image according to the AVT estimation.That done, such a map ( ( )) i j , p p AVT is obtained as if it was resulting from an extra first base eigenvector coefficient adding up its contribution to the CCA result, i.e., ( ) i j , , according to a sum decision rule (see Algorithm 4 and Fig. 1d).

Experiments and results
Results of dominant speaker detection on real data are now presented and evaluated against audio-only and video-only methods as well as against the baseline method presented by Izadinia et al. [17].An indoor room where people can freely move and talk together is our experimental region.
Experiment 'Crossing (Fig. 11) is used again (see Section 3.1) to demonstrate in this case that extending correlation techniques to scenarios where distracting motion and occlusion exist can be  [17].Precision-recall and hit ratio curves for the testing videos averaged over the total number of frames.Results of the proposed method (PM) are compared first against audio and video only results, then against the baseline method (BM) of [17].Besides, the proposed method aided by the information about the speaker ID, i.e., PMþ SR is given, showing that the method is actually suitable for diarisation purposes in cross-talking scenarios whenever a record of the speaker ID over time is kept.
done if more cues, as speaker positions, are used.
Experiment 'Surveillance' (Fig. 12) is a recording of several people having a conversation in groups and some passer-by.Speakers are at least 0.5 m far from the microphones.They stand still and move around.The ground-truth consists of the left and right people in the foreground who are having a conversation.Meanwhile a third person, frontal facing in the foreground, is just listening to the conversation and producing some distracting fine motion slightly moving his body on a side.Note that another group of speaking people is in the background.In total, at every moment, 4-5 people are speaking contemporary, resulting in challenging speech interferences.This experiment is designed to demonstrate the power of the method to detect the loudest (dominant) source among a group of speaking people in a cluttered scenario.
At first a qualitative evaluation of results performance against the baseline [17] is given.Fig. 11a shows results of the baseline method applied to the first dataset at the moment of occlusion.The segments corresponding to the AVT tracked position of the actual speaker are given in Fig. 11b, whereas Fig. 11c shows the results of the proposed method.In Fig. 11d knowing the information about the speaker ID, i.e., using the AVTþ SR þCM output, the results are ascribed at the current speaker.Fig. 12a shows one frame of 'Surveillance' for the baseline method results.The actual speaker is about to raise his hand while the listener has been moving his body resulting in false positive detections.This can be only mitigated by the AVT speaker position x av (see Section 2.2) corresponding segment (12b), so that the fusion results, despite pointing out the correct speaker, still present false detection trails corresponding to the other people movements (12c).When it is possible to recognise the speaker ID from the AVT þSR þCM, these trails can be actually further filtered out as shown in Fig. 12d.
To measure quantitatively performances of the presented method against [17], the precision-recall measure given in their paper is calculated.Specifically, the moving pixel ground truth is manually defined by selecting those regions of the video which correlated with the dominant speaker's voice.In practice, as this method is ultimately meant to be used for recognition and tracking purposes this is always represented by a bounding box including the speaker's body pixel.This region is denoted as R c , whereas R d is the pixel region detected by the method.Hence, the two curves are defined as . Note that, for detection of tracking purposes the size of the ground truth regions cannot be restricted to just the physical (anatomical) joint of a person.Hence, by defining R d as the detected pixels which actually belong to the current speaker, the goodness of the method in recognising the dominant speaker among other potential speakers can now be evaluated using this metric rather than the DER, as the last is more specific to diarisation systems.The precision-recall curve is given by letting vary a threshold between zero and one for every frame, thus we present the average curve for all the video frames.At last, to capture the temporal aspect of the methods performances we calculate their hit-ratio curves; we assume a hit that occurs in a frame if > Pr 0.5.Precision-recall curve and hit-ratio curves are shown in Fig. 13.The proposed method þ speaker recognition (PM þSR), i.e., the CCA þ AVT þSR þ CM precision is higher than the one of both the proposed method (PM), i.e., the CCA þAVT, and the baseline method (BM) over the entire range of recall, although when the recall value increases all curves drop dramatically.However, this is largely expected as the ground truth size is larger if compared to the recovered segments size, which decreases the accuracy of the methods by definition.On the other hand, the size of the segments cannot be increased, as clutter will take over the segmentation phase and foreground region would be blended into the background.Nevertheless, the PM þ SR solution improves on average on speaker ID recognition through occlusions and interferences by 23%, 59% over audio-only and video-only systems and by 36% over the baseline method [17].

Conclusion
In this paper a hierarchical AV tracking and recognition system based on novel audio and video feature integration and fusion is introduced.Specifically, the system carries out a finer independence-based AV localisation and a coarser AV correlationbased scene analysis to robustly track the dominant speaker through general (babble) noise in an open room scenario using a small sensor network.This can be useful in a number of general contexts which range from surveillance applications to the prototypical "cocktail party".Results show that we can rely on low complexity techniques even in unconstrained scenarios, without resorting to more cumbersome audio-only or video-only methods.

Future work
We highlight that the problem of detecting a speaker in a nonobtrusive fashion and in a natural environment is extremely challenging.And so a number of assumptions had to be made in order to make the problem tractable.We therefore suggest that the following future work could be undertaken: (a) defining a sounding calibration procedure independent from sensors movement, (b) learning gestures associated with specific person roles in a conversation, (c) speeding up optical flow computation using variational methods or sparse techniques, (d) developing an improved speech overlap recognition system to further decrease the diarisation error rate and, consequently, on the dominant speaker detection error, (e) developing a fully probabilistic scheme, i.e., a dynamic Bayes network (DBN) where Fig. 1a would represent onetime slice of the system.The hidden variables would be the speaker position and identity whilst the observables would be the audio and video detected speaker locations, spatio-temporal features and optical flow.With such a fully probabilistic model, any further features may be integrated.

Algorithm 2 . 1 :
Audio video tracking algorithm.Input: Audio z a and video z v measures Output: Position x av and identity S AV of actual speaker for every time step t do 2:

13 :
return x av and S 2.7.Fusion of audio-video tracking and speaker recognition scores

Fig. 3 .
Fig. 3.The speaker verification confusion matrix for close and far microphone setups.The speaker verification confusion matrix for close and far microphone setups.This figure shows the ability of the implemented SR unit to verify ID of people in the recorded pool of voices for (a) 2 close microphones (1 close field microphone pair); (b) 6 far microphones (3 far field microphone pairs); and (c) the total 8 channels (4 microphone pairs: 1 in the close field plus 3 in the far field).Results show that the best performance is obtained for close distance recording, i.e., (a), whereas in (b) and (c) speaker recognition is severely compromised due to the 3 pairs of far distance microphones.

Fig. 4 .
Fig. 4. A high level schematic diagram of the overall system presented in this paper.The presented high level diagram depicts the combination of the detections of three AV "weak" classifiers into one robust AV speaker recognition process.

Fig. 5 .
Fig. 5.In (a) a picture of the room used for our set of experiments is shown.(b) illustrates its layout and sensor setups.

Fig. 6 .
Fig. 6.Isolines, i.e., hyperbolas loci.An isoline indicates all points of a constant difference in distance from two points.2D contour maps in the xy plane when z ¼1.7 m for the (a) linear array of microphones, (b) 2 linear arrays of 4 microphones on two sides of the room, (c) 4 pairs of microphones of the 4 sides of the room.The last, has the more dense number of intersecting isolines, i.e., a higher number of possible solutions.

Fig. 7 .
Fig. 7. 'Formal Conversation' and 'Informal Conversation' localisation results.In (a) a 'Formal Conversation' is shown, the video tracker as well as the multimodal AVT can detect and recognise that there are two targets speaking alternatively and their output is the same.(b) shows an 'Informal Conversation', targets are so close that the video tracker cannot distinguish them.In (c) the AVT instead correctly localises the actual speaker despite the occlusion.Note that (c) showing two speakers talking contemporary is only for an illustration purpose to highlight AVT that can discriminate identities.In reality, as said, speakers talk in turns.

Fig. 8 .
Fig. 8. 'Single Speaker' tracking results.In (a) the video tracker only loses the speaker track when a long occlusion occurs.In turn, (b) shows the AVT correctly locating the speaker through the occlusion at the same time instant of (a).Finally (c) shows speaker track recovering (the video tracker alone is not capable of achieving this result).

Fig. 9 .Fig. 10 .
Fig. 9. 'Abandoning' tracking results.(a) Shows the AVT locked onto Speaker1.In (b) the other person appears while Speaker1 has left the scene.The ID assessed by the AVT is still Speaker1 meaning the AVT cannot make a distinction between IDs.In fact, the video tracker features for the two people, i.e., the histogram of colours at the bottom of (a, b), are very similar.In (c) instead, the AVT þSR þ CM solution correctly the person ID is Speaker2.(For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

1 .
covariance matrix and w v and w a are the canonical basis of v and a respectively.The largest CCA eigenvectors w v 1 and w a 1, which correspond to the largest eigenvalue λ 1 2 are the ones which give the larger contribution to the maximum audio and video correlation, hence they maximise the canonical variates ′ = If we assume that only a single dominant audio source exists, the first of these eigenvectors w v 1 is chosen and the corresponding frame segments

Fig. 11 .
Fig. 11.'Crossing' results.In (a) the results of the baseline method [17] are presented; (b) shows the result for the tracking results projected onto the image plane.In (c) the output of the proposed method is displayed, whereas (d) presents results when the information about the speaker ID is given.Ground truth is shown in red.(For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

Fig. 12 .Fig. 13 .
Fig. 12. 'Surveillance' results.This figure shows the Second Speaker talking while the other two people are listening without moving.In (a) the results of the baseline method [17] are given whereas (b) shows the result for the tracking algorithm.(c) presents the output of the proposed method.Finally, (d) shows what is the results when the information about the speaker ID is given.Ground truth is shown in red.(For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

Table 1
Test bed comparison to closest works.Abbreviation mic denotes microphones.Abbreviation cam denotes cameras.

Table 3 A
tabular summary of the following experiments and their rationale.

Table 4
[2]plified synoptic table of symbols used to defined the MOTP and MOTA indexes first defined for the CHIL meetings[2]as a benchmark for the CLEAR2007 datasets and others.
z h Output: speaker position SRP p 2 measurements, CM vOutput: Position x av and identity S of actual speaker 1: for every time step t do 2:

Table 5
Performance comparison for 'Formal Conversation' and 'Informal Conversation' experiments.In (a) we enumerate SRC vs SRC-HE raw speaker position detections.Results are shown in terms of sound source localisation (SSL) accuracy and number of functional evaluations (FEs) calculations.In (b) we present MOTP, MOTA and DER of the joint AVT against single modality trackers.