Comparison of Diarization Tools for Building Speaker Database

. This paper compares open source diarization toolkits (LIUM, DiarTK, ALIZE-Lia_Ral), which were designed for extraction of speaker identity from audio records without any prior information about the analysed data. The comparative study of used diarization tools was performed for three diﬀerent types of analysed data (broadcast news - BN and TV shows). Corresponding values of achieved DER measure are presented here. The automatic speaker diarization sys-tem developed by LIUM was able to identiﬁed speech segments belonging to speakers at very good level. Its segmentation outputs can be used to build a speaker database.


Introduction
The current state of the science and knowledges has a strong influence on our society.One of the highly attractive research topic is speech-to-text transcription, where the fusion of different theoretical knowledges, experiments and finally prototype realizations lead to the complex system.
Over recent years, a speaker diarization has become an important key technology for many tasks such as navigation, retrieval or segmentation to the homogeneous regions of audio data [1].
An audio diarization is the process of annotating an input audio stream with information that attributes to segments of signal energy to their specific sources such as speakers, music, background noise sources, and other signal source characteristics.The diarization can be also used in the speech recognition, facilitating the searching and indexing of audio archives, increasing the richness of automatic transcriptions and for the enhancement their readability [2].The effective diarization tool can also be used for the automatic database creation from large amount of acoustic data.
The speaker diarization, the "who spoke when" task, consists in annotating recordings with labels that represent speakers.This task is performed without any prior information: neither the number of speakers, nor their identities, nor samples of their voices are available [3].
There are two main kinds of clustering strategies, which can be used in a diarization system.The first is called bottom-up, also known as an agglomerative hierarchical clustering (AHC).The algorithm starts in splitting the full audio content in a succession of clusters and progressively tries to merge the redundant clusters in order to reach a situation where each cluster corresponds to a real speaker.The Bayesian Information Criterion (BIC), Kullback-Leibler (KL) or T s based metric can be applied as a stop criteria [1].
The second clustering strategy is called top-down and starts with one single cluster for all the audio content and tries to split it iteratively one-by-one until reaching the number of clusters equal to the number of speakers.Previous mentioned stopping criteria can be applied to terminate the process or it can continue until no unlabelled data remain.The bottom-up approach is more popular.
This paper compares open-source diarization toolkits: LIUM_SpkDiarization, DiarTk and ALIZE-Lia_Ral.The speaker segmenation and clustering are based only on the audio information (i.e.without any additional information such as a number of speakers, etc.).Obtained speaker tags don't represent identities but abstract labels.Three different data types for which the mentioned toolkits have been used are re-ported in this paper.One are short broadcast news (BN), then speech of main anchormen of daily BN and the last type is SUS TV-show.

Diarization System
Figure 1 shows the main modules of an overall diarization system.It composed of parametrization, possible VAD/SAD detector, diarization and evaluation module.The audio processing starts with the parametrization module, that is responsible for the features extraction.The next module performs VAD (Voice Activity Detection) in better case SAD (Speech Activity Detection).Ideally, only speech segments are processed by the diarization module.These speech data are divided to clusters and finally after the whole diarization process (according to the diarization tool) only clusters corresponding to speakers are provided.The diarization output is then evaluated in the last module.

Parametrization
Popular features used in such systems are Mel-Frequency Cepstral Coefficients (MFCC), Perceptual Linear Prediction coefficients (PLP), Linear Frequency Cepstra Coefficients (LFCC), fundamental frequency, energy, etc., [15], with their temporal extensions such as delta and acceleration coefficients.Depending on the used diarization system it is possible to omit some features (c 0 in the LIUM) or used a combination of different features (e.g.DiarTK [7]).

Speech Activity Detection
The Speech Activity Detection (SAD) plays a very important role in the whole diarization process for two reasons [1], [2].The first is the impact on the speaker diarization performance metric, namely the Diarization Error Rate (DER).The evidence of non-speech sounds increased a diarization error.The second follows from the fact that non-speech segments can disturb the acoustic modelling of speaker dependent models and make them less discriminant.An initial approaches for diarization try to solve SAD on the fly, i.e. non-speech clusters were a by-product of the diarization.
An audio stream is segmented to the speech and nonspeech frames [7].Speech segments are then processed by the Hierarchical Agglomerative Clustering (HAC) in which segments belonging to the same speaker are clustered together.Viterbi decoding (re-segmentation) is performed to generate a new segmentation realigned on the speaker boundaries.LIUM_SpkDiarization system finally performed another HAC using Cross-Likelihood Ratio (CLR) (classical or normalized) or Integer Linear Programming (ILP) proposed in [12], where i-vectors were used to model and measure the similarity between clusters.The diarization output contains the time stamps of segments that belong to the each recognized speaker.

Evaluation
For computing a Diarization Error Rate (DER) on the speech segments [6], three error types have to be defined: • The confusion error -the system-provided speaker tag and the reference do not match through the mapping.
• The miss error -speech is present in the reference but no speaker is present in the system hypothesis.
• The false alarm error -speech is incorrectly detected by the system.
The used speaker diarization systems were evaluated by the NIST evaluation procedure for computing DER using rttm files (perl script: md-eval-v21.pl):DER = confusion + miss + false alarm total reference speech time . (1)

Used Diarization Tools
The brief theoretical description of analysed diarization tools can be found below, namely LIUM_SpkDiarization, DiarTk and ALIZE-Lia_Ral.

LIUM_SpkDiarization
The open-source toolkit LIUM_SpkDiarization [3], [5] was developed from a previous speaker segmentation tool, mClust [16] in C++ by LIUM for the French ESTER evaluation campaigns in 2005 and 2008.This toolkit was designed to processing TV shows and radio broadcast.It analyses the input audio stream, performs diarization and identifies homogeneous segments belonging to the same speaker without any prior information about an audio content e.g.number of speakers.
The diarization system provided by LIUM starts its processing by the computation of 13 MFCC with c 0 using Sphinx tools (http://cmusphinx.sourceforge.net/).This configuration of features is not used through whole diarization but for example in the Viterbi decoding phase, c 0 is removed and first order derivative are added to the feature vectors.
Two phases speaker segmentation is based on GLR (Generalized Likelihood Ratio) for the identification of instantaneous change points and BIC (Bayesian Information Criterion) distance metric for the fusion of consecutive segments belonging to same speaker.BIC hierarchical clustering merges two closest clusters until BIC distance is positive.In the segmentation and clustering phase speakers are modelled by Gaussian distribution with full covariance matrix.
Viterbi decoding is performed to adjust segment boundaries using GMMs as speaker models.Feature vectors are modified as was described above.The speech/non-speech segmentation and music & jingle regions removal is done in this phase [5].The decoding uses 8 GMMs corresponding to 2 silences (wide and narrow band), 3 types of wide band speech (clean, over noise or over music), 1 narrow band speech, 1 music and 1 jingle.The GMMs contain 64 diagonal Gaussians trained by EM-ML on ESTER data [4].
LIUM_SpkDiarization system finally performs another HAC using normalized Cross-Likelihood Ratio (CLR) or Integer Linear Programming (ILP) proposed in [12], where i-vectors were used to model and measure the similarity between clusters [5].
Diarization outputs were converted to the rttm file format for obtaining Diarization Error Rate DER (%) according to the NIST evaluation procedure.

DiarTk
The open source toolkit DiarTk [7] was developed for multi-stream speaker diarization tasks under GPL licence.It was developed in C++.
The diarization process consists of three main operations.The first is a segmentation into homogeneous regions, then an agglomerative clustering is performed where segments are grouped according to the speaker.Finally a Viterbi realignment represents a diarization output, in which speaker segment boundaries are refined.DiarTk is able to handle with multi feature streams (MFCC, Time Delay of Arrivals -TDOA, Modulation Spectrum -MS, Frequency Domain Linear Prediction features -FDLP) [13], [14].The main difference against a conventional diarization system is in the speaker modelling technique: DiarTk employs the non-parametric clustering and realignment based on the agglomerative Information Bottleneck principle (it does not use GMM speaker modelling) [7].
The diarization algorithm can be briefly described for one feature stream as follows [7]: It starts with the feature extraction, then the audio segmentation tool (IBf eat) performs speech/non-speech segmentation, non-speech frames elimination, background GMM estimation and computation of relevance variable distributions p(Y |X) as a weighted sum of individual distributions W i p(Y |X i ), where X represents speech segments and Y relevance variables information.
The second aIBclust tool performs hierarchical agglomerative clustering, where speech segments X associated with the relevance variable distributions p(Y |X) are clustered in to clusters C.

The last module IBrealign is responsible for the Viterbi realignment of speaker boundaries. This process as well depends only on the relevance variable distributions p(Y |X).
The diarization output is in the rttm format compatible to be scored by the NIST evaluation module.

ALIZE-LIA_RAL
This complete diarization toolkit in C++ is an opensource distributed under GPL license.The package ALIZE [5], [10] provides the basic operations required for handling configuration files and features, matrix operations, error handling, etc., and LIA_RAL package performs several tasks including language recognition, speaker recognition, diarization/segmentation, VAD, etc.
In the acoustic segmentation step, the VAD classifies the audio content in the following predefined classes: speech, music, music+speech, or telephone.It is realized by build-in GMMs, or can be used speech nonspeech GMM detection based only on the energy.VAD models can be created or downloaded from official webpage.Then the segmentation and speaker clustering are performed by using evolutive HMMs (e-HMM).
The additional segmentation is done in order to refine the previous speaker segmentation output and to remove irrelevant speakers (i.e.speakers with a low number of assigned frames).In this case the algorithm creates a new HMM models generated from the previous segmentation output and then applying an iterative speaker model training/Viterbi decoding loop [10].
An optional purification step can be integrated within LIA_RAL toolkit.The purification step is then realized between the segmentation and resegmentation phase.It eliminates the spurious segments.The diarization output is in the rttm file format.

Database
In our experimental work all used data had wav format, mono, 16 kHz sampling frequency.Three different types of acoustic data were diarized (see the Fig. 2): • Short broadcast news included jingles, speech (anchorman, redactors and respondents), phone speech and background music.The average short news duration was about 5 minutes.
• Second diarized data included two anchormen speech during main daily broadcast news.They contained only main anchorman (primary sound) a cross talk of co-anchorman.The average news duration was about 1 hour.
• The last type of diarized data were TV shows SUS ("Court hall").They contained advertisement, jingles, several speakers, overlapped speech and background music.The average show duration was about 1 hour.This type of sound data is characterized by dynamic dialogues between participants of the court hearing.

Results
Achieved results for three different data types can be found in the Tab.The majority of errors for short_BNs diarization was caused during external contributions and during telephone or degraded speech for all analysed tools.Generally, this data type achieved high DER values.
Daily_BNs were diarized very good by the LIUM, but in the case of Lia_Ral were achieved high DER values (on average 64.81 %).It was caused by the applied VAD algorithm, that classified the co-anchorman low level speech as a non-speech.
As was mentioned before, SUS data were very challenging, but LUIM_SpkDiarization was able to identified the primary speakers (judge, advocates and another main participants).Other tools (Lia_Ral and DiarTk) achieved worse, but comparable results.
Promising results were achieved mainly for daily_BNs (on average 3.19 % DER) with LIUM, where the correct identification of speakers was at very high level.Some problems were identified in the case of simultaneous speech and during laughter.Such kind of data can be very effective processed by LIUM.

Conclusion
In this paper the diarization of three different types of acoustic data were performed.The comparison of obtained DER values was presented.LIUM_SpkDiarization seems to be the effective tool for speaker segmentation and identification for all tested data types.
The speaker database can be created according to the LIUM by using only speakers that have high occurrence in the diarization output.This way an elimination of hazardous segments will be achieved.An automatic transcription system (ATS) can be built from an effective diarization and an automatic speech recognition system.Then ATS should allow identification of audio document structure and automatic speech recognition with an extraction of the speaker identity.
In the future we would like to perform the diarization also with other type of data (for example meeting speech and lecture recordings).
1.As was mentioned, for DER (%) computation the rttm file format is required.Obtained results are grouped according to the analysed data.Average DER (%) was computed for each data type separately.SUS data in the comparison of both BN data were much more demanding.