Characterization of Temporal and Acoustic Parameters for Speaker Identification in Disguised Speech

Identification of speaker through disguised voice sample is not uncommon but it poses an intricate problem and hurdle in crime case examination. Most of the time, perpetrator is uncooperative because of the fear of detection and always tries to hide his/her identity. In the present study, we have chosen two modes of disguised speech such as freestyle and by keeping handkerchief on mouth. Twenty speakers (10 males and 10 females, aged between 25-45 years) were selected for recording of speech samples. Total 3780 spoken words were subjected to Spectrographic analysis and temporal measures (Speech rate, Articulatory rate, Phonation-Time ratio). The acoustic parameters such as supralaryngeal [Formant frequencies (F1, F2, F3)] and laryngeal [Open Quotient (H1*-H2*), Degree of glottal opening (H1*-A1) and Glottal leakage (B1)] were studied for the identification of the speaker through disguised voice samples. Temporal measures are very effective parameters when speaker is trying to disguise his/her voice. It has been found that Phonation time ratio (P/T) and Open Quotient (OQ) are least affected in case of disguised speech as compared to Formant frequencies (F1, F2, F3), Degree of glottal opening (H1*-A1) and Glottal leakage (B1)


Introduction
With the fast growth of technology and communication, the criminal activities are also increasing rapidly. We can see the use of latest technology at every next door, may be for good or evil cause. And the limitations of the step-by-step system are increasingly felt. Criminals are very much aware with the latest technologies and they are always ready to beat the surveillance system. Criminals are using the telephones, mobiles and satellite phones to communicate or for threatening, hoax and ransom calls and disguise their voice because of the fear of identification. In forensic scenario, speaker identification is laying down the mile stones with remarkable opinions given in many cases of national and political interest. But speaker identification in case of disguised voice is still one of the grey areas in this progressive field. Thus, the research work carried out here is concerned with relatively immature area of speaker identification with disguised voice.
Speaker identification in case of disguised voice is highly sought-after in forensic world as vocal disguise can potentially modify the acoustical characteristics of an individual voice. The nature of speech sound is dependent mainly on three factors i.e. Dimensions of vocal tract, Mode of Phonation and Manner of Articulation. There is variation in vocal tract organs with time, health factors and increasing age, which affect the formant values [1]. Although every individual have habitual and learned patterns of phonation and articulation, there would be small variation in each utterance of the same word or text during normal speech known as intra-speaker variation. This amount of variability in fundamental voice frequency F0 has little effect on the linguistic interpretation of an utterance and greater effect on prosodic feature for each utterance [2]. However, the person can change the manner of articulation and mode of phonation intentionally during disguise. Riech et al. [3] investigated the effects of few selected vocal disguises upon spectrographic speaker identification. He also examined the ability of naïve and sophisticated listeners to detect the presence of particular disguises with a high degree of accuracy and reliability [4].
LTAS (Long Time Average Spectrum) were used by Lindh to identify speakers from closed set of disguised voices [5]. Hollien and Majeswski carried out experiments to study the normal controlled speech samples, under stress and disguised speech condition using Long Term Speech Spectra (LTS). The results demonstrated high level of correct speaker identification for normal speech sample, slightly reduced scores for speech during stress and markedly reduced correct identification for disguised speech samples [6]. Spectrograms are considered as reliable method for speaker identification in normal speech samples [7][8][9]. The previous studies [1,6,10] indicate that disguise indeed increases the intra speaker variation as seen in voiceprints. A lot of valuable studies had been carried out in the respective area but still the problem is as firm as rock. So being not only relying on voiceprints, we investigated few more parameters such as Temporal measurements (Speech rate, Articulatory rate, Phonation-Time (P/T) ratio), Supralaryngeal measurements i.e. Formant frequencies (F1, F2, F3) and Laryngeal parameter (Open Quotient (H1*-H2*), Degree of glottal opening (H1*-A1) and Glottal leakage (B1)) which are highly characteristics of an individual.
In the present study, we have taken voice samples in three different modes, first in normal speech and then speakers were allowed to disguise their voice in freestyle and keeping handkerchief in front of mouth mode. For freestyle disguise, speaker disguise in a manner in which the speaker felt that he would conceal his identity most effectively. Speaker changes the voice to taper off, to creaky, hoarse voice or in falsetto. The third mode of speech recording was using handkerchief placed in front of mouth (covering nose and mouth) by speakers while he/she is speaking. It is hoped that temporal and acoustic parameters characterized from the present study could be used by forensic expert as and when such situation arises.

Experimental Procedure
Experimental procedure is divided into three parts namely: Selection of speech material, Selection of speakers and Recording procedure and environment Fifteen isolated words, which are frequently used in normal conversation and having importance in forensic speaker identification, were selected. Further using these isolated words, five contextual sentences were also framed ( Table 1). These isolated words and contextual sentences were given to each speaker to utter three times. These utterances were recorded in normal mode as well as two modes of disguised speech (freestyle and handkerchief in front of mouth).

Selection of Speakers
Twenty speakers (10 males and 10 females, aged between 25-45 years) having normal speaking habits were asked to read the selected speech material normally and in disguised modes (freestyle and handkerchief in front of mouth) as far as possible. It is pertinent to mention here that none of the speaker was professional imitator.

Recording Procedure and Environment
The recordings were made in sound treated room of Central Forensic Science Laboratory (CFSL), Chandigarh. The speech samples were recorded directly on computer equipped with CUBASE software through Sennheiser microphone model HMD25-1 600Ω at normal room temperature. The speech samples were recorded at the sampling rate of 44.1 KHz frequency and 16-bit quantization. Hence, a total of 3780 word files in normal and disguised modes have been stored in computer in the following format.

Acoustic and Temporal Analysis
All acoustic analyses were made using the Computerized Speech Lab (CSL) Model-4300B and Goldwave software. Spectrographic analyses (Wideband and Narrow band), Pitch contour, Energy contour, Supralaryngeal and Laryngeal acoustic parameters were determined through Computerized Speech Lab (CSL) and Temporal measurements were carried by using Goldwave Software. Spectrograms of normal and disguised modes were compared.

Results and Discussion
Spectrograms (SPG) are usually reliable representation of relative vowel quality, strengthening and weakening of stops, frication and aspiration, but there is great deal of individuality in the length and type of aspiration and frication. Spectrographic analysis was carried out for all the utterances in all three modes and the values of first four formants were calculated. The rate of transition of the formant and duration of stops varies from one individual to another. Figure 1 shows the spectrograms of the text 'Hello I am Fine' in all three modes i.e. normal, free style and handkerchief in front of mouth mode in window A, B & C respectively. The horizontal intense bands made up from vertical striations represent the formants. The value of frequency at particular formant varies from speaker to speaker. The third and fourth formants are not clearly visible in case of freestyle and handkerchief in front of mouth disguised mode where as these are clearly obtainable in normal speech. It can also be observed that the aspiration and frication parameters in the sentence /hello, I am fine/ is reduced and diminished in case of disguised speech. It is pertinent to mention that /f/ is dento-labial fricative. In the word 'fine/, frication is followed by good formant pattern formed by the vowel /i/ and nasal sound /n/. The formant pattern of /fine/ doesn't alter in disguise mode in other utterances but the change in the values of second and third formant frequencies (F2 and F3) was observed in disguised mode as compare to normal mode utterances.

004
To study co-articulation effect of followed vowels, the sound of /H/ was compared in the words "/HELLO/" and "/HAAN/". It has also been noted that in word 'Haan', /h/ followed by long nasal/a/ vowel sound, the vocal tract appears to be completely open (lips completely open position) without any restriction from articulators and vocal folds keep on vibrating after the consonant closer. Thus the duration of aspiration of /h/ in 'Haan' is shorter (0.046sec) as compare to the /h/ in 'hello'(0.064sec) where /h/ is followed by /e/ (little spreading of lips) and little constriction of vocal tract appears to increase the duration of aspiration of /h/ in /hello/ ( Figure 2). In the spectrographic study of contextual speech sentence 'Where are you' you' in all three modes i.e. normal mode (window A), freestyle (window B) and by putting handkerchief on mouth (window C) mode (Figure 3), The formant pattern get distorted from normal (window A) speech sample as speaker tried to disguise (freestyle (window B) and by putting handkerchief in front of mouth (window C)). The position of formants shifts (both in frequency and time axis). The retroflex sound of word /r/ of 'are' is missing in free style (window B) mode but shift from word 'are' to 'you' remains unaffected. Similar pattern was observed in other utterances.
Though there is variation in spectrogram patterns because of disguise but still there are many factors, which reflect the speaker characteristics, as it was difficult for the speaker to alter all the parameters simultaneously. Pitch contours were plotted and it was observed that it was difficult for speaker to change the pitch during the utterances. There is high probability that speaker attain his original fundamental frequency during speaking. Energy contours were also plotted and it has been observed that the energy values of word 'you' was also approximately similar in all the utterances.

Temporal Measurements
In present experiment, temporal parameters such as speech rate, syllables per minute, articulation rate and phonation-time ratio were calculated for all the utterances of contextual speech text.  a) Speech Rate: Total number of syllables produced in a given speech sample divided by the amount of total time required to produce the sample (including pause time), expressed in seconds. Speech rate has been calculated for all the twenty speakers in normal, freestyle and handkerchief mode. The bar diagrams of averaged speech rate for each speaker in all three modes have been plotted (Figure 4). The speech rate varies from one speaker to another speaker as well as between the modes of speaking within the speaker. Similar trends have been observed in case of male and female speakers ( Table 2). b) Syllables per Minute: Total number of syllables produced in a given speech sample divided by the amount of total time required to produce the sample (including pause time), expressed in minutes. Figure 5 represents the averaged syllable rate per minute over three utterances of each mode from all the twenty speakers. Syllable rate varies from speaker to speaker as well as in different modes. It was found that speaker can modify its syllable rate per minute and speech rate intentionally and can also increase pause time in between the syllables, which was in agreement with the findings of Kunzel [11] ( Table 2).  c) Articulation Rate (Per Minute): Total number of syllables produced in a given speech sample divided by the amount of total time required to produce the sample (excluding pause time). Articulation rate also was calculated for all the contextual speech text in respect of each speaker. It has been observed that variation in the articulation rate between the utterances is less as compared to speech rate and syllable per minute for all the speakers. It varies in the range approximately 320 to 500 for normal 220 to 550 in disguised mode respectively (Table 2). It has been noted that articulation rate gave less intra-speaker difference in case of normal and disguise speech samples as compared to speech rate and syllables rate per minute ( Figure 6). It is important to mention that articulation rate can be considered as promising parameter for forensic speaker identification.

Supralaryngeal Acoustic Parameter
A speech sound created solely by vocal folds vibrations as the sound source will have its phonetic contents mainly determined by the first three formants; F1, F2, F3 and the relative distances between them. These are determined by the manner and position of articulation, so the structure of the formant frequencies will vary with each sound. We have calculated the first four-formant frequencies (F1 F2, F3 & F4) for the isolated and contextual text from the FFT and LPC spectrum for all male and female speakers. In disguised speech samples, higher formant frequencies (F3 & F4) show larger variation as compare to lower formant frequencies (F1 & F2). It has also been observed that second formant frequencies (F2) is affected more as compared to first formant frequencies (F1). Some speakers has tendency to shift second formant frequency closer to third formant frequency (Figure 8). To study the deviation on four formant frequencies values, standard deviation was calculated for all the twenty speakers in normal and disguised modes.  The values for first four formants were averaged over all utterances in respect of each speaker for the selected text. The values of standard deviation were calculated for all twenty speakers (Table 3). It has been observed fourth formant frequency is most affected parameter whereas first formant frequency is least affected and second and third formant frequencies are in between of F1 and F4 in case of disguised speech when compared w.r.t. normal speech. The order of variation was found to be: F1<F2<F3<F4 (Table 3). These findings found to be similar as the results obtained in the study carried out by Endres et al. [1].

Laryngeal Acoustic Parameter
In speech production, Larynx acts as a phonatory mechanism, transforming the airflow from the respiratory system into waveforms. Phonation types refer to the activity of the larynx and could be considered as laryngeal voice quality. Voice quality consists of physically induced voice characteristics and vocal settings, and that both make use of the similar acoustic parameters and physical characteristics lies in a speaker's voice quality. Acoustic measurement of laryngeal voice quality method was first introduced by Fischer-Jørgenson [10,14].
In addition to supralaryngeal acoustic parameters, we have made an attempt to study the Laryngeal acoustic parameters such as Open Quotient (H1*-H2*), Degree of glottal opening (H1*-A1) and Glottal leakage (B1) on the selected text. The calculated values of each parameter have been plotted to distinguish between different speakers even when they are trying to disguise their voices.  Figure 9 a & b Represents the plot of open quotient for female and male speakers respectively. It is clear from the figures that every speaker has individualistic pattern, which is independent of type of disguise and type of text spoken. The relationship between the OQ and the harmonics of a periodic signal is evident from Fourier analysis; as the OQ increases, the amplitudes of the higher harmonics decrease. Male speaker have a relatively shorter pulse than female and hence a smaller OQ such that the higher harmonics remain relatively strong, opposed to the relatively weak higher harmonics for women [14,15]. Hence, OQ can be considered as significant parameter to differentiate the speakers. can also increase or decrease depending upon type of disguise. The first formant bandwidth varies with type of disguise as well as type of text spoken, but the variation is very less in case of some of the speakers which is in the range 0-5 (Hz) in both disguise and normal modes of speaking. Further the value of Glottal Leakage ranges 350-600 (Hz) in case of females and 250-450 (Hz) in case of male speakers respectively. The range of variation varies from speaker to speaker differently ( Figure 10). It has been observed that the glottal leakage is quite similar for normal and disguise modes in case of some of speakers and this can be used to differentiate among the speakers. c) F Degree of glottal opening (H1*-A1): Degree of glottal opening (H1*-A1) primarily reflects the laryngeal voice quality. As we know, vocal fold adduction is largely a function of posterior cricoarytenoid muscle action, and the opening of glottis is usually greater in voiceless mode than in any other mode used in speech. Degree of glottal opening (H1*-A1) is mainly characteristic of type of disguise. The values of Degree of Glottal Opening are widely varying in type of disguise and normal modes but variation is less among the speakers. Such as whisper requires far greater constriction than the voiceless setting of the glottis and in breathy voice; normal vocal fold vibration is accompanied by some continuous turbulent airflow. In both cases glottal closure is incomplete [16][17][18]. It was found that the value of degree of glottal opening is more for female speakers as compare with male speakers (Figure 11).

Conclusion
The present study reveals that each parameter which is investigated carries speaker specific information, but they outrank according to their striking features to show more inter speaker differences and less intra-speaker differences in normal as well as disguise modes. Spectrographic analysis reveals good information about the intonation and formant pattern in normal mode but could be changed in case of disguised mode of speaking. However, among all the temporal measures, Phonation-Time Ratio (P/T) was found be prolific parameter. The first fourformant frequencies (F1, F2, F3 & F4) were also calculated for the isolated text extracted from contextual speech. Further these are ranked statistically in the order of F1<F2<F3<F4.In some cases of disguised speech samples, Second formant frequencies (F2) shows more variation as compared to Third formant frequencies (F3) and Fourth formant frequencies (F4); but first formant frequencies (F1) is least affected. Besides, laryngeal acoustic parameter such as Open Quotient (OQ) and Glottal Leakage (B1) also reflect the speaker specificity. These parameters can be considered to identify the speakers in normal as well as in disguise modes.