Employing Second-Order Circular Suprasegmental Hidden Markov Models to Enhance Speaker Identification Performance in Shouted Talking Environments

Speaker identification performance is almost perfect in neutral talking environments; however, the performance is deteriorated significantly in shouted talking environments. This work is devoted to proposing, implementing and evaluating new models called Second-Order Circular Suprasegmental Hidden Markov Models (CSPHMM2s) to alleviate the deteriorated performance in the shouted talking environments. These proposed models possess the characteristics of both Circular Suprasegmental Hidden Markov Models (CSPHMMs) and Second-Order Suprasegmental Hidden Markov Models (SPHMM2s). The results of this work show that CSPHMM2s outperform each of: First-Order Left-to-Right Suprasegmental Hidden Markov Models (LTRSPHMM1s), Second-Order Left-to-Right Suprasegmental Hidden Markov Models (LTRSPHMM2s) and First-Order Circular Suprasegmental Hidden Markov Models (CSPHMM1s) in the shouted talking environments. In such talking environments and using our collected speech database, average speaker identification performance based on LTRSPHMM1s, LTRSPHMM2s, CSPHMM1s and CSPHMM2s is 74.6%, 78.4%, 78.7% and 83.4%, respectively. Speaker identification performance obtained based on CSPHMM2s is close to that obtained based on subjective assessment by human listeners.


Introduction
Speaker recognition is the process of automatically recognizing who is speaking on the basis of individual information embedded in speech signals. Speaker recognition involves two applications: speaker identification and speaker verification (authentication). Speaker identification is the process of finding the identity of the unknown speaker by comparing his/her voice with voices of registered speakers in the database. The comparison results are measures of the similarity from which the maximal quality is chosen. Speaker identification can be used in criminal investigations to determine the suspected persons who generated the voice recorded at the scene of the crime. Speaker identification can also be used in civil cases or for the media. These cases include calls to radio stations, local or other government authorities, insurance companies, monitoring people by their voices and many other applications.
Speaker verification is the process of determining whether the speaker identity is who the person claims to be. In this type of speaker recognition, the voiceprint is compared with the speaker voice model registered in the speech data corpus that is required to be verified. The result of comparison is a measure of the similarity from which acceptance or rejection of the verified speaker follows. The applications of speaker verification include using the voice as a key to confirm the identity claim of a speaker. Such services include banking transactions using a telephone network, database access services, security control for confidential information areas, remote access to computers, tracking speakers in a conversation or broadcast and many other applications.

Speaker recognition is often classified into closed-set recognition and open-set
recognition. The closed-set refers to the cases that the unknown voice must come from a set of known speakers, while the open-set refers to the cases that the unknown voice may come from unregistered speakers. Speaker recognition systems could also be divided according to the speech modalities: text-dependent (fixed-text) recognition and text-independent (free-text) recognition. In the textdependent recognition, the text spoken by the speaker is known; however, in the text-independent recognition, the system should be able to identify the unknown speaker from any text.

Motivation and Literature Review
Speaker recognition systems perform extremely well in neutral talking environments [1][2][3][4]; however, such systems perform poorly in stressful talking environments [5][6][7][8][9][10][11][12][13]. Neutral talking environments are defined as the talking environments in which speech is generated assuming that speakers are not suffering from any stressful or emotional talking conditions. Stressful talking environments are defined as the talking environments that cause speakers to vary their generation of speech from neutral talking condition to other stressful talking conditions such as shouted, loud and fast.
In literature, there are many studies that focus on speech recognition and speaker recognition fields in stressful talking environments [5][6][7][8][9][10][11][12][13]. Specifically, these two fields are investigated by very few researchers in shouted talking environments. Therefore, the number of studies that focus on the two fields in such talking environments is limited [7][8][9][10][11]. Shouted talking environments are defined as the talking environments in which when speakers shout, their aim is to produce a very loud acoustic signal, either to increase its range of transmission or its ratio to background noise [8][9][10][11]. Speaker recognition systems in shouted talking environments can be used in criminal investigations to identify the suspected persons who uttered voice in a shouted talking condition and in the applications of talking condition recognition systems. Talking condition recognition systems can be used in: medical applications, telecommunications, law enforcement and military applications [12].
This paper aims at proposing, implementing and testing new models to enhance text-dependent speaker identification performance in shouted talking environments. The new proposed models are called Second-Order Circular Suprasegmental Hidden Markov Models (CSPHMM2s). This work is a continuation to the work of the four previous studies in [8][9][10][11]. Specifically, the main goal of this work is to further improve speaker identification performance in such talking environments based on a combination of each of: HMM2s, CHMM2s and SPHMMs. This combination is called CSPHMM2s. We believe that The rest of the paper is organized as follows. The next section overviews the fundamentals of SPHMMs. Section 4 summarizes LTRSPHMM1s, LTRSPHMM2s and CSPHMM1s. The details of CSPHMM2s are discussed in Section 5. Section 6 describes the collected speech data corpus adopted for the experiments. Section 7 is committed to discussing speaker identification algorithm and the experiments based on each of LTRSPHMM1s, LTRSPHMM2s, CSPHMM1s and CSPHMM2s. Section 8 discusses the results obtained in this work. Concluding remarks are given in Section 9.

Fundamentals of Suprasegmental Hidden Markov Models
SPHMMs have been developed, used and tested by Shahin in the fields of: speaker recognition [10,11,14] and emotion recognition [15]. SPHMMs have demonstrated to be superior models over HMMs for speaker recognition in each of the shouted [10,11] and emotional talking environments [14]. SPHMMs have the ability to condense several states of HMMs into a new state called suprasegmental state. Suprasegmental state has the capability to look at the observation sequence through a larger window. Such a state allows observations at rates appropriate for the situation of modeling. For example, prosodic information can not be detected at a rate that is used for acoustic modeling.
Fundamental frequency, intensity and duration of speech signals are the main acoustic parameters that describe prosody [16]. Suprasegmental observations encompass information about the pitch of the speech signal, information about the intensity of the uttered utterance and information about the duration of the relevant segment. These three parameters in addition to the speaking style feature have been adopted and used in the current work. Prosodic features of a unit of speech are called suprasegmental features since they affect all the segments of the unit. Therefore, prosodic events at the levels of: phone, syllable, word and utterance are modeled using suprasegmental states; on the other hand, acoustic events are modeled using conventional states.
Prosodic and acoustic information can be combined and integrated within HMMs as given by the following formula [17], where is a weighting factor. When:  O: is the observation vector or sequence of an utterance.
: is the probability of the vth HMM speaker model given the : is the probability of the vth SPHMM speaker model given the observation vector O. The reader can obtain more details about suprasegmental hidden Markov models from the references: [10,11,14,15].

First-order left-to-right suprasegmental hidden Markov models
First-Order Left-to-Right Suprasegmental Hidden Markov Models have been derived from acoustic First-Order Left-to-Right Hidden Markov Models (LTRHMM1s). LTRHMM1s have been adopted in many studies in the areas of: speech, speaker and emotion recognition in the last three decades because phonemes follow strictly the left to right sequence [18][19][20]. Fig. 1 shows an example of a basic structure of LTRSPHMM1s that has been derived from LTRHMM1s. This figure shows an example of six first-order acoustic hidden Markov states (q 1 , q 2 , …,q 6 ) with a left-to-right transition, p 1 is a first-order suprasegmental state consisting of q 1 , q 2 and q 3 , p 2 is a first-order suprasegmental state composing of q 4 , q 5 and q 6 . The suprasegmental states p 1 and p 2 are arranged in a left-to-right form. p 3 is a first-order suprasegmental state which is made up of p 1 and p 2 . a ij is the transition probability between the ith and the jth acoustic hidden Markov states, while b ij is the transition probability between the ith and the jth suprasegmental states.
In LTRHMM1s, the state sequence is a first-order Markov chain where the stochastic process is expressed in a 2-D matrix of a priori transition probabilities (a ij ) between states s i and s j where a ij are given as: In these acoustic models, it is assumed that the state-transition probability at time t+1 depends only on the state of the Markov chain at time t. More information about acoustic first-order left-to-right hidden Markov models can be found in the references: [21,22].  In LTRHMM2s, the state sequence is a second-order Markov chain where the stochastic process is specified by a 3-D matrix (a ijk ). Therefore, the transition probabilities in LTRHMM2s are given as [23]:

Second-order left-to-right suprasegmental hidden Markov models
with the constraints, The state-transition probability in LTRHMM2s at time t+1 depends on the states of the Markov chain at times t and t-1. The reader can find more information about acoustic second-order left-to-right hidden Markov models in the references: [8,9,23].

First-order circular suprasegmental hidden Markov models
First-Order Circular Suprasegmental Hidden Markov Models have been constructed from acoustic First-Order Circular Hidden Markov Models (CHMM1s). CHMM1s were proposed and used by Zheng and Yuan for speaker identification systems in neutral talking environments [24]. Shahin showed that these models outperform LTRHMM1s for speaker identification in shouted talking environments [9]. More details about CHMM1s can be obtained from the references: [9,24]. Fig. 2 shows an example of a basic structure of CSPHMM1s that has been obtained from CHMM1s. This figure consists of six first-order acoustic hidden Markov states: q 1 , q 2 ,…, q 6 arranged in a circular form. p 1 is a first-order suprasegmental state consisting of q 4 , q 5 and q 6 . p 2 is a first-order suprasegmental state composing of q 1 , q 2 and q 3 . The suprasegmental states: p 1 and p 2 are arranged in a circular form. p 3 is a first-order suprasegmental state which is made up of p 1 and p 2 .

Second-Order Circular Suprasegmental Hidden Markov Models
Second-Order Circular Suprasegmental Hidden Markov Models (CSPHMM2s) have been formed from acoustic Second-Order Circular Hidden Markov Models (CHMM2s). CHMM2s were proposed, used and examined by Shahin for speaker identification in each of the shouted and emotional talking environments [9,14].
CHMM2s have shown to be superior models over each of LTRHMM1s, LTRHMM2s and CHMM1s because CHMM2s contain the characteristics of both CHMMs and HMM2s [9].
As an example of CSPHMM2s, the six first-order acoustic circular hidden Markov states of Fig. 2 are replaced by six second-order acoustic circular hidden Markov states arranged in the same form. p 1 and p 2 become second-order suprasegmental states arranged in a circular form. p 3 is a second-order suprasegmental state which is composed of p 1 and p 2 .
Prosodic and acoustic information within CHMM2s can be merged into CSPHMM2s as given by the following formula, Therefore, the stochastic process that is specified by a 3-D matrix yields higher speaker identification performance than that specified by a 2-D matrix.
2. Suprasegmental chain in CSPHMMs is more powerful and more efficient than that possessed in LTRSPHMMs to model the changing statistical characteristics that are available in the actual observations of speech signals.

Collected Speech Data Corpus
The proposed models in the current work have been evaluated using the collected speech data corpus. In this corpus, eight sentences were generated under each of the neutral and shouted talking conditions. These sentences were: 1) He works five days a week.
2) The sun is shining.
3) The weather is fair.
4) The students study hard.

LTRSPHMM2s, CSPHMM1s and CSPHMM2s and the Experiments
The training session of LTRSPHMM1s, LTRSPHMM2s, CSPHMM1s and a. In LTRSPHMM1s, where, O: is the observation vector or sequence that belongs to the unknown speaker.  b. In LTRSPHMM2s, where,  c. In CSPHMM1s, where,  : is the suprasegmental first-order circular model of the vth speaker.

Results and Discussion
In the current work, CSPHMM2s have been proposed, implemented and evaluated for speaker identification systems in each of the neutral and shouted talking environments. To evaluate the proposed models, speaker identification performance based on such models is compared separately with that based on each of: LTRSPHMM1s, LTRSPHMM2s and CSPHMM1s in the two talking environments. In this work, the weighting factor has been chosen to be equal to 0.5 to avoid biasing towards any acoustic or prosodic model. Table 1 shows speaker identification performance in each of the neutral and shouted talking environments using the collected database based on each of LTRSPHMM1s, LTRSPHMM2s, CSPHMM1s and CSPHMM2s. It is evident from this table that each of LTRSPHMM1s, LTRSPHMM2s, CSPHMM1s and CSPHMM2s perform almost perfect in the neutral talking environments. This is because each of the acoustic models: LTRHMM1s, LTRHMM2s, CHMM1s and CHMM2s yield high speaker identification performance in such talking environments as shown in Table 2.
where, 1 model x : is the mean of the first sample (model 1) of size n.  Table 2.
In one of his previous studies, Shahin showed that CHMM2s contain the characteristics of each of LTRHMM1s, LTRHMM2s and CHMM1s. Therefore, the enhanced speaker identification performance based on CHMM2s is the resultant of speaker identification performance based on the combination of each of the three acoustic models as shown in Table 2. Since CSPHMM2s are derived from CHMM2s, the improved speaker identification performance in shouted talking environments based on CSPHMM2s is the resultant of the enhanced speaker identification performance based on each of the three suprasegmental models as shown in Table 1. Table 3 The calculated t values in each of the neutral and shouted talking environments using the collected database between CSPHMM2s and each of LTRSPHMM1s,  Table 2 yields speaker identification performance in each of the neutral and shouted talking environments based on each of the acoustic models: LTRHMM1s, LTRHMM2s, CHMM1s and CHMM2s. Speaker identification performance achieved in this work in each of the neutral and shouted talking environments is consistent with that obtained in Ref. [9] using a different speech database (forty speakers uttering ten isolated words in each of the neutral and shouted talking environments) [9]. SUSAS database was designed originally for speech recognition under neutral and stressful talking conditions [30]. In the present work, isolated words recorded at 8 kHz sampling rate were used under each of the neutral and angry talking conditions. Angry talking condition has been used as an alternative to the shouted talking condition since the shouted talking condition can not be entirely separated from the angry talking condition in our real life [8]. Thirty different utterances uttered by seven speakers (four 25 males and three females) in each of the neutral and angry talking conditions have been chosen to assess the proposed models. This number of speakers is very limited compared to the number of speakers used in the collected speech database.     [10,11]. In the angry talking condition using the same database, Shahin achieved speaker identification performance of 79.0% and 79.2% based on LTRSPHMM1s and genderdependent approach using LTRSPHMM1s, respectively [10,11]. Based on using SUSAS database in each of the neutral and angry talking conditions, the results obtained in this experiment are consistent with those reported in some previous studies [10,11].
2. The new proposed models have been tested for different values of the weighting factor (. Fig. 4 shows speaker identification performance in each of the neutral and shouted talking environments based on  4. An informal subjective assessment of the new proposed models using the collected speech database has been performed with ten nonprofessional listeners (human judges). A total of 800 utterance (fifty speakers, two talking environments and eight sentences) was used in this assessment.
During the evaluation, each listener was asked to identify the unknown 29 speaker in each of the neutral and shouted talking environments (completely two separate talking environments) for every test utterance.
The average speaker identification performance in the neutral and shouted talking environments was 94.7% and 79.3%, respectively. These averages are very close to the achieved averages in the present work based on CSPHMM2s. CSPHMM1s. This superiority is significant in the shouted/angry talking environments; however, it is less significant in the neutral talking environments.
This is because the conventional HMMs perform extremely well in the neutral talking environments. Using CSPHMM2s for speaker identification systems increases nonlinearly the computational cost and the training requirements needed compared to using each of LTRSPHMM1s and CSPHMM1s for the same systems.
For future work, we plan to apply the proposed models to speaker identification systems in emotional talking environments. These models can also be applied to speaker verification systems in each of the shouted and emotional talking environments and to multi-language speaker identification systems.