Intelligent acoustic data fusion technique for information security analysis

Tone is an essential component of word formation in all tonal languages, and it plays an important role in the transmission of information in speech communication. Therefore, tones characteristics study can be applied into security analysis of acoustic signal by the means of language identification, etc. In speech processing, fundamental frequency (F0) is often viewed as representing tones by researchers of speech synthesis. However, regular F0 values may lead to low naturalness in synthesized speech. Moreover, F0 and tone are not equivalent linguistically; F0 is just a representation of a tone. Therefore, the Electroglottography (EGG) signal is collected for deeper tones characteristics study. In this paper, focusing on the Northern Kam language, which has nine tonal contours and five level tone types, we first collected EGG and speech signals from six natural male speakers of the Northern Kam language, and then achieved the clustering distributions of the tone curves. After summarizing the main characteristics of tones of Northern Kam, we analyzed the relationship between EGG and speech signal parameters, and laid the foundation for further security analysis of acoustic signal.


Introduction
Among all human language, tone is an essential component of tone languages, and is used to build words much as consonants and vowels do [1]. For instance, in Chinese mandarin, the syllable /ti/, when pronounced with a level pitch pattern, means "embankment"; with a rising pattern, the meaning is "whistle"; with a falling-rising pattern, the meaning is "bottom"; and with a falling pattern, the meaning is "land". Thus, semantic distinction in tonal languages depends not only on articulatory composition, but also on tone patterns. Thus, more and more researchers pay attention to research the distinguishing characteristics of different tones. Here, in continuation of this work, we use the physiological and acoustic signals information to describe the tones in a tone language.
From a linguistic or physics point of view, the voice generation of larynx towards the lip involves three components: the voice source (phonation), the acoustic resonance (articulation), and the lip radiation (loudspeaker) (Fig. 1) [2]. The vibration of vocal cords generates speech signal, and the vowels and consonants are highly connected to the vibration difference of vocal cords, namely the phonation types. The vibration of vocal cords is a synergistic effect of different parts of a throat. During the process of generating voice, F0 is decided by the vibration frequency of vocal cords [3]. After tuning by soundtrack, mouth cavity and nasal cavity, the source signals from vibration can be made into different sounds. This is called resonance, which can be seen as a modulating process to the source signals. Finally, the sounds are transferred into the air through the radiation effect of mouth to form the speech signals. Thus, the generation of speech is a process of stimulation and modulation. In the research of traditional linguistics and speech processing, what we collect is the speech signals, and many in-depth studies have been done as well as numerous acoustic models are raised. Meanwhile, the generation of tones is also fully analyzed. However, tone is much more than just a physical quantity. As a linguistic term, tone is a kind of rational cognition of pitch by linguist, belonging to the categorical perception. Though F0 is the frequently used physical quantity in the present speech research, F0 cannot indeed represent the original vibrational state of vocal cord with the influence of phonation. An electroglottograph (EGG) signal, for example in Fig. 2, can be used to measure the vocal information directly and index the degree of contact between the vocal folds, so it has been a useful tool in measuring and describing phonations [5][6][7][8][9][10][11]. Childers and Lee used EGG measures to extract source related features for modal, vocal fry, falsetto and breathy voice, which are produced by subjects with and without vocal disorders [12]. Four factors were found to be the most important in distinguishing the above four phonation types: glottal pulse width, pulse skewing, abruptness of glottal closure, and turbulent noise. Results for glottal pulse width showed that modal phonation was produced with medium glottal width, vocal fry with a short glottal width, falsetto and breathy voice with a long glottal width. For pulse skewing, modal phonation was characterized by medium skewing, vocal fry with high skewing, falsetto and breathy voice with low skewing. In terms of abruptness of closure, modal voice and vocal fry were characterized by abrupt closure, while falsetto and breathy phonations were characterized by progressive closure. Finally, breathy phonation was the only phonation produced with high turbulent noise.  under clean environment from a 53-year-old man who is familiar with Kam language and mandarin. Speech signals are recorded by FieldPhon software with sampling frequency 44100Hz, and the strict processing of listening, identifying and tuning is implemented after recording. The corpus includes 770 monosyllable words, covering all tones on 9 unchecked syllables and 4 checked syllables in Northern Kam Language. Moreover, the universality and proportionality of the samples of tones, consonants and vowels are ensured as far as possible. EGG signals are synchronously acquired with the recording of speech signals. Data processing: Seven kinds of F0 parameters are acquired based on the basic acoustic parameters extracted by STRAIGHT [22]: 1) the amplitude difference between starting point and end point of fundamental frequency (pitch variation), 2) the length of fundamental frequency curve (duration), 3) the maximum of fundamental frequency, 4) the minimum of fundamental frequency, 5) the extreme points of fundamental frequency curve, 6) the slope of fundamental frequency curve, 7) the slope variation of fundamental frequency curve. Meanwhile, two parameters of EGG, CQ and DECPA, are obtained based on the parameters including CQ, Peak Vel and Min Vel extracted by EggWorks and processed with Matlab.

Methods
Analysis Methods: The rule of minimum opposition is adopted for analyzing the experimental data so as to accurately attain the influence of EGG parameters and acoustic parameters on the acoustic characteristics. There is only one kind of characteristic difference (tone, consonant or vowel difference) for data comparison. For example, when the opposition characteristic is consonant, the rule of minimum opposition ensures the consistency of vowel and tone. The opposite characteristic is tone, and the consonant and vowel are consistent. During analysis, the pathological voice is ignored while the outliers and interferences of erroneous recording are eliminated.
The method of minimum opposition used in this paper for one person and one language system can also be applied to multi-person and multi-language system. However, the person and the language type should be respectively taken as the condition of opposition for analysis and discussion.

Analysis of Tone Identification by EGG Signal
At present, the tone definition is mainly confined to the value and modes of pitch. The value of pitch in physiology field corresponds to the vibration speed of vocal fold, and in acoustics or information field corresponds to the value of the fundamental frequency. On the other hand, tone is a linguistic term. We can think that the tone is the combination of phonation type and vibration frequency. Phonation type is a suprasegmental feature related to timbre, while vibration frequency determines the value of fundamental frequency. As shown in Fig. 3, with the influence of resonance, labial radiation and consonant, it is often difficult to accurately acquire the value of F0 through the speech signal. Meanwhile, EGG can obtain the glottal information of larynx directly without the influence of the above disturbances. So the value of F0 extracted by EGG signal is more precise than by the speech signal. As shown in Table 1, F0 extracted by EGG signal has smaller variation range and is smoother compared to F0 extracted by the speech signal.

Analysis of Acoustic and EGG Parameters of Checked and Unchecked Syllables
In the Northern Kam Language, like most Eastern Asian languages, the same tone value can correspond to the checked and unchecked syllables. For example, as shown in Fig. 3, tone7 is on unchecked syllable and tone10 checked syllable. They have the same tone value 55, but with the different durations, the ends of the vowel and the variation ranges of F0. In Tai-Kadai languages, the tones on checked and unchecked syllables differ in duration, and the checked syllables end in the consonants, leading to big difference in the phonation type. Figure 4 and Table 2 show the parameters of the checked and unchecked syllables with the same tone value. We can find that DECPA of the checked syllable is higher than that of the unchecked syllable with the same tone value, but the trends of DECPA of the checked and unchecked syllables are similar. Furthermore, from Table 2, the difference between the values of DECPA of the checked and unchecked syllables is beyond 100. The F0 slope of the checked syllable is also higher, which is related to the higher DECPA of the checked syllable. Meanwhile, the values of CQ of the checked and unchecked syllables with the same tone value are approximately equal. From Fig. 4, the vowel is /a/ in syllables with different tone values, and the values of CQ are around 0.55.

Analysis of Similar Tones
The similar tones, for example, the tone4 and tone9 shown in Fig 4 have much resemblance in tone contour. The curves of them are nearly parallel, and the durations of them are approximately equal, with only the values of F0 are different. For these similar tones with little difference in tone characteristics, it is difficult for the current tone recognition algorithms to distinguish them. The parameters DECPA and CQ of similar tones (tone4, tone9) for /tu/ and /ta/ respectively. DECPA can distinguish the similar tones with the same vowel and consonant. For /tu/ DECPA of tone4 is larger than that of tone9, and the values of DECPA of the two tones are clearly distributed in two intervals. For /ta/ DECPA of tone9 is larger than that of tone4. At the same time, the distributions of DECPA for /tu/ and /ta/ are also different as the result that the articulation positions of /a/ and /u/ are different. So DECPA is related to not only the energy change of pronunciation but also the articulation positions.
Meanwhile, the similar tones cannot be well discriminated by CQ in /tu/ and /ta/. In addition, the curves of CQ of different tones in the same vowel are approximately similar. But the values of CQ of different vowels in the same tone are inconsistent, which may imply that CQ to some extent has distinction of the vowels.

Conclusion
Although EGG is a widely used instrument in physiology and medicine, it is still a relatively new means of measurement in the fields of linguistics and speech. EGG signals were initially used in linguistics to examine phonation types; the EEG parameters CQ and DECPA have been mainly used to measure the opening and closing glottis states, as well as velocity variations. Because Northern Kam is a relatively rich tonal language, it is very meaningful to study its tones. In this paper, we collected EGG signals of