The application of fractional Mel cepstral coefficient in deceptive speech detection

The inconvenience operation of EEG P300 or functional magnetic resonance imaging (FMRI) will be overcome, when the deceptive information can be effectively detected from speech signal analysis. In this paper, the fractional Mel cepstral coefficient (FrCC) is proposed as the speech character for deception detection. The different fractional order can reveal various personalities of the speakers. The linear discriminant analysis (LDA) model (which has the ability of global optimal vector mapping) is introduced, and the performance of FrCC and MFCC in deceptive detection is compared when all the data are mapped to low dimensional. Then, the hidden Markov model (HMM) is introduced as a long-term signal analysis tool. Twenty-five male and 25 female participants are involved in the experiment. The results show that the clustering effect of optimal fractional order FrCC is better than that of MFCC. The average accuracy for male and female speaker is 59.9% and 56.2%, respectively, by using the FrCC under the LDA model. When MFCC is used, the accuracy is reduced by 3.2% and 5.9%, respectively, for male and female. The accuracy can be increased to 71.0% and 70.2% for male and female speakers when HMM is used. Moreover, some individual accuracy is increased over 20%, or even more than 85%, when FrCC is introduced. The results show that the deceptive information is indeed hidden in the speech signals. Therefore, speech-based psychophysiology calculating may be a valuable research field.


INTRODUCTION
Deception detection is regarded as an ancient and mysterious topic in the long history of human science, and there have also accumulated many useful results. The mechanism of modern polygraph is based on the changes of EEG signals, due to the contribution of psychophysiology research. Event-related potential (ERP) of P300 and functional magnetic resonance imaging for brain (fMRI) are widely used in polygraph technology (Gao et al., special circumstances (Anton et al., 2012). The application of these techniques achieved some good results. The complex measurement process and the need for the participant's cooperation make their further promotion limited. Some information can be derived from speech, including the speaker's gender, age, emotional and mental states. Therefore, speech signals, as the carrier of deception information, may provide the basis for lie detection. Due to the complicated process from psychological activity to physiological reaction, voicebased polygraphs still exhibit the following problems. First, there is a lack of physiological experimental results demonstrating the theory basis. Second, the study of the hearing mechanism also cannot provide clear conclusions. Third, the deception detection results are not as intuitive as that of speech recognition or speaker identification, because lying is a process. Psychological stress evaluators (PSE) (Eriksson & Lacerda, 2007) and voice stress analyzers (VSA) (Harnsberger et al., 2009) are used to measure the voice tremors, which are considered as the reflection of stress. Layered voice analysis (LVA) (Anolli & Ciceri, 1997) claims that the link between the certain types of brain activity and the lie have been discovered. Although they are some controversial, these methods are effective to some extent. The recognition accuracy in deceptive detection is significantly low. Bond & DePaulo (2006) claimed that people only achieve an average of 54% correct lie-truth judgments. Hirschberg et al. (2005) reports a classification accuracy of 66.4% versus a chance baseline of 60.2%. A study by Graciarena et al. (2006) reports an accuracy of 64.0% versus a chance baseline of 60.4%. Most of the studies focused on the traditional speech features screening, and few mechanism analysis of the deceptive speech features have been made. Christin & David (2013) report that short-time energy, pitch trace and formant F1, F2, F3 did not show clear correlation with deceptive information. Owing to the lack of robust feature analysis, the authors still give positive prospects for speech deceptive detection.
The MFCC parameters are used as the main characteristics in the state-of-the-art speech recognition systems. The standard extraction process of MFCC makes it suitable for standard pronunciation pattern classification. Since lying varies depending on the behavior of individual differences, the standard characteristics cannot fully reflect the personality of each speaker. The conventional Fourier analysis cannot fully reveal the deceptive information hidden in the voice message. Most of the researches show the performance of GMM and SVM systems in lie detection, but there are few papers report the changes of phonetic features (including MFCC) after linear transform, and there are also very few linear classification systems used in such experiments. It is very important to evaluate the performance of some linear algorithms in feature extraction or classification, when lie detection research is in the preliminary stage. The results can provide the mathematical foundation for the further use of complex models.
In this paper, fractional Fourier transform (FrFT) is introduced in deceptive speech feature extraction, Linear Discriminant Analysis model (LDA) and hidden Markov model (HMM) are proposed for classification. Fractional Fourier transform is regarded as the angle rotation from traditional Fourier analysis. Many examples of current literature reports that the fractional Fourier transform is used for the analysis of linear frequency modulated waveform (Pang, Liu & Shan, 2014;Zhu, Zhao & Tang, 2013;Zhu et al., 2014).
Linear frequency modulated signals can be transformed into an impulse signal under certain angles by FrFT. The application of FrFT in the speech signal processing field is gradually increasing, and the accuracy of speech recognition and speaker identification is improved when FrFT is introduced in feature extraction (Yin, Xie & Kuang, 2012;Pawan & Raghunath, 2013). The fractional order linear canonical transform algorithm also obtains a good result in speech signal reconstruction (Qiu, Li & Li, 2013). There are many successful applications with LDA or HMM in the field of speech signal processing. Behzad used LDA and its modified algorithm to reduce the speech recognition error rate (Behzad et al., 2011), and Ana and Jordi used LDA to achieve the phonetic features analysis of snoring patients (Ana et al., 2014;Jordi et al., in press). Rabiner & Schafer (2007) and Matthias & John (2009) successfully used HMM in speech recognition.
In this paper, the Mel cepstral coefficient in fractional domain (called Fractional Mel Cepstral Coefficient, FrCC) is extracted for voice spectrum analysis. Then, LDA and HMM models are used to distinguish between truth and deception. Different fractional order spectrum analysis in Mel domain can further refine the pronunciation characteristics of all liars. The rotation of MFCC parameters in the Mel domain is introduced, and the details of each individual difference in speech signal may reinforce to a certain extent. The LDA algorithm can help us to find the best rotation angle and obtain the best division result. The HMM-based time series model can reveal the psychological and physiological changes when lying, and increase the recognition accuracy. The FrCC parameter corresponding with the highest accuracy can be called optimal order FrCC. The deceptive detection accuracy of all optimal order FrCC is higher than that of MFCC, so the acoustic characteristics of speech signals can provide some support for lie detection.
The arrangement of the full text is as follows. The first section introduces the study of distinguish features of deceptive speech, and provides the application basis of fractional Fourier analysis in this subject. The second section presents the calculation of Fractional Mel cepstral coefficients. The LDA algorithm and HMM model are introduced in the third section. Experiment results and analysis are described in the fourth section. The conclusions are drawn finally.

DECEPTIVE SPEECH FEATURE INTRODUCTION AND FRACTIONAL FOURIER TRANSFORM APPLICATION FOUNDATION The introduction of speech signals based polygraph application
Lying is a process which moves from psychological activity to physiological execution. Firstly, people decide to lie from their conscious mind, and then organize the language to cover the real content. Finally, control vocal organs to form the voice. Tommaso, Fabio & Massimo (2013) studied the deception detection results among the people of different personality by the means of pattern recognition. The conclusion is that the outgoing personality groups are easier to be identified. It is also proved that lie detection has individual differences. Lamb & Skillicorn (2013) reported that the frequency of words with different parts of speech appear in the trial process may also be a way to analyze the possibility of lying. In the field of natural language processing, Christie, David & Dursun (2008) showed that the result is influenced by lexical, syntax, sentence length and motivation. Furthermore, the organization of text and voice can be used in an anti-phishing detection system, preventing people from being cheated in instant messengers (Mohammed & Lakshmi, 2012). These researchers discuss the deceptive speech detection results in psychology and natural language processing fields, including personality difference, pronunciation difference, language expression, sentence organization and so on. Then, Sofia et al. (2013) researched the speaker's differences under normal state and tense state. Cheryl et al. (2013) summarized that the person under tension will show the following case: increase of adrenaline, higher blood pressure, sympathetic excitement, bronchiectasis, and cricothyroid muscle tension. Some people tended to show increased pitch frequency and voice trembling, but not everyone exhibited these traits. These physiological studies have proved the existence of the difference between normal and deceptive state, and provide some simple basis in the physiological field for speech deception detection.
The above conclusion from psychology, linguistics and physiological aspects are relatively consistent, but the research in the acoustic field have different results. Gopalan & Wenndt (2007) claim that the trace of pitch and first formant, which are processed by AM-FM model and Teager operator, presented the definite difference between normal and deceptive speech. However, Christin & David (2013) gave the opposite conclusion. They executed several experiments, and the results are presented on a range of speech parameters including fundamental frequency, overall amplitude and mean vowel formants F1, F2, F3. They could not establish a significant correlation between deceptiveness and truthfulness; the two results appeared opposite. Pitch and formants are susceptible to the influence of speaker's pronunciation habit, language content, and coarticulation. The parameters will also be changed due to different extraction algorithm, so it is not a good choice for using them as the speech features for lie detection. There is little research to reveal the process from psychological activity to speech production. Muhammad & Kaliappan (2013) use the Bark spectrum as speech features, and use the neural network as the classification model to identify the truthfulness and deceptiveness. Therefore, using robust acoustic characteristics for deceptive speech identification should be more reliable, and there is still plenty of scope for more progress.

The fractional Fourier transform application foundations
Current research has not investigated the feature difference between normal speech and the lie. Physiological studies also could not provide any explanation for whether there are specific changes of articulators or not when people are lying. The existing information is only obtained from speech features statistical research result or classification conclusion by traditional pattern recognition models. The speech features in the frequency domain are mostly achieved by power spectrum transformation. If the liar's psychological and physiological changes indeed impact the frequency of speech, the deceptive information can be extracted by short-time frequency analysis for voice signals. But it does not work, if only the speech phase is changed. Therefore, a speech feature which can express both the change of frequency and phase is needed. The fractional Fourier analysis is applicable to the task.
That being the case we take the cosine signal as an example to compare the difference between the traditional Fourier transform and fractional Fourier transform. (Please refer to the next section for the detail description of FrFT.) Through (1) and (2), it is shown that the Fourier amplitude-frequency response can't reflect the phase difference of the two signals. According to trigonometric formula, Eq. (2) can be expanded and transformed by FrFT as the Eq. (3). Then: Equation (6) expressed the FrFT result of cos(ϖ 0 t + θ). The phase θ still exists in |X α (u)|, so Eq. (6) can reserve the phase information. The use of speech signal for lie detection is only in the preliminary stages. If the difference between truthfulness and deceptiveness is really expressed by the amplitude and phase of speech spectrum, the fractional Fourier transform analysis method should be an effective way to reveal the distinction. So as to suit for the diversity of speakers, variety orders of FrFT should be involved. The difference between normal speech and lie can be enhanced due to some orders of FrFT.

FRACTIONAL MEL CEPSTRAL COEFFICIENT (FrCC) EXTRACTION
The FrCC parameters are modified based on MFCC. First step is short-time analysis, then transform time domain samples to frequency domain by FrFT under a set of rotation angles. The following step is to divide the signals into Mel frequency domain by triangular filters, then conduct by a DCT at last. So the whole process is shown in Fig. 1.
The details of FrCC calculation steps are shown as follows: (A) The fractional Fourier transform for speech signals is shown in (7).
Here, α = p π 2 , p is the set of real numbers, the order of FrFT. K α (t,u) is the primary function of FrFT, and its specific expressions is presented in Eq. (8).
According to the properties of (8), when α = π 2 the fractional Fourier transform is equal to the traditional Fourier transform.
(B) Equation (9) provides the spectrum mapping operator from fractional domain to Mel frequency domain.
The Mel frequency band is based on the human auditory characteristics, it should also meet such requirement in fractional domain. Therefore, Eq. (10) presents the frequency projection operator.
The output of each triangular filter is |S M α (u)|, M refers to M-th Mel component. Figure  The FrCC parameters can be calculated according to the above equations. And FrCC is equal to MFCC when α = π 2 .

LDA algorithm
The main function of linear discriminant analysis (LDA) is to project the high-dimensional samples onto a low dimensional space. It is aimed to maximize the distance between classes, and minimize the distance in the class. Therefore, LDA is suitable for two group classification task, such as truthfulness and deceptiveness division. The S = {s 1 ,s 2 ,...,s n } refers to the training voice set, and s i is belong to ω 1 or ω 2 , which represents for normal speech or lie respectively. A projection operator w defined as the best vector may map x i to the one-dimensional y.
Then, it is very easy to make a decision by a simple comparison.
So the main task of LDA is to calculate the optimal mapping vector w.
The mean of each category can be expressed as The mean value is changed after projection.
The distance between two means is And the variance after projection is Define the objection function J(w), when reaching the max ratio of the distance between these two categories and the variance within the classes, the w is the best vector.
Then define And we may have Sõ And The Eq. (18) can be written as The optimal vector is The test voice set can be easily divided into two groups by Eqs. (12) and (13) when the w is obtained.

Hidden Markov model
The hidden Markov model (HMM) can be considered as a generalization of a mixture model. The hidden variables are related through a Markov process, and the observation is controlled by the hidden state. The state is not directly visible in a HMM, but output observation is visible and dependent on the state. Each state has a probability distribution corresponding to the possible output. Therefore, the output sequence generated by an HMM presents some information about the sequence of invisible states. The random variable x(t) presents the hidden state at time t (with x(t) ∈ {x 1 ,x 2 ,x 3 }). The random variable y(t) is considered as the speech observation at time t (with y(t) ∈ {y 1 ,y 2 ,y 3 ,y 4 }). According to the basic theory of Markov process, it is clear that the conditional probability distribution of the hidden variable x(t) only depends on the value of the x(t − 1). The values at time t − 2 and before have no influence. The value of the speech observation y(t) also depends on the value of the x(t).
Two types of parameters, called transition probabilities and output probabilities, are contained in a HMM. The hidden state at time t is determined by hidden state at time t − 1 according to the transition probabilities. There is also a set of output probabilities to describe the distribution of the observed variable.
There are some important parameters in a HMM.
1. N, the number of states in the model.

M, the number of observation symbols per state.
3. The transition probability distribution.
4. The output observation probability distribution.
b j (k) = P(y t = o k |x t = s j ).
5. The initial state distribution.
The famous forward-backward algorithm, EM algorithm, and Viterbi algorithm can be used to train the models and solve the recognition task (Rabiner & Schafer, 2007;Matthias & John, 2009). Although the HMM is a traditional method used in recognition systems, it is still a suitable model in deceptive speech detection with time series speech signals.

Speech database
The liar's appearance has a direct relationship with individual personality, culture background, conversation content and the cost of being seen through the lie. Therefore, the speech sources should be collected in a real circumstance. According to the concealed information test theory (Verschuere, Ben-shakhar & Meijer, 2011), we designed an interesting game, and the speech database is selected from the game records. There are two groups in the game, which are called A-group and B-group. Every person in A-group should tell a story, and the persons in B-group can ask all kinds of questions according to the story. Due to the different stories told by peoples in A-group, the questions and answers are different from each other. Since the persons in B-group do not know whether the story is experienced the storyteller himself/herself, they should decide the true or false through the teller's answers. If the persons in B-group speculate the correct result, they win the game and obtain some reward. Otherwise, the person in A-group wins. So if the story is a lie, the teller should try his/her best to keep the secret from every one to win the game. People in B-group should ask as many questions as possible (generally more than 10 questions) to make the liars nervous and make mistakes. Here, we select the every fake stories and fake answers as the deceptive speech samples. Then the corresponding tellers should record a set of normal speech in a calm environment, the topics can include such as self-introduction, hobbies, and daily life topics and so on. These records should long enough to cover as much as possible syllables in Mandarin Chinese. So the normal speech samples are collected.
At last, we reserved useful records of 50 participants, including the 25 men and 25 women. Due to some limitations, the participants are mainly 25 to 35 years old. The SNR of all the samples are more than 25dB. All speech is mono sampled at 8 kHz and quantified with 16 bits. The frame length is 20 ms, and the overlap is 10 ms when the speech is under short-time analysis. The data set is divided into two parts, namely the training set and test set. The experiment is under a unified standard to divide the data set due to the different length of the every people's speech sample. The training set contains 30% of whole record, and the remaining data are regarded as the test set.

Human Ethics
This research was approved by the Institutional Review Boards of Soochow University School of Electronics and Information Engineering, and Suzhou University of Science and Technology School of Electronics and Information Engineering. The speech set recording is carried out in a game style, so all the participants are confirmed with verbal consent.

The experiment results for LDA model
The experiment step is expressed as follows: (A) Divide the speech signals into short frames with the length of 20 ms and overlap of 10 ms. (B) Extracted the FrCC parameters from every frame, and 12 FrCC coefficients and 12 delta FrCC coefficients are used as the FrCC vector from one speech frame. The range of angle is α ∈ (0,π), with the 0.01π as the step. So there are 100 FrCC vector groups in every frame. (C) Select 30% of the total data as the train set, and use LDA algorithm to calculate the optimal vector w.  Then use Eqs. (12) and (13) to make the decision and the statistical accuracy can be obtained at last. In this experiment, the recognition results of MFCC parameters are taken as a benchmark to compare with that of FrCC parameters. The results are shown in Tables  1, 2  In order to further refine the improvement of FrCC, the vector variance is introduced to compare the clustering performance of the two parameters. The vector variance is shown in Eqs. (29) and (30).
Here, the normfrcc i presents the FrCC of normal speech, and the normfrcc presents the mean vector. The normmfcc i and normmfcc present the MFCC of normal speech and MFCC mean vector, respectively. The R 1 in Eq. (29) denotes the vector variance ratio    Tables 3  and 4.

The experiment results of HMM model
There are many sophisticated tools for HMM training and testing, such as HTK or Matlab software package. The speech signals should also be divided into short frames. Then the speech characters such as FrCC and MFCC can be regarded as the observations, the psychophysiology status is regarded as the invisible states. The 30% of the total data is regarded as the train set, and the remaining data is the test set. The speech characters changed frame to frame, and the hidden Markov chain can present the process of the psychophysiology changes. The experiment results are shown in Tables 5 and 6. The comparison between the men set and women set are in Figs. 5 and 6.

Results analysis and discussion
In the sections 'The experiment results for LDA model' and 'The experiment results of HMM model,' the experiment results show that the identification accuracy of FrCC parameters under certain angles is higher than that of MFCC parameters. The FrCC coefficients introduced to the LDA model make the clustering performance much better.  The accuracy will be increased when HMM model is used to enhance the contextual information. The following paragraphs give some brief explanation to the experiment results.
(A) In the LDA recognition system, the men groups' average accuracy of FrCC with best angle is 59.9%, and MFCC is 56.3%. The average of best angle is α = 0.51π , and the variance of α is D(α) = 0.23π. The women groups' average accuracy of FrCC with best angle is 56.2%, and MFCC is 50.3%. The average of best angle is α = 0.59π, and the variance of α is D(α) = 0.22π. The best angle of 10, 22 in men's group and 8 in women's group is π 2 . In these cases, the FrCC coefficients are equal to MFCC coefficients. In the other cases, the identification performance of FrCC under LDA model is better than that of MFCC. The accuracy increased from 36.7% to 56.0% when FrCC is introduced in the 16th men. And the accuracy of many people is increased over 10%. Due to individual differences, the accuracy is only a little increased with some people. Overall, when FrCC coefficients are involved, the average accuracy is increased by 3.6% in men group and 5.9% in women group, respectively. Although the average growth of accuracy is not very large, there was great progress with some individuals. The FrCC parameters can therefore improve the deceptive detection performance.
(B) The performance of FrCC parameters may be changed with different α of FrFT. Due to the diversity and non-stationary characteristics of speech, and personality difference of the speakers, it is impossible to determine the optimal α before the experiment. The best α is selected by the highest accuracy. So the mechanism of selection algorithm should be further studied.
(C) Most of the R 1 and R 2 is less than 1 in Tables 3 and 4. It is shown that the clustering performance of certain FrCC is much better than that of MFCC. The FrCC data is concentrated to the clustering center. The existence of α enhanced the appearance of women groups' average accuracy of FrCC with best angle is 70.2%, and MFCC is 65.0%. When the HMM model is involved, the average accuracy is increased by 11.1% and 14.0%, respectively, in two groups. The highest accuracy of FrCC is 82.0% in the men set, and 85.4% in the women set. The largest individual accuracy increase is 31.2% in the women set (from 39.9% to 71.1%, 13th woman) and 18.5% in the men set (from 45.5% to 64.0%, 12th man). The FrCC based deceptive detection accuracy comparison between LDA and HMM are shown in Figs. 9 and 10.
(G) The ROC curve (Fawcett, 2006) is usually used to analysis the performance of the identification system. Here, the deceptive detection is a binary classification problem, in which the outcomes are labeled either as positive (p) or negative (n). There are only four outcomes from a binary classifier, the true positive (TP), false positive (FP), true negative (TN) and false negative (FN). Therefore, we select two parameters, the true positive rate (TPR, sensitivity) and true negative rate (TNR, specificity) to evaluate the performance of the LDA and HMM model. The sensitivity defines how many correct positive results occur among all positive samples during the experiments. Specificity defines how many correct negative results occur among all negative samples during the experiment. The definition equations are shown in (31) and (32). The statistical results are presented from Figs. 11-14. The difference between the sensitivity and specificity of every participant is not large, so LDA and HMM are the suitable tool for dividing the normal or deceptive speech.
In summary, the experimental results at least reflected that the acoustic feature is effective for lie detection. Faint difference between truthfulness and deceptiveness can be  expanded under some improved acoustic features such as FrCC, and these characteristics may play an important role in deceptive speech identification.

CONCLUSION AND PROSPECT
Lie detection based on Speech signal analysis is affected by many factors, such as the psychological quality of the subjects, the way of speaking, interference of environment, and the cost of being exposed, etc. So the development of this technology is relatively difficult. The lack of psychological and physiological research basis also makes less progress in this field. In this paper, fractional Mel cepstral coefficient (FrCC) has been proposed as the speech feature, linear discriminant analysis model (LDA) and hidden Markov model (HMM) are introduced as the classifier. The experiment results show  that the clustering effect of FrCC under optimal angles is better than that of MFCC, and the truthfulness/deceptiveness identification accuracy of FrCC is higher than that of MFCC through LDA or HMM. The successful application has demonstrated that the FrCC parameter can be used in deceptive speech detection, and provides some further experiment evidence in this field.  Future work should mainly focuses on the following aspects: First, to establish a unified optimal angle search mechanism, and achieve complete extraction algorithm of FrCC; Second, to further deep mining-related features, construct a data fusion model, enhance the useful property, and compress the redundant information and interference; Third, the deep mining the time series model, and enhance the contextual information for deceptive speech detection. Speech-based deceptive detection may be an important aid for traditional neuroimaging methods.