Nonlinear Dynamic Feature Extraction Based on Phase Space Reconstruction for the Classification of Speech and Emotion

Due to the shortcomings of linear feature parameters in speech signals, and the limitations of existing timeand frequencydomain attribute features in characterizing the integrity of the speech information, in this paper, we propose a nonlinear method for feature extraction based on the phase space reconstruction (PSR) theory. First, the speech signal was analyzed using a nonlinear dynamic model. /en, the model was used to reconstruct a one-dimensional time speech signal. Finally, nonlinear dynamic (NLD) features based on the reconstruction of the phase space were extracted as the new characteristic parameters./en, the performance of NLD features was verified by comparing their recognition rates with those of other features (NLD features, prosodic features, andMFCC features). Finally, the Korean isolated words database, the Berlin emotional speech database, and the CASIA emotional speech database were chosen for validation. /e effectiveness of the NLD features was tested using the Support Vector Machine classifier. /e results show that NLD features not only have high recognition rate and excellent antinoise performance for speech recognition tasks but also can fully characterize the different emotions contained in speech signals.


Introduction
Language is the most effective medium of human communication. Language not only contains interpretable text but also contains a large amount of paralinguistic information that can reflect the emotional changes in a speaker. Interpretation of human spoken language through technologies such as speech recognition and affective computing have found a wide range of applications in diverse domains such as vehicle navigation, video surveillance, network video, and other human-computer interaction fields. Speech recognition refers to the ability of machines to convert spoken language into written text. To do this, a speech recognition system often needs to take into consideration the specific and nonspecific environment to recognize the content of speech accurately. erefore, feature extraction and speech signal characterization are two important steps for accurate speech recognition. Currently, the most important feature extraction techniques used in speech recognition can be divided into (a) prosodic features [1], (b) phonetic features [2], (c) features based on the correlation characteristics of the spectrum [3,4], and (d) feature fusion [5]. e above features are characterized by the piecewise linearity of speech signals. However, studies have shown that speech signal generation is neither a linear process, nor a stochastic process, but rather a nonlinear process [6]. us, only using the piecewise linearity of speech signals in the time and frequency domains to extract speech feature will lead to the loss of some of the nonlinear features of speech signals, making the information being extracted incomplete.
With recent development in nonlinear analysis methods, they have been successfully applied in various fields [7][8][9][10][11][12]. Zbancioc [7] applied the Lyapunov index for the extraction of spectral coefficients of MFCC and LPCC features and achieved an emotion recognition accuracy of 75%; Firoozet al. [8] evaluated nonlinear dynamic features by reconstruction of speech signals using phase space reconstruction to improve the accuracy of automatic speech recognition. Spanish researcher Karmele Lopez applied the study of the chaotic characteristic of natural speech for the detection of Alzheimer's disease and pointed out detection of the speaker's lesions by extracting the fractal dimension features in natural speech [9,10]. Xiang and Tan of Beijing Jiaotong University combined the chaotic features from speech with other common features to detect fatigue among automobile drivers [11]. Although some researchers have studied the chaotic characteristics of speech signals, very few studies have focused on the nonlinear dynamics and geometric features of chaotic characteristics in speech signals.
Aerodynamic studies have shown that people generate vortices in the channel boundary layer when they make sounds, and this vortex can eventually form turbulence [12]. e nature of this turbulence is chaotic. To verify the chaotic characteristics of speech signals, this paper explores this chaotic mechanism of speech signal generation, from three different analytical aspects: (a) power spectrum, (b) principal component analysis, and (c) phase space reconstruction. is research aims to provide a theoretical basis for extracting nonlinear dynamic features based on the chaotic characteristics of speech signals. By studying and analyzing the two main parameters of phase space during phase reconstruction of speech signals, the minimum time and embedding dimension, we realize the optimal phase space reconstruction. en, we extract the nonlinear dynamics features from the phase space. By designing experiments to contrast the dynamic features and MFCC nonlinear features for speech recognition, we verify that nonlinear dynamic features of speech signals not only provide high accuracies and excellent noise cancellation performance for speech recognition but also help in identifying emotional cues in speech.

Chaos Theory and Verification of Chaotic
Characteristics in Speech 2.1. Chaos eory. Chaos is a seemingly irregular, random phenomenon that occurs in deterministic systems [13]. Although a chaotic system has no obvious cycle, and the form of motion seems disorderly, the internal structure is ordered, and it is a new existence form of nonlinear systems. Nonlinear dynamics are mainly studied for describing a system or time series. e internal state of motion and the law of transformation of a nonlinear system or time series are analyzed qualitatively and quantitatively [6]. At present, the method of nonlinear dynamic analysis of time series has been maturing and has a relatively complete theoretical research background, covering different nonlinear modeling techniques and nonlinear representations [14], such as fractal dimensions, Lyapunov index, and Kolmogorov entropy, among others. ese features can not only effectively distinguish the signal sequence due to chaotic characteristics but also effectively describe the motion state and variation of the signal. ese features absent in traditional analysis methods give an advantage to nonlinear modeling.

Verification of Chaotic Characteristics in Speech.
ere are two basic features which are used to describe chaotic characteristics.
e chaotic attractor of high-dimensional phase space reconstruction has (a) fractal dimension characteristics and (b) initial conditions which have great influence on the system [13]. If a time series has the above two characteristics, we can say that the time series itself is chaotic. Based on the above theory, this paper verifies the chaotic characteristics of speech signals from three aspects: (a) power-spectrum analysis [13], (b) principal component analysis [13], and (c) phase space reconstruction [13].

Power-Spectrum Analysis Method.
From the timedomain waveform, we cannot intuitively determine whether the time series is periodic or disordered. However, its power spectrum can be used to identify these regularities. Analysis of the power spectrum can help determine whether the time series demonstrates chaotic characteristics. is analysis is based on two aspects: the number of peaks in the power spectrum and the broad-spectrum characteristics. If there are a finite number of peaks in the power spectrum, the time series is said to have a periodic sequence. However, if there is no obvious peak in the spectrum and it demonstrates a "wide-spectrum" characteristic, we can say that the time series is turbulent or chaotic. erefore, power-spectrum analysis has evolved as a theoretical basis for judging whether the signal has chaotic characteristics.
In this paper, we analyze the power spectrum of the speech signals of a single word in the Korean isolated words database [15]. e analysis is done for four cases: "15 dB, 20 dB,""25 dB," and "clean." From Figure 1, we can see that the speech signals of the four SNR have a wide spectrum and no special peak. erefore, it can be verified that the isolated speech signals are chaotic.

Principal Component Analysis.
Principal component analysis (PCA) is an effective method to identify a time series which has chaotic characteristics. e steps for the calculations are as follows.
Given a time series [x(1), x(2), . . . , x(N)], the appropriate embedding dimension m is chosen to construct the matrix X k×m (k � N − (m − 1)), which is represented as en, the covariance matrix A(A ∈ R m×m ) of the trajectory is calculated as en, the eigenvalues of the covariance matrix A(A ∈ R m×m ) are solved to obtain λ i (i � 1, 2, . . . , m). Next, we calculate the sum of all the eigenvalues λ and then sort the eigenvalues λ i (i � 1, 2, . . . , m) in descending order. We calculate and plot the main component spectrum using ln(λ i /λ) − i as the coordinates for the simulation graphics. If the principal component spectrum is a nearly straight line with a negative slope, this indicates that the signal has chaotic characteristics.
As shown in Figure 2, in this paper, we carry out the principal component spectral analysis of the four emotions-"happy,""sad,""neutral," and "anger"-from semantic speech signals taken from the Chinese Affective Chinese database (CASIA) [16]. As can be seen from Figure 2, the covariance matrix is used to calculate the three eigenvectors (i � 1, 2, 3), and the resulting value ln(λ i /λ) is calculated as a nearly straight line with a negative slope in the graph. erefore, it can be shown that the emotional speech signals are chaotic.

Phase Space Reconstruction.
Phase space reconstruction (PSR) is the first step to analyze nonlinear dynamic, commonly used in the embedding theorem proposed by Taken's [17]. e essence of this method is to construct an m-dimensional space vector x(t), x(t + τ), . . . , { x(t + (m − 1)τ)} by selecting the corresponding appropriate delay time τ and embedding dimension m from the one-dimensional time series x(t). e reconstructed high-dimensional space is equivalent to the original space. Given the time series of the one-dimensional emotional speech signals x i , i � 1, 2, 3, . . . , N, we select the appropriate time delay τ and embedding dimension m. e sequence expression after phase space reconstruction can be written as e row vector x i → represents the location information of each single attractor required for phase space reconstruction. e definition of nonlinear dynamical systems indicates that these vectors are connected by a column to form a trajectory matrix. is information can be used to create the following PSR matrix: X � e significance of a high-dimensional phase space is that the internal structure of the signal can be expanded. e signal can be projected onto a high-dimensional space, and the qualitative properties of the signal can be obtained by measuring and predicting the evolutionary trajectory in this space.
is paper reconstructs the phase space by measuring different emotions in the same semantics of the Berlin emotional speech database [18]. In this paper, we study the overall structure and motion trajectory of the speech signals under a one-dimensional time series and a three-dimensional phase space reconstruction for four emotional states: "happy,""sad,""neutral," and "angry." From Figure 3, we can see that the differences between the four kinds of emotional speech are mainly reflected in features such as the number of peaks, the peak size, and the number of zero crossings in the time-domain waveform. However, there are also significant differences in the overall structure and motion trajectory once the four kinds of emotional speech are reconstructed in a three-dimensional phase space. erefore, a nonlinear dynamic model can be used to analyze the chaotic characteristics of speech signals.

Nonlinear Dynamic Feature Extraction from Speech
Phase space reconstruction is one of the key techniques used to study time series with chaotic characteristics. Taken's embedding theorem [14] states that as long as the appropriate time delay τ and the embedded dimension m are appropriately selected, the one-dimensional emotional time Here, i � 1, 2, . . ., and ensure that the reconstructed phase space and the original onedimensional voice signal retain information integrity. e emotional speech signals are analyzed under the reconstructed phase space, and then, the following nonlinear dynamic (NLD) features are extracted. e algorithm flow is shown in Figure 4.

Preprocessing.
Since speech signals are nonstationary and time-varying and have short-time stationary characteristics, the following three steps are needed for the processing and analysis of speech signals: ①endpoint detection: the identification of the start and end points of the speech signals based on energy and zero rate; ②preemphasis: a first-order digital filter is used to preaccentuate the high-frequency part of the speech signals; ③window framing: a Hamming window is used for frame processing, with a frame length of 256 and a frame shift of 128.

C-C Algorithm.
e purpose of phase space reconstruction is to extend the dynamic one-dimensional speech signals into a high-dimensional space to completely reveal the implicit information in the time series. However, we observed that the significant parameter delay time τ of the reconstruction phase space is strongly correlated with the embedded dimension m. erefore, this paper chooses the C-C [18] method to calculate the delay time τ and the  window delay time τ w . is paper also further obtains the embedded dimension m which is a part of the implicit information in the time series. In view of the current spatial coordinates, the geometric information is limited to a twoor three-dimensional space. is paper improves the C-C method and extends its speech time series to two-and threedimensional phase spaces to extract five nonlinear geometric features (NLD-2) from the structural trajectory contours. e specific calculations are performed in the following steps: (1) As shown in equation (5), the time series subsequences: where the length is l � (N/t).
(2) e associated integral of the embedded time series is defined by the following function: where e S (m,N,r,t) of the subsequence x i is defined using the associated integral C(m, l, r, t) function: If the time series is independently distributed, then for fixed m, t, when N ⟶ ∞, for all r, S(m, r, t) is equal to zero. But the actual sequence is limited, and the sequence elements may be related, we actually get S(m, r, t) which is generally not equal to zero so that the local maximum time interval can be located at the zero point of S(m, r, t) or at the minimum time point for all the differences between the radii. Since this implies that these points are almost uniformly distributed, the maximum and minimum radii of the corresponding values are selected, and the difference ΔS (m,N,t) can be written as e above formula measures the maximum deviation of the radius r.
(4) To calculate the time delay τ and the window delay time τ W , we must first calculate the following three components: where r j is r j � (jσ/2) and σ is the mean square of the time series time delay. τ is the first value of S t or the first minimum of ΔS t corresponding to the value of the input t. e window delay time τ W is the value of the input t corresponding to the minimum value of S cor (t). (5) e embedded dimension m is calculated:

Nonlinear Attribute Feature Extraction
(1) Minimum delay timethe known speech signal is represented as Here, we use the mutual information function to calculate the mutual information between the speech signals x(i) and x(j) at different time intervals. At the points where the mutual information of these two speech time series reaches the minimum, the correlation between the two variables is also minimal. is corresponding time interval is the minimum delay time τ. As shown in equation (11), this paper uses the average mutual information (MI) [19] to calculate the minimum delay timeτ: where p i and p j , respectively, represent the probability of the sequence amplitude falling in the ith and jth segments, respectively. p i,j denotes the joint probability of the two-point amplitude of the  Mathematical Problems in Engineering sequence at time interval τ.
e minimum delay which quantifies the disorder between two discrete variables corresponds to the moment of the first local minimum of the obtained mutual information function curve.
(2) Correlation dimension:the correlation dimension is a nonlinear representation of chaotic dynamics. It is used to describe the property of the dynamics and self-similarity of the structure of high-dimensional spatial speech and provides a quantitative analysis of the complexity of its structure. e more complex the corresponding system structure is, the greater will be the correlation dimension. e correlation dimension is calculated using the G-P algorithm [20]. As shown in equation (12), the G-P algorithm is a method proposed by Grassberger and Procaccia for calculating the correlation dimension: where D(m) is the relational dimension and C(r, m) is the correlation integral function. C(r, m) is the ratio of the phase point between any (X i , X j ) in the m-dimensional reconstruction space which is less than r, the ratio of all phases, and is defined as In Equation (13), the corresponding ln C(r, m) ⟶ ln r curve is obtained by taking the minimum embedded dimension of m, and the correlation dimension can be obtained by fitting the local line of the curve. (3) Kolmogorov entropy: it is a physical quantity used to accurately describe the degree of confusion in a timeseries distribution. Grassberger and Procaccia proposed the correlation dimension analysis method. ey demonstrated that the K entropy can be approximated using the K 2 entropy. e relationship between K 2 entropy and the correlation integral function C(r, m) can be expressed as is entropy calculated in equation (14) is the Kolmogorov entropy. (4) Largest Lyapunov exponent: the Lyapunov exponent is used to quantify the average change in the rate of local convergence or divergence of adjacent orbits in the phase space. e maximum Lyapunov exponent x represents the degree of convergence or divergence of the orbit. When λ 1 > 0, as the value of λ 1 increases, the value of the orbital divergence and the chaos also increases. e paper uses the Wolf method [21] to obtain the maximum Lyapunov exponent. Here, we take the initial point X i in the phase space and find its nearest neighbor point X i ′ . e distance between them is represented as L 0 . is distance is tracked over time as the adjacent orbits in the phase space converge or diverge. A point is retained when the distance L i between the two points meets the set value ε after n iterations of tracking. Once this condition is met, the next moment is tracked. When tracking the overlay M times, we can obtain the maximum Lyapunov exponent using the following equation: Compared with other algorithms, this algorithm has advantages of fast computation, robustness to embedded dimension m, delay time τ , and noise. If H > 0.5, it indicates that the time series displays a long-term autocorrelation and the time series is highly correlated. is paper uses the rescaled-range analysis method [22] to calculate the H value. e rescaled-range is a nonparametric statistical method, which is not affected by the distribution of the time series. e method divides the one-dimensional speech signal with emotional content [x(1), x(2), . . . , x(N)] into M adjacent subsequences C of equal lengths. By calculating the cumulative deviation z u and the standard deviation S u for each subsequence and then calculating the weight difference of each sub-sequence R u /S u , we obtain the Hurst exponent using R u � max z u − min z u . e calculation is as follows: Here, b is a constant. By taking the logarithm of both sides of equation (16), we can obtain the value of H which is the Hurst exponent. For different emotional states contained in a speech signal, the changes in the value H are different. e Hurst exponent feature of the extracted emotional speech reflects the correlation between the emotion and the change.

Nonlinear Geometric Feature Extraction.
After the onedimensional speech signal is mapped to a high-dimensional space using phase space reconstruction, the speech signal is analyzed in the high-dimensional space. Next, the geometric features-which are the five trajectory-based descriptor contours-of the phase space reconstruction for different speech states are extracted. ese five descriptors are detailed as follows: (1) e first contour: the distance from the attractor to the center is expressed as a � [|a 1 → |, |a 2 → |, . . . , |a N �→ |]:

Mathematical Problems in Engineering
Among them, the two-dimensional space under the attractor is defined as a i → � (a i , a i + τ i ), and the three-dimensional space under the attractor is defined as a i → � (a i , a i + τ i , a i + 2τ i ).
(2) e second contour: the length of the continuous trajectory between the attractors is expressed asl (3) e third contour: the trajectory of the continuous path between the attractors is expressed as (4) e fourth contour: the distance from the attractor to the marker line is expressed as For the time delay τ � 1, when the original waveform x(t) is lagged, there will be a small difference between the two samples x(t − 1) and x(t − 2). is can be expressed as the identity [20]: From formula (20), we can observe that the upper form will not hold when the three attractors are different. Since the dynamic factors of the chaotic system are interactive, the data points produced in time will also be correlated [23]. erefore, formula (21) represents the labeling line. e differences between the attractors can be obtained by analyzing the distances between the attractors and the labeling line. (5) e fifth contour: the total length of the trajectory of the attractor is expressed as S: [15]. e isolated words database was used for performing speaker-independent, isolated word recognition from neutral (nonemotional) speech. e vocabulary sizes used in the experiments were 10 words, 20 words, 30 words, 40 words, and 50 words. e corpus consisted of ten digits and 40 command words with 16 speakers thrice repeating each word. For our experiment, we used the recording of the utterances of 9 speakers as the training set and the utterances of the remaining 7 speakers as the test set. [16].

CASIA Database
e CASIA database is a Chinese database developed in the Institute of Automation, Chinese Academy of Sciences. e recordings consisted of six acted (simulated) emotions (Neutral, Anger, Fear, Happiness, Sadness, and Surprised) by four professional speakers (2 females and 2 males). Each emotion category consists of 300 identical texts and 100 different texts. Recordings of readings of the same text with different emotions are useful for the comparison of acoustics and prosodic performance for different emotional states. Another 100 different texts with emotional content that matched the emotion being expressed made it easier for the articulating person to express their feelings better. e recordings were performed with a sampling rate of 16 kHz and a 16-bit resolution and were stored in PCM format. [17].

Berlin Database
e Berlin database is a German database recorded in an anechoic chamber at the Technical University Berlin. e database consists of 10 actors (5 females and 5 males) who simulated seven emotions (Neutral, Anger, Fear, Happiness, Sadness, Disgust, and Boredom). Each emotion category contains ten German sentences. e recordings were performed with a sampling frequency of 48 kHz and later downsampled to 16 kHz with high-quality recording equipment. In our experiments, we use happy, sad, neutral, and angry as the four basic emotions from the German Berlin speech library.
Taking into account the effect of the length of speech on the recognition results, this paper filters the database to obtain 363 German sentences and 1000 Chinese sentences with approximate speech length of five seconds. e results of the division of emotional speech into the training and test set are shown in Table 1.

Feature Extraction.
Previous studies have demonstrated that prosodic features [24] and MFCC features [24] are highly efficient for distinguishing between different emotional states. In this paper, we first perform a series of preprocessing operations on the speech signals. en, we extract the prosodic features and MFCC features for each speech frame. We also extract the NLD-1 and NLD-2 features based on the phase space reconstruction method described earlier in this paper. en, we calculate the statistical functions for the above features. ese statistical functions include the maximum and the minimum values, the mean, the variance, the median, the deviation, and the kurtosis. Finally, as shown in Table 2, we end up with a feature set of 150 dimensions. e normalized method of linear function 8 Mathematical Problems in Engineering transformation is used to eliminate the influence of different types of affective features, and then, the objective performance is evaluated synthetically.

Prosodic Feature Extraction.
Prosodic features mainly describe the nonverbal information in the emotional speech signal, including the level, the length, the speed, the severity of speech, and the fluent speech information. erefore, the prosodic feature, also known as the "supersegmental feature," is also recognized for its ability to recognize emotions. erefore, we use speech speed, average zero-crossing rate, energy, fundamental, and formant as the prosodic feature.

MFCC Feature Extraction.
e ability of the human ear to perceive the sound intensity is related to the frequency of the sound. At low frequencies, the perceived sound perception of a human ear is linear with the sound frequency. At high frequencies, due to the masking effect, the perception of the human ear to the sound is nonlinear with the frequency of the sound, so Mel frequencies are introduced to simulate auditory properties. is paper uses the expression: f mel � 1125 * ln((1 + f)/700). e ordinary frequency is converted to Mel frequency, and the first 12 steps of MFCC are extracted.

Classification.
Constructing a reasonable and efficient speech recognition model is the most important research challenge in the field of speech recognition technology. It requires learning from a large training corpus, which can be used to explore a variety of acoustic features for mapping the corresponding path of the speech signals to achieve the correct identification. Currently, for speech recognition tasks, both linear and nonlinear classifiers are used. e linear ones include Naïve Bayes Classifier, Linear ANN (artificial neural network), and Linear SVM (support vector machine). e nonlinear ones include Decision Trees, k-NN (k-nearest neighbor algorithm), and Nonlinear ANN. Nonlinear classifiers also include SVMs, GMM (Gaussian mixture model), HMM (hidden Markov model), and sparse means classifiers, among others. Researchers have experimented with different model classifiers for improving speech recognition. e most widely used classifiers for speech recognition are HMM [25,26], GMM [27,28], ANN [29,30], and SVM [31,32]. In this paper, to improve the separability of data, the SVM classifier is used to generate a nonlinear mapping of the original features to a high-dimensional space; the choice of kernel function is Radial Basis Function (RBF).

Experimental Setup and Analysis of Results
To verify the validity and robustness of the proposed NLD feature set, we design the following two experiments. e first experiment consists of an analysis of the influence of PSR parameter selection on the NLD feature set. e second experiment verifies the validity of NLD features for speech recognition by comparing them with traditional acoustic features.  Training  47  42  53  55  46  132  132  132  132  132  Test  24  20  26  27  23  68  68  68  68  68  Sum  71  62  79  82  69  200  200  200  200  200   Table 2: Statistics of feature parameters extracted from speech.

Prosodic features 38
Speed Average zero-crossing rate Energy and its 1 st -order maximum, minimum, and mean values Fundamental frequency and its 1 st -order maximum, minimum, and mean values First formant and its 1 st -order maximum, minimum, and mean values Second formant and its 1 st -order maximum, minimum, and mean values ird formant and its 1 st -order maximum, minimum, and mean values MFCC 60 e skewness, kurtosis, mean, variance, and median of the first 12 steps of MFCC e maximum, minimum, mean, median, and variance of the hurst exponent e maximum, minimum, mean, median, and variance of the minimum delay time

NLD-1 (nonlinear attribute features) 59
Correlation dimension's maximum, minimum, mean, median, and variance Kolmogorov entropy's maximum, minimum, mean, median, and variance e mean, median, and variance of the largest Lyapunov index e first, second, and third contours

NLD-2 (nonlinear geometric features) 23
e maximum, minimum, mean, variance, standard deviation, skewness, and kurtosis e maximum, minimum, skewness, and kurtosis of the fourth contour e fifth contour

Influence of PSR Parameter Selection on NLD-2 Features.
We design two experiments to verify the validity of two important parameters of phase space reconstruction and discuss the results under different parameters: Experiment 1: first, we generate the phase space reconstruction of speech signals using the delay time τ and the embedding dimension m(τ � 1, m � 3) as set in the document [20]. Next, the phase space reconstruction of speech signals is also carried out using the delay time τ and the embedding dimension m for each frame of the speech signal extracted using the improved C-C method. Finally, we compare the results of the two experiments. Experiment 2: in view of the current research on spatial coordinates, the geometric information is limited to the two-or three-dimensional space [13]. erefore, we set the value of the embedded dimension as m � 3 and the delay time τ � 1, 2, . . . , 5. is is done to compare the experimental results for the delay times and embedding dimensions.
We reconstruct the phase space, based on the above two sets of experimental parameters. Next, we extract five kinds of NLD-2 features from the corresponding phase space of the Berlin-DB for the recognition of five basic emotions. e experimental results are shown in Table 3 and Figure 5.
From Table 3, we can observe the task of recognition of emotional speech, and we obtain a higher accuracy (75%) for the delay time and the embedded dimensions than those reported in the literature [20]. Our system demonstrates an increase of 33.3%, for the happiness category, while the recognition rates for sadness, anger, and fear are relatively low. However, from the perspective of average recognition rate, using NLD features extracted by our method based on the parameters of this paper, we obtain a recognition rate which is 2.5% higher.
According to the experimental results shown in Figure 5, the NLD-2 features based on the method of parameter setting cannot achieve the optimal recognition rate for the recognition of each emotion speech category. However, the overall recognition trend is relatively smoother than other approaches. At the same time, we are also able to achieve an optimal value for the average recognition rate. is indicates that the five NLD-2 features used to solve the delay time τ and the embedding dimension m are valid based on the method of improving the C-C. is also proves that compared with setting fixed values for the delay time τ and embedding dimension m, the method of using C-C to set the delay time τ and the embedding dimension m for each frame of the speech signal's phase space reconstruction yields better results for recognition of emotional speech signals.

Validity and Verification of Robustness of the NLD Features.
In this paper, we used three methods to verify the validity of the extracted features.

Experimental Scheme 1: Speech Recognition of Isolated
Words. e ten types of NLD features based on the PSR theory are combined with the MFCC features to identify isolated speech vocabulary. e experimental results are shown in Table 4 and Figure 6.
ese results verify the validity and robustness of the NLD features based on phase space theory.
e experimental results show that using different vocabulary and different values of signal-to-noise ratio (SNR), the recognition rate can be improved by the combination of NLD features and traditional linear speech acoustic features. Compared with the above four types of feature combination methods, from Table 4, we can see that the complementary effect of NLD-2 features is better than that of NLD-1 features. e effect of combination of the NLD features and the MFCC features yields optimal results. From the results, it can be seen that the recognition rate of the feature set comprising of the fusion of traditional linear acoustics with NLD features increases with the increase in the vocabulary size. is can be attributed to an increase in the training set. erefore, the effective information from the speech signals can be better described by combining or complementing the NLD features with the traditional linear features of speech signals. But the overall recognition rate decreases with an increase in the number of words. is is because the fusion of the above features is not suitable for large words.
erefore, new features must be considered to improve the recognition effect for large vocabulary speech recognition.

Experimental Scheme 2: Single Language Emotion
Recognition.
e prosodic features, MFCC features, NLD-1 features, and NLD-2 features are combined to recognize single emotional speech from Berlin-DB and CASIA in two languages. e results of the recognition are shown in Tables 5 and 6. e confusion matrix of the Chinese emotional speech recognition is provided in Table 5. We can see that compared with MFCC, NLD-1, and NLD-2, prosodic features achieve the best recognition rate for the happy emotional state. From the perspective of misjudgment, the misclassification of happiness and anger is the lowest for prosodic features. is indicates that prosodic features can effectively distinguish between happy and angry emotional states. From the overall recognition results, the overall recognition performance of MFCC is higher than that of the other three features and the recognition results for the anger class are optimal. NLD-1 features have better recognition effect for the neutral emotional voice, and NLD-2 has a better recognition for sadness and fear. e recognition performance of NLD alone is not optimal. is can be explained as that for emotional speech, NLD is used for effectively recognizing the effect of local emotion recognition only. It also indicates that the nonlinear feature can make up for the lack of speech chaos observed in previous studies.
In Table 6, we can see the confusion matrix of the Berlin German emotional speech corpus. e recognition effect of NLD-2 is better than that obtained using prosody, MFCC, and NLD-1. For happiness, NLD-2 correctly classifies 50 instances which is higher than the number of instances recognized using the other feature sets. From the recognition results of fear, the recognition performances of NLD-1 and MFCC reach the optimum values. From the overall recognition results, the recognition performance obtained using MFCC is superior to the other three types of features. is is because the MFCC features extracted for the sadness, neutral, anger, and fear yield the best recognition results. Comparing the results of emotion recognition in two languages, we can see that recognition result of emotional speech is not only related to the language type of the speech database but also has a close relationship with the features. e same feature yields different results for the representation of emotional information in different languages.
In Figure 7, we compare the results for single language emotional speech recognition for German and Chinese. We can see that for the recognition of the emotional speech, only prosodic features yield slightly better results in Chinese than in German. is is because in Chinese, we obtain the highest recognition rate for happy emotional speech. From the results of the recognition rate of the different features, the dominant features based on the recognition performance can be sorted as follows: MFCC > NLD-2 > NLD-1 > prosodic The proposed method  is is verified for both Chinese and German emotional speech corpus. erefore, we can state that the NLD-1 and NLD-2 features extracted in this paper can effectively characterize the emotional information in speech signals.

Experiment Scheme 3: Speech Recognition of Mixed
Language Emotion. Prosodic features, MFCC features, NLD-1 features, and NLD-2 features are used to recognize the cross-emotional speech from the Berlin-DB and CASIA in two languages.
e recognition results are shown in Table 7. is further validates the efficiency of the extracted features for recognition of emotional states from speech.
From Table 7, we can draw the following conclusions: from the average recognition results with single use of four feature types (prosodic features, MFCC, NLD-1, NLD-2, and NLD-2), the average recognition rate is the highest for NLD-2 and the lowest for the prosodic features. We can conclude that prosodic features are superior for the task of recognition of emotional speech in a single language. Evaluating the results for each individual emotion, we observe that MFCC has a better discriminative power for detecting sadness; NLD-1 can better differentiate neutral emotions; However, NLD-2 provides better distinction between happiness,   erefore, we can state that NLD demonstrates a better distinction between different emotions of great intensity, such as sadness, neutral, happiness, and anger. From the perspective of feature fusion, we observe that addition of NLD features effectively compensates the chaotic characteristics of the emotional speech signals compared to traditional acoustic linear features. In addition, we also observe that it is partial to using NLD features for characterizing the emotional difference in speech signals.
is is because NLD features are obtained by treating the speech signals as a one-dimensional time series and completely ignoring the acoustic features of the emotional speech signals.
erefore, when the NLD features and acoustic features are combined, the effective information in the emotional speech signals can be better described.

Conclusion and Further Study
In this paper, based on the chaotic characteristics in the nonlinear generation mechanism of speech signal, aiming at the deficiency of linear feature parameters in speech signal and the limitation of existing time-domain and frequencydomain attribute features in characterizing the integrity of speech information, a nonlinear feature extraction method based on phase space reconstruction theory is proposed, and the chaotic characteristics of speech signal are verified from three aspects: power-spectrum analysis, principal component analysis, and phase space reconstruction. e nonlinear dynamic model is applied for the extraction of speech features. is paper also extracts and evaluates the contribution of NLD features from speech signals. e speech   recognition experiments are designed to combine the features of traditional linear acoustics with the NLD features to verify whether this combination can improve the performance of the recognition. From the experimental results for the recognition of isolated words, the addition of nonlinear dynamical features is able to effectively compensate for the chaotic features neglected by the traditional acoustic features. is proves that merging NLD features with acoustic features can better describe the effective information contained in speech signals. From the recognition results of emotional speech, we can observe that while the performance of nonlinear features alone is ideal, we can obtain better recognition rates through feature fusion. For the experimental designed in this paper, the recognition network was developed by combining NLD features with acoustic features. rough our experiments, we demonstrate that while NLD features efficiently compensate for the chaotic characteristics of the emotional speech signals, they are also biased to represent the differences in the emotional speech alone. In future research, we would like to explore the research direction of integrating NLD and acoustic features to generate the strongest combination of the features. Additionally, in view of the high efficiency of NLD features for emotion recognition in mixed languages, the study of crossdatabase emotion recognition using NLD features is another research direction that needs to be further explored.
Data Availability e databases used in the manuscript can be found in the following two links, from which you can download: http:// emodb.bilderbar.info/docu/#home and http://www. chineseldc.org/.

Conflicts of Interest
e authors declare that they have no conflicts of interest.