Abstract

Air traffic controller fatigue has recently received considerable attention from researchers because it is one of the main causes of air traffic incidents. Numerous research studies have been conducted to extract speech features related to fatigue, and their practical utilization has achieved some positive detection results. However, there are still challenges associated with the applied speech features usually being of high dimension, which leads to computational complexity and inefficient fatigue detection. This situation makes it meaningful to reduce the dimensionality and select only a few efficient features. This paper addresses these problems by proposing a high-efficiency fatigued speech selection method based on improved compressed sensing. For adapting a method to the specific field of fatigued speech, we propose an improved compressed sensing construction algorithm to decrease the reconstruction error and achieve superior sparse coding. The proposed feature selection method is then applied to optimize the high-dimension fatigued speech features based on the fractal dimension. Finally, a support vector machine classifier is applied to a series of comparative experiments using the Civil Aviation Administration of China radiotelephony corpus to demonstrate that the proposed method provides a significant improvement in the precision of fatigue detection compared with current state-of-the-art approaches.

1. Introduction

IATA (the International Air Transport Association) has predicted that China will become the largest civil aviation market in the world by around 2025, with China’s civil aviation involving the flow of 1.6 billion passengers by around 2037 [1]. The rapid development of civil aviation represents the great challenge to air traffic control and contributes to increasing shortages of air traffic controllers (ATCs). The resulting high workloads can increase the fatigue experienced by ATCs, thus increasing the probability of human error and the associated dangerous consequences for aviation safety [2]. Research studies have demonstrated that greater fatigue is closely associated with higher risk [3]. This situation has resulted in considerable attention being paid to the accurate detection of fatigue in ATCs among researchers in the field of civil aviation.

Fatigue in ATCs can be measured using a multitude of methods and tools, which can be grouped into two categories: subjective and objective methods [4]. Subjective self-rating scales and questionnaires have been the most-important sources of data for assessing both ATC and pilot fatigue [5, 6]. Two renowned and validated subjective fatigue/sleepiness scales are the Karolinska sleepiness scale [7] and NASA’s task load index [8]. Although subjective methods are easy to implement, they perform poorly in detecting a fatigue state rapidly, including real time. Therefore, objective methods have received a considerable amount of research interest. There are two categories of popular objective methods based on their different manifestations: (1) methods based on physiological parameters, including heart rate, blood pressure, breathing rate, electroencephalogram, and skin electricity [911], and (2) methods that directly record observable body actions, including voice strength, eye movement, blink times, yawning, and nodding frequency [12]. These objective methods are more accurate and can be used to formulate a reliable physiological fatigue index. The main disadvantage of these monitoring techniques is that their intrusiveness usually results in aversion and disturbance to the ATC, which will reduce their accuracy.

The rapid developments in speech recognition have resulted in vocal feature-based methods recently emerging as the preferred avenue for research into fatigue in ATCs [13]. Vocal features are convenient to collect and analyse, given that the main job of ATCs involves communicating with pilots via radiotelephony, and regulations specify that all voice records must be preserved for a certain period of time. There are several analyses in the literature for the connection between vocal features and fatigue [14, 15]. In 2006, Greeley et al. demonstrated that voice features show strong correlations with fatigue in the sleep onset latency test [16]. Krajewski introduced a fatigue eigenvector composed of linear speech features such as the fundamental frequency, resonance peak, and mel-frequency cepstrum coefficient (MFCC) [17]. However, the reported average accuracy when using these features was 76.5%, which is inadequate for the work performed by ATCs.

It has been demonstrated that the detection accuracy of fatigued speech is greatly affected by feature extraction and efficient features’ selection [15]. It has recently become convenient to extract common speech features such as pitch, energy, and MFCC using commercial software (e.g., Opensmile) [18]. In addition, some state-of-the-art approaches utilizing nonlinear features based on wavelet decomposition and the fractal dimension [19] have shown more efficient results in detecting ATC fatigue. Overflow features result in a difficult trade-off between computational complexity and accuracy. Furthermore, the duplicated features obtained by different methods will confuse the subsequent recognition network, which consequently leads to inefficient results in detecting fatigue [20]. This situation indicates the need to achieve efficient features’ selection and reduce the dimensionality of features.

Compressed sensing (CS) is a sub-Nyquist sampling technique that allows a sparse signal to be reconstructed reliably from a set of measurements to reduce the signal redundancy and reconstruction costs [21]. Many researchers have attempted to utilize this characteristic in exploring the performance of CS in dimension reduction and feature selection. For example, Haneche et al. proposed a novel speech enhancement approach based on the CS framework in 2019 [22], while Langari et al. extracted the best subset of features for speech emotion recognition by combining with CS in 2020 [23]. Although the technique of CS is beneficial for speech recognition, a considerable challenge is determining a well-designed measurement matrix that accurately represents the corresponding specific target speech signal. For this reason, the goal of this paper is to improve the conventional framework of CS to achieve the feature selection of speech, which will lead to a higher fatigue detection rate for ATCs using a popular machine learning training network, such as a support vector machine (SVM).

The rest of this paper is organized as follows. Section 2 briefly introduces the basic theory of CS, Section 3 proposes a fatigued speech detection network and describes an improved CS construction algorithm (ICSCA) in detail. Section 4 reports on the series of experiments performed to test our new method and conclusions are drawn in Section 5. And, all the terminologies used in this paper are illustrated in Table 1.

2. Compressed Sensing

CS was proposed by Candes and Donoho, who constructed the initial theoretical framework consisting of signal sparse coding, measurement matrix construction, and a reconstruction algorithm. In brief, CS can achieve complete sampling to the original signal at a sampling rate that is much lower than the Nyquist sampling theorem and reconstruct the original signal using only a small proportion of the sampled data. The detailed description is shown in Figure 1.

In Figure 1, denotes the original signal and is the final compressed signal, and M is usually smaller than N. In addition, and indicate the sparse matrix and measurement matrix, respectively.

2.1. Sparse Coding

CS theory is based on the assumption that the signal is sparse or highly compressible; in other words, most of the signal values are either zero or small enough to be ignored. Even though the signals under consideration often do not satisfy the sparse condition, it might be possible to find a basic matrix to transform the original signal linearly and ensure that the coefficient vector is sparse, in case of which the original signal also exhibits sparsity. The formula for sparse coding is as follows:where represents the coefficient vector, and only K of the N signal entries are nonzero . The selection of the sparse matrix depends on the inherent characteristics of the signal. The common methods used in the sparse representation include the curvelet transform, wavelet transform, barren transform, discrete cosine transform, and discrete Fourier transform.

2.2. Selection of Measurement Matrix

Another major problem in CS is how to choose measurement matrix . For a sparse one-dimensional signal, a measurement matrix is constructed to compress the original signal and obtain a measurement signal, which can be expressed as follows:where is defined as the sensing matrix. Generally, the restricted isometry property (RIP) defined in Definition 1 is the property that sensing matrix needs to satisfy.

Definition 1. For any sparse signal and measurement matrix , there exists , and is the minimum value satisfying equation (3); then, it is called , the rip constant of order of :The purpose of the RIP is to ensure that the “redundant” information discarded in the process of compression measurement is controlled within an acceptable range and to prevent useful information from being discarded. The RIP has been proved to be a sufficient condition for the existence of a single feasible solution of equation (3) [24].

2.3. Reconstruction Algorithm

The process of signal reconstruction is the reverse solution of equation (1). Since M is less than N, it is an NP-hard question for which it is difficult to obtain exact solutions. The signal reconstruction process is expressed as follows:where denotes the number of nonzero elements. In order to reduce the computational complexity, many scholars have proposed replacing the norm with the norm in order to transform the problem from nonconvex to convex. Some other algorithms have also been proposed by researchers to solve this problem, such as orthogonal matching pursuit (OMP) [25], iterative hard thresholding [26], basis pursuit [27], and compressed sampling matching pursuit [28].

In summary, when applying CS, it is necessary to ensure that the signal is sparse, which has led to some efficient reconstruction algorithms being proposed by researchers as CS theory has advanced. However, how to construct an efficient sensing or measurement dictionary for a particular type of input signal remains a challenge that needs to be overcome. Therefore, below, we propose an ICSCA that is suited to fatigued speech among ATCs.

3. Improved Fatigued Speech Feature Selection Method

3.1. Architecture of Fatigued Speech Detection

With the introduction of CS, a high-efficiency speech detection model based on the Civil Aviation Administration of China radiotelephony corpus is proposed. Some signal preprocessing methods are first applied to reduce the impact of noise added during the collection process, such as denoising, filtering, and emphasis. Wavelet decomposition is then applied to the speech signal, and the detailed coefficients of each signal layer are extracted. Inspired by a recently proposed nonlinear feature [29], the detailed fractal dimension coefficients of each signal layer are calculated to extract the ATC fatigued speech features. Furthermore, an ICSCA is applied to remove the redundant information and perform the final selection of the ATC fatigued speech feature. The accuracy of fatigue detection is calculated with the help of an SVM. Figure 2 shows the detailed architecture of the proposed model.

3.2. Preprocessing and Feature Extraction
3.2.1. Preprocessing

The energy of the speech signal is concentrated in the low frequency, and the high-frequency parts carry less energy. For solving this problem, the signal preemphasis is utilized to increase the high-frequency part of the speech signal, thereby to obtain the signal spectrum in the entire frequency band. The preemphasis is generally implemented by a first-order FIR high-pass digital filter and original signal (the sample value at n time) can be processed as follows:where is the new signal and represents the preemphasis coefficient and is set as 0.95.

The speech signal is a time-varying and unsteady process, and its characteristic parameters will change randomly over time, but in the short-term range (generally 10∼30 ms), the speech has relatively stable characteristics, that is, the speech signal has short-term stability. Therefore, if the speech signal is divided into short-term segments, then each segment can be regarded as stable. Taking the 16 K sampling frequency as an example, 256 sampling points are used as a chunk that is about 16 ms. And, the overlapping segmentation method is usually used to ensure a smooth transition between adjacent chunks. Finally, the selected stride is 64, and there are 192 sample points overlapped between two adjacent chunks.

Then, the chunk signal would be windowed due to reduction in the discontinuity of the signal at the beginning and end of the chunk. This is achieved by using the Hamming window , and the final processing signal can be obtained as follows:

Based on the former signal preprocess, the two typical and prevalent speech features (pH [30] and SWFF [31]) were selected to verify our proposed methods better, which are based on the speech linear and nonlinear research theory separately. The basic signal process of these two methods is introduced in the follow sections.

3.2.2. pH Vocal Source Feature

The pH is a time-frequency feature used in a speaker recognition and verification system [30]. Research shows that this feature is closely related to the excitation source and consists of a vector containing the Hurst index [32]. Then, the Hurst exponent (0 < H < 1) expresses the time correlation or scaling degree of the speech signal. Its autocorrelation coefficient function (ACF) decays gradually in the following form:where the value of H can be associated with the spectral characteristics of . The detailed extraction process can be shown in Figure 3 [30].Step 1: the discrete wavelet transform (DWT) is applied to decompose speech signals into approximate coefficients and detail coefficients . is the decomposition scale and k is the coefficient index of each scale.Step 2: for each scale , variance is derived from the detail coefficient, where is the number of possible coefficient values of each scale. The value of is obtained as .Step 3: the pH is composed of values in , and component is calculated from the original speech signal. Other values are obtained by repeating Steps 1 to 2 for each detail coefficients’ sequence.

3.2.3. Speech Wavelet Fractal Feature (SWFF)

The theory of fractal dimension (FD) and wavelet decomposition are applied in extracting SWFF feature. Fractal is a complex system whose complexity can be described by a noninteger dimension called the fractal dimension (FD). It can be defined by data and calculated approximately and experimentally. It is related to H as follows [33]:where D represents the fractal dimension, is the side length of a small cube, and is the number needed to cover the measured geometry with the small cube.

In the process of wavelet decomposition, inspired by [31], the Daubechies wavelet was chosen as the wavelet basis function because it is highly consistent with our requirements. And, the frequency distribution of speech signals on each scale after wavelet decomposition is shown in Figure 4, where high-frequency coefficient is the detail coefficient.

Then, the detailed calculation of FD can be introduced as follows:Step 1: a time series with length N is set up. There are k new time series that are obtained by reconstructing the time series with a delay method.Step 2: the curve length of each can be calculated using the following formula:Step 3: the length of the total sequence can be approximated as the average of the length of the sequence curve generated by k delays. For different values of k, a set of curve data related to k and L (k) can be obtained.

In the end, the detailed SWFF feature can be obtained from the following formula:where FD refers the FD calculation method and is set as 10. represents the FD of the detail coefficients of layer.

3.3. Improved CS Construction Algorithm

The sensing dictionary and measurement matrix are constructed based on the modified t-mean index. The inner product of and is made equal to 1, such as in equation (6), which defines the t-mean coherence coefficient aswhere represents the element in row and column of the Gram matrix. The absolute coherence coefficient is the average value of all nondiagonal elements whose absolute values in the Gram matrix exceed a certain threshold . A greedy algorithm is then used to make the Gram matrix closer to the ideal Gram matrix. Specifically, the nondiagonal elements are gradually reduced to near 0. Finally, and can be constructed when satisfies the threshold.

The above process can be described as follows:

The value of threshold can be set to to reduce the number of iterations because matrix cannot be completely iterated into , and the nondiagonal elements in cannot be made equal to zero. It is proved that the minimum value of nondiagonal elements in the ETF (wqual-dimensional tight frame) matrix is

The construction process and characteristics of are very similar to the ETF matrix. In this case, equation (12) can be modified aswhere , the diagonal element of matrix is equal to 1, and nondiagonal elements are equal to .

Solving equation (14) yields the measurement matrix and sensing dictionary. Equation (14) can be decomposed into the following two problems that are solved iteratively:

Evaluation and performance assessment are calculated iteratively by using OMP and equation (11). If the difference between the results of successive iterations is less than the threshold or the number of iterations exceeds the set maximum number of iterations, the algorithm is terminated.

The gradient method is used to solve . The values of the nondiagonal elements of the matrix can be reduced to reduce the coherence between different columns. The optimization process is described as follows:Step 1: define the cost function as .Step 2: calculate the gradient of the cost function:Simplify this toStep 3: the complete iteration equation iswhere is the number of iterations and is the step size, which is set as 0.001.Step 4: use OMP to evaluate the coherence coefficient of t and evaluate whether the difference between the results of two successive iterations is less than the threshold.

Two points need to be considered when solving : (i) ensuring the correlation between the sensing dictionary and measurement matrix throughout the process and (ii) ensuring the consistency between and , where should be as small as possible. For overcoming the former difficulty, we propose methods as follows.

Matrix is first constructed. Then, using the taut operator to shrink the nondiagonal elements in the matrix, approximation degree is gradually reduced. Finally, a pair of perceptual dictionaries and measurement matrices can be obtained by singular value decomposition.

The value range of the nondiagonal elements of the matrix is because matrix and matrix are initially column normalized. Applying the tighten operator further narrows this range to , where . A simple and easy-to-implement operator is proposed for mapping from to :

It can be seen that the above tightening operator can adjust the range of matrix nondiagonal elements in iterations with only one parameter, , which is set as 0.4.

Utilizing the SVD decomposition yields

The diagonal elements in matrix are nonnegative, and all diagonal elements are arranged from the upper-left corner to the lower-right corner. In order to be closer to , set the maximum elements in to be retained and then construct as follows:

At the same time, in order to ensure that the inner product of corresponding atoms is 1, it should be treated according to the following formula:

Above all, we construct a pair of sensing dictionary and measurement matrix with a weak cross correlation.

3.4. SVM Settings

An SVM is a classification model whose mathematical strategy involves maximizing the interval of different kinds of data. Therefore, an SVM can be formalized as a convex quadratic programming problem. Here, a WLS-SVM (weighted-least-squares SVM) [34] is used for the classification process, which is formulated as

The weighting coefficient of is calculated aswhere represents the membership grade, . The WLS-SVM utilizes fuzzy c-means clustering methods to decide the rule number, which is based on the following formula:where denotes a fuzzy exponent, is the degree to which belongs to the rule, and is the cluster center. The advantage of a WLS-SVM is that general errors including noise in the input and output variables are considered as empirical errors.

Furthermore, in terms of the selection of the Gauss kernel function, we finally use the radial basis function (RBF) due to its superior antijamming ability for noise in data. The RBF kernel in this research is the same as the activation function used by Mu et al. [35]. The mathematical model of the kernel function is as follows:where is the parameters of the kernel function.

4. Experimental Results

Experimental results were obtained on a Windows 10 personal computer equipped with a 64 bit Intel Core i5-9300H CPU running at 2.4 GHz and with 8 GB of RAM. All of the proposed methods were implemented using Python (version 3.7) and TensorFlow (version 1.14.0) software.

4.1. Datasets and Parameters

A fatigued speech dataset [31] consisting of 1606 speech samples from ATC radiotelephony was used in the experiment depicted in Table 2. Due to the proportion of samples representing fatigued speech being less than for normal speech samples, we finally selected 824 speech samples from the dataset (412 fatigued speech samples and 412 normal speech samples) to ensure the authority of experimental results.

The SWFF was then extracted as the original signal feature. The dimension of the SWFF was 256, and according to the progress of CS, we set the final feature dimension to be 32.

During the set of the SVM, the 824 speech samples were divided into K = 6 groups (the overall average). Each subset dataset was used as a verification set, and the remaining subset dataset was used as a training set so that K models could be obtained. The average classification accuracy of the final verification set of these K models was used as the performance index of the classifier under this K-CV. The penalty factor was set to , and the gamma parameter was  = 0.5.

4.2. Results and Analysis

In this section, the experiments were conducted by using two types of prevalent fatigue features (PH and SWFF). And, the sparse autoencoder (SAE) [36] was utilized to replace the SVM classifier. Furthermore, the Gauss random matrix and uncompressed sample were selected for comparisons with the ICSCA. The fatigue state detection results obtained by using these two nonstop measurement matrix construction algorithms for feature sampling are shown in Figures 57 and Table 3.

Overall, it was clear that SWFF feature played better detection performance with the same classification methods. Considering the use of different classifiers, we can see that the SAE method consumed less time, but the average accuracy was far lower than the SVM.

In terms of the function of different measurement matrices, compared with the detection results without feature sampling, the accuracy of ATC fatigue state detection for Gaussian random matrix algorithm feature sampling was reduced by about 2%, while the detection results with proposed ICSCA were improved to 85.11% (pH) and 94.25% (SWFF) separately. Finally, it can see that the proposed ICSCA method also has the fastest operation speed of 1.37 minutes (pH) and 1.21 minutes (SWFF), which features the highest accuracy rate of 97.11%, when compared with DDL is 93.10%, while pH is 60.36% and SWFF is 71.39%. These findings demonstrated that the ICSCA proposed in this study provides better improvement in both detection accuracy and operation time.

5. Conclusions

In order to quantitatively and fast detect fatigue condition of ATCs, we proposed a CS-based framework for detecting fatigue from speech of ATCs. Then, an improved compressed sensing reconstruction algorithm is proposed to decrease the reconstruction error and achieve superior sparse coding, which was applied to fatigued speech selection with redundant information in the original feature vector removed. Finally, pH and SWFF speech features are applied to a series of comparative experiments using the Civil Aviation Administration of China radiotelephony corpus to demonstrate that the proposed method provides a significant improvement in the precision of fatigue detection compared with current state-of-the-art approaches.

Data Availability

The radiotelephony corpus data sampled from Air Traffic Management Bureau, Civil Aviation Administration of China, used to support the findings of this study, are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors acknowledge the financial support from the National Natural Science Foundation of China (Grant no. 71874081), State Key Laboratory of Air Traffic Management System and Technology (Grant no. SKLATM202006), special financial grant from China Postdoctoral Science Foundation (Grant no. 2017T100366), and Innovation Project from the Air Traffic Management Bureau, Civil Aviation Administration of China.