SASEGAN-TCN: Speech enhancement algorithm based on self-attention generative adversarial network and temporal convolutional network

: Traditional unsupervised speech enhancement models often have problems such as non-aggregation of input feature information, which will introduce additional noise during training, thereby reducing the quality of the speech signal. In order to solve the above problems, this paper analyzed the impact of problems such as non-aggregation of input speech feature information on its performance. Moreover, this article introduced a temporal convolutional neural network and proposed a SASEGAN-TCN speech enhancement model, which captured local features information and aggregated global feature information to improve model e ff ect and training stability. The simulation experiment results showed that the model can achieve 2.1636 and 92.78% in perceptual evaluation of speech quality (PESQ) score and short-time objective intelligibility (STOI) on the Valentini dataset, and can accordingly reach 1.8077 and 83.54% on the THCHS30 dataset. In addition, this article used the enhanced speech data for the acoustic model to verify the recognition accuracy. The speech recognition error rate was reduced by 17.4%, which was a significant improvement compared to the baseline model experimental results.


Introduction
As a new technology of the internet of things, speech recognition plays an important role in various electronic products such as smart homes and vehicle-mounted equipment.However, the interference of surrounding environmental noise can seriously affect the quality and intelligibility of the speech signal.
In response to the above problems, speech enhancement technology aimed at improving the quality of the speech signal, reducing noise, and enhancing speech information has emerged [1,2].
In the last century, by reason of limited resources and immature advanced technologies, people were able to rely more on traditional methods and techniques.Boll et al. [3] tried to obtain clear speech noise by subtracting the noise part from the spectrum, but spectral subtraction does not work well for nonstationary noise.To address the aforementioned issues, Ephraim and et al. [4] reduced the impact of noise on the speech signal by calculating the average value of samples within the window, and the experimental results show that the quality and intelligibility of speech signals has been improved significantly compared to other models.With the aim of further deepening the effect of the model in the face of nonstationary noise, some researchers used the median value of the value in the window to replace the value of the sampling point, which further improved the model's denoising effect on nonstationary noise and sudden noise [5,6].With a view to solve the limitations of the median filtering method, Widrowand et al. [7] used adaptive filtering, which can automatically adjust parameters according to the signal and noise, improve signal quality, effectively suppress various noises, and it is suitable for complex noise environments and real-time signal processing.Although traditional methods have made many achievements in the field of speech enhancement, their scope of use is still limited, such as the detailed parts of the speech signal and the use environment.However, deep learning methods are able to compensate for these deficiencies through data-driven feature learning, thereby achieving better noise suppression and speech enhancement [8,9].
Up to now, speech enhancement technology has completed the transformation from traditional signal processing methods to deep learning methods [10,11].Among them, Grais et al. [12,13] used a deep neural network (DNN) to process speech signals, and it completes the modeling of the spectrum or time domain characteristics of the speech signal and finds out the nonlinearity between the speech signal and the noise.Subsequently, as the complexity of speech enhancement tasks became higher and higher, Strake et al. [14,15] introduced the convolutional neural network (CNN) into speech enhancement technology to solve complex speech enhancement problems.CNN is deeply loved by researchers due to its efficient feature extraction capability and small number of parameters.Nonetheless, CNN still cannot learn features directly from the original signal when processing speech signals, which means that it has limitations in modeling time series data [16].To address the issues mentioned above, Choi et al. [17,18] began to introduce recurrent neural network (RNN) into the speech enhancement model to improve the modeling ability of speech signals and noise.At the same time, Hsieh et al. [19,20] combined CNN and RNN to not only improve the model's ability for time series data, but also speed up the model's training and prediction speed.In recent years, under the concept of data-driven models, autoencoders (AED) [21] and generative adversarial neural networks (GAN) [22] have begun to emerge, among which the AED model can realize unsupervised learning of low-dimensional representations of data and reduce the need for labels, making model training more flexible.The GAN model consisting of a generator and a discriminator is also an unsupervised learning method, which achieves data enhancement through adversarial training.Pascual et al. [23,24] demonstrated for the first time that its performance in the field of speech enhancement has significantly improved compared to other models.However, there are many problems with the GAN model in practical applications [25,26].In order to further improve model performance, Hao et al. [27] began to introduce deep learning technologies such as attention mechanisms into the GAN model, and relative experimental results showed that the model can effectively capture local feature information and establish a long sequence dependency relationship with the data.With the aim of further enhancing the feature extraction and data generation enhancement capabilities of the model, Pandey et al. [28] combined the AED and GAN models to implement a more flexible enhancement strategy.
This type of model has good performance in processing speech signals, for example, the generator of GAN can generate synthetic samples similar to real speech, and improve the generation effect through adversarial training.Additionally, GAN is able to learn and process complex speech features, including speech speed, pitch, and noise, thereby making the model more able to approximate the performance of real speech.Moreover, GAN is an unsupervised learning method that does not require a large amount of clearly labeled speech data and can reduce the difficulty of data acquisition.Last but not least, the generator of GAN can simulate multiple types of noise and makes the model highly robust in different environments, thereby improving the effectiveness of speech enhancement.These features make GAN a powerful tool for processing speech enhancement tasks.Nevertheless, these models possess certain drawbacks, such as the absence of aggregated feature information.The specific reasons why the structure design of the network may lead to discrete and non-aggregated feature information, include mismatched hierarchical structures between encoders and decoders as well as a lack of effective information transmission mechanisms in hierarchical structure design.The overly simple design of the network structure is the main factor that cannot fully capture and transmit the correlation of complex data, which results in the loss of continuity and integrity of feature information during the transmission process.However, the above models still cannot obtain the best speech enhancement effect.Through investigation and research, this article found that the above models have ignored the impact of feature information aggregation between the encoder and decoder on the performance.Therefore, this article will focus on the problem of non-aggregation of generator feature information in the GAN network.
Considering the above factors, this paper fully exploits the network advantages of the temporal convolutional network (TCN) [29].By introducing modules such as multilayer convolutional layers, dilate causal convolutions, and residual connections in the TCN network to aggregate and interact feature information effectively, the goal is to capture the feature information between the encoder and decoder to improve feature expression ability of the overall network.The main contributions of this article are summarized as follows: • A novel speech enhancement model is proposed.We have made some extension work on the basis of the Self-Attention Generative Adversarial Network for Speech Enhancement (SASEGAN) model [30].By integrating the TCN network with the generator, this model can capture the local feature information and long-distance feature information to solve the problem of nonaggregation of feature information.Moreover, our model obviously improves speech signal quality and intelligibility.
• This article uses Chinese and English datasets to conduct experimental verification analysis based on SEGAN and SASEGAN models, respectively.The experimental results perform well, which validates the effectiveness and generalization of the model.During the training phase, the model has a relatively smooth and stable loss curve, which verifies that the model is more stable and has a good fitting ability compared to other models.
The remainder of this paper is organized as follows.We introduce the two baseline models of SEGAN and SASEGAN in Section 2. In Section 3, the SASEGAN-TCN model is proposed.In Section 4, we introduce the relevant configuration of the experiment, and the results of multiple sets of experimental data are analyzed and discussed in depth.Assume that the speech signal input to the GAN model is X = X + N, where X and N represent the intermediate variables of input data, noise, respectively.As shown in Figure 1, the goal of speech enhancement is to recover a clean signal X from a noisy signal X.The SEGAN method generates enhanced data X = G( X, Z) by using a generator G, where Z represent the data of encoder input value decoder.The task of the discriminator D is to distinguish between the enhanced data and the real clean signal and learn to classify as true or false.At the same time, the generator G learns and generates an enhanced signal in order for the discriminator D to classify data as true.SEGAN is trained through this adversarial method and the least squares loss function.The least squares target loss function calculation formula of D and G can be expressed as: where p data (X) and Z represent the distribution probability density function of real data and latent variables, respectively.X, N, and E represent the clean speech signal, additive background noise and the expected value with respect to the distribution specified in the subscript, respectively.When traditional GANs perform speech enhancement, they often rely entirely on the convolution operations of each layer of the CNN in the model, which may blur the event correlation of the entire sequence and provide a way to capture the correlation between long-distance speech data.The SASEGAN model combines a self-attention layer that can adapt to nonlocal features and the convolutional layer in the SEGAN model, and the effect is significantly improved.
The structure diagram of the self-attention layer is shown in Figure 2. The conv and pooling in the figure represent the convolutional layer and the max pooling layer, respectively.Assume that the input speech feature data is F ∈ R L×C , and choose to use a one-dimensional convolution to calculate one dimensional feature data.Query vector (Q), key vector (K), and value vector (V) are derived as follows: where L and C represent the time dimension and the number of channels, respectively.
k , and W V ∈ R C× C k represent weight matrices.Their values are determined by the convolution layer with the number of channels as C k and the convolution kernel size as (1 × 1), respectively.The optimization of the feature dimension is achieved by setting the variable k.At the same time, K and V of appropriate dimensions are selected by introducing the variable p, then the relative lower complexity O, A, and O are as follows: where k=2, p=3, C=4, and L=6 by introducing the variable β.The convolution and other nonlinear operations are used to obtain the output result O out , which can be expressed as: (2.6)

SASEGAN-TCN model
In the generator, in an effort to enhance the feature representation ability between the encoder and the decoder, the existing technology often ignores the aggregation of feature information between the encoder and the decoder, and the model cannot obtain long-distance feature dependencies.To this end, this paper proposes a SASEGAN-TCN model, whose generator structure diagram is presented in Figure 3: In Figure 3, the speech signal is first extracted into matrix data with a dimension of (8192 × 16) through feature extraction.Second, a downsampling operation is performed through a multilayer CNN to compress the feature information, then the self-attention layer is used to obtain the dependencies of long-distance feature information until the latent variable Z between the encoder and the decoder is extracted.Finally, the obtained feature information is aggregated again through the TCN network layer.By virtue of the hole causal convolution and sum in the TCN network, the residual connection module not only avoids problems such as gradient disappearance and long-term dependence in traditional CNNs, but also it achieves the effect of aggregating feature information between the encoder and the decoder.

Dilate causal convolution
Although the SASEGAN model generates some feature vectors at each time step in the encoder, these features can only describe the local information of the input sequence, and the output of each time step is only related to the previous input in the decoder.The above situation will lead to the problem of non-aggregation of feature data in variable Z.We will choose the SASEGAN model based on the self-attention mechanism at the 10th layer for research and analysis.When processing time series data, the traditional CNN has some limitations.For example, when using a convolution kernel with a fixed kernel size for operation, the receptive field of the model is limited, which makes it impossible for the model to capture time dependencies within a limited range.In consideration of the foregoing challenges, dilated causal convolution combines the characteristics of dilated convolution and causal convolution to achieve an increase in the receptive field and an improvement in parameter efficiency and parallel operation efficiency.It can well handle long-term trend and periodic pattern data, and achieve an effect of feature information aggregation.Its structure is shown as:  In Figure 4, assuming that the input time series data is z = , the calculation formula of the dilate causal convolution output result is shown as: where i, d, k, l[t], z[t − d • c], and w[c] represent the time step, dilatation rate, index of the convolution kernel, the output of the t time step, the input data at the time step, and the weight of the convolution kernel at the convolution kernel index c, respectively.

Residual module
This paper takes into account the problems of gradient disappearance and gradient explosion when traditional recursive neural networks process time series data.Therefore, the TCN network uses residual connection to bypass the feature information of the convolution layer and directly transfer the original feature information to the output layer.To alleviate the gradient descent problem and improve the information transfer of the network, we assume that the input is x, and the output result after the Rectified Linear Unit (RELU) nonlinear operations is F, then the calculation formula of the final output result o of the residual network is shown as: where F(x, W) and W represent the nonlinear operation and network weight of the residual part, respectively.

Mathematical Biosciences and Engineering
Volume 21, Issue 3, 3860-3875.The residual connection module in the TCN network is shown in Figure 5.The TCN network can well aggregate feature information and realize the interaction of feature information through methods such as multilayer convolution layers, dilate causal convolutions, and residual connections to achieve the goal of improving the overall network performance and feature expression ability.Accordingly, we have effectively integrated the SASEGAN model and the TCN network, as well as processed the final output result (latent variable Z) of the encoder in the generator through the two-layer TCN network to achieve the aggregation of feature information and improve the speech enhancement effect.

Experimental parameters
This article uses the Valentini English dataset [31] and the THCHS30 Chinese dataset [32] with both audio sampling rates of 16 kHz.The Valentini dataset contains audio data from 30 pronunciation members in the Voice Bank corpus, and the training set was recorded by 28 pronunciation members.This pronunciation data was mixed with 10 different types of noise at signal-to-noise ratios of 15, 10, 5, and 0 db, respectively.The test set was recorded by 2 pronunciation members.After recording, it was mixed with 5 types of noise in the Demand audio library, with a signal-to-noise ratio of 17.5, 12.5, 7.5 and 2.5 db as the mixing conditions.First, we adjust the sampling rate of 15 audio signals in NoiseX-92 and concatenate them to form a long-term noisy audio data.Second, we traverse the training and testing sets in the THCHS30 dataset, then randomly select a long period of noisy audio data and mix it with mixing conditions of one of the four signal-to-noise ratios of 0, 5, 10 and 15 db.In this experiment, Table 1 shows the output data dimensions of each layer of the generator.

Data processing
These experiments are conducted on a 2060 graphics card with 6 GB memory and the Windows system, and the software is used in Python version 3.7 and TensorFlow version 1.13.At training time, the raw audio segments in the batch are sampled from the training data with 50% overlap, followed by a high-frequency pre-emphasis filter with a synergy efficiency of 0.95.Because the computer hardware configuration is limited, the TCN network used in this article has only two layers, and the number of channels is 32 and 16, respectively.The models are trained for 10 rounds with a batch size of 10, and the learning rates of the generator and discriminator models are both 0.0002.
To evaluate the effectiveness of the model experiments, this article will elaborate analysis based on various indicators.PESQ acts as an objective measure of speech quality, typically ranging from -0.5 to 4.5.A superior PESQ score indicates enhanced speech quality, and it's a pivotal metric for assessing the performance of speech encoding, decoding, and communication systems.As a comprehensive signal-to-noise ratio indicator, Channel State Information Gain (CSIG) evaluates the ratio of speech signals to noise, with a higher CSIG score reflecting an improved signal quality.Mean opinion score prediction of the intrusiveness of background noise (CBAK) serves as a comprehensive indicator for background noise suppression, and measures the extent of noise reduction in speech signals.A heightened CBAK score signifies more effective background noise suppression.Mean opinion score prediction of the overall effect (COVL) assesses the coverage of speech quality assessment algorithms across various quality levels and offers a more thorough evaluation of system performance.Lastly, Segmental Signal-to-Noise Ratio (SSNR), as a segmented signal-to-noise ratio indicator, is employed to assess the ratio between speech signals.

Experimental analysis
In order to verify the effectiveness of this method, this paper first conducts experiments on the Valentini dataset.It can be seen from Table 2 that SEGAN-TCN has improved in PESQ, STOI, SSNR, and other indicators compared with the SEGAN model.Specifically, PESQ, CBAK, COVL, and STOI reached 2.1476, 2.8472, 2.7079 and 92.61% and have been improved by 9.0, 16.7, 3.0 and 0.5% compared with noisy data, in addition, the SSNR increased by 5.3724 db.However, the CSIG has been slightly reduced due to improper selection of data processing methods and insufficient model training, which will be elaborated later.During the training process of the SEGAN model and the SEGAN-TCN model, the false sample loss value of the discriminator (d fk loss), the real sample loss value of the discriminator (d rl loss), the adversarial loss value of the generator (g adv loss), and the L1 loss value of the generator (g l1 loss) curve chart are shown in Figure 6.This article records data every 100 steps and plots it.As can be seen from Figure 6, the SEGAN-TCN model loss value decline curves are smoother than the SEGAN model curves, and the training process is relatively stable.A decline in the d fk loss value denotes the discriminator's increased proficiency in distinguishing the generated samples as counterfeit, while a reduction the in the d rl loss value indicates the discriminator's heightened ability to accurately classify genuine samples as authentic.The diminishing g adv loss value suggests the generator's success in outsmarting the discriminator and creating realistic samples.Meanwhile, the decrease in the g l1 loss value signifies the similarity, at the pixel level, between the generator-produced sample and the authentic sample.In order to further verify the generalization and effectiveness of the network, we will continue to conduct experiments based on the SASEGAN model.It can be seen from Table 3 that SASEGAN-TCN achieves 2.1636, 3.4132, 2.8272, 2.7631 and 92.78% on PESQ, CSIG, CBAK, COVL, and STOI on the Valentini dataset, and compared with the noise data, it's improved by 9.83, 1.9, 15.9, 5.1 and 0.7% besides the SSNR, which is improved by 4.4907 db.Data analysis reveals that the SASEGAN-TCN model has good performance in CSIG indicators, but it will reduce the quality of the speech signal when processing speech signals and the introduction of external noise will lead to a slight reduction in PESQ, CBAK, SSNR and other indicators.To effectively confront and resolve these issues, we will continue to conduct experiments and research analysis.As can be seen from Figure 7, we can clearly see that during the training phase, the SASEGAN-TCN model not only successfully fits to the optimal state, but also exhibits more stable loss curves  To tackle the issue that the SASEGAN model will reduce the quality of the speech signal and introduce external noise when processing Valentini data, this article will once again verify the effectiveness and applicability of the network on the THCHS30 Chinese dataset based on the SASEGAN model.The experimental results are shown in Table 4. PESQ, CSIG, CBAK, COVL, and STOI can reach 1.8077, 2.9350, 2.4360, 2.3009 and 83.54%, and the SSNR increases to 4.6332 db.After analyzing the experimental data, it can be seen that the SSNR in the SASEGAN model is higher, while the PESQ and STOI are lower, which proves that the SASEGAN model introduces additional noise during the training process and results in signal distortion.Nevertheless, the SASEGAN-TCN model proposed in this article not only ensures that SSNR does not attenuate too more, but also effectively improves PESQ and STOI levels.During the training phase, the training loss graphs of the SASEGAN and SASEGAN-TCN models on the THCHS30 dataset are shown in Figure 8.The SASEGAN-TCN model is still very stable and can achieve better fitting results than other models during the training process, which indicates that the model in this paper improved the discriminator's ability to distinguish between false and true samples, and also enhanced the generator's ability to generate false samples that are extremely similar to true samples.Through relevant experiments, it has been shown that there are also some problems that we should notice.Specifically, the integration of the TCN module increases the number of model parameters, which in turn requires higher experimental hardware costs.In addition, it has been experimentally proven that the model presented in this paper performs well in processing long speech data, while there may be poor performance in processing short speech data.

Conclusions
To enhance the quality and intelligibility of speech signals effectively, this paper analyzed the characteristics of the TCN network and used modules such as multilayer convolution layers, dilated causal convolution, and residual connections in the TCN network to effectively avoid problems like gradient vanishing.Moreover, the feature information between the encoder and decoder is also aggregated, thereby improving the performance and feature expression ability of network speech enhancement.Experimental results show that the proposed model has very obvious improvement on the Valentini and THCHS30 datasets, and exhibits a certain stability during the training process.In addition, we used the enhanced speech data in speech recognition technology, and the word recognition error rate is reduced by 17.4% compared with the original noisy audio data.The above content indicates that the SASEGAN-TCN model used the characteristics of the TCN network to solve the problem of non-aggregation, improved the model's speech enhancement performance and feature expression capabilities, and effectively elevated the quality and intelligibility of noisy speech data.Additionally, the speech recognition scheme proposed in this article can still maintain high recognition accuracy in noisy environments.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Figure 2 .
Figure 2. The structure of self-attention network.

Figure 4 .
Figure 4.The structure diagram of dilated causal convolution.

Figure 5 .
Figure 5.The structure of TCN network residual module.

Figure 6 .
Figure 6.The loss curve of SEGAN and SEGAN-TCN on the Valentini dataset.

Mathematical
Biosciences and Engineering Volume 21, Issue 3, 3860-3875.compared to the SASEGAN model.This strongly confirms the higher stability and easier convergence of SASEGAN-TCN during the training process.This result further emphasizes the superiority of the model in processing training data.The reduction in discriminator loss (d fk loss, d rl loss) indicates an improvement in the recognition of false and true samples.Lower g adv loss indicates successful generator deception, while lower g l1 loss represents pixel level similarity between generated samples and real samples.

Figure 7 .
Figure 7.The loss curve of SASEGAN and SASEGAN-TCN on the Valentini dataset.

Figure 8 .
Figure 8.The loss curve of SASEGAN and SASEGAN-TCN on the THCHS30 dataset.

Table 1 .
Output dimensions of each convolutional layer in the generator.

Table 2 .
SEGAN and SEGAN-TCN experimental results of on the Valentini dataset.

Table 3 .
SASEGAN and SASEGAN-TCN experimental results of on the Valentini dataset.

Table 4 .
SASEGAN and SASEGAN-TCN experimental results of on the THCHS30 dataset.