Applying a proven filtering method to adjust the training sample of neural networks

. The article notes the complexity and duration of the process of forming a training sample of a neural network, since the correctness of the sample is checked by assessing the quality of the network after its training, it also notes the negative impact of the commonly used formal method of forming a training sample of a neural network without taking into account the physical processes of data and signal transformation in real devices on the quality of the network when filtering the noise of the speech signal. Methods and means for filtering the noise of speech signals are described. To solve the filtering problem, a sequence of main stages of processing a speech signal containing noise is presented, and their description is given. The article proposes to choose a filtering method based on the analysis of noise characteristics, while it is recommended to distinguish between homogeneous (monotonic) and dynamically changing (random) noise, for which filtering methods are different. When choosing a filtering method, it is proposed to take into account the degree of correspondence between the frequency range of the noise and the speech signal. As the main way to reduce noise, an approach is proposed based on the use of an improved and proven method for filtering noise by subtracting the spectral components of noise from the spectrum of a signal containing noise. This approach is proposed to be used for the formation and correction of a training set for a neural network designed to reduce noise in a speech signal. The results of the practical application of the proven filtration method confirmed the feasibility of its application. An important result of the work presented in the article is the possibility of evaluating the feasibility of specific corrective changes in the neural network training set by comparing it with the filtering results of the modified and tested method.


Introduction
A significant number of filtering approaches and techniques have been developed to remove noise from audio and speech signals.However, most of them have a number of specific properties caused by imposed constraints or accepted assumptions, so they work effectively under certain conditions.This peculiarity is most pronounced when methods are implemented to remove noise from complex signals, which includes speech signals.Thus, several assumptions are often used in algorithm design, such as the assumption that voice and noise signals are uncorrelated.Such algorithms consist of two parts: a noise detector and a noise removal system.Noise filtering procedures usually start with speech-activity detection to separate pauses from the informational speech signal.Correct separation of the speech and pause boundaries makes it possible to determine the features and characteristics of the signal in the pauses, which are usually background noise.Therefore, the results of the pause signal analysis are used to determine the characteristics of the noise and the subsequent cleaning of the speech signal.However, correctly identifying areas of speech activity in a speech signal containing noise is one of the challenging tasks in noise filtering.Various methods for detecting speech activity have been proposed in the scientific literature, but usually most of them have drawbacks.This is usually due to assumptions introduced when formalising the speech activity detection process.One proposed option is based on the assumption that the speech signal is contained in the low-frequency part of the overall signal spectrum that contains noise, while the noise is contained in the higherfrequency region [1].However, the nature of noise is quite variable, and the presence of noise spectrum components in the low frequency domain will result in errors.In addition, the frequency spectrum of some noise is wide enough to fall into both high-and lowfrequency bands, which would also give rise to error.In [2] a method is proposed based on the assumption that the speech signal is usually harmonic and that the speech activity is determined by the formants of the harmonic signal.However, speech contains not only vocalized (vowel) sounds, so in some time sections it is difficult to identify formants and speech activity may not be detected.In these works, signal processing is performed using features generated from signal spectra.The use of neural networks is usually associated with the desire to solve two problems: to separate speech signal and noise and to perform noise filtering under the assumption that the two signals are independent of each other, arise from two independent sources, and can appear simultaneously; i.e., noise is additive in nature.
However, the use of currently popular artificial neural networks does not always effectively solve the problem of speech activity extraction, which is usually due to the complexity of these networks, the need to process a significant number of parameters, the complexity of network training and the complexity of the formation of the training sample.The proposed different implementations of the speech signal processing sequence, network architecture, content and number of processing layers do not allow us to identify a method that is optimal for all types of noises and speech signals, which is also usually due to the diversity of input data and complexity of correct network training.Changing the processing sequence and pre-filtering the noise signal, isolating the voice signal by the input layers of the network [3,4], followed by processing of the cleaned voice signal does not provide significant advantages, because the implementation of a noise reduction approach based on pre-cleaning the audio signal before using formal, automatic speech recognition methods is fraught with additional difficulties.Any filtering procedure must not introduce distortions into the information signal after it has been cleaned, otherwise it will lead to a reduction in the quality of automatic recognition.For example, as noted in [5], noise reduction systems based on masks do not improve the result of speech recognition, but lead to the opposite result, changing the spectral characteristics of the final signal, increasing the unnatural sound of speech, so it is not recommended to use this approach to clean the speech signal from noise.
Noise filtering often introduces distortions and unnatural sounds, resulting in additional sounds.It is difficult to do complete noise reduction that is both effective and correct, so it is usually limited to the task at hand.For music listening, for example, the goal is to maintain a natural sound after the noise is removed, but individual frequency components that are hard to hear may be significantly reduced or removed along with the noise.It is also possible to introduce small distortions into the filtered signal, if they are not strongly perceptible.However, such an approach cannot be used, for example, in forensic applications.Historically in the noise filtering, methods to improve the quality of recognition and speech intelligibility have been used, which include methods based on the autoregressive model of the speech signal, on the application of hidden Markov models, based on the estimation of noise parameters, minimizing the RMS error, in addition to the application of artificial neural networks.Noise reduction is applied in digital image processing and some methods are used for speech signal processing [6,7], for example, the well-known nonlocal averaging method.This approach is based on spectrogram analysis of a sound signal as an image.Thus, one of the advantages of using the method of non-local averaging is the reduction of "musical noise", which does occur when using the algorithms that implement the spectral subtraction method.It is known that during speech signal processing a smoothing procedure is used.However, a smoothing procedure is necessary to eliminate local spikes and excessive "indistinctness" of a speech signal, and allows reducing the level of "musical noise", and averaging, as a rule, leads to loss of some information components of a speech signal and negatively affects the formation of spectral coefficients in the process of speech signal processing.The possibility of applying nonlocal averaging to the real and imaginary parts of the audio signal spectrum for noise reduction is analysed in [8].Based on the presence of areas of stationarity and periodicity in the speech signal, the paper [9] provides a rationale for the use of a nonlocal averaging method for the suppression of noise in speech signals with a positive effect.In this case the main task is to find the same fragments in the speech signal.Special techniques are used to find them when using an approach similar to [9], since the result of nonlocal averaging algorithms depends on the presence of the same fragments.Although such patches do occur in vocalized sounds, their occurrence is less likely than in stationary images.In addition, it is known that the speech signal usually changes at a fairly high rate, has a more random nature than the image.Therefore, averaging methods for speech signals are less efficient than for images.
Based on the above-mentioned assumption that noise is contained in the high-frequency part of the signal spectrum, other works suggest that wavelet analysis be used to clean the speech signal from noise.Therefore, noise elimination is performed by removing the highfrequency components of the speech signal spectrum.When a wavelet transform is used, noise filtering is done by limiting the level of detail coefficients.Noise refers to short-term changes in the signal that produce detail coefficients that are assumed to have high noise components.The noise level is reduced by zeroing out the detail coefficients below a selected threshold.The assumption that noise can only be found in the high frequency part of the signal spectrum, however, as noted above, is not always correct because quite often noise also appears in the low frequency part of the spectrum.Additionally, removing the high-frequency part of the signal spectrum above the artificially assigned threshold may lead to distortion of the cleaned speech signal.In separate works, rather successful application of the wavelet transform to remove white Gaussian noise using different wavelet families is described under the assumption that noise with limited bandwidth and uniform spectrum covers the whole frequency range of the speech signal; thus, it is practically considered white noise.Unfortunately, such assumptions greatly limit the practical application of algorithms developed under such constraints.
The high popularity and positive results of applying neural networks have given rise to the opinion that they can be used to solve most high-complexity application tasks.The use of a large number of layers of modern neural networks, the complexity of their structure and neurons, and as a result, significant computational capability of the networks, allowed to abandon the historically used approach of serial processing of input data and signals by previously used methods based on physical transformations of data and signals in real devices, and replace them with specially designed and trained neural networks that formally perform the filtering procedure.However, as practice has shown, a significant number of difficulties encountered during the implementation of neural networks has been associated with the complexity of forming a training sample.The formation of a training sample of a neural network is a labour-intensive and time-consuming process because the result of any corrective change of a training sample can be obtained only after the quality of a trained neural network is checked with a corrected sample.In practice, the process of training sample formation consists of multiple corrective modifications of the training sample followed by evaluation of the achieved quality of the neural network performance.It is quite difficult to reduce the intensity of labour and optimise the training sampling process, as it is difficult to predict the impact of specific corrective changes on the quality of the neural network as a whole.In addition, the lack of taking into account the characteristics of physical processes in real devices, such as noise filtering, as a consequence of the formal approach implemented by neural networks, does not allow the network developers to properly form and adjust the training sample, that is to adequately assess the changes made in the training sample.This is largely due to the fact that the results of sample correction can be obtained only after the neural network trained by this or that sample has processed a significant number of initial data or, when solving a filtering problem, signals.As one of the variants to solve this problem, it is proposed to compare results of processing input data or signals by a neural network with results of processing the same input data or signals by a reliable, proven method based on physical transformations in real devices.However, the peculiarity of the described task implies a refinement of this method to improve the efficiency of its application.A simple and proven method based on the subtraction of the noise spectrum from the spectrum of the speech signal containing noise, i.e., to filter the socalled additive noise, has been used to solve the given filtering problem.The filtering results by the refined method confirmed the feasibility of its application in the proposed approach of forming a training sample based on a comparison of the quality of the filtering results obtained by the neural network and this refined method.

Methods
The relevance and practical value of the solved problem of noise filtering in a speech signal nowadays intensive development of sound technologies does not cause doubts.Indeed, in modern conditions communication almost always takes place in conditions of noise, regardless of place: indoors -laboratory, classroom, production shop, in the street -in the city, in the countryside, in the field, in transport -train, bus, car, aeroplane.In addition, in a number of special cases, in the cockpit of a helicopter, an aircraft, noise can be varied and its level can be significant.Usually, the formal recognition of speech using computer technology focuses on methods that provide a sufficiently high recognition quality, assuming that the noise level is negligible or there is no noise.Therefore, when the noise conditions are close to ideal, that is, when there is actually no noise, such methods provide high recognition quality, but when noise appears, the recognition reliability decreases significantly.This has led to the need to assess the degree of influence of noise on recognition realized by each method, as well as to analyze the nature of noise, its characteristics, and the need to distinguish noise and speech in a signal containing noise.Analysis of the nature and characteristics of noise sources allows their cause to be identified, localized, and, in some cases, eliminated.This is the most radical way to reduce the impact of noise on speech signal and recognition.However, when noise sources cannot be eliminated, noise filtering is resorted to.In all cases noise sources vary in nature, noise signal level, amplitude, frequency range.The noise characteristics derived from noise analysis are used for filtering; in particular the frequency range of the noise signal is one of its most important characteristics.The main issue in this case is the determination of the frequency noise characteristics that can be obtained in the pauses between words or at the beginning and the end of a speech message, i.e., outside of the speech activity.One of the long-standing and well-known methods of speech activity extraction is the use of a voice activity detector (VAD).Unlike previous versions of VAD based on linear prediction coefficients, rate and number of zero crossings, and signal energy analysis, modern implementations of VAD use cepstral coefficients, wavelet transform coefficients, and spectral entropy, so recent versions of VAD give better results.However, more efficient methods based on spectrum energy analysis of the audio signal are also being applied.
Real-time noise filtering is a very topical and most difficult task of noise removal.Due to its importance and urgency, the methods for solving this problem are constantly being improved.So in the last versions of Microsoft's programs, the most famous of which are Skype and Teams, noise filtering methods based on recurrent neural networks with LSTM blocks at the final stages of processing and also using convolutional neural networks [10,11], which have shown a rather high efficiency in solving the filtering problem, are implemented.Indeed, many works, including [12][13][14], note the effectiveness of convolutional neural networks for noise filtering, so we should consider them in detail.For example, in an implementation of TasNet [12] using LSTM at the last stage or in a modification of Conv-TasNet using convolutional layers with extension, as well as in a similar implementation of WaveNet [13].
The basis for subsequent modifications was the TasNet architecture [12], implemented from a convolutional encoder and decoder, shown in Fig. 1.The encoder in Figure 1 converts the selected mixed signal segment into a multidimensional representation, and the decoder reconstructs the original signals in a split form using the mask generated by the separation module for each of the mixture sources.The TasNet modification based on convolutional neural networks Conv-TasNet proved to be the most efficient implementation of noise reduction on neural networks with speech signal enhancement.
Three main components can be identified in the Conv-TasNet architecture: an encoder, a splitter and a decoder, shown in Figure 2. The convolution encoder models the waveforms.The splitter module (TCN) uses the encoder output to generate the masks used to further separate the input mix sources.In the splitting module (TCN), different colours of convolution blocks reflect different extensions.The main task of neural network training is to separate signal sources by their known number and total signal, i.e. their mixture.This rather complex task is solved in several steps after feature extraction using 1D blocks, which are denoted as 1-D Conv in the scheme.Therefore, the circuit uses several modules with similar blocks for separation and the circuit has a complex structure, as noted in [14].The signal separation approach is often used in music processing for music source extraction.For example, DEMUCS [15] is used for music source separation and noise filtering, but such implementations have their own specificities, which are reflected in their architecture, training, input signals and training samples.For example, most of the processing methods implemented by neural networks introduce distortions in filtering, which can be partially eliminated by smoothing and correction of the training sample.Thus, as shown above, efficient neural networks have a complex architecture, their application requires proper training and correct and careful composition of the training sample, which leads to additional labour costs.
Much of the work on noise filtering has focused on traditional methods.Traditional filtering methods are based on determining the frequency spectra of noise and speech, and their ranges of variation.It is clear that to use traditional methods it is necessary to correctly determine the noise signal or pure information signal.The most effective method is the one based on subtracting the amplitudes of the spectral components of the noise from the amplitudes of the spectral components of the signal containing noise or calculating the difference of the spectral densities of the signal power.These methods are similar to Wiener filtering and differ in their implementation; for example, a filter may be chosen based on the criterion of minimum discrepancy between a pure signal without noise and a filtered signal.The standard deviation may be chosen as the minimisation criterion.After noise filtering, smoothing is usually applied to remove "musical" distortion using a Gaussian filter [16,17], which has long been used effectively for image processing, so it is often used nowadays, too.

Discussion
In order to separate the speech signal from the noise it is necessary not only to correctly identify the boundaries of the speech activity, but also, as with speech recognition, to distinguish words and sentences in the speech messages.This ensures a more accurate noise filtering and avoids errors during further word and sentence recognition.Further, time and level (amplitude) signal scaling is applied to eliminate the influence of word rate (speech tempo) and amplitude on the word analysis results of the recognition.Since the main characteristic of the signal for recognition and filtering is the frequency spectrum, it is obtained after scaling, before the basic operations of signal processing for its recognition and before filtering.In computer implementation, a discrete Fourier transform (DFT) with a windowing function to account for time or a wavelet transform is usually used to obtain the frequency spectrum of the signal.Since usually the speech signal of any speaker is not ideal and contains a significant number of accompanying sounds (rales, etc.), its frequency spectrum is heterogeneous, contains spikes and is rugged.Therefore, as a rule, a smoothing filter, e.g. a median filter, is used.The signal prepared in this way is filtered by subtracting the spectral noise components from the speech signal containing noise.After filtering, the frequency spectrum is converted back into a signal to form a cleaned signal.Consequently, the stages of speech signal processing in noise filtering are largely the same as the processing stages in speech recognition, since the main criterion for noise extraction from the speech signal in filtering is the values of the components of the frequency spectrum of the noise.
Thus, the general sequence of processing steps for a speech signal containing noise can be represented as follows: 1. Speech activity extraction.
3. Signal frequency spectrum generation.4. Spectrum smoothing.5. Subtracting the spectral noise components from a speech signal containing noise.6.Generation of a cleaned speech signal.Ways to reduce the impact of noise on speech recognition usually come down to the selection of effective noise filtering methods, which are practically chosen depending on the results obtained from the analysis of noise characteristics, in particular the most important ones: the frequency range of the noise and the extent to which it matches the frequency range of the speech signal, as well as the nature of noise variation -uniform or dynamically varying.
1.When the frequency range of the noise does not match that of the speech signal, the simplest method of bandpass filtering is to remove the noise component from the frequency range.
2. When the noise is homogeneous (monotonous) and the frequency domain coincides with the speech signal, the amplitudes of the frequency domain components of the noise spectrum are subtracted from the frequency domain components of the signal that contains the noise.To realize subtraction from the noise spectrum, its statistical features such as the arithmetic mean of the energy density for each frequency as a function of time (number of windows) when splitting the signal into N windows in a Fourier transform window are determined.Then the standard deviation with respect to the previously obtained arithmetic mean for each frequency is calculated.In this case threshold value for each frequency is obtained as a sum of arithmetic mean of energy density and standard deviation.
Thus, determination of statistical features can be performed as follows: -the arithmetic mean of the energy density for each frequency as a function of time (number of windows) in the Fourier transform here  is the arithmetic mean value of the energy density for each frequency,   defines an amplitude value of the frequency component in a separate window,  is the number of windows, i is calculation of the standard deviation relative to the arithmetic mean for each frequency here  is the standard deviation for each frequency, -calculation of the threshold value for each frequency, which is obtained as the sum of the arithmetic mean of the energy density and the standard deviation here  is a threshold value for each frequency.
Despite some peculiarities and refinements to improve efficiency, this method is similar to "spectral subtraction", which calculates a threshold value for the frequencies of the spectrum at each point in time, i.e. a binary matrix of threshold values is obtained for the frequency spectrum of the signal, with one corresponding to the energy less than the threshold value in the corresponding cell of the matrix, and zeros corresponding to the energy greater than the threshold.The resulting threshold matrix can be applied (as a filter) to a signal containing noise, reducing the frequency amplitude to the minimum signal amplitude for the unit values of the matrix.However, such drastic frequency suppression will result in distortion of the resulting signal after filtering.Therefore, it is better to further apply a smoothing filter to the matrix, which results in the use of a frequency value equal to the weighted average of adjacent values (vicinity) in frequency and time, similar to smoothing by blurring images with a Gaussian filter, that is, smoothing is performed by convolution of matrix values with a smoothing mask.The smoothed matrix is applied to a signal with noise, reducing the frequency amplitudes not to a minimum of the amplitude, but to an amplitude multiplied by the smoothed matrix value at a selected point.After the subtraction is performed, an inverse Fourier transform is applied to the signal spectrum and the speech signal is cleaned from noise.However, the above approach of filtering by subtraction is the simplest and most straightforward, more advanced methods of thresholding and matrix smoothing and their modifications can be applied to increase its efficiency.It should be noted that the interest to simple methods is kept at present because they do not require significant calculations, allow to quickly obtain the result and realize the signal processing practically in real time.The described refined method was applied without additional correction transformations to remove noise from various speech signals containing noise and proved its effectiveness.
3. The most complicated case of dynamically changing or random noise.The analysis time of speech signal with noise should be maximized to accumulate noise information and then statistically process it.Once areas of speech activity have been identified and pauses detected using a VAD detector or other more efficient method (taking into account dynamically varying noise), areas of periodicity and relative stationarity of the noise signal (if present) should be determined.If periodic bursts (spikes) appear in the noise signal, their energy should also be included in calculations at time-of-arrival sites, taking into account that their spectrum is wide.For selected areas the basic characteristics of the noise signal including the statistical sign of the arithmetic mean of the energy density for each frequency as a function of time (number of windows) when splitting the signal into N windows in the Fourier transform and the standard deviation with respect to the obtained arithmetic mean for each frequency should be determined.Then apply the method of subtracting the amplitudes of the frequency components of the noise spectrum obtained after noise analysis from the frequency components of the signal containing noise as described earlier in step 2 for homogeneous noise coinciding in frequency with the speech signal.

Conclusion
Summarising the results of the practical processing of various speech signals, the proposed refined noise filtering method has confirmed the effectiveness and feasibility of its application and can be recommended to assess the quality of filtering in the formation of the training sample of the neural network within the proposed approach.Based on the analysis of experimental data of speech signals filtering by neural networks trained with corrected samples, we can conclude that the important result of this work is the ability to assess the feasibility of specific corrective changes of the training sample of the neural network by comparison with the reliable filtering results of the finalized approved method.