Research Article Automatic Evaluation of Internal Combustion Engine Noise Based on an Auditory Model

,


Introduction
e loudness and sound quality of internal combustion engines directly affect the operator's experience.erefore, noise control and evaluation of internal combustion engines is a popular topic in the field of engineering.e objective evaluation methods of noise quality include linear and nonlinear evaluation and predictive models.In previous studies [1,2], multiple linear regression theory was used to establish a sound quality classification model, the results of which agreed closely with the measured values of subjective evaluation.Huang et al. [3] proposed the use of psychoacoustic parameters as inputs to a genetic algorithm (GA)wavelet neural network and back propagation (BP) neural network to predict sound quality, which was proven to be somewhat effective.In a study by Xu et al. [2], a nonlinear evaluation model based on an adaptive boosting (AdaBoost) algorithm was proposed.e predictive results of the model were compared with those of the GA-BP, GA-extreme learning machine (ELM), and GA-support vector machine (SVM) models, which showed that the proposed model improved the accuracy and precision.In the above models, the sound qualities were predicted using the objective psychoacoustic parameters of sound quality as inputs.e accuracy and precision of the predictions were the main focus of the evaluation model research.Auditory models are widely used in target recognition, fault diagnosis, and speech recognition.An underwater target echo recognition method based on auditory spectrum features was proposed [4].e underwater target single-frequency echo recognition experiment showed better robustness.Under the same test conditions, the recognition rate was about 3% higher than that of a perceptual linear prediction (PLP) model.In a study by Wu et al. [5], the auditory spectrum feature extraction was applied to the fault diagnosis of broken teeth.A gammatone (GT) band-pass filter and phase adjustment were applied to signals to calculate the probability density of the amplitude at each extreme point.e results showed that the proposed method could accurately characterize and extract the fault features of broken teeth, and the extraction accuracy was high.Liang [6] proposed a binaural auditory model and applied it to the analysis and control of a car's interior noise quality.e results showed that the interior noise quality of the car was greatly improved.At present, there have been no studies on the application of auditory models in the automatic evaluation of noise quality of internal combustion engines.
In this study, the noise samples of certain types of diesel engines were processed using a gammatone filter to establish an auditory model similar to human ears, and an automatic classification model of noise quality was constructed based on a convolutional neural network (CNN).We aimed to study the following: (1) time domain signal processing of noise samples, (2) auditory spectrum transformation of noise samples, and (3) applications of the auditory spectrum-based CNN in the classification of noise quality.e auditory model of sound samples was taken as the input, and the subjective evaluation score was taken as the output label for model training and optimization.Compared to the model using objective sound quality psychoacoustic parameters as the input, the proposed model exhibited higher classification accuracy.

Auditory Model
e human auditory system consists of several parts of the ear and brain.e widely used auditory model in the field of signal analysis simulates the ear functions.Different locations of the basement membrane inside the cochlea will produce different traveling wave deflections when stimulated by corresponding frequencies, similar to a set of bandpass filter banks, and the nerve fibers for transmission are called channels.Each channel corresponds to a specific point on the basement membrane.In the human auditory system, each channel has an optimal frequency (center frequency), which defines the frequency of maximum excitation [7], as shown in Figure 1.
Gammatone (GT) band-pass filter [8] banks have been used to simulate the internal mechanism of the cochlea.e frequency channel of the GT filter banks covers the range of 80 Hz-8 kHz. Figure 2 shows the frequency response of the GT filter banks.e GT filter algorithm is represented as follows: (1) e sound sample x was decomposed into 64 different frequency channels using GT filter banks.Each frequency channel contained the relationship between the component harmonics and time t.y(t, s) denotes the auditory spectrum output of basement membrane, x(t) denotes sample signal, * t denotes time domain convolution, h(t, s) denotes the time domain expression of the GT filters, f s denotes central frequency coverage, which was set to 50 Hz-20 kHz, u(t) denotes step function, and n denotes the order number of the filters.Studies have shown that when n � 4, the characteristics of the human ear basement membrane filter can be well simulated using GT filters.e phase φ is 0, and b denotes the equivalent rectangle bandwidth (ERB), i.e., the attenuation velocity of the filter, calculated as follows: e value of b 1 was 1.019, so that the physiological parameters of basement membrane could be better simulated.

Automatic Evaluation Method
e idea of an unsupervised learning algorithm in machine learning was adopted for automatic evaluation.A CNN model was chosen to automatically extract the eigenvalues of the input noise samples, and the parameters that were learned through training were applied to the online automatic evaluation system to continuously optimize the model and improve the classification accuracy.[9] is a deep feedforward artificial neural network to which a convolution layer and pooling layer are added.It can be used to extract the features of training samples under unsupervised conditions, realize sparse representation of sample features, and achieve the principle of detecting optical signals similar to animal visual cortex neurons.e use of CNNs has been highly successful in the field of image and speech recognition.e convolution layer of a CNN involves sparse interactions and parameter sharing, so that each convolution kernel can extract the feature information on the time-frequency axis of the sample sound, which is directly related to the sound quality perception.

Evaluation Based on CNN. A CNN
e preprocessed sound samples and subjective evaluation results were divided into two parts: a training set and a test set.e training set data were taken as the original input to train the CNN model.A Softmax classifier was used to obtain the attribution probability of the current samples.e test set was used to validate the training model completed iteratively, and the optimal model was selected and saved.us, the saved CNN classification model could be used to predict the attributes of new samples.e process is shown in Figure 3.

Definition of CNN Evaluation Model.
e input layer data sample V ∈ R A×B .A denotes the number of frames contained in a noise sample.B denotes the dimension of each frame (it was set to 64), as audio frames were signals obtained through the gammatone band-pass filter, and the dimension ordinal number indicates that the frequencies of 2 Shock and Vibration the feature points in this frame were arranged in ascending order.e vertical coordinates of the input data in the gure are frequency bands.
of the l − 1 th layer, and v b denotes the eigenvalue of the b th band.e convolution kernel of the l layer . It consists of four convolution kernels, and the activation value v l j of the convolution layer was calculated as follows: e left side denotes the j th feather map of layer l.A convolution calculation was performed on all feature maps v l−1 i of layer l − 1 and the j th convolution kernel k l j of layer l. e right side was obtained by adding the sum and the o set b j of the j th feature map.Further, it was calculated using an activation function f(x), which was a recti ed linear unit (ReLU) function [10], shown as follows: e pooling layer [11] was the next operation for the convolution layer.e low-resolution representation of the feature map obtained by the convolution layer was calculated using a downsampling method.A maxpooling function is often used to calculate the maximum value of a feature map obtained by convolution (continuous frequency band), and its formula is as follows: where p j,m denotes the output of the j th feature map of the m th pooling band, n denotes the downsampling factor, and r denotes the size of pool, indicating how many frequency bands of data were pooled together.e parameter sharing and maxpooling of convolution kernels in the model played an important role in the invariance of small frequency shift characteristics and could reduce the number of training parameters to suppress over tting.e convolution layer and the pooling layer appeared in pairs and could be stacked many times to obtain more abstract features.e nal fully connected layers were used to combine the features of di erent frequency bands.e model is shown in Figure 4.

Training of CNN Evaluation Model.
e training method used in the CNN model was basically the same as the back propagation training method used in the BP neural network [12].e error calculation of each layer was updated layer by layer from back to front according to the chain rule.
e upper layer of convolution layer l was pooling layer l + 1, and its error was calculated using the following equations.
Bias term:

Shock and Vibration 3
Weight item: where (p l−1 i ) uv denotes the area (u, v) where the convolution kernels k l ij and v l−1 i multiply each element in the forward propagation convolution calculation.
e sensitivity term δ l j of layer l is as follows: According to the back propagation sensitivity algorithm of the neural network, the sensitivity δ l j of layer l is the product of the sensitivity of layer l + 1 and the derivative of the output activation function of layer l.However, in the CNN, layer l + 1 is the pooling layer of downsampling, and its feature map elements and layer l do not have one-to-one correspondence, so up(δ l+1 j ) must be used to replace δ l+1 j .e upsampling function is as follows: up(X) ≡ X ⊗ ln × n, (9) where up(•) denotes upsampling.If the sampling factor n was used in the downsampling, the upsampling operation in the back propagation enlarged each feature map by n times in the horizontal and vertical dimensions.erefore, the Kronecker product [13] was used to complete the calculation.

Sound Quality Evaluation Based on Psychoacoustic Parameters
To test the predictive accuracy of the CNN evaluation method in the auditory model, the widely used BP evaluation model based on psychoacoustic parameter input was selected as the control model.

Sample Collection and Preprocessing.
A HeadRec ASMR head recording binaural microphone was used as the frontend equipment for audio acquisition of the test samples, and 90 sound samples were collected as the test sample database.e sample database contained 30 groups of steady-state sound signals collected from three types of internal combustion engines, a Mitsubishi 4G6 MIVEC gasoline engine, Toyota HR16DE gasoline engine, and Hyundai D4BH diesel engine, with speeds ranging from 800 to 4500 rpm.e audio sampling frequency was 44 kHz.
e frequency identi cation resolution was 1 Hz.e recording length of each sample was 15 s. e values of the sound quality in the sample database were based on subjective evaluations by an assessment group.In the experiment, 25 students with normal hearing and 5 teachers of related majors were selected to form the sound quality assessment group, and the sound samples were divided into nine grades.Grade 1 was the best, and Grade 9 was the worst.e arti cial evaluation module of the automatic evaluation system developed in this study was used for the evaluation of the internal combustion engine noise quality.
After the evaluation, Spearman correlation analysis between the rating results and the evaluators was carried out, and 8% of the unreliable data were removed.
e distribution of evaluation results is shown in Figure 5.Most of the sample scores were within the range of 4-7 points, i.e., the grades ranged from "satisfactory" to "poor" in the subjective evaluation of sound quality.

Psychoacoustic Parameter Processing of Noise Samples.
e psychoacoustic parameters [14] of the noise samples mainly included 8 parameters, which were the A-weighted sound pressure level (hereinafter referred to as A sound level), loudness, impulsiveness, sharpness, tonality, roughness, uctuation strength, and articulation index (AI index).To reduce the complexity and training time of the BP neural network, the Pearson correlation analysis method was used to obtain the correlation distribution between the psychoacoustic parameters and evaluation score, as shown in Figures 6-9.
e correlation tting curve shows that the correlation coe cients of the four parameters, i.e., A sound level, loudness, sharpness, and roughness, with the evaluation score were all more than 0.70.Table 1 shows that the correlation coe cients of the uctuation strength and impulsiveness with the evaluation score were 0.563 and 0.346, respectively.e correlation signi cance <0.05 indicates that both the uctuation strength and impulsiveness were linearly correlated to the evaluation score, but the correlation was insigni cant.
e correlations of the tonality and AI with evaluation score were less than 0.31, which indicated that the two parameters were not correlated to the evaluation score.e main reason was that the vibration noise of the vehicle's internal combustion engine was insensitive to the sound quality, and the pitch values of internal combustion engines were not highly discriminated at di erent rotational speeds.Given the complexity of the BP model, training time, and the correlations between the objective parameters and evaluation score, four parameters, A sound level, loudness, sharpness, and roughness, were selected as the input variables of BP neural network.

BP Neural Network
Structure.e topological structure of the BP neural network in this experiment consisted of four layers: an input layer, hidden layer 1, hidden layer 2, and output layer, as shown in Figure 10.
ere were four nodes in the input layer in total, corresponding to four input psychoacoustic parameters.e number of nodes in the hidden layer was determined using the dropout method.After many trials, it was determined that there were eight nodes in hidden layer 1 and six nodes in hidden layer 2. e output layer had nine nodes, to match the levels of rating.e inputs and outputs of the hidden layers are expressed as follows: where hi l (k) denotes the input of the k th sample in layer l and w il denotes the weight of the i th node in layer 1, and where ho l (k) denotes the output of the k th sample in layer l and f(x) denotes activation function:

BP Neural Network
Training.e data of the four psychoacoustic parameters of 80 sound samples were selected as training samples, and the remaining 10 were used as test samples.e algorithm of the normalizing variables before the input is shown in equation ( 16) below.e average error E avg of the model classi cation and training is expressed as follows: where y i denotes the i th expected output value, y(x i ) denotes the i th calculated output value, and m denotes the number of samples.
e initial weight in the network model was generated randomly using the Nguyen-Widrow algorithm as follows: w 0.7n 1/r * random(n, r), (14) where n denotes the number of neurons in each layer, which are n 1 4, n 2 8, and n 3 6, and r denotes the dimension of input vector, which was set to 4 in this experiment.To avoid the problem of local optimal solutions arising from the gradient descent method used in the training, the simulated annealing arithmetic algorithm [15] (SAA) was used.In this algorithm, if the objective function value of current state x was less than that of state x 1 , then x 1 was accepted as the optimal point.Otherwise, the acceptance probability p exp((f(x) − f((x 1 ))/t) was calculated.If p > random (0, 1), then x 1 was accepted as the optimal point.e initial temperature was 1000 °C.e temperature attenuation rate was 0.7.
e learning rate in the training was 0.03.e training iteration was carried out 4500 times, and the target error was set to 0.008.

Sound Quality Evaluation Based on the
Auditory Model e sound samples collected in Section 4.1 were used in the auditory model test.e input sound signals were processed using the time domain signal and auditory spectrum [8], respectively.e CNN model was used for comparison.

Time-Domain Signal Processing of Noise Samples.
e intensity of the noise signal in di erent time periods was described using the short-time average energy, which reected the energy information of the sample signal.is could be used for determining the time domain eigenvalues of sound signals, e.g., noise contrast or noise and mute distinction.In this transformation, the values of each sampling point in a short time frame were squared, and the time series consisting of short-time energy through an impulse function (window function) was output.e equation for this calculation is as follows: where E m denotes the average energy of the m th short frame, time-domain signal x(n) denotes the value of the n th sampled signal in the m th short frame signal, and W(n) denotes the window function with length N. A Hamming window function was used, and α was set to 0.46.Sampling was carried out at a frame size interval of 100 ms, and the frame o set was 20 ms. e short-time average energy fragments (4500-5000 frames) of the sample signals were calculated, as shown in Figure 11.

Auditory Spectrum Transformation of Noise Samples.
e proposed auditory model based on gammatone lter banks was used for the auditory spectrum transformation.
e overlapping segmentation method was used for signal processing.During downsampling, the lter response y(t, s)

Intput layer
Hidde layer 2 Hidde layer 1 To ensure a fast convergence speed of the model training, the input data used should be normalized.e normalized equation for the noise samples is as follows: e output label value of the CNN model is the grade vector representation of the subjective evaluation results on a scale of 1 to 9.

Structure of Convolutional Neural Network.
In the CNN model, the rst layer was the input layer, followed by a threelayer convolution and three-layer pooling alternately.Next, there was the fully connected layer, followed by the Softmax classi er [16].e detailed structure is shown in Table 2.

Training and Optimization.
To ensure the comparability of test results, in the evaluation based on the CNN model, Conv1-64 in Table 2 was used in the input signal experiment for the rst layer convolution (pad 1, stride 1), and a 3 × 3 pooling layer (pad 1, 0, stride 2) was used for the maxpooling to obtain 64 × 30 × 48 output feature map.In the second convolution layer, Conv2-128 (5 × 5, pad 1, stride 1) and the maxpooling window were used to generate a 128 × 14 × 23 feature map.In the third convolution layer, Conv3-256 (3 × 3, pad 1, stride 2) and the same pooling window were used to complete the convolution operation and output 256 × 3 × 6 eigenvalues.In the fourth layer, i.e., the fully connected layer, there were 4608 explicit nodes and 300 hidden nodes.Finally, the predictive probabilities for nine kinds of noise qualities were output by connecting the Softmax classi er.
e random initialization algorithm proposed in Section 4.4 was also used to initialize the weights in the CNN.In the training experiment, 10 samples were selected for each batch to predict the gradient and 60 iterations were performed for each batch.To accelerate the convergence, batch normalization was applied to the output of each layer.
e adaptive moment estimation (Adam) algorithm [17] was used for training gradient descent optimization.e Adam algorithm could calculate the adaptive learning rate of each parameter.It preserved not only the exponential attenuation average of the square gradient but also the exponential attenuation average of the previous gradient M(t).us, it could be used to deal with the sparse gradient problem of convex functions quickly.
e learning rate was set to 0.01.To prevent over tting, 50% of the nodes were randomly removed or 30% of the hidden layer nodes were randomly removed in each batch of training in the fourth layer.

Results and Analysis
Figure 13 shows the trends of the training and test losses of the auditory spectrum input in the CNN evaluation test.e loss tended to be stable after 5300 iterations, indicating that the model converged and no over tting occurred.Noise was added randomly to the sample, and dropout technology was used to prevent over tting, which showed that the robustness of the classi cation model increased.Figure 14 shows the trends of the training and test losses in the BP evaluation test.After 4500 training iterations, the model converged without under tting or over tting.
As shown in Figure 15, the x-axis represents the serial number of noise sample and the y-axis means the noise grade.Di erent types of points in the graph show the mean of multiple evaluation results for di erent models.e coincidence degree of the evaluation average based on the CNN auditory spectrum input and arti cial evaluation average was the highest, and the coincidence degree of the evaluation average based on the CNN time and frequency domain inputs and arti cial evaluation average was higher than that of BP model.e overall accuracy and error are shown in Tables 3 and 4. Variable A represents the weighted sound pressure level.Input variable L represents the loudness, and variables R and S represent the roughness and sharpness, respectively.e accuracy of the auditory spectrum input training in the CNN model was 97.31%, and the testing accuracy was 95.28%.Compared to the control model, i.e., the BP neural network evaluation model, the accuracy of the CNN model was at most 5.95% higher.In the CNN evaluation model, the accuracy of the auditory model input was 3.53% higher than that of the Shock and Vibration time domain input.e single input results in the CNN model obtained using loudness, roughness, and sharpness showed that the input had no obvious advantages in the classication accuracy of the sound quality.e results also showed that the convolution neural network possessed strong automatic feature extraction and expression abilities.It could accurately express the features of the samples when the samples were input with complete time-frequency information and achieve good classi cation.Furthermore, because the convolution operation had sparse features of the sample signal expression, the complexity of the training was e ectively reduced by features such as parameter sharing.

Conclusions
(1) In this study, an auditory model-based automatic evaluation method was proposed to evaluate the sound quality of internal combustion engines.e hierarchical structure of a CNN and the size and thickness of the convolution kernel were designed.

Figure 6 :Figure 5 :R 2 Figure 7 :
Figure 6: Correlation between A sound level and evaluation score.

R 2 Figure 8 :
Figure 8: Correlation between sharpness and evaluation score.

Figure 9 : 1 w
Figure 9: Correlation between roughness and evaluation score.

Figure 10 :
Figure 10: Structure of BP neural network.

Table 1 :
Correlation between psychoacoustic parameters and evaluation score.

Table 2 :
Structure of convolutional neural network evaluation model.

Table 3 :
Comparison of test results of two evaluation models.

Table 4 :
Comparison of auditory spectrum and time-domain input test results.sound samples collected in the evaluation were used to obtain a time-frequency auditory spectrum simulating the auditory characteristics of the human ear basement membrane through gammatone filter banks.e input was used to train the CNN model, which could accurately express the characteristics of the samples and achieve good evaluation and classification.(2) e training accuracy of the auditory spectrum input in the CNN model was much higher than that of the BP neural network evaluation model, and the time domain input test results of the CNN evaluation model also indicated better performance than that of the BP neural network evaluation model.erefore, the CNN evaluation model was a better evaluation model for the classification of internal combustion engine sound quality.(3) e experiments of 20 groups of different types of newly collected internal combustion engine sound samples showed that the proposed model has good generalization abilities.It was compared with CNN experiments using frequency domain and time domain inputs.e results showed that the sound quality classification method based on an auditory model was effective. e