Performance evaluation of lung sounds classification using deep learning under variable parameters

,

sounds, it would be very helpful.With the development of computation and electronic technology, it is coming true.First of all, respiratory sounds can be recorded by electronic stethoscopes, and be stored as audio files for further processing.We can collect lots of normal and abnormal lung sounds by the electronic instrument.After accumulating sufficient data, we can try to develop a model to classify normal/abnormal lung sounds, or even to diagnose lung diseases automatically.
Thanks to the development of machine learning, especially deep learning, computeraided lung sounds detection technology has made rapid progress.There have been many works about detecting abnormal lung sounds via machine learning or deep learning.Until 2015, the machine learning methods such as support vector machine (SVM), principal component analysis (PCA) had played major roles [3][4][5][6][7][8].And later after this year, the deep learning model, especially convolutional neural network (CNN), had been introduced to this field and showed to be superior to machine learning on accuracy and generalization ability [9][10][11][12][13][14].For machine learning, the so-called hand-crafted features, namely the peculiar signatures of some abnormal/normal lung sounds, must be extracted as the input to the learning model in advance.Various features, such as skewness and kurtosis of time signal or spectral density in frequency domain are extracted from sounds [8].Machine learning does not require a large number of samples, but has one big drawback of the limited generalization ability.
Deep learning is an end-to-end approach and does not need the step of features extraction.The raw samples are fed to deep learning models (DLMs) directly.In recent years, it has been applied to speech recognition, object recognition, classification and other fields successfully [14].In the field of biomedicine, Alpha Fold of Deep Mind can accurately predict the structure of the human proteome (a collection of all proteins encoded by the human genome), and the resulting dataset covers the structural positions of nearly 60% of the amino acids in the human proteome prediction, and the prediction results have a high degree of confidence [15].On the other hand, deep learning is widely used in the field of diagnosis.As concluded by Fourcade et al. [16], DLMs "will contribute to optimize routine tasks and thus have a potential positive impact on our practice." Through an investigation in the corresponding author's hospital, we found that an electronic stethoscope able to make initial classification of lung sounds would be very welcome by physicians.It could get rid of some troubles of the traditional in-ear stethoscope by transferring the sounds to a computer, or even a mobile phone.The ability of lung sounds classification can lighten the burden of physicians to a great extent.An issue concerned universally by physicians is the accuracy and practicality.
Over the past 2 decades, there have been many works about lung sounds classification by using machine learning or deep learning.Lots of solutions with different parameters and performance levels were presented.The parameters were selected and set empirically in many works.There are few works discussing how the performance is affected by different parameters.In this paper, we will focus on this topic and try to discover the relationship between the parameters and performance.
The remainder of this paper is organized as follows.A detailed literature review is performed, in which some representative works on automatic classification of lung sounds are reviewed.Then ICBHI 2017 dataset is introduced briefly, and the emphasis is put on data preprocessing and augmentation.The CNN model is discussed from the aspects of architecture, features and parameters in detail.A comparative study of classification performance in the proposed work versus up-to-date ones is performed.The summary of the paper and outlook of future work is presented in the last section.
A CNN model includes many parameters, such as feature-related and model-related ones.Before training it, we have to select appropriate parameters for it.On one hand, the parameters can be designed by trial and error approach.On the other hand, we can inherit some existing parameters proved to be effective by other works.It would be very helpful for parameter selection if we clarify the relationship between the performance and parameters of the CNN.It is the aim of this work and has attracted few researchers' attention.The length of lung sounds frame, overlap percentage (OP) of successive frames and feature types are picked as three typical parameters.And the relationship between these parameters and classification performance is explored in detail through experiments.This is the main contribution of this study.
It must be pointed out that the CNN model used in this work had been validated by other work [17], and so the problem of tuning hyperparameters of the network such as number of filters and layers, activation function selection is out of scope of this work.

Literature review
There have been lots of works about automatic lung sounds classification via DLMs.Several respiratory sound datasets have been used to train and test DLMs.The signals were collected from patients and healthy volunteers by using an electronic stethoscope or microphone.Some datasets are publicly available, while others are limited to personal use.To the best of the authors' knowledge, we have collected some frequently used datasets shown in Table 1.
Among these datasets, RespiratoryDatabase@TR and ICBHI 2017 are two of the most popular ones.The former was created by Altan et al. [20], including not only sound audio signals but the chest X-ray films and pulmonary function test (PFT) measurements of related subject.RespiratoryDatabase@TR has been widely used to assess the severity of chronic obstructive pulmonary disease (COPD) [26,27].The later was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics and is freely available to everyone [17].It has been used by many works to train and test DLMs and will be utilized in this work.
Based on these datasets, researchers have tried to distinguish between normal and abnormal lung sounds automatically via machine learning or deep learning.In recent years, deep learning models have been playing the major role.For comparison, we sort some works about lung sounds classification according to the datasets and classification models used, as shown in Table 2.
From Table 2, it can be seen that spectrogram-like features were used more widely, including but not limited to spectrogram, mel-spectrogram, log-spectrogram, scalogram.Some works fused spectrogram and mel frequency cepstrum coefficients (MFCCs) as features for DLMs with the intention of improving classification performance.In addition, the quantity of deep learning-related works is far more than machine learningrelated ones.
When ICBHI 2017 was used as dataset to train and test a DLM, Acharya et al. [28] achieved a score of 71.81% on four-class classification by re-training a deep CNN-RNN (recurrent neural network) model with patient specific data.Chen et al. [29] trained a deep residual network (ResNet) for triple classification of respiratory sounds with the accuracy, sensitivity, and specificity up to 98.79%, 96.27% and 100%, respectively.It was reported that the proposed model outperformed CNN.Shuvo et al. [30] used empirical mode decomposition (EMD) and the continuous wavelet transform (CWT) to train a lightweight CNN with the accuracy scores of 98.92% for three-class chronical classification and 98.70% for six-class pathological classification, respectively, which outperformed some larger network and other contemporary lightweight models.Petmezas et al. [12] made a four-class lung sounds classification using a hybrid CNN-LSTM (long short-term memory) network and spectrogram as feature.They achieved the scores as high as sensitivity 52.78%, specificity 84.26%, score 68.52% and accuracy 76.39%.Cinyo et al. [31] combined a CNN architecture with support vector machine (SVM)/softmax as an architecture to which various classifiers were incorporated.It was reported that the best classification accuracy was 83% with VGG16-CNN-SVM model.Jayalakshmy et al. [32] employed conditional generative adversarial networks and made four-class classification using a pre-trained CNN (ResNet-50) and scalogram as feature, achieving an accuracy of 92.5%.Asatani et al. [33] used an improved convolutional RNN as a quadruple classifier of lung sounds and yielded the results of sensitivity 0.63, specificity 0.83 and MFCCs [34][35][36] Both the above [37,38] Machine learning Spectrogram mel-spectrogram scalogram, etc. [39,40] MFCCs [41] Respiratory Database@TR Deep learning Spectrogram [27,42] Time-domain features [43,44] Machine learning Empirical wavelet transform [45] score 0.72.Some works used MFCCs as feature to train/test the deep learning network.Perna [34] used a deep CNN architecture with regularization to classify the breathing cycles into three classes: healthy, chronic and non-chronic, and obtained the scores of accuracy 82%, precision 87%, recall 83% and F1_score 84%.Dhavala et al. [35]  In most cases, RespiratoryDatabase@TR was used to detect the severity of Chronic Obstructive Pulmonary Diseases (COPD).Roy et al. [27] generated mel-spectrogram snippet representation as input feature and compared the performance of two classifiers for COPD severity detection.Yu et al. [42] extracted bispectrum of lung sounds as feature of the CNN classifier, to assist diagnosis of COPD.Altan et al. [43] applied the deep belief networks (DBN) to separate the lung sounds from different levels of COPD with extracting 3D second order difference plot in time domain as feature.In their another work [44], the statistical features of frequency modulations were extracted using Hilbert-Huang transform and then were fed to a DLM.Ahmet et al. [45] extracted statistical features using empirical wavelet transform (EWT) algorithm and then applied them to SVM, AdaBoost, random forest and J48 DT, respectively, in aid of diagnosis of COPD.
In some works, lung sounds were collected by physicians in field to generate private datasets, to see Table 1.It is inappropriate to compare the performance of these works because of these varied datasets, and they will not be reviewed in detail.

Data preparation and preprocessing
The ICBHI 2017 database consists of about 5.5 h of data recording sampled from 126 subjects totally, which contains 6898 respiratory cycles (i.e., from inspiratory to expiatory phase), from which 3642 contain normal sounds, 1864 contain crackles, 886 contain wheezes, and 506 contain both crackles and wheezes can be extracted as shown in Table 3.
The audio samples in the dataset were sampled at frequencies of 4 kHz, 10 kHz and 44.1 kHz using different instruments.There are differences in amplitudes of audio signals across instruments.Another challenge is that noise exists, such as rubbing sound of the stethoscope with the participant's dress and ambient noise.In addition, lung sounds are always presented along with heartbeat sounds.
The length of respiratory cycles in the dataset varies in a wide range from 0.3 to 12 s.In theory, a respiratory cycle takes 3-5 s.Such varieties in the data make it challenging to classify the lung sounds.Therefore, the raw signals in the dataset must be preprocessed before being brought to train/test a DLM.

Resampling the signals
First of all, the audio signals are resampled at a frequency of 8 kHz to standardize the signal length.It is noted that the maximum frequency of lung sounds is not greater than 3 kHz [46], so the resampling operation could not lead to important information loss of the audio sounds.

Noise filtering
In order to mitigate the effect of ambient noise and heartbeat sounds, the lung sounds samples are filtered by a 10th Butterworth band-pass filter [38].The transfer function

H(z) of the filter is
The magnitude characteristic is shown in Fig. 1.It can be seen that the bandwidth of the filtered samples is about [25,2500] Hz.For the frequency of heartbeat is far below 20 Hz, the heart sounds can be filtered out completely.In general, the frequency of background  or electronic noise is concentrated below 50 Hz, so they can be also eliminated by this filter.Some ambient noise above 2500 Hz can also be filtered out.

Normalization
After the noises being filtered out, all signals are normalized to the range [−1, 1] for standardizing the data across different recording devices.

Data segmentation
In accordance with the annotated respiratory cycle, each audio signal is segmented timing with a 5 s duration.If the time duration of one annotated respiratory cycle is not over 5 s, the length of related audio clip will be extended to 5 s by sample padding.According to Fraiwan [23], it is appropriate to select 5 s as the cycle time, for it can cover both faster and slower breathing rates without adding extra complexity of the model.

Transformation of time series to spectrogram-like feature
It is not recommended to apply the lung sound samples to CNNs as feature directly.
On one hand, there may be a significant difference in the waveforms of two time series with the same label, especially when the two series are disturbed by noise.On the other hand, the major disadvantage of CNNs on time series is the use of Euclidean kernels.
The kernel considers only a continuous and short time series subsequence at a time.In order to extract more representative features, non-contiguous and longer time series samples must be analyzed.To overcome these drawbacks, the time series of lung sounds are transformed to spectrogram-like images as features to CNNs.
A spectrogram is a two-dimensional image that shows the change of sound amplitude with frequency and time as shown in Fig. 2.

Fig. 2 An example of lung sound spectrogram
The vertical and horizontal axes represent frequency and time, respectively.Not only does the spectrogram match our understanding of sounds through frequency decomposition, but it also allows us to use 2-dimensional analysis architectures.The fact that spectrogram certainly is the best-suited representation of audio signals for analysis has become the common view.

Data augmentation
As mentioned in Table 3, we have extracted 1864 cycles containing crackles, 886 containing wheezes, 506 containing both crackles and wheezes, and 3643 containing normal sounds.Obviously, the number of different types of records is unbalanced, which may lead to overfitting during model training and poor generalization ability.So, the unbalance must be corrected to keep the number of four types of records even.One straightforward solution is to delete some records from the type with greater number of records randomly, in order to make the number approach the type with less records.This will waste a lot of useful data.Another effective solution is to expand the capacity of data, which is very common in image processing, for example, to increase the sample capacity through image dithering, inversion, rotation, etc.
There are some popular ways for audio data augmentation, such as time stretching, pitch shifting and background noise inserting [47].By time stretching, we can slow down or speed up the audio samples while keeping the pitch unchanged.On other hand, the pitch of the audio samples can be raised or lowered while keeping the duration unchanged.By mixing the audio samples with some background noise signal, we can get a new record of sample augmented [48].We will use these three approaches to augment the lung sounds for balance.First of all, the white noise is selected and inserted to the lung sounds.White noise consists of some random sound samples with similar amplitude but various frequencies.The performance of speech emotion recognition can be increased by addition white noise to original sound [49].Lung sounds are very weak, and the signal-to-noise ratios (SNRs) of many records are not high.So, the SNR should be controlled not to be too low when noise is inserted.Three SNRs of 10 dB, 15 dB and 20 dB are determined by trail and error.
In order to achieve balance in the number of records labeled by the four labels, the number of records labeled by crackle should be expanded to be twice as the original number, and the number of wheeze records should be expanded to be four times, and then the number of records both with crackle and wheeze should be seven times.For the records with the two latter labels, the augmentation is a little exaggerated.To avoid it, we formulate a procedure for augmentation as follows: (1) Every time to train the CNN model, first of all, we select 400 from 506 records with both crackle and wheeze randomly and then to add noise to the 400 records under the SNRs of 10 dB, 15 dB and 20 dB, respectively.So we get 1200 "new" records meaning that the number of records with both crackle and wheeze is expanded to 1700 approximately.4. It can be seen that the unbalanced number of the four types of records has been corrected.
For the balanced dataset, time stretching and pitch shifting are performed successively.Each record in the balanced dataset is stretched by two factors: {0.93, 1.07} and pitch shifted by four values: {−1, 1} .Finally, we have four times as many number of each types of lung sounds in the augmented dataset as in Table 4.

Architecture of deep learning model
A relatively simple CNN is introduced, including input layer, pooling layer, batch normalization layer, max pooling layer and fully connected layers in sequence.There are two fully connected layers.Two dropout layers are inserted before and after the first one, respectively.The second fully connected layer is followed by the second dropout layer, and connected to output layer.This architecture shown in Fig. 3 has been validated by Rocha et al. [50].The softmax function is adopted in the output layer, as shown in Fig. 3.
The hyperparameters of the CNN to be used are listed in Table 5.In order to ensure the consistency of the CNN architecture among different parameter settings, the hyperparameters are kept fixed and not to be tuned during the training stage.Fig. 3 The architecture of the CNN with one certain parameters setting

Generation of features
Two types of features, spectrogram and MFCCs, are consider in this paper.We cannot get them without fast Fourier transform (FFT).First of all, the sound samples of a respiratory cycle must be windowed to some successive frames on which FFTs are performed.The window length (Lwin) of FFT should not exceed 40 ms in general because of the nonstationarity of lung sound.In discrete domain, if the sampling frequency is fixed to 8 kHz, the window size or the frame length should not be above 320 points.There should be some overlap between successive frames in order to keep continuity.The overlap percentage (OP) is the ratio between the number of overlap points and frame length.For example, as in Fig. 4, the frame length is 4, and the overlap length 3, and then the overlap percentage is 3/4 = 75%.
The number of frames of a respiratory cycle lasting for 5 s can be calculated as when the sampling frequency is f s = 8 kHz.
For each frame x(n), n = 0, 1, . . ., Lwin − 1 , FFT is performed according to Table 5 The architecture of the CNN  where k is the index of frequency bin.The magnitude |(X(k))| is used to construct a spectrogram.Considering the conjugate symmetry of Fourier transform, it is only necessary to take the first half of the magnitudes of FFT.It means that we only keep a n-dimension vector n = Lwin/2 + 1 after performing one FFT.After all the frames are transformed by FFT, a n × m matrix, then the spectrogram is built, as shown in Fig. 2. The column vectors are corresponding to the FFT magnitudes of the frames of one respiratory cycle, respectively.Another popular spectrogram-like feature employed in respiratory sounds classification is MFCCs, which is also applied to speech recognition frequently.MFCCs are introduced to separate the speech signal spectrum S(z) into the source U(z) (periodic signal generated by opening and closing of the vocal folds which generates the pitch) and vocal tract filter H(z) which changes according to the word being spoken.The spectrum can be represented as This equates to the convolution of the source with the vocal tract filter in time domain:

Parameters Activations Parameters
where s[n], u[n] and h[n] are the speech, source and filter responses, respectively.MFCCs incorporate the fact that the human auditory system is more sensitive to changes at lower frequencies (linear below 1000 Hz) than at higher frequencies (logarithmic above 1000 Hz).To model human pitch perception, a series of triangular filter banks are applied to the speech spectrum which are spaced linearly below 1000 Hz and logarithmically above 1000 Hz according to the mel scale which is given as where f mel is the frequency converted in the mel scale and f is frequency in the linear domain.
There are four steps for calculating MFCCs as follows: Step 1 To window the lung sound signal and perform STFT to the windowed frame, and get its spectrum Y (k) = X 2 (k), k = 0, 1, . . ., N − 1 .N is the frame size or STFT length.
Step 2 Apply mel-scaled filter bank to the spectrum as shown in Fig. 5.In this bank with the same bank height, the number of filters is M = 20 , and the sampling fre- quency is 8 kHz.
Step 3 Calculate the log of the summed filter bank energies: Step 4 The discrete cosine transform (DCT) of the log values is calculated to give the coefficients as in Eq. ( 7): (3) where L is the order of MFCCs and usually lies in interval [2,15].We choose 15 for it considering the granularity of MFCCs.M is the number of filters in the bank.
Following the procedures above, we can get an example of MFCCs image for a segment of lung sound, to see Fig. 6.It should be noted that the size of MFCCs image depends on the signal duration and the order of MFCCs.

Platform and parameters setting
In this paper, the CNN for lung sounds classification is implemented with Pytorch in Python and tested on our workstation with a 64-bit Windows 10 with Intel i7-7800X 3.50 GHz processor and an Nvidia GTX 3060 graphics card.We attempt to discover the relationship between classification performance and parameters of the CNN.The Lwin, OP, feature type (spectrogram or MFCCs) are selected as comparative parameters, as shown in Table 6.
The parameters in Table 6 may have different value combinations.There are totally 18 different combinations.As mentioned above, the image size of feature m × n depends on the parameter setting.For example, if Lwin = 128, OP = 75% and feature type is SG, we have m = 1247 and n = 65 .If feature type is MFCCs, m is still equal to 1247, but n is kept to be 15 no matter what the value of Lwin is.
Classification performance is evaluated with tenfold cross-validation.Ninety percent of the data is used for training and 10% for validation to avoid overfitting [26].It is a common practice.We partition the dataset by patients and not by lung sounds.None of lung sounds from the same patient is used in both training and validation set.The validation set in each fold contains at least one class for every possible recording location.
At training stage, the CNN uses Adam optimizer with learning rate 2 × 10 −4 and gra- dient decay factor of 0.5.

Performance criteria
The following five typical performance criteria, including accuracy, specificity, sensitivity, precision and F1_score are selected to evaluate the CNN model: where C i , T i , i = c, w, b, n refers to the number of correctly recognized instances of class i, and the total number of instances of class i in the test (or validation) set, respectively.The symbols c, w, b and n stand for the class of crackle (c), wheeze (w), c and w and normal, respectively [41].
Sensitivity (true-positive rate) refers to the probability of a positive test, conditioned on truly being positive.On the other side, specificity (true-negative rate) refers to the probability of a negative test, conditioned on truly being negative.Sensitivity is an indicator of the ability to correctly identify those with disease, and specificity is used to indicate the ability correctly identify those without disease.Accuracy is a measure of how well a binary classification test correctly determines whether a patient is healthy or not.Precision reflects how reliable the model is in classifying samples as positive.F1_score is defined as the harmonic mean of precision and sensitivity, which combines the two metrics into a single metric, and works well on imbalance data particularly.These metrics are used most widely to characterize the performance of a classifier, such as CNN and SVM.
In the following, the process of training and test will be performed 100 times randomly, and then the mean value and 95% confidence intervals (CIs) for these metrics will be derived.

Results
The performance criteria and confusion matrices of the CNN are shown in Table 7 and Fig. 7, respectively, when spectrogram is used as feature.For simplicity, only five typical parameter combinations are presented in it.Under these combinations, the criteria of sensitivity, specificity, accuracy, precision and F1_score are shown.The confusion matrices are exhibited only under the parameter combination of Lwin = 256, OP = 75%.The performance criteria and confusion matrices of the CNN are shown in Table 8 and Fig. 8, respectively, when MFCCs are used as feature.The same parameter combinations as in Table 7 are presented in the following table.
The above results show that the larger Lwin and OP, and the better the performance for lung sounds classification.The larger Lwin, the greater the frequency resolution.And the larger the OP, the greater the time resolution.It means that a larger frequency and time resolution is beneficial to improve the classification performance of the CNN.However, when the frequency resolution reaches a specific value, the improvement on classification performance is no longer significant.When the OP is kept at 75% and the window length increases from 128 to 256, either at training or test stage, no significant improvement on classification performance has been found at all, and some performance criteria have even begun to decrease.In addition, the larger Lwin, the higher the requirements for computation and storage resources for FFT.There should be a compromise between the requirements of performance improvement and computation or storage capacity, and select appropriate Lwin and OP.From Tables 7 and 8, it can be concluded that Lwin 128 and OP 75% are a relatively optimum parameter combination.By comparing the data between Tables 7 and 8, and Figs.7 and 8, it can be seen that under the same parameter combination, at both training and test stage, the performance criteria of the CNN with spectrogram feature are significantly better than the criteria of the CNN with MFCCs feature.The reason may lie in the fact that the resolution of MFCCs is weaker than that of spectrogram.When the machine learning models, such as SVM with MFCCs feature are used to classify lung sounds, they may show advantages over DLMs.However, as a hand-crafted feature, MFCCs may be more biased in the generation process, resulting in incomplete feature representation.This might be one of the reasons for its slightly worse performance in DLMs.
The receiver operating characteristic (ROC) curves are shown in Fig. 9, under four different parameter combinations.As shown in it, the best performance is achieved under the parameter combination of Lwin 256, OP 75% and spectrogram feature.However, under the spectrogram feature and OP 75%, there is no significant improvement on performance compared to the performance with Lwin 128.The AUCs (area under ROC curve) are 0.

Discussion
The results show that using spectrogram feature input will achieve better performance than using MFCCs feature.And under the same feature the classification performance may be affected by feature-related parameters heavily.Generally speaking, the longer the frame length and the larger the overlap percent of two successive frames, the better the classification performance.
The spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time.When applied to an audio signal, spectrogram is sometimes called voiceprint or voicegram.It has been used as the indicator of speaker and applied to speaker recognition.And it has been accepted as a widely used feature to classify normal/abnormal lung sounds.For a segment of lung sound, the larger the frame length, the higher the frequency resolution, and but the lower the time resolution.So we must make a compromise between the two resolutions.From the experimental results, it can be concluded that frequency resolution contributes more to the classification performance than the counterpart does.The reason may lie in the fact that the lower frequency resolution will lead to a coarse voiceprint to some extent.It is natural to say that we cannot achieve more accurate from some coarse voiceprints.
MFCCs are commonly used as features in speech recognition systems and work relatively well.Unfortunately, they do not get a score high enough as expected in this work.Because the maximum order of MFCCs is fixed, the granularity of MFCCs feature is limited.The difference of features of the four types of lung sounds may not be represented significantly.So there exists difficulty for the CNN to classify lung sounds accurately.
We make a comparative study with similar up-to-date works.In order to ensure comparability, comparative study was limited to the similar works using spectrogram-like features and ICBHI 2017 dataset.The number of classes varies between these classifiers, such as 2-types (healthy/non-healthy, normal/abnormal), 3-types (wheeze/crackle/normal, healthy/non-chronic/chronic diseases, crackle/rhonchi/normal), 4-types (normal/ crackle/wheeze/crackle and wheeze) and 6-types (healthy/bronchiectasis/Bronchiolitis/ COPD/Pneumonia/URTI).Among them, only the works performing 4 types of classification were selected for comparison.The performance metrics for these works along with the proposed one are provided in Table 9.Among these works, our proposed classifier with recommended parameters can achieve relatively better performance.Sometimes, we focus on the responsible section of the image rather than classification accuracy, especially in clinic analysis [51].Unlike medical images such as X-rays and CT images, lung sounds are acoustic signal in.Physicians are trained to make diagnosis by auscultation.In order to use CNNs for lung sounds classification, we must convert acoustic signals into spectrogram-like images.These images are intermediate results and cannot be shown to physicians directly.Even the responsible section is marked by some method such as GradCam, it has little significance in guiding the diagnosis of lung diseases for the difficulty in perception [25].So this part is not included in this paper.

Conclusions and outlook
The performance of the deep learning model, namely CNN, under different parameter combinations and two types of features are investigated in detail by experiments.Combined with the two types of features, two parameters of frame length and overlap percentage (OP) of successive frames are emphasized.The spectrogram and MFCCs of lung sounds are used as features to the CNN, respectively.The training and test results show that there is significant difference on performance under varied parameter combinations and features.From the results, we can see that OP is a performance sensitive parameter.The higher OP, the better overall performance.Meanwhile, more computation and storage resource is needed for higher OP.So OP is restricted to maximum of 75% for practical purpose.We fix the sampling frequency to 8 kHz without loss of important characteristics of lung sounds, because the maximum frequency of lung sounds is not above 3 kHz.When the frame size increases to 128 or more, the improvement on the performance is slight.We can hardly see significant difference between the performance metrics between the CNN with frame size of 128 and 256.However, when the frame size decreases from 128 to 64 or even less, the performance of the CNN degrades rapidly.It can also see that the CNN with spectrogram feature shows more excellent performance than the one with MFCCs feature under the same parameter combination.So it is concluded that frame size 128, OP 75% and spectrogram input is the optimum parameter setting, under which a compromise between performance and resources requirement can be reached.
In future, on one hand, we will evaluate the performance by considering more parameters or another deep learning model.For example, some other features and data augmentation methods would be tried.The background noise recorded from the hospital will be inserted in the audio samples, instead of white noise.And the log-scale spectrogram is another choice for the feature.In addition, we could compare the performance of CNN with another deep learning model, such as RNN.We will also try to combine more open respiratory databases, such as RespiratoryDatabase@TR [20] for CNN training and test.On the other hand, the CNN is running on the GPU platform presently.For practical purpose, the model should be simplified in order to be transferred to the embedded platform.The most ideal implementation is to train and test a lightweight CNN [30,51] and run it on a electronic stethoscope, which will help the physician to distinguish normal/abnormal lung conditions as quickly as possible.

Fig. 1
Fig. 1 Magnitude characteristic of the band-pass filter

( 2 )
Every time to train the CNN model, first of all, we select 750 from all the records with wheeze randomly and then to add noise to the 750 records under the SNRs of 10 dB, 15 dB and 20 dB, respectively.It means that the number of records labeled by wheeze is expanded to about 1600.(3) Every time to train the CNN model, we select 180 records from crackle records randomly and divide them into three parts evenly, and then each part has 60 records.The three parts are disturbed by noised under three SNRs of 10 dB, 15 dB and 20 dB, respectively.It means that the number of crackle records is expanded to about 2000.(4) Every time to train the CNN model, 2000 records are selected from normal record randomly as training and test data.Finally, the distribution of augmented lung data is shown in Table

Fig. 4
Fig.4 The overlap points between successive frames

Fig. 7
Fig. 7 The confusion matrices with spectrogram feature

Fig. 8
Fig. 8 The confusion matrices with MFCCs feature 93 and 0.95 under the two settings, showing no difference almost.In summary, it can be seen that with the sampling frequency of 8 kHz, the parameter combination of Lwin 128, OP 75% and the spectrogram feature can achieve superior performance to the same combination but MFCCs feature.With this combination, the time complexity of the CNN is O(1241 * 59 * 7 * 7 * 64) = O(229614784) without considering the activation layer.

Fig. 9
Fig. 9 ROC curves under four different parameter combinations

Table 1
Some frequently used lung sounds datasets

Table 2
Summary of studies conducted on lung sounds classification

Table 3
Distribution of classes in ICBHI 2017

Table 4
Distribution of classes in ICBHI 2017 after white noise addition

Table 6
Parameter and its possible values

Table 7
Performance criteria with SG feature

Table 8
Performance criteria with MFCCs featureThe optimum values of performance criteria at stage of training and test are highlighted in bold and italics font respectively

Table 9
Performance comparison between the proposed work (Lwin = 128, OP = 75%, SG feature) and state-of-the-art works as quadruple classifiers based on ICBHI 2017 The optimum values of performance criteria are highlighted in bold Acc accuracy, Sen sensitivity or recall, Pre precision, Spe specificity