Learning Representations from Heart Sound: A Comparative Study on Shallow and Deep Models

Leveraging the power of artificial intelligence to facilitate an automatic analysis and monitoring of heart sounds has increasingly attracted tremendous efforts in the past decade. Nevertheless, lacking on standard open-access database made it difficult to maintain a sustainable and comparable research before the first release of the PhysioNet CinC Challenge Dataset. However, inconsistent standards on data collection, annotation, and partition are still restraining a fair and efficient comparison between different works. To this line, we introduced and benchmarked a first version of the Heart Sounds Shenzhen (HSS) corpus. Motivated and inspired by the previous works based on HSS, we redefined the tasks and make a comprehensive investigation on shallow and deep models in this study. First, we segmented the heart sound recording into shorter recordings (10 s), which makes it more similar to the human auscultation case. Second, we redefined the classification tasks. Besides using the 3 class categories (normal, moderate, and mild/severe) adopted in HSS, we added a binary classification task in this study, i.e., normal and abnormal. In this work, we provided detailed benchmarks based on both the classic machine learning and the state-of-the-art deep learning technologies, which are reproducible by using open-source toolkits. Last but not least, we analyzed the feature contributions of best performance achieved by the benchmark to make the results more convincing and interpretable.


Introduction
Auscultation, as a simple, convenient, cheap, noninvasive, and efficient clinical method, has been used by physicians over a century [1].Accurately monitoring and understanding the heart sounds can be essential for the early-stage diagnosis and proper management of the cardiovascular diseases (CVDs), which are ranked as the first leading cause of death globally [2][3][4].However, training professional medical interns who can make an efficient use of their stethoscopes is not an easy task [5].Within the fast development of machine learning (ML) and its cutting-edge subset, deep learning (DL), automatic analyzing and monitoring of the heart sounds has increasingly attracted numerous efforts from the community of both the medical and engineering fields [1,6].Due to the noninvasive characteristic in nature, heart sound classification can be a feasible and efficient way for not only the early cost-effective screening of CVDs but also managing the progression of its condition [1].Furthermore, this study can benefit the Internet of Things-based assisted living [7], personalized healthcare [8], and smart home monitoring systems [9].
Nevertheless, it maintained a challenge in a long time for researchers to train and validate their automatic heart sound classification algorithms due to a lack of high-quality, rigorously validated, and standardized open-access databases of heart sound recordings [6].To address this challenge, Liu et al. released the 2016 PhysioNet/Computing in Cardiology (CinC) Challenge heart sound recordings as the first large open-access standard heart sound database [10].Besides, the authors provided an open-source toolkit for implementing automatic segmentation and classification of the heart sounds.On one hand, the CinC database dramatically promoted and encouraged scientific community to research and develop algorithms for heart sound classification task, which included both the classic ML methods needing human hand-crafted features and stateof-the-art DL techniques that can learn higher representations from the raw signal itself [6,10].On the other hand, there are still some limitations existing in the CinC database: First, multicenter data collection (from 8 different sources) makes CinC inconsistent in the data acquisition system, environment, and annotation process.This may arise issues to make an intelligent model without any external uncertainties or interference.Second, the CinC database ignored a subject-independent data partition, which may result in an overoptimistic evaluation of the final perform ances.Third, more reasonable evaluation metrics for the imbalanced data set, e.g., unweighted average recall (UAR), were not used in CinC Challenge.To this line, we proposed our first version of the open-access standard subjectindependent heart sound database, i.e., the Heart Sounds Shenzhen (HSS) corpus [11].In [11], a basic comparison between classic ML models (SVM) and DL models (long shortterm memory [LSTM]/gated recurrent unit [GRU]-RNN) was investigated.However, human hand-crafted features were used in that study, which cannot give the audience a view of using state-of-the-art techniques that can learn high-level features automatically from the heart sound via DL.In fact, we can see some recently published literature giving some encouraging results showing the trend on learning heart sound features in an unsupervised learning paradigm.However, a comprehensive study on the state-of-art representation learning paradigms on heart sound classification task is lacking.To this end, we introduce this work includes transfer learning, sequence-tosequence learning and end-to-end learning approaches for the heart sound classification.For the reason that the relationship of the models usually is not clear, we utilize the Shapley values [12,13] to evaluate the global features contributions.To the best of our knowledge, this is the first time to present the comprehensive investigation on heart sound classification task.
The main updates and contributions of this work are: Firstly, we use a shorter duration (10 s) of the heart sound recordings, which is more similar to real human auscultation scenario in clinical practice.Secondly, besides the same task as in HSS, we add another subtask for classifying normal/abnormal heart sounds.We think this binary classification study can be important for fast early clinical screening or in-home monitoring of the subjects who are suffering from chronic CVDs.Thirdly, we present the benchmarks of both the classic ML models and the state-of-the-art DL paradigms for heart sound classification task.Moreover, the interpretation experiments of the best benchmark are given to examine the features importance.In addition, all the results are reproducible based on our opensource toolkits.Last but not least, we hope this study cannot only benefit the study on automatic heart sound classification, but also facilitate other domains using the cutting-edge machine listening techniques for healthcare or social wellbeing applications.
The remainder of the work will be organized as follows: the background and related work are introduced in Background and Related Work.Then the proposed database and tasks will be described in Database and tasks.In Classic ML models, we introduce the methods and toolkits used in this study in details.The experimental results will be presented in Results, and followed by a discussion in Discussion.Finally, we make a conclusion of this article in Conclusion.
The heart sound, a.k.a.phonocardiogram (PCG), has been studied and investigated to be a potential marker for both personal identification [14] and CVDs diagnosis [1] in tremendous prior works.As a noninvasive and inexpensive method by nature (cf.[15]), automatic computer-assisted analysis of the PCG signals cannot only substantially improve the diagnosis accuracy of the CVDs [16] but also avoid a cumbersome and expensive check by the Echocardiography approach [17].In early works [1], classic ML models needing human hand-crafted features, e.g., Mel-frequency cepstral coefficients (MFCCs) [18], were used to help building ML models, e.g., support vector machine (SVM) [19], mapping the heart sound signals to the targeting predictions, e.g., normal/abnormal clips.In particular, the whole heart sound recordings should be firstly segmented into fundamental components [20] (see Fig. 1).In addition, in this scenario, electrocardiography is usually used as an auxiliary signal, which can identify the locations of the fundamental components in the cardiac cycle [1].Nevertheless, as indicated by Dwivedi et al. [1], such kind of method has several disadvantages on requiring a secondary signal, difficulties in sensing and synchronization, depending on pathological conditions of subjects, and complex in computation and processing.Therefore, one recent trend of the relevant studies is to build a holistic automated system for diagnosis of CVDs via the PCG signals without any segmentation step [11].
Moreover, the ever-changing paradigms of ML, specifically for its hottest subset, DL, can learn higher representations from the data itself.On one hand, we are encouraged to see the stateof-the-art approaches like sequence-to-sequence learning [21] and transfer learning [22] are implemented successfully in the heart sound classification task, which can get rid of timeconsuming and expensive human-designed features.On the other hand, the available standard open-access heart sound database is still insufficient.This obstacle not only restrains the power of DL models to learn robust and efficient representations but also makes it difficult to develop a reproducible and sustainable research in the field.To this end, we proposed our first version open-access heart sound database, i.e., HSS [11,23].To the best of our knowledge, HSS is the current largest open-access heart sound database collected from a single medical center using consistent methods of data collection and annotation.A brief summary of the published literature based on HSS can be found in [11].In this updated database, we use a 10-s-based long clip as the instance of heart sound, which is shorter than HSS (30 s).We may think that, this shorter duration makes it more challenging to learn sufficient information inherited in the heart sounds, for both machines and humans.In fact, in clinical practice, physicians usually perform the auscultation in one check for lasting approximately 10 s [24].Additionally, apart from the same 3-class classification task in HSS [11], we add a binary classification task in this study.In this scenario, normal or abnormal heart sounds need to be classified, which can be crucial for both the prescreening in clinical practice and the in-home monitoring for subjects who are suffering from long-term chronic CVDs [25].

Database and Tasks
In this section, we give the information of the database we proposed.In addition, we define the tasks needed to be addressed.This study was approved by the ethics committee of the Shenzhen University General Hospital, Shenzhen, P. R. China.All the participants agreed to use their data for research purposes.

Database
The database is based on the whole original data in HSS [11] but having shorter clips of shorter duration.All of the original audio recordings in HSS were segmented into 10-s-based long clips with 5-s neighboring overlaps.Totally, 170 subjects (female: 55, male: 115, age: 65.4±13.2years.)participated in the data collection.These subjects were with a variety of health conditions including hypertension, hyperthyroid, arrhythmia, coronary heart disease, heart failure, valvular heart disease, and congenital heart disease amongst others.All the heart sound audio recordings were recorded via an electronic stethoscope (Eko CORE, USA, Bluetooth 4.0, 4-kHz sampling rate) from 4 locations of the body (see Fig. 2), i.e., the auscultatory mitral, the aortic valve auscultation, the auscultatory areas of the tricuspid valve, and the pulmonary valve auscultation.

Tasks
In this work, we have 2 subtasks, i.e., Task 1: Classification of normal, mild, and moderate/severe heart sounds (see Fig. 3); Task 2: Classification of normal, and abnormal (which includes the labels of mild or moderate/severe) heart sounds.In this study, the data partition are the same as in HSS [11], which is under the subject-independent condition.The details of the data partition can be found in Table 1.

Methods and Toolkits
In this section, we give details of the methods used in this study.
Besides, the open-source toolkits will also be introduced to reproduce this research.

Classic ML models
In the classic ML paradigm, human hand-crafted features, as the front-end, expert's domain knowledge are crucial for further model building.At the first step, low-level descriptors (LLDs) will be extracted from the audio signals (i.e., heart sounds in this study).Subsequently, supra-segmental features [26] summarizing statistical information can be obtained from the LLDs over a given period of the signals.In this study, we use the ComParE feature set LLDs (see Table 2) in our popular toolkit openSMILE.This standard feature set (including temporal and spectral acoustical properties) has been used successfully in the previous editions of the ComParE challenges starting from 2013 [27].
For the back-end, we use the popular SVM model [19], for its stable and efficient performance in the previous study [11].For the implementation of SVM, we select the open-source toolkit LIBSVM [28].

Statistical functionals
The statistical functionals (func.),containing the mean, standard deviation, extremes, etc., are calculated from a given period of one audio clip (see Fig. 4).In this study, we use the default func.configuration in the ComParE feature set (see Table 3), which results in 6,373 features by applying func.to the LLDs and their first delta values.

Bag-of-audio-words approach
The bag-of-audio-words (BoAW) approach was derived from the concept of bag-of-words (BoW, cf.[29]), which has been successfully applied in the domain of natural language processing [30] and computer vision [31,32].In our proposed BoAW approach (see Fig. 5), a codebook was generated from the acoustic LLDs/deltas via as seeded random sampling process following the initialization step of k-means++ clustering [33].
When calculating the histograms, each LLD/delta is assigned to the 10 audio words from the codebook having the lowest Euclidean distance.In this study, both BoAW representations from the LLDs and their deltas are concatenated.We use a logarithmic term frequency weighting to compress the numeric range of the resulting histograms.The LLDs and their deltas are extracted using the openSMILE toolkit [34] with the ComParE feature set.The BoAW approach is implemented by the openXBOW toolkit [35].For optimizing the codebook size N c , we investigate 125, 250, 500, 1,000, and 2,000 in this study.

DL models
DL [36], which can extract higher representations from the data by using a series of nonlinear transformations of the inputs, is dramatically changing the paradigms of ML.In particular, compared with the classic ML methods (shallow models), DL models can learn more robust and generalized features when the data size becomes big.In this study, we propose 3 typical DL methods, i.e., the deep spectrum transfer learning method (see Fig. 6) by using pretrained DL models [22], recurrent sequence-to-sequence autoencoders (S2SAE) [21], and end-to-end (E2E) learning models [37].

Deep spectrum transfer learning
In this method, heart sound signals are fistly transformed into Mel-spectrograms (128 Mel frequency bands are computed) using a Hanning window with 32-ms width and 16-ms overlap.Then, the generated spectrograms are forwarded through a

Frame-level LLDs Functionals
Fig. 4. The diagram of the statistical functionals approach.We calculated a series of statistical functionals (e.g., max.min.mean, etc.) from the frame-level LLDs (e.g., MFCCs) extracted from the heart sound signal, which can be independent of the length of the instances.pretrained deep convolutional neural networks (CNN) [38].

Recurrent S2SAEs
In this approach, the first step is the same as in the aforementioned deep spectrum transfer learning method, namely, Melscale spectrograms are generated from the raw heart sound data.In addition, power levels are clipped below certain predefined thresholds in those spectrograms to eliminate some background noise (in this study, power levels are clipped below 4 different given thresholds, i.e., −30, −45, −60, and −75 dB).Then, in an unsupervised scenario, i.e., without any labels, a distinct recurrent S2SAE is trained on each of those sets of spectrograms.Finally, the learnt representations of a spectrogram are concatenated to be the feature vector for the corresponding instance.We use the auDeep toolkit [43] to implement the S2SAE method (see Fig. 7) in this study.

End-to-end learning
The E2E model utilizes a series of CNNs [38] and/or recurrent neural networks (RNNs) [44] to extract higher representations directly from the raw heart sound audio waveforms (see Fig. 8).The previous studies had achieved success by the E2E models in analysis of music [45], speech emotion [46], and snore sound [37].In this study, we use the deepSELF toolkit [47] for the E2E model implementation.To overcome the vanishing gradient problem in RNN training [48], we use the LSTM [49] and the GRU [50] cells when building the deep RNN models.

Evaluation metrics
We use the UAR as the main evaluation metrics in this study by taking the data imbalanced characteristic into account.
Compared to the widely used weighted average recall (WAR), i.e., the accuracy, UAR is more reasonable and rigorous for imbalanced database [51].It is defined as: where Recall i and N class are the recall of the i-th class and the number of classes, respectively.The WAR (accuracy) can defined as: where λ i is called the weight for the i-th class, N i is the total number of instances labeled as the i-th class, N is the total number of instances.
In addition, when comparing 2 results, we use the significance level test by one-tailed z-test [52].The results can be regarded as significant when the P value is larger than 0.05.

Explainable method SHAP
In many applications, the importance of understanding why a model makes a particular prediction is just as significant as the accuracy of the prediction itself.To interpret the contribution of the each feature to the prediction, we utilize SHAP (SHapley Additive exPlanations) [12] to explain how the features affect the predictions.The Shapley values is defined as follows:  Fully connected layer Fig. 7.The diagram of the recurrent autoencoder based S2SAE approach.In this approach, an unsupervised scenario is used to learn higher representations.The network is trained to minimize the root mean square errors between the input sequence and the reconstruction.When the training is complete, we regard the activations of the fully connected layer as the representations of the input sequence.
where g(z′) is the explanation model, ϕ 0 is a constant (usually the mean value of the target variable for all samples), ϕ i is the Shapley value of the feature i, z � j ∈ {0, 1} M , and M is the number of the input features.
Contrasting to the traditional feature importance, the Shapley value can be "positive" or "negative".When ϕ i > 0, it means the feature i improve the prediction of the model, i.e., positive.Conversely, it indicates that the feature i leads to a decrease in the predicted value, existing a negative effect.In addition, the greatest advantage of SHAP is that it is able to reflect the influence of the features in each sample.

Experimental Results
In this section, we describe the experimental settings at first.Then, we show the results achieved by this work.

Setup
All the experiments in this study are run by Python-based scripts.For reproducibility, we use opensource toolkits to implement the 5 methods aforementioned.For the SVM model, we use the Python sklearn toolkit (linear kernel is chosen), which is based on the LIBSVM toolkit [28].[https://scikitlearn.org/stable/modules/generated/sklearn.svm.SVC.html.]All the hyper-parameters are tuned and optimized on the dev set, and applied to the test set.For showing results, the dev results are the ones achieved by the optimal models and the test results are the ones validated by the models within optimized hyper-parameters on the dev set.To minimize the effects by data imbalanced characteristic, we use upsampling technique to replicate the instances that belong to the scarce class.All the features are standardized before fed into the classifier by using the mean and standard deviation values from the train set.

Results
The experimental results of the proposed 5 methods are listed in Table 4.We can see that, for both Task 1 and Task 2, the ComParE feature set-based models dominate the best performances.It is consistent with our previous studies [11,23], currently the well-designed human expert hand-crafted features are important for building an efficient and robust model for heart sound classification.For both Task 1 and Task 2, the best single models are all achieved by the classic ML model, i.e., the ComParE func.plus SVM classifier.The corresponding best UARs for the 3-class and binary class tasks are 48.8%(chance level: 33.3%) and 58.6% (chance level: 50.0%), respectively (chance level: the level that would be expected by random choices).These results are all significantly better than the E2E models in this study (P < 0.001, by one-tailed z-test).A late fusion (by majority vote) of the best 4 models reaches a comparable performance.For Task 1, the best fusion model yields to the best single model (UAR: 47.2% versus 48.8%).For Task 2, the best fusion model has a very slight improvement compared to the best single model (UAR: 58.7% versus 58.6%).
The SHAP interpretation experiments are conducted on the all test set (815 samples) and we analyzed the feature contributions for the top 2 results in the baseline(i.e., ComParE 10 -4 and ComParE 10 -3 ).The SHAP explanation of the best performance achieved by the ComParE feature set is as shown in Fig. 10.From the figure, it can be seen that functional transformed feature MFCC and RASTA-filtered auditory spectrum have the highest feature contributions and features of the function of PeakMeanRel (relative mean value of the peak in a specific feature) usually get the better contributions.
Tables 5 and 6 show the confusion matrices of the best 4 models and their late fusion results on the test set for Task 1 and Task 2, respectively.For Task 1, the recall of the "Mild" and "Moderate/Severe" types of heart sounds are higher than the "Normal" ones.The fusion of the best 3 models can slightly improve the recall of the "Moderate/Severe" types compared with individual models whereas remains a low recall of the other 2 types of heart sounds.For Task 2, the recalls of the "Abnormal" types are much higher than the recalls of the "Normal" types.In other words, the "Normal" ones tend to be recognised as "Abnormal" ones.Fusion of the best 3 models can lead to a highest recall of the "Abnormal" types.
Figure 9 illustrates the complementary metrics achieved by the best 4 models and their late fusion results on the test set for Task 2. It can be seen that, most of the models show higher results on acc., sens., prec., and F1-score while lower performances on spec.and G-mean.Fusion of the best 4 models cannot dramatically improve the single models' performances.

Discussion
In this section, we firstly indicate the current findings by this study.Then, we discuss the limitations from this work, and give our perspectives toward future work.

Current findings
In this study, ComParE feature set based methods (func.and/ or BoAW + SVM) are superior to other methods, in particular, for reaching the best results (see Table 4).We can see that, at current stage, finding efficient acoustical representations from the heart sound is the prerequisite.With the SHAP interpretation, we figure out that MFCC and RASTA-filtered auditory spectrum perform better in the heart sound classification.
In contrast, the performances of the 3 proposed DL based methods are modest.We segment audio clips into smaller duration length (10 s) compared with HSS 1.0 (around 30 s), which has increased the total number of instances from 845 in HSS 1.0 to 4,225 in this study.Nevertheless, DL models cannot generator better and more robust performances than the classic ML models.On one hand, we enjoy the benefits of DL methods getting rid of expensive, time-consuming, and inefficient human expert feature engineering process.On the other hand, we find it difficult to train a sufficiently robust and generalized DL model by using the current limited heart sound data resources.
Encouraging results can be found in acc., sens., prec., and F1-score.Specifically, the sensitivity (for detecting "Abnormal" heart sounds) can be higher than 70.0%, which can benefit a potential emergency care of subjects suffering from long-term chronic CVDs.However, the performances on spec.and G-mean are needed to be improved.

Limitations and perspectives
The extreme data imbalance characteristic is still the first challenge for limiting the current performances of all the models.Even though the current best models can beat the chance level for both of the 2 tasks, the overall UARs are modest.In future work, we will continuously collect more "Normal" heart sounds, which can enrich the heart sound database.Additionally, we will involve more advanced technologies like generative adversarial networks [53], which were demonstrated to be efficient in the snore sound classification task [54].
Secondly, fundamental studies on heart sound feature analysis are lacking.The relationship between the acoustical properties and the anatomical changes in the heart under different kinds of CVDs is still unclear.Furthermore, more advanced signal processing should be investigated such as tunable-Q wavelet transformation [55], scaled spectrogram, and tensor decomposition [56].
Thirdly, we should overcome the big gap between the performances on the dev and the test sets.We think that, the subject-independency makes it difficult to reach high performances compared with other works based on subject-dependent data partition (e.g., the PhysioNet CinC Challenge Dataset [10]).This overfitting challenge should be overcome by developing more generalized models in future work.Last but not least, more attentions and contributions should be attracted to this field.As a noninvasive method, heart sound analysis via ML methods have a promising potential in not only real clinical applications, but also in-home healthcare services.We will organize a series of open challenges and workshops in future to facilitate this research.

Conclusion
In this study, we segmented the audio recordings in HSS into 10-s-based clips, which means an accurate prediction of heart status is needed from a shorter duration (around 30 s in HSS) of the audio recording.In addition, we added a binary classification task (normal/abnormal detection) as a subtask in this work.Both the classic ML and the cuttingedge DL methods were investigated and compared by using our open-source toolkits, which can be easily reproduced.In this benchmark study, the best result for the 3-class classification task was 48.8% of UAR (chance level: 33.3%).The best result for the binary classification task was 58.7% of UAR (chance level: 50.0%).Through the SHAP interpretation experiments, we analyzed the contribution of each feature in the entire test set and found that acoustic features MFCC and RASTA-filtered auditory spectrum performed the best in heart sound classification.We hope this new database and its benchmarks can further the relevant studies in a broad scientific community.W.J., and B.H. participated in the design and interpretation of the reported experiments or results.K.Q., Z.Z., F.D., Y.J., B.D., Z.D., and B.H. participated in the acquisition and analysis of data.K.Q., Z.B., F.D., M.S., J.S., and B.H. were primarily responsible for a particular, specialized role in the research.K.Q., B.H., and Y.Y. provided administrative, technical or supervisory support.
Competing interests: The authors declare that they have no competing interests.

Fig. 1 .
Fig. 1.The normalized time waveform of the PCG (heart sound) signal.The fundamental components include the S1 (caused by the closure of the mitral and the tricuspid valves) and the S2 (caused by the closure of the aortic and the pulmonary valves).au, arbitrary unit.

Fig. 6 .
Fig.6.The diagram of the deep spectrum transfer learning approach.In this paradigm, we use a pretrained deep CNN model (e.g., AlexNet) to extract higher representations from the spectrograms transformed from the heart sounds.Then, a classifier (e.g., SVM) can make predictions based on the extracted representations.

Fig. 8 .
Fig. 8.The diagram of the end-to-end learning approach.The higher representations can be extracted directly from the raw heart sound time waveforms with a series of deep CNN and/or RNN models.

Table 2 .
[26]LLDs for ComParE feature set.The details can be found in[26].

Table 1 .
The number [#] of instances in data partition in this study

Table 3 .
[26]functionals applied to LLDs in the ComParE feature set.Note that, the LLDs listed in Table 2 may or may not use all of the functionals of this table, which is described in details in[26].
Fig. 5.The diagram of the processing chain of the BoAW approach.The term frequency histograms are regarded as the representations extracted from the LLDs for further ML models.

Table 4 .
Results for the benchmarks.C: Complexity parameter of the SVM.N c : Codebook size of bag-of-audio-words (BoAW) splitting the input into 2 codebooks (ComParE-LLDs/ComParE-LLD-Deltas), with 10 assignments per frame, optimized complexity parameter of SVM.X: Power levels which are clipped below 4 given thresholds.N E2E : Number of layers in LSTM/GRU/