Transfer Learning Model to Indicate Heart Health Status Using Phonocardiogram

The early diagnosis of pre-existing coronary disorders helps to control complications such as pulmonary hypertension, irregular cardiac functioning, and heart failure. Machine-based learning of heart sound is an efficient technology which can help minimize the workload of manual auscultation by automatically identifying irregular cardiac sounds. Phonocardiogram (PCG) and electrocardiogram (ECG) waveforms provide the much-needed information for the diagnosis of these diseases. In this work, the researchers have converted the heart sound signal into its corresponding repeating pattern-based spectrogram. PhysioNet 2016 and PASCAL 2011 have been taken as the benchmark datasets to perform experimentation. The existingmodels, viz.MobileNet, Xception, VisualGeometryGroup (VGG16), ResNet, DenseNet, and InceptionV3 of Transfer Learning have been used for classifying the heart sound signals as normal and abnormal. For PhysioNet 2016, DenseNet has outperformed its peer models with an accuracy of 89.04 percent, whereas for PASCAL 2011, VGG has outperformed its peer approaches with an accuracy of 92.96 percent.


Introduction
IoT enabled devices have opened many opportunities in the area of medical diagnosis and treatment, ensuring safety and health of patients, and empowering medical practitioners to deliver better care. The increased and sophisticated interaction with the patients is possible due to enormous, personalized data generated by these devices. The analysis of data helps the doctors to diagnose patients more efficiently. Further, remote monitoring of patients' health enables the medical practitioners to assess situations for admission to hospital, minimizing the patient load, and hence, improves treatment outcomes and significantly reduces the healthcare costs. IoT enables proactive engagement of the patients by healthcare professionals [1]. Physicians can identify the best treatment plans and achieve the best results by having access to personalized data collected through these IoT devices [2]. Fig. 1 presents the architecture of an IoT, where each stage captures or processes data to provide an input for the next stage. Integrated values in the process bring intuitions and deliver dynamic business prospects. The steps are explained as below: (I) Deployment of interconnected devices for data generation (e.g., sensors, actuators, monitors, detectors, camera systems, etc.). (II) Analog to Digital conversion, aggregation of data. (III) Pre-processing, cleaning, and storage into cloud. (IV) Advanced analytics for actionable business insights needed for effective decision-making.

Iot in Healthcare
Mobility support, low latency, location, and privacy awareness are some of the advantages of a fog-based smart healthcare system [3]. Patients may receive personalized care using IoT devices such as exercise bands and other wirelessly activated devices such as cardiovascular and metabolic rate monitoring cuffs, glucometers, and so on. With IoT-connected intelligent systems, health insurers have various opportunities. Data collected by health tracking systems may be used by insurance providers for underwriting and claims processing. In the underwriting, pricing, claims management, and risk evaluation systems, IoT devices provide clarity between insurers and consumers. A detailed review undertaken by Tuli et al. [4] has shown the utility of IoT devices and artificial intelligence-based techniques for heart disease prediction system. Research community has emphasized on using ECG [5][6][7][8] w.r.t. heart health using IoT, and artificial intelligence techniques that further incorporate machine learning and deep learning approaches. Fig. 2 describes the generic architecture being followed for predicting the heart health status in real time. Where, specific sensors collect the heart-related vitals from the body and send it to the computing environment through some web-based dashboard or mobile application for further computation. Machine learning or deep learning-based models revert with the status of heart health; and the same can be seen in the mobile application or the web-based dashboard for planning the further course of action.

Figure 2:
IoT and AI-based general architecture used for the heart healthcare Heart Failure (HF) and its allied ailments are the major causes of death globally. The initial step for analyzing a cardiovascular system is the examination of the sound emanating from the heart. A phonocardiogram (PCG) is the graphical representation of heart sound signals [9] which allows the detection of heart murmurs and aids in visual analysis of the cardiac sound signals. The PCG signal serves as a cheaper alternative to the popular methods such as ECG, and offers almost equivalent diagnostic capabilities [10]. The procedure allows self-diagnosis and early detection of pathological heart conditions. The magnitude and frequency signals in PCG can detect additional systolic and diastolic murmurs that help in identifying heart diseases [11].
Here, in this research work, the authors have taken PCG signal instead of ECG as an input. It is believed that this is an unexplored domain in the context of given architecture in Fig. 2. Also, the input heart sound signal has been firstly converted into respective spectrogram and then classified further as normal and abnormal using the transfer learning-based model which is also an unexplored sub-domain. Fig. 3 illustrates the generic flow for classifying a PCG signal as normal or abnormal using machine learning approaches which include three steps after data acquisition. Firstly, heart sound signals in .wav format get splitted into frame sizes of 5. For example, a signal of 31 length was divided by five, and by considering the whole number only and ignoring its real part, it resulted into 6 small signals that were taken as input for the next sub-phase. Feature extraction methods used for PCG signals can be grouped into five categories, viz. time, frequency, statistical, time-frequency, and image domain [12]. The image domain features can provide more information about the audio signal [13], and are thus, considered to be better than the time, frequency and time-frequency aspects. This research work uses spectrograms, two-dimensional depictions of an audio, where x-axis and y-axis represent the time and frequency respectively [13].  [14,15]. Reddy et al. [16] in their study proposed a hybrid OFBAT-rule based fuzzy logic heart disease diagnosis system. There have been seminal reviews contributed by Dwivedi et al. [17], Nabih-Ali et al. [18], and Clifford et al. [19] in the said domain. Besides, the Convolutional Neural Networks (CNNs) have also been in practice since the last decade for solving the classification problems. CNN requires high computational power as it learns all features from the scratch to provide the final results [20]. It has been found that the concept of transfer learning emphasizes on applying the facts understood while resolving one category of problems to a different, but allied domain. This approach may be a better choice over CNNs as it requires comparatively less computational power. However, transfer learning has only been sparsely reported and hasn't been given a thorough consideration for categorizing the PCG signals. Reddy et al. [21] in their study proposed a novel rough set based method for feature selection and fuzzy rule based method for diagnosing heart disease.

Focus and Contribution of This Study
Through the survey of related papers that appeared during the period 1999-2019 [17][18][19], it was observed that the repeating pattern-based spectrograms had not been used for classifying the human heart sounds into normal or abnormal. In the domain of PCG signals, the spectrograms can be used in many verticals like transfer learning, repeating pattern-based spectrograms, texturalbased feature extraction from the spectrograms, and amalgamation of spectrograms with either chromagram or scalogram, etc. Where, a spectrogram represents the signal power, or "loudness," of a signal over time at different frequencies in a specific waveform. Scalograms plot the absolute value of a signal's continuous wavelet transform (CWT) as a function of time and frequency. The chromagram is a time-frequency translation of a signal into a temporally changing precursor of pitch. The research in the domain of acoustics and spectrograms is contemporary and many customizations can be done as an add-on to achieve better results. Arora et al., in their work [22], had used the spectrograms to classify the heart sound signals using machine learning models on the textural-based features extracted from the spectrogram. Whereas, this research work has bypassed the extra step related to feature extraction; and repeating pattern-based spectrograms have been used directly in transfer learning-based models. The pre-trained transfer learning models, viz.
ResNet, VGG, Xception, MobileNet, DenseNet, and InceptionV3 have been taken for classifying the spectrograms. By deploying techniques such as data augmentation, overfitting in the classifier has been avoided. The proposed approach has been validated using the PhysioNet 2016 [23] and PASCAL 2011 [24] benchmark datasets. There were 2575 samples of normal and 665 samples of abnormal class in PhysioNet 2016 dataset, whereas 262 samples of normal and 92 samples abnormal class in PASCAL 2011 dataset. For both the benchmarks, 80% of the dataset has been used for training, while the remaining 20% for testing the model's performance.

Transfer Learning and Work Related to PCG
Humans are able to acquire knowledge from one task and transfer/apply it to perform more complicated tasks (such as from walking to running). Similarly, transfer learning (or transferability) refers to the training of a machine learning model on some problem domain and applying it for solving the concerns in different but related areas. Here, in this process, the pre-trained neural networks have been amended by replacing their last layer with two fully connected layers followed by a Sigmoid function to classify the repeating pattern-based spectrograms. Transfer learning tends to improve traditional machine learning/deep learning (CNNs) by transferring knowledge in the form of features learned from training on one problem domain and deploy it to another related domain(s) [25]. Fig. 4 illustrates the working of transfer learning. Some of the known pre-trained networks are ResNet, VGG, Xception, MobileNet, DenseNet, and InceptionV3. All these networks have been trained on the ImageNet database and designed to categorize images into 1000 classes [26]. Features such as edge detection and shape detection learned by the pre-trained source model on the ImageNet dataset have been transferred to the target model for their further fine-tuning to classify the PCG signals. While the initial layers have focused on lower-level feature extraction, viz. edge detection and colour blob detectors, the later layers have been customized to be more specific for the details in the dataset under consideration [27].  [32] applied transfer learning approach to recognize cardiovascular diseases using PCG signals from PASCAL 2011 benchmark dataset. Ren et al. [28] and Wolk et al. [30] explored the use of transfer learning for classifying the PCG signals. Ren et al. [28] converted the PCG signals into a scalogram by using a wavelet transformation. These scalograms were fed to a pre-trained VGG16 model for the classification. Nonetheless, the mean accuracy achieved by Ren et al. [28] was only 56.20%, which was not sufficient for real-life applications. Wolk et al. [30] converted the PCG signals into spectrograms which further underwent a basic data augmentation, and were finally fed into a pre-trained convolutional network for classification. Wok et al. The researchers reported a testing accuracy of 99.00%. However, they used the validation dataset which was a part of the training dataset, to validate their model's performance, resulting in overfitting of the model. Alaskar et al. [29] pre-processed the raw PCG signals to filter the noise and get segmented PCG signals. The PCG scalograms were obtained by applying Cosine Wavelet Transforms (CWT) to these signals. The researchers used pre-trained AlexNet [36] to extract the features and passed them to the SVM classifiers. The result was again fed to AlexNet as an end-to-end learning classifier. Tab. 1 presents the accuracy, recall and precision values obtained from the various pre-trained models deployed at PhysioNet 2016 dataset used for classifying heart sound signals as normal and abnormal.

Spectrograms with Repeating Patterns
The spectrogram is a visual depiction of sound. The horizontal bars on the spectrogram represent time; and the vertical bars show frequency. A 'spectrogram' is generated from a collection of overlapping windows obtained from the signal and evaluated by computing the Fast Fourier Transform. The method of splitting the signal in small, fixed-size portions and then applying Fourier transforms on them individually is called Short-time Fourier transform (STFT). Then, the spectrograms are measured as the (squared) complex magnitude of the STFT. Spectrograms fall within the domain of images and can be used to discern sounds by means of machine learning models. A distinctive type of spectrogram depicting PCG beats and noisy backgrounds was chosen in this analysis as the input for the Transfer Learning models [37]. Spectrogram exhibited in Fig. 5 explains the relationship between frequency and time for the heart sound signal [13,38]. Spectrograms can accommodate intermittent repeated components, including the quickly changing repeating patterns w.r.t. foreground/background of an audio signal [39]. Fig. 6a displays the beats derived from PCG; the persistent noise background is seen in Figs. 6b, and 6c indicates the incidence of heartbeats.

Methodology
The steps taken into consideration for categorizing heart sound signals as normal or abnormal is highlighted in Fig. 7. PCG signals were taken from the PhysioNet 2016 and PASCAL 2011 datasets and converted into respective spectrograms. The benchmark dataset PhysioNet2016 holds the records used in the PhysioNet/CinC Challenge 2016. Heart sound samples were obtained from a variety of sources around the globe, in both clinical and nonclinical settings, from both safe and pathological individuals. PASCAL 2011 dataset was collected from two sources: (A) the general population using the iStethoscope Pro iPhone software (Dataset A), and (B) a clinic experiment in hospitals using the DigiScope wireless stethoscope (Dataset B).
In this research work, the proposed methodology has been implemented using Python (Version 3.7.6), Keras (Version 2.3.1), NumPy (Version 1.18.2), Tensor Flow (Version 1.15.2), Librosa (Version 0.7.0), SciPy (Version 1.4.1). The experimental work has been conducted using Intel Core i7 processor, 8 GB RAM, and Windows 10 Pro operating system. Repeat patterns were established with a similarity matrix. These patterns were further filtered to generate spectrograms with patterns that denoted unrepeated noisy PCG signals. A soft mask had been added to the background spectrum to remove noise from the heart sound signal.

Modelling the Spectrogram Using Similarity Matrix and Soft-Mask
The similarity matrix accurately determined cosine similarity among transposed spectrograms and normal spectrograms of the PCG signal [40]. With quarterly overlapping N hamming sample length windows, the transition of STFT for the PCG (Z) signal had been computed. Taking into account the absolute values of Z, a spectrogram (V ) was derived from the STFT (where Z consisted of z 1 , z 2 , . . . z n , elements of the time series of a signal). The similarity matrix (S) among spectrogram transposition and standardized spectrogram was determined from each frequency channel over all time frames in V . The similarity matrix, by calculating cosine similarity between the two frames, demonstrated the regions of higher and lower resemblance in the spectrogram at various time frames. The recurrent aspect of the similarity matrix encompassed the background of the spectrogram of the PCG. The repeating pattern-based spectrogram (W ) was derived through three phases [22,39,40]: • Using the cosine similarity measure, a similarity matrix S was computed from V .
• For each frame in S, the frequency of unique repetitive patterns was identified.
• Median of frames in V had been computed for each frequency channel over all the repeating frames of W .
A sparse time-frequency (TF) depiction of PCG signal was contrasted with TF depiction of background sound element. The median filter detected minor changes in the intervals of TF, and deleted the intervals with great variations (outliers) from the repetitive patterns. Soft timefrequency masking was implemented to distinguish speech signals from noise-in-speech mixtures. The spectrogram depicting heart sound with noise (V ) and the spectrogram illustrating only noise (W ) were initially taken to obtain the soft time-frequency mask. An element-wise division between W and V was subsequently carried out to get the TF mask [33]. For attaining a soft mask, these values were normalized between 0 and 1. These were used to achieve a recurrent background when imposed to V . To obtain the foreground, i.e., PCG beats, the repetitive background was removed from the base spectrogram [41]. Reddy et al. [42] in their study on improving the quality of stroke data by improving the pre-processing techniques on multimodal stroke dataset. The authors trained deep neural networks and the optimal hyperparameters were identified using Antlion optimization algorithms (ALO).
Two fully connected layers have been added to the convolutional bases. These fully connected layers extract the domain-specific features from the repeating pattern-based spectrograms here in this research work. The hyper-parameter optimization has been conducted on the original dataset using a 5-fold cross-validation validation technique. Various parameters like batch size, layer size, regularization have been experimented with for various values. The combination of 512, and 256 neurons at the first and second layer respectively were identified as the best combination among other experimented, viz, (1024, 512) and (256, 128). The above said combinations were also been subjected to L2-regularization. ReLU Activation function has been used for the fully connected layers, and the sigmoid activation function has been taken for classification. The batch size of the network has also been tuned for three different values, i.e., 10, 12, and 16. The best training accuracy has been achieved at the batch size of 12.

Data Augmentation
Augmentation of data is done to produce fresh training samples artificially from the existing training data. Domain-specific techniques have been applied to existing examples from the training data to create novel training samples. The prediction accuracy of deep learning models is dependent on the amount and the diversity of data available during training. Data augmentation not only addresses both these requirements, but it can also be used to address the class imbalance problem in classification tasks. Imaging data augmentation is a documented method of data increase in which converted image copies are used with original image in a training dataset for original class. Various transforms have been tried by the researchers which mainly include shifting, flipping, zooming among range of image manipulation operations.
Data augmentation can be effectively used to train the deep learning models in complex applications. Modern deep learning algorithms (e.g., CNN) can learn location-invariant features in the image. Augmentation boosts this transform invariant approach to learning transform invariant features such as ordering (Left →Right, Top→Bottom), intensity variations, and many more. Image data augmentation usually is applied to boost the training dataset, whereas, for testing and validation original dataset is always preferred. Data augmentation, including image resizing and pixel scaling, is separate from data preparation. These basic augmentation techniques have widely been used on small datasets to combat overfitting. Augmentations of the original images were generated using the Keras ImageDataGenerator function [43]. In each epoch, the Image-DataGenerator function applies a transformation on the images and use the transformed images for training. During experimentation, the original signals were subjected to change in brightness, rotation, and image scaling for data augmentation using the ImageDataGenerator function. In this research work, only image scaling in the range of 1 : 255 has been used as a technique for data augmentation and the repeating pattern-based spectrograms were resized as 224 × 224 × 3 to match the input requirements of the pre-trained model.

Discussion on Experimentation Results
The PhysioNet 2016 and PASCAL 2011 benchmark datasets have been used here for experimentation with transfer learning models.

Parameters of Evaluation
Tab. 2 shows a confusion matrix providing the details about sensitivity, precision, recall, and fall-out. The matrix reflects the performance of a model in machine learning. Columns in this matrix represent real class instances, while the rows reflect instances in an anticipated class. The proportion of positive cases predicted to be positive is referred to as sensitivity. The proportion of real negatives reported to be negative is used to calculate the specificity. A false positive is a classification mistake in which a test outcome falsely suggests the existence of a disorder that does not exist in fact. Whereas, false negatives are negative outcomes that the model predicted incorrectly. In context of cardiac arrhythmia true positive indicates the count of subjects that suffer from some aliment and model has correctly identified them. True negative is the count where the model correctly identified the normal subjects. False positive is the count where the proposed model has predicted normal subjects as patients. False negative refers to the count of abnormal subjects that are predicted as normal.
For the purpose of this research, accuracy, precision, recall and F1-score have been used to form the evaluation criteria. Accuracy is the indicator of all well-identified incidents. Precision-Recall is a powerful predictive efficiency predictor where datasets are quite imbalanced.  F1-score may be a better metric to use, if we need to seek a balance between Precision and Recall, especially where there is an unequal class distribution. F1-score has also been used where the False Negatives and False Positives are crucial.

Comparison Among Transfer Learning Models
The transfer learning models have been trained using the holdout validation technique, taking 80% of the benchmark dataset selected through random sampling. The training has been conducted for 25 epochs with a batch size of 12 each. Four different statistics, viz., accuracy, precision, recall, and F1-score, have been used to compare the performance of each model taking repetition-based spectrograms with data augmentation and without data augmentation.
Tabs. 3 and 4 presents the accuracy, recall, precision, and F1-scores as obtained for the PhysioNet 2016 database with and without data augmentation respectively. The results for the MobileNet, Xception, VGG16, ResNet, DenseNet and InceptionV3 have been listed and computed on 20% of the base dataset reserved for validation of the model. DenseNet has appeared with the highest accuracy in comparison to its peer approaches. Fig. 8 presents the plots drawn for the accuracy, recall, precision, and F1-score obtained on the PhysioNet 2016 benchmark dataset.   Tabs. 5 and 6 exhibit the accuracy, recall, precision, and F1-scores as obtained for the PASCAL 2011 database with and without data augmentation respectively. The results for the MobileNet, Xception, VGG16, ResNet, DenseNet and InceptionV3 have been listed and computed on 20% of the base dataset reserved for validation of the model. VGG16 has appeared with the highest accuracy in comparison to other approaches. Fig. 9 shows the plots for the accuracy, recall, precision, and F1-scores obtained on the PASCAL 2011 benchmark dataset.
It is evident from Tabs. 3-6 that results obtained after applying augmentation are better as compared to the results without applying the data augmentation operation. Due to skewed class distribution of the benchmark datasets, both accuracy and F1-score have been used for comparison [44,45].

Conclusion and Future Scope
A new type of spectrogram, i.e., repetition-based spectrogram has been used as an input for the transfer learning models. At PhysioNet 2016 dataset, the DenseNet has outperformed all its peers; whereas, for PASCAL 2011 dataset, VGG16 has performed excellently. The approach handled by the authors in this work has considered only one depiction of PCG signals, i.e., spectrograms, which is an image representation of sound signals in time-frequency domain. The proposed approach can be enhanced further using chromagrams, melspectrograms and scalograms in conjunction with spectrogram for a PCG sound signal, as these can depict the sound in the different pitch classes. In future, the accuracy of the prediction model can be improved by using an ensemble of the model that had the best performance at PhysioNet 2016 and PASCAL 2011 dataset (i.e., DenseNet and/or VGG) with Boosting algorithm like XGBoost.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.