Federated Abnormal Heart Sound Detection with Weak to No Labels

Cardiovascular diseases are a prominent cause of mortality, emphasizing the need for early prevention and diagnosis. Utilizing artificial intelligence (AI) models, heart sound analysis emerges as a noninvasive and universally applicable approach for assessing cardiovascular health conditions. However, real-world medical data are dispersed across medical institutions, forming “data islands” due to data sharing limitations for security reasons. To this end, federated learning (FL) has been extensively employed in the medical field, which can effectively model across multiple institutions. Additionally, conventional supervised classification methods require fully labeled data classes, e.g., binary classification requires labeling of positive and negative samples. Nevertheless, the process of labeling healthcare data is time-consuming and labor-intensive, leading to the possibility of mislabeling negative samples. In this study, we validate an FL framework with a naive positive-unlabeled (PU) learning strategy. Semisupervised FL model can directly learn from a limited set of positive samples and an extensive pool of unlabeled samples. Our emphasis is on vertical-FL to enhance collaboration across institutions with different medical record feature spaces. Additionally, our contribution extends to feature importance analysis, where we explore 6 methods and provide practical recommendations for detecting abnormal heart sounds. The study demonstrated an impressive accuracy of 84%, comparable to outcomes in supervised learning, thereby advancing the application of FL in abnormal heart sound detection.


Introduction
Cardiovascular diseases (CVDs) are the leading cause of death worldwide, surpassing other causes in annual fatalities [1,2].The importance of early diagnosis and preventive measures in cardiovascular healthcare cannot be overstressed.Due to its universal and noninvasive nature, heart sound analysis offers a promising avenue in medical care for assessing an individual's cardiovascular status.Leveraging machine learning models for abnormal heart sound detection in digital healthcare provides a practical approach for early diagnosis and effective prevention of CVDs [3][4][5][6].
However, the issues of privacy protection and data silos seri ously impede the exploration of medical data and the applica tion of medical artificial intelligence (AI) models [7].First, variations exist among medical institutions.Some institutions have limited resources and records that hinder effective medi cal machine learning modeling.Second, pertinent laws and regulations, including the Health Insurance Portability and Accountability Act (HIPAA) [8], restrict data exchange between medical institutions for security and privacy protection.Conse quently, healthcare data become fragmented and scattered across medical institutions, causing the phenomenon of "data islands." Federated learning (FL) is a distributed machine learning paradigm that enables collaborative modeling among partici pants without sharing their private data [9][10][11].It serves as a viable method to address the "data island" issue in the medical field through collaborative modeling across multiple centers.Consequently, it provides a certain degree of protection for data security and patient privacy.Our studies are based on SecureBoost [12], a federated ensemble learning framework embedded in FATE.[FATE (Federated AI Technology Enabler [13]) supports the FL architecture, as well as the secure computation and development of various machine learning algorithms; https:// github.com/FederatedAI/FATE.]In this study, we practically applied the verticalSecureBoost (Vertically Federated XGBoost) model on a multiinstitutional heart sound database.[XGBoost (eXtreme Gradient Boosting [14]) provides an optimized dis tributed gradient boosting treebased ensemble model designed to be highly efficient, flexible, and portable; https://xgboost.readthedocs.io.]We propose corresponding federated optimi zation strategies for the requirements of realworld healthcare scenarios with label scarcity.
In reallife medical scenarios, we consider 3 key issues: (a) Accurately labeling all heart sound records is resourceintensive, leading to only a fraction of the dataset being labeled [15][16][17].Semisupervised FL is considered suitable, involving a few "posi tive" labeled samples and a large volume of "unlabeled" samples, which may contain both positive and negative samples.(b) The widely studied horizontalFL, also known as samplepartitioned FL [18], requires data from institutions to have the same feature space and different sample spaces.HorizontalFL is devised to facilitate collaboration among medical institutions with varied patient populations, given the inability to share data across institutions.Therefore, horizontalFL data partitioning is rec ommended when developing models with limited sample size variability of FL participants.However, in real medical sce narios, the same patient may receive treatment at different hospitals, allowing for the use of records from multiple sources in diagnosis.Consequently, multiple healthcare institutions may serve the same patient population.VerticalFL, akin to featurepartitioned FL [19], has recently garnered attention from researchers in cases where medical institutions participat ing in FL share the same user community but have different medical record feature spaces.This study centers on verticalFL, aiming to model collaboration across multiple institutions with distinct medical record spaces to provide comprehen sive insights into the same patient population.(c) Leveraging the highdimensional features extracted from heart sound records, it is necessary to select an effective feature importance analysis scheme to retain the most influential feature set [20].This enhances the efficiency of FL modeling and is anticipated to sustain comparable performance while achieving a reduction in feature dimensionality.Therefore, the contributions of our work can be summarized as follows: • Our study uniquely shifts from traditional datacentric centralized learning to embrace the FL paradigm in the analy sis of the PhysioNet/CinC heart sound database.(Classification of Normal/Abnormal Heart Sound Recordings [21,22]: the PhysioNet/Computing in Cardiology Challenge; https://physi onet.org/content/challenge2016/1.0.0.)We adopt a vertical data partitioning approach and leverage the verticalSecureBoost FL framework for multimedical center collaboration modeling to address data islands and privacy concerns in healthcare.
• To meet the demands of real medical scenarios, we pro mote an FL framework with a naive positiveunlabeled (PU) semisupervised learning strategy.In specific medical contexts, semisupervised FL emphasizes the integration of positive and unlabeled training strategies.The approach achieves a remark able 84% accuracy, comparable to the outcomes of supervised learning, representing an important exploration of FL in the realm of abnormal heart sound detection.
• In our study practice, we explore 6 distinct methods for feature importance analysis.Utilizing the ensemble learning paradigm based on XGBoost, we compare 5 methods, namely, "gain, total_gain, cover, total_cover, weight," with the SHAP method.[SHAP (SHapley Additive exPlanations [23]) is a gametheoretic method to explain the output of machine learn ing models.The method is used to determine the importance of an individual by calculating the contribution of that indi vidual in the cooperation; https://shap.readthedocs.io.]Based on comparative experiments, we provide practical recommen dations for feature selection in the context of abnormal heart sound detection.
The rest of the paper is organized as follows: The "Related Works" section introduces the related work.The "Materials and Methods" section describes data preprocessing methods, exper imental design, and evaluation metrics.The "Experiment and Results" section presents our comparative experiments and results.The "Discussion" section provides a detailed discussion.Finally, we conclude the paper in the "Conclusion" section.

Related works
In the realm of healthcare, FL has emerged as a pivotal research area, addressing the challenges associated with collaborative modeling across diverse medical institutions.Recent studies emphasize its application in multicenter settings, enabling model training without raw data exchange, thus preserving privacy and adhering to data security regulations.Researchers have investigated federated approaches for tasks such as predic tive modeling, disease diagnosis, and personalized treatment recommendations.Examples of noteworthy work include the following: • Privacypreserving patient data sharing: Pioneering stud ies have focused on preserving patient privacy while enabling collaborative model training [24,25].Techniques such as feder ated averaging and secure aggregation have been employed to facilitate model updates without raw data sharing.This ensures that FL complies with data protection regulations such as HIPAA.
• Decentralized disease prediction models: Some researchers have applied FL to construct disease prediction models using data across multiple healthcare institutions [26][27][28].This approach allows each institution to contribute to the model without shar ing patientspecific information, enabling the development of robust and generalizable models.
• Realworld federated systems: Emerging research involves the implementation of FL systems in realworld healthcare set tings [29][30][31].These systems consider challenges like data het erogeneity, communication efficiency, and model convergence across multiple institutions.
A practical concern often overlooked in healthcare is the limited availability of labeled data.We study the realworld setting of FL medical applications, where assuming fully labeled data in each FL client is less practical.Two related areas include federated unsupervised representation learning and federated semisupervised learning.In scenarios with limited labeled data, semisupervised FL becomes crucial [15][16][17].This paradigm involves training models using a combination of labeled and unlabeled data, making it par ticularly relevant for medical applications with limited anno tated datasets.In terms of semisupervised FL, some studies explore crossinstitutional transfer learning strategies to trans fer knowledge between institutions with varying degrees of labeled data [32].Therefore, models can leverage labeled data from one institution to enhance the performance on different institution datasets, contributing to better generalization.Additionally, some studies incorporate active learning tech niques within FL frameworks to intelligently select and query instances for annotation [33].This ensures efficient utiliza tion of labeling resources and enhances model performance in scenarios with limited labeled samples.There are few FL studies directly addressing federated PU learning.Study [34] proposes a novel framework called Federated Learning with Positive and Unlabeled Data (FedPU).FedPU considers that each client can label only a limited amount of data for some classes.The work [35] introduces the FedMatch algorithm, a stateoftheart federated semisupervised model based on consistency regularization training.FedMatch addresses sce narios where clients have both labeled and unlabeled data.We study the problem of learning from positive and unla beled (PU) data in the federated setting.In contrast to the previous scenario, we focus on situations where some clients exclusively have positive and unlabeled samples, while others have only unlabeled samples.
To sum up, FL in healthcare is developing rapidly, with a focus on preserving privacy and addressing data distribution challenges.The incorporation of semisupervised learning tech niques further extends the applicability of federated approaches, especially in scenarios with imbalanced or limited labeled data.These developments set the stage for tackling complex tasks like abnormal heart sound detection across multiple federated care institutions.

Dataset description and preprocessing
In this work, heart sound data are obtained from the PhysioNet/ CinC [21,22] challenge, a highquality, authentic public data base.As shown in Table 1, it comprises 6 subdatabases, each independently gathered by diverse institutions in clinical and nonclinical environments.Samples labeled as "normal" origi nate from healthy subjects, whereas "abnormal" samples are derived from patients with various conditions like heart valve disease and coronary artery disease.We use openSMILE [36,37], a widely used opensource toolkit for audiosignal processing, to extract features.openSMILE provides features commonly used in traditional acoustic signal processing methods, includ ing mel frequency cepstrum coefficients (MFCCs), physio logical acoustic features, and energy spectrum features.Initially, it extracts lowlevel descriptor (LLD) features from the audio signal and then reextracts statistical features from these framebased LLD features.We use the ComParE [38] feature set in openSMILE, extracting a total of 6,373 dimen sional features, which include 65 acoustic LLD features and their associated statistical features.The data preprocessing pro cedure is summarized in Fig. 1, and the specific steps are out lined below.
Step 1: Due to the original databases collected by each insti tution, multiple sets of heart sound records may have been obtained from the same subjects.To ensure subject indepen dence, the experiment combined the data from 5 medical insti tutions (Dataset {b − f} ) as the training set, while the database Dataset a was designated separately as the public test set.Addi tionally, we implemented a downsampling strategy using the RandomUnderSampler function in Python to address the data imbalance problem.After balancing the samples, there are 665 positive samples and 665 negative samples.The training set to test set ratio is approximately 7:3.The validation set is derived from the officially provided "validation" dataset, comprising 150 positive and 150 negative samples, each.
Step 2: Further selecting the subset of features that have the most impact on the model benefits resourceconstrained feder ated clients, as it is expected to improve model performance while reducing feature dimensionality.As the FL model in this paper is a novel privacypreserving gradient tree boost ing frame work, it conducts FL by constructing boosting trees across multiple federated parties.Using the 6,373dimensional ComParE feature set, we apply 5 treebased feature importance analysis methods: gain, total_gain, cover, total_cover, and weight, along with a SHAPbased method to assess their individual contributions to the model.Subsequently, the selected 165 fea tures will be used in the hyperparameter experiments of this study.
Step 3: Since accurate labels exist for all samples in the data set, to assess the effectiveness of the semisupervised FL algo rithm, we introduce the assumption that labels for some samples are absent.Following the PU scenario, we designate all nega tive samples as unlabeled, while also masking a portion of the positive samples as unlabeled.This approach, inspired by a previous study [39], involves randomly selecting 20% of the positive data as labeled positive examples, treating the rest of the data as unlabeled examples.The mask strategy is visually depicted in Fig. 4A, where the unmasked part represents positive samples, and the masked part is unlabeled.This masking strategy is applied to both the training and testing datasets.
Step 4: Following the completion of step 3, we vertically partition the preprocessed dataset, gearing up for the vertical SecureBoost model with PU learning.In verticalFL, datasets across institutions share the same sample space but exhibit dif ferent feature spaces.To adhere to this condition, vertical partitioning in this study involves vertically dividing the data set.Let us consider a dataset D = (X, Y) consisting of a feature set X and a label set Y, partitioned into guest = (X 1 , Y) and host = (X 2 ), where guest represents the federated participant with labels, host denotes the unlabeled participant, and X = X 1 ∪ X 2 .The classifier's objective is to label the unlabeled sam ples within the masked segment and accurately classify the unmasked positive samples.

Vertically federated XGBoost (vertical-SecureBoost)
FL is an emerging machine learning paradigm that leverages decentralized data and distributed learning.It offers a novel solution for collaborative modeling across multiple healthcare institutions.In the traditional horizontalFL approach, partici pating institutions initially train their models using local data.Subsequently, they transfer the parameters of these local mod els, such as the gradients of neural networks, to a central server for aggregation.This process enables the construction of robust global models without sharing raw data.HorizontalFL requires alignment of feature spaces among participants, which is an ideal scenario.This paper considers medical institutions as federated participants and studies the same patient population with different medical record feature spaces, which is consis tent with the verticalFL scenario.Vertical FL, also known as featurepartitioned FL, is suitable for scenarios where medi cal institutions share the same patient population.In other words, the data of these institutions have the same sample space but different feature spaces.
In this study, we employ a verticalFL model named vertical SecureBoost for semisupervised FL learning.In the vertical SecureBoost setting, only one client has labels, while other clients only have features.The client with labels is referred to as the guest party, and the others are termed host parties.The role of the guest party is analogous to the central server in horizontalFL.In real medical scenarios, some FL participants have unlabeled data and only serve as feature providers.In response, the semisupervised FL of this paper aims to address the problem of missing and unlabeled labels in federated medi cal institutions.
The guest party, holding the class labels, is responsible for computing gradient values for all samples and transmitting them to all host parties.Additionally, the guest party is tasked with aggregating feature bins from host parties, decrypting gradient histograms, traversing them, and determining the optimal split point along with the corresponding feature.For host parties, the main function is to compute their own feature bins and local gradient histograms based on the encrypted gra dient values of all samples transmitted by the guest party.Upon receiving the broadcast from the guest party regarding the opti mal splitting feature, the host party holding that feature must determine the corresponding threshold value.The nodesplitting mechanism of the tree model in verticalSecureBoost is illus trated in Fig. 2.

PU classification scenario
PU classification is prevalent in realworld applications such as healthcare and bioinformatics.The data consist of an incomplete set of positive samples and a set of unlabeled samples that may be either positive or negative.
Stated formally, let y∈ {0, 1} be a binary label, x be the fea ture matrix, s = 1 if the sample is labeled, and s = 0 if the sample is not labeled.If y = 1, then s = 1.But if s = 0, y can be either 1 or 0. So, we have p(s = 1| x, y = 0) = 0, which means that the probability that a negative sample x appears in the labeled set is zero.

Theoretical basis of the naive PU training strategy
In this study, we adopt a naive PU training strategy, modeling only from positive and unlabeled data.This strategy initially treats all unlabeled samples as negative sample and then trains the model accordingly.Highscoring initial samples are iden tified as positive label, while the rest are labeled negative.Subsequently, the second classifier is trained.This process is repeated until the unlabeled samples yield the desired result.
The naive PU training strategy has been proved to be reason able by the work [39].It shows that a classifier trained on posi tive and unlabeled examples predicts probabilities that differ by only a constant factor from the true conditional probabilities of being positive.Let f(x) = p(y = 1| x), g(x) = p(s = 1| x).f is a traditional probabilistic classifier, while g is a nontraditional one.It can be proved that p(y ; according to the definition of the PU scenario, p(s = 1| y = 1) is a constant.It can be noticed that f is an increasing function of g.This means that if the classifier f is only used to rank examples x according to the chance that they belong to class y = 1, then classifier g can be used directly instead of f, which verifies the rationality of the naive PU training strategy.The description of relevant variables is shown in Table 2.

Workflow of PU vertical-SecureBoost
PU is applicable to classification tasks in the verticalFL sce nario.The constructed semisupervised FL model can be trained using positive samples and unlabeled samples, and the predic tion of unlabeled samples is completed based on the trained model.As the labels change, the data distribution also under goes alterations, requiring the model to rely on the updated data for continued training.The iterative process continues for multiple rounds until the labels in the dataset converge under predefined rules.Due to the absence of overlapping users among medical institutions, we merged data from five institu tions for building the verticalFL model.Specifically, the multi dimensional table data extracted after merging are partitioned into 2 segments based on the feature columns, representing the Fig. 2. The splitting mechanism for privacy preservation.Vertical-SecureBoost guarantees the privacy and security in the process when multiple parties jointly build the tree model.When the guest party figures out the best split feature, it will notify the party that holds the feature, denoted as host.Then, the host will search for its threshold value, split the local model, and get the left children and right children.After splitting the local model, the host will transfer its party id and the sample space in the left children node to the guest party, since the sample space in the right children can be inferred from the left children.The guest party then records the party id in the current node and splits the local tree model.Then, the guest party will send party id and the sample space in the left children node to the remaining party.In this way, although all the parties share the same tree model, the recorded information of each node of each party's tree model may be different.Each party can only have the authority to see its own data information.
feature spaces for the federated participants-guest and host, respectively.The FL participants, guest and host, meet the requirement that the sample space is the same but the feature space is different, thus enabling verticalFL modeling.In this study, we designate the medical institution data warehouse as the federated client and establish 2 federated parties for verticalFL modeling: the guest party and the host party.
Figure 3 illustrates the workflow of semisupervised vertical SecureBoost with a naive PU training strategy, providing addi tional details on each component.As the guest participant in the FL, the guest holds 2 types of data: positive samples and unla beled samples.In the data preprocessing stage, unlabeled samples are treated as negative samples, and the process incorporates the verticalSecureBoost FL algorithm.The trained federated model is used to predict the unlabeled intersection data of the guest participant.Subsequently, these data are sorted based on their predicted probabilities, and those exceeding a pre defined threshold are selected.Positive labels are then assigned to these selected highprobability unlabeled intersection data.Figure 4A illustrates the masking strategy used in our experiments with the selected dataset.Figure 4B

Evaluation metrics
The multiinstitutional heart sound database reflects imbalances in sample size and class distribution across institutions.This study uses the following evaluation metrics, in addition to tra ditional methods such as accuracy (Acc), to measure model performance.Given are C classes, true positives (TP), false nega tives (FN), false positives (FP), and true negatives (TN).
We utilize the unweighted average recall (UAR) and the unweighted F1-Score (UF1) to evaluate the performance of the diagnostic model.The importance of the UAR metric lies in its ability to give equal importance to the performance of each class.Therefore, UAR is especially valuable for evaluat ing models on datasets where some classes are underrepre sented.UAR is calculated as: and UF1 can be formulated as:

Results
Our main objective is to investigate the hyperparameter con figurations of the verticalSecureBoost model with native PU learning.Subsequently, we will conduct comparative experi ments to assess model performance using various feature impor tance analysis methods, aiming to provide valuable insights into abnormal heart sound detection.The experiment includes essential parameters within the interactive learning processes of both the SecureBoost and PU components, along with the selection of 165 features determined by feature importance analysis methods.

Model hyperparameter experiment
We explore crucial hyperparameter settings in semisupervised FL models through 2 sets of experiments.The first set involves configuring the proportion parameter in the PU component and determining the number of trees in the SecureBoost com ponent.The second set focuses on establishing the optimal numbers of SecureBoost and PU components in semisupervised FL.The "proportion in PU" refers to the percentage of top samples considered as positive when executing the current PU, determined based on the sorted scores of samples labeled by the preceding SecureBoost classifier.

Relationship between the first PU (PU 1 ) proportion and model performance
The PU learning strategy enables the FL model to directly learn from a limited set of positive samples and a large pool of ( 1) The splitting value of the qth feature bin for the kth feature g i The first-order derivative of the loss function with respect to the predicted labels of the previous round, denoted as The second-order derivative of the loss function with respect to the predicted labels of the previous round, denoted as  The illustration shows 2 federation participants, a host party and a guest party.In the guest side, ID 1 represents labeled samples, ID 2 represents unlabeled samples.Masked y refers to our treatment of unlabeled samples based on the PU learning strategy.y represents the predictions of the samples from the previous round.The host side does not have labels and only provides features.In stage 1, the guest side calculates the first-order derivative (g i ) and the second-order derivative (h i ) of the loss function for each sample ID based on the real or masked labels and the predictions from the previous round, and sends this information to the host side.In stage 2, all parties calculate feature bins based on the information from g i and h i , and this relevant information is transmitted to the guest side.In stage 3, the guest side aggregates all the feature bin information from the participating parties and iteratively calculates the best split points for the tree.In stage 4, the algorithm ranks the samples based on the scoring values obtained using the PU learning strategy.

True label
Masked label 1: means positive label.0: means negative label.
x: means masked samples.
According to naive PU training strategy, the classifier treats all x as 0, then iteratively trains a model and selects the top few samples with higher scores as positive samples.

Masked samples
The mission of the classifier is to label the unlabeled samples, which belong to the masked part, and classify unmasked positive samples correctly.

Relationship between the number of PU and model performance
Figure 4 illustrates the interactive learning process between the SecureBoost and PU components in the FL model based on PU.
The number of SecureBoost and PU components determines the iterations or rounds of the learning process.In the control experiment, we varied the number of PUs from 1 to 3, and the corresponding number of SecureBoost components from 2 to 4. As indicated in Table 5, the semisupervised FL model achieves its optimal performance with 2 PU components and 3 SecureBoost components.Experimental results, in conjunc tion with tree models, demonstrate that we can achieve higher classification performance of the semisupervised FL model with relatively lower model complexity.

Comparative experiment on feature selection methods
This study has 2 main objectives for the semisupervised FL classification model.First, it should perform well, accurately predicting the output of given input features.Second, the model should be interpretable, providing an understanding of the relationship between input features and output.This is crucial when using auxiliary diagnostic models in the sensi tive field of healthcare.For instance, in a cardiac auscultation model, it is vital to predict the patient's diagnosis and under stand which features contribute to the result.Feature importance analysis is a widely used method for interpreting classification models.It quantifies the individual contributions of specific features to a given classifier.Thus, the importance of input data features is modeldependent.In this study, we compared the effects of various feature importance analysis methods on the classification performance of our model, utilizing a highdimensional feature set extracted from the original heart sound recordings.
In the verticalFL framework, each federated participant has a distinct feature space.Furthermore, we aim to identify which features contribute most to the performance of the semisu pervised FL model in this study.Since verticalSecureBoost is implemented based on the XGBoost model, we employed 5 treebased feature selection methods: gain, total_gain, cover, total_cover, and weight.Additionally, we conducted compara tive experiments using the SHAP method.Although they are technically related and partially overlap, there is a distinction between feature importance and feature selection.The experi ments show that these methods consistently filter the same set of 165 contributing heart sound features (including LLD features and statistical features), with only differences in the importance ranking of these features.The table in the Appendix presents the computed results (feature coefficients and importance values) for the 165 features, sorted by feature importance from the SHAP method.In the comparative experi ments, selecting the top 165 features based on the SHAP method yielded optimal model performance (Acc: 84.36%, UAR: 84.33%, UF1: 84.35%).The model results for other feature selection methods under the same conditions are compared in Table 6.Moreover, our optimal model performance closely matches that of the supervised SecureBoost model when using 30 trees and a tree depth of 3. Comparative experimental demonstrate that the semisupervised SecureBoost model efficiently identifies the heart sound features that contribute most, particularly when employing the SHAP feature importance analysis method.The advantage lies in selecting fewer features to achieve superior classification performance, providing clear benefits over other methods.To further demonstrate the superiority of the proposed method, we conduct a comparison with the semisupervised FL algorithms.FedPU (https://github.com/littleSunlxy/FedPUtorch) [34] and FedMatch (https:// github.com/wyjeong/FedMatch)[35], compared with the optimization models in this paper, represent the most compa rable and stateoftheart semisupervised FL models.Table 6 presents the performance comparison among FedPU, FedMatch, and the proposed method.Given the limited data resources in this study, the proposed method achieves state oftheart performance on the multiinstitutional heart sound database.This also demonstrates that our method out performs other semisupervised FL methods under lowresource conditions.

Discussion
We will now discuss 3 aspects: the application of the semisu pervised FL model in heart sound classification, the identifica tion of the most important features, and whether the crucial features vary depending on the technique used.Application of the semisupervised FL model.We study the problem of learning from positive and unlabeled (PU) data in the federated setting.Specifically, we concentrate on sce narios where some clients have only positive and unlabeled samples, while others have only unlabeled samples.The semisupervised FL model can effectively learn from different institutions with a limited pool of positive samples and unla beled samples.We validated the effectiveness of this frame work on realworld heart sound recordings through a series of experiments.Additionally, this framework demonstrates the ability to achieve better classification performance with relatively low model complexity.When utilizing the SHAP feature importance analysis method, all metrics consistently reach above 84%.The semisupervised FL model can conduct multiinstitutional federated modeling without sharing local medical institution data.This helps address the issue of medical data silos and partially safeguards patient privacy.However, it is worth noting that the limited data and the rela tively simple PU strategy mean that the performance of the FL model in medical diagnosis needs to be improved.To this end, we are collaborating with multiple medical institutions to build a larger, highquality multiinstitutional heart sound database, such as https://www.vobbit.org,as part of our cur rent work.In practical applications, assessing the perfor mance of the proposed model necessitates considering the diverse environments of each medical institution.Future work should explore various factors in practical applications, such as the number of federated participants, communication costs, data distribution, and FL modeling based on multi modal data [40,41].
What features are the most important?To refine effective representations of heart sounds from the 6,373 features in the ComParE feature set for the model, we employed various fea ture importance methods.Consistently, these methods identi fied the same 165 features contributing to the model, albeit with differences in importance ranking.The key statistical findings are as follows: The most influential features encompass 73 related "udSpec" features, with 57 related to "udSpec_Rfilt" and 6 related to "udspecRasta." Additionally, there are 45 features associated with "fcc_sma " and 36 features linked to "cm_fftMag, " including 30 features tied to "cm_fftMag_spectral" and 5 fea tures associated with "cm_RMSenergy." This implies that dis tinct methods can identify the same effective features for the same classification model.Furthermore, the features extracted from the heart sound data exhibit high correlation, making the classification task straightforward.Thus, different fea ture importance analysis methods can enable the FL model to achieve better classification accuracy.
Do the most important features differ depending on the technique?The most important features indeed depend on the method used.Our experiments indicate that the SHAP method provides better results, as the model's performance is optimal and stable when the first 165 SHAP features are selected.By selecting fewer features and achieving optimal performance in the analyzed cases, SHAP has a clear advantage over other methods.Ultimately, this study provides insights into screening onedimensional acoustic signal features for abnormal heart sound examination.It is noteworthy that this framework, rooted in traditional machine learning, is designed for process ing onedimensional tabular data rather than phonocardio gram (PCG) images.Although model interpretability was not the primary focus, the feature importance analysis in this paper lays the foundation for future FL research on featurebased interpretability.

Conclusion
This study was motivated by 2 primary objectives.First, we assessed the classification performance of the semisupervised FL model using realworld heart sound recordings.Second, we investigated the influence of various feature importance methods on the model's classification performance.Utilizing the classical ComParE feature set, we identified 165 features contributing to the model.Notably, we observed superior performance in heart sound classification with the SHAP based method, which selected fewer features in the analyzed cases while meeting the model's performance criteria.
The framework employed a naive PU learning strategy, one of the most basic semisupervised learning methods.In future work, we will explore more complex PU training strategies to enhance the performance of the FL model.Moreover, we intend to replicate the proposed analytical scheme on a larger scale, particularly aiming to implement the techniques utilized in neural networkbased FL frameworks.The synergy of advanced nonlinear FL models and sophisticated PU learning strategies is expected to demonstrate significant potential for extensive PCG signals.
provides a simple concrete example to illustrate the training process.Algorithm 1 describes the pseudocodes detailing the basic principles and workflow of semisupervised verticalSecureBoost with a naive PU training strategy.

Fig. 3 .
Fig.3.Rough outline of the workflow of semisupervised vertical-SecureBoost.The illustration shows 2 federation participants, a host party and a guest party.In the guest side, ID 1 represents labeled samples, ID 2 represents unlabeled samples.Masked y refers to our treatment of unlabeled samples based on the PU learning strategy.y represents the predictions of the samples from the previous round.The host side does not have labels and only provides features.In stage 1, the guest side calculates the first-order derivative (g i ) and the second-order derivative (h i ) of the loss function for each sample ID based on the real or masked labels and the predictions from the previous round, and sends this information to the host side.In stage 2, all parties calculate feature bins based on the information from g i and h i , and this relevant information is transmitted to the guest side.In stage 3, the guest side aggregates all the feature bin information from the participating parties and iteratively calculates the best split points for the tree.In stage 4, the algorithm ranks the samples based on the scoring values obtained using the PU learning strategy.

Fig. 4 .
Fig. 4. Rough outline of the workflow of semisupervised vertical-SecureBoost. (A) Mask strategy on the dataset.(B) Simple concrete example of the training process.

Table 1 .
Summary of the sub-databases used in the PhysioNet/CinC Challenge MIT, Massachusetts Institute of Technology; AAD, Aalborg University; AUTH, Aristotle University of Thessaloniki; UHA, University of Haute Alsace; DLUT, Dalian University of Technology; SUA, Shiraz University

Table 2 .
List of notations used in the semisupervised vertical- −1 i unlabeled samples.A control group experiment is conducted to analyze the impact of different proportions in PU 1 on FL global model performance while keeping other settings fixed.Since the preprocessed data class is balanced with equal propor tion of positive and negative samples, we set the final PU (PU 2 ) proportion to 0.5.This helps the model's predictions for the samples converge to an equal distribution of positive and negative outcomes.As depicted in Table3, the model's perfor mance improves with increasing proportions of PU 1 .However, when the proportion exceeds 30%, the model metrics start to decline.The model achieves optimal values (Acc: 84.36%, UAR: 84.33%, UF1: 84.35%) when the proportion in PU 1 is 30%.Another control group experiment involves varying the number of tree models in the SecureBoost component.This pertains to the impact of the SecureBoost model complexity on the performance of the semisupervised FL model.The experiment fixed 3 SecureBoost components, each with rel evant parameters, and examined the performance variation of the FL model with 10, 20, 30, and 40 trees within each component.As indicated in Table4, the semisupervised FL

Table 3 .
Mean testing performance (in [%]) of 50 repetitions of the semisupervised FL model.Exploring the relationship between the proportion in the first PU (PU 1 ) and model performance.Fixed parameters: The proportion in the second PU (PU 2 ) is 0.5.The number of trees in SecureBoost {1, 2, 3} is 10, 20, and 30, respectively, and the depth of the trees is 3.

Table 5 .
Mean testing performance (in [%]) of 50 repetitions of the semisupervised FL model.Exploring the impact of the number of SecureBoost and PU components on model performance.Fixed parameters: The proportion for PU 1 is 0.3, and for PU 2 , it is 0.5.The number of trees in SecureBoost {1, 2, 3} is 10, 20, and 30 respectively, and the depth of the trees is 3.

Table 6 .
Mean testing performance (in [%]) of 50 repetitions of semisupervised and supervised FL models.Performance comparison of semisupervised FL models when utilizing different feature importance analysis methods.Fixed parameters: In semisupervised learning, the proportion for PU 1 is 0.3, and for PU 2 , it is 0.5.The number of trees in SecureBoost {1, 2, 3} is 10, 20, and 30, respectively.In supervised learning, the number of trees is 30 and the depth of the trees is 3.

Table 4 .
Mean testing performance (in [%]) of 50 repetitions of the semisupervised FL model.Exploring the impact of the number of tree models in the SecureBoost component on the performance of semisupervised FL.Fixed parameters: proportion 0.3 in the first PU (PU 1 ), proportion 0.5 in the second PU (PU 2 ).The depth of the tree in SecureBoost is 3.

Table A1 .
Based on the ComParE feature set, we present the selected 165 heart sound features and their corresponding computational results