Encrypted federated learning for secure decentralized collaboration in cancer image analysis

Highlight • Federated learning with homomorphic encryption enables multiple parties to securely co-train artificial intelligence models in pathology and radiology, reaching state-of-theart performance with privacy guarantees.


Introduction
Artificial intelligence (AI) and machine learning techniques are transforming cancer imaging and cancer research and will have a profound impact on the practice of medicine (Boehm et al., 2022;Echle et al., 2021;Elemento et al., 2021;Kleppe et al., 2021).They can automate manual tasks in medical image analysis and can be used to extract hidden information from routinely available clinical image data, beyond what is visible to the human eye (Kather and Calderaro, 2020;Lu et al., 2021).AI models have been used for the detection and diagnosis of cancer, subtype classification, and optimization of cancer treatments.In particular, deep neural networks have been trained to analyze radiology images and digitized pathology slides for numerous different cancer types.For example, AI models can now detect mammographic lesions with expert-level performance (Lotter et al., 2021).Similarly, AI models predict molecular biomarkers for treatment selection directly from routine pathology slides of solid tumors (Binder et al., 2021;Coudray et al., 2018;Fu et al., 2020;Kather et al., 2020Kather et al., , 2019;;Loeffler et al., 2022).
However, the training of AI models is infamously data hungry and requires large amounts of annotated training data.While this data may already exist, in most cases it is scattered among multiple centers.Collecting this data at a central site is hindered by obstacles which are often insurmountable in practice, most notably issues with data privacy and data governance.The data governance problem has been addressed by collaborative learning protocols such as federated learning (FL) (Lu et al., 2022;McMahan et al., 2017) in which an AI model is trained on separate sites and in which not data, but only the learned model weights are shared.This facilitates collaboration between multiple parties, but still poses significant risks for breach of patient privacy.The weight updates communicated to the central FL server contain information about the data that can be extracted to reconstruct sensitive patient information (Kaissis et al., 2021).This can be exploited through privacy attacks such as model inversion (Kaissis et al., 2020;Usynin et al., 2021;Wang et al., 2019), in which a malicious server eavesdropper captures the weight updates and attempts to recover the private dataset used to train the model or reveal other private attributes.Thus, secure multi party computation (SMPC) (Canetti et al., 2002) methods are needed by the medical community.

Prior work on privacy-preserving federated learning
One measure to protect against privacy breaches is differential privacy (DP) in which deliberate noise is added to the training updates by each site (Dwork and Roth, 2013;Kaissis et al., 2020;Truex et al., 2019).However, while this paradigm protects private information, it comes at a utility tradeoff and can lead to less performant AI models as demonstrated recently (Lu et al., 2022;Tayebi Arasteh et al., 2023).Another privacy-preserving technique which could be used for SMPC is homomorphic encryption (HE).HE can protect against a malicious server eavesdropper while maintaining AI model performance by encrypting the weight updates before sending them to the central server.One of the most common methods to implement HE in machine learning is so-called fully homomorphic encryption (FHE) (Gentry, 2009), where all the operations are done in an encrypted space.A successful implementation of FHE was first shown by Cheon et al.(Cheon et al., 2017), i. e., the CKKS algorithm (named after the authors' names: Cheon, Kim, Kim, and Song) which supports computation for almost all algebraic operations.Further works (Froelicher et al., 2021;J. X. Ma et al., 2022;Sav et al., 2021;Stripelis et al., 2021;Zhang et al., 2020) built on top of CKKS by introducing other modules such as bootstrapping or new batching mechanisms to improve the performance or to save more computation time.Although guaranteeing up to a high degree of privacy, a major downside of the CKKS-based algorithms is the high compute needed to execute (Taiello et al., 2022) which leads to very high demand of computational resource for the SMPC training process, in particular for high-dimensional data.On the other hand, none of the above works employed real-world large medical datasets to support their methods and their applicability in terms of utility and computational overhead in the medical image analysis domain is unclear.Somewhat homomorphic encryption (SHE) (Damgård et al., 2012) methods, could save computational resources while still providing privacy guarantees for certain parts of the process.One of the most successful SHE protocols is the SPDZ algorithm (named after the authors' names: Damgård, Pastro, Smart, and Zakarias) (Damgård et al., 2012), and extensions thereof (Baum et al., 2020;Damgård et al., 2013;Keller, 2020), which is based on additive secret sharing and can provide low-latency SMPC because of its very fast online phase.Keller et al. (Keller et al., 2018) showed that computational time could be drastically reduced while still preserving privacy by ignoring the zero-knowledge proof of plaintext knowledge (Bendlin et al., 2011).
We propose to use an SPDZ-based algorithm, so-called somewhathomorphically-encrypted federated learning (SHEFL).In this setup, HE is merely employed after each local training round of participating sites.The central server performs the weight aggregation on the encrypted values and the encrypted updated weights are sent back to the clients for decryption and incorporation into their models.Importantly, since the central server does not have access to the decryption key, it cannot infer any information about which calculations have been done at individual peer locations and thus cannot extract sensitive private information.In other words, all handling of the model parameters happens in the encrypted space, making homomorphic encryption an optimal tool for D. Truhn et al. low-trust environments and handling of personal health data.

System and threat models
In this study, we examined how SHEFL can be leveraged for training of competitive AI models for cancer diagnosis and detection of cancer biomarkers in radiology and pathology images.To this end, we assumed the following threat model: A mutually trusting confederation of data owners wishes to collaboratively train a model on their joint data, but neither wants to relinquish data governance.For conducting the training, the confederation makes use of a untrusted aggregation server, which we assume to honestly participate in the protocol (i.e., faithfully conduct the aggregation procedure), but attempt to extract all available information from the weight updates sent to it by the other participants ("trusted-but-curious" threat model).We evaluated the training of AI models in three retrospective multicentric settings: 1) AI models are trained with local data only 2) AI models are trained with conventional federated learning whereby no additional measure of protection against privacy-centred attacks on the updates is undertaken and 3) AI models are trained with SHEFL in a decentralized, secure and privacypreserving manner, whereby the individual participants encrypt their weight update before transmitting it to the server.We hypothesized that the collective and secure training of AI models reaches better accuracy than training of local models and is associated with minimal risk of privacy leakage as compared to conventional FL while keeping the cost of additional training time low due to employing HE according to the SPDZ algorithm, which is only applied immediately before weight aggregation.Furthermore, we hypothesized that dropping the zeroknowledge proof requirement (Keller et al., 2018) of the SPDZ algorithm could reduce the quadratic complexity to linear, which could substantially lower the computational time.

SHEFL guarantees data privacy compared to conventional federated learning in the untrusted central server setting
When multiple institutions collaborate in a conventional federated learning scheme, weight updates are calculated locally and are sent to a central server to be aggregated.When unencrypted weight updates are transmitted, we demonstrate that the untrusted central server can reconstruct the training images from the weight updates in a model inversion attack.In this setting we train a neural network for the detection of malignant lesions on brain MRI examinations from the brain tumor segmentation (BraTS) dataset (Bakas et al., 2018(Bakas et al., , 2017;;Menze et al., 2015).We employ a realistic setting in which data is contributed by five different institutions and in which each institution performs separate weight updates only on their data.We then perform a gradient inversion attack following the approach by Zhao et al. (Zhao et al., 2020).We demonstrate that the original training images can be reconstructed after only 120 iterations -notably, before training of the underlying neural network objective has converged, see Fig. 1.This poses a serious threat and renders the whole concept of conventional federated learning vulnerable to privacy-focused attacks.To showcase that homomorphic encryption can be used to counter these attacks and to salvage patient privacy, we repeat the training procedure, but employ homomorphic encryption in which the central server only has access to the encrypted weight updates and the key is kept private by the peers.Following the same approach -no identifiable information can be extracted from the weight updates, even after eavesdropping on the weight updates for 40,000 iterations.

Secure training does not affect performance of oncological AI models
We trained AI models for tasks in oncology spanning both radiological and histopathological use-cases, see Fig. 2. Each model was trained in three settings: a) AI models are trained with local data only b) AI models are trained with conventional federated learning in a decentralized manner c) AI models are trained with SHEFL in a decentralized, secure and privacy-preserving manner.While approach a) is immune to privacy leaks, it results in training on only a limited subset of the possible data pool.Approach b) makes full use of the data but is prone to privacy leaks through the aforementioned attack by the untrusted aggregator.Only approach c) combines both training on full data and guarantees patient privacy.Moreover, as the HE scheme utilized in our study is endowed with a correctness guarantee (i.e., the values of the decrypted updates are guaranteed to be identical up to numerical precision to their plain-text counterparts), this setting does not suffer from an accuracy penalty compared to non-private training.We test the performance of each paradigm for AI models for the segmentation of glioblastoma on magnetic resonance images (MRIs) and for the detection of microsatellite instability in histopathological whole slide images (WSIs) of colorectal cancer patients.

Segmentation of glioblastoma on MRI
The BraTS training dataset comprises 369 MRI examinations of 369 patients which have been acquired at seventeen different clinical centers.We partitioned the data along the information where the images had been acquired into five groups and trained a 3D U-Net (Çiçek et al., 2016;Ronneberger et al., 2015) architecture to segment the tumor volume.All models were tested on an external test set from a separate institution provided by the BraTS organizers (n = 125) and employed the dice similarity score as a measure of performance.All five locally trained AI models performed inferior in terms of the dice score both to the models trained with FL and with SHEFL.Notably, no performance drop was seen in the model trained with SHEFL as compared to the model trained with conventional FL, cf.Table 1.

Prediction of genetic biomarkers in colorectal cancer patients from pathology images
In an analogous setting to the radiological use-case, we tested whether SHEFL performs equal to conventional FL and superior to locally trained models in the benchmarking task of predicting a molecular biomarker in colorectal cancer from pathology images: microsatellite instability (MSI)/mismatch repair deficiency (dMMR), which qualifies metastatic patients to receive cancer immunotherapy.We performed the evaluation on independent test sets never seen during training: the clinical trial cohort QUASAR (n = 1774 patients from the United Kingdom) and the population-based cohort YCR BCIP (Yorkshire Cancer Research Bowel Cancer Improvement Programme, n = 889 patients).We trained three models on the Epi700 data (United Kingdom, n = 607), the DACHS data (Germany, n = 2039) and the TCGA data (USA, n = 426) respectively.Subsequently, we trained one model each in the federated learning setup including all three datasets without and with homomorphic encryption.Training with SHEFL was superior to training just with local data and non-inferior to training with FL both for testing on the YCR cohort and for testing on the QUASAR cohort.Both FL and SHEFL performed on the same level with no detectable difference, cf.Table 2.

Secure training is time-efficient
A notable drawback of homomorphic encryption is its computational overhead.In our study, we eschewed this drawback by encrypting not the entire training process, but only the privacy-critical weight aggregation step, which is performed by a (potentially untrusted third party), thus enabling substantial computational savings.To determine the effect of our scheme on training time compared to FL without encryption, we conducted the following experiments on a typical hardware setup used in machine learning.As a side note, de-and encryption as well as weight aggregation is usually conducted on the central processing unit (CPU), while backpropagation during training of the networks depends on the graphics processing unit (GPU).
We found that the time required for encryption was almost negligible compared to the time required to perform the backpropagation steps and the application of weight updates: for the radiological use-case described above, less than 1 % of computational time was spent on decryption, encryption and homomorphic aggregation of the weights (Fig. 3d).For the histopathological use-case, less than 5 % of time was used for decryption and encryption (which happens at edge) and homomorphic aggregation of the weights (which happens at the central server, Fig. 3b).This difference is due to the different network architectures and different number of parameters: the histopathological usecase employs a fixed backbone feature extractor (Saldanha et al., 2022) and thus has fewer parameters to optimize.Encryption and decryption scales approximately linear with the number of weights to be updated, while neural network training complexity scales more than linearly in our setup.Thus, more complex networks, such as the one used to segment brain tumors invest more computational resources in the backpropagation algorithm relative to the encryption algorithm.This is encouraging, since the relationship between training time and aggregation time is in favor of more complex networks that are usually employed when working with big data.

Discussion
AI has an indisputable potential in the field of oncology (Bhinder et al., 2021) and AI models are currently reaching a stage in which they can improve patient care and render medical processes more efficient (Killock, 2020;McKinney et al., 2020).
However, this improvement critically depends on the availability of sufficiently large, curated, and representative training data (Willemink et al., 2020).Currently, most research groups and industry have limited and only local data access.To train useful and generalizable AI models, stakeholders need to be able to collaborate on a large scale without jeopardizing patient privacy (Bhinder et al., 2021).Only through such multi-institutional collaboration can robust AI models be trained that adequately capture the entire human population and that make the transition from bench to bedside (Bhinder et al., 2021).Federated learning was initially proposed as a technical solution for privacy-preserving distributed AI (Konečný et al., 2017).FL enables joint training of AI models by multiple partners who do not share their data with each other and has been demonstrated to facilitate the training of AI models on big data (Dayan et al., 2021).Similarly, swarm learning (SL) utilizes a network of nodes to jointly train a model on distributed data and to aggregate model weights without a central instance (Saldanha et al., 2022;Warnat-Herresthal et al., 2021).However, FL and SL have an important shortcoming: during training, weight updates must be shared and information about the underlying data can be extracted from these weight updates as shown in our study.Such techniques should thus not be considered privacy techniques, but techniques for preserving data governance (Ziller et al., 2022).Since medical data is highly sensitive and since data privacy laws forbid the use of data in such environments, where private data can be extracted, this critically limits the applicability of collaborative learning schemes and prevents the development of powerful AI models in cancer diagnosis and treatment.
This shortcoming can be remedied by employing techniques which guarantee privacy to data owners.The only technique to guarantee privacy in a data release process is differential privacy (Dwork and Roth, 2013).Hence, when sharing the model with untrusted third parties, such a technique would have to be employed to constrain the success of attacks against patient privacy.We operate under a slightly different threat model.As all participants of the federated learning workflow described above are mutually trusting, are not intending to publish the model to the outside world and all receive an identical copy of the final model, we need only protect against an attack by the (untrusted aggregation server).Our homomorphic encryption scheme protects the weights during this critical aggregation step: local sites encrypt their weight updates before sending them out and keep the decryption key private.The entity which receives the weight updates from all sites and which performs the weight aggregation in the encrypted space thus has no access to the underlying data and no sensitive data can be extracted by design.Our technique has two notable benefits: it sidesteps the computational overhead of having to train the entire model in the encrypted space using HE.In principle, it would also be possible to use HE on all levels of the training process -i.e., also during

Table 1
Performance of the five radiological AI models that were trained on local data only (sites 1-5) and of the AI model that was trained with federated learning (FL) and with additional homomorphic encryption (SHEFL).P-values are given for the comparison to SHEFL.backpropagation.However, with concurrently available computational resources, this has proven to be prohibitively computationally expensive and is not yet in reach (Keller et al., 2018).Furthermore, as long as all data stays on site -as is the case in our FL setup -there is no need to encrypt the backpropagation procedure: potential eavesdroppers do not have access to that part of the training procedure as it is done behind secure firewalls.By restricting the fully homomorphic encryption to the most critical part of FL -the weight aggregation -we show that additional computational overload is almost negligible.Moreover, our technique allows us to avoid the privacy-utility trade-offs of employing Differential Privacy for training, in which training with Differential Privacy can lead to less-performant AI models (Lu et al., 2022).We note that the utilization of Differential Privacy would be mandatory in threat models different from ours, especially if the final model was designed to be shared with untrusted third parties.
A similar scheme to ours was demonstrated by Kaissis et al. in a proof-of-concept study for classifying pneumonia on chest radiographs by using secure multi-party computation through additive secret sharing (Kaissis et al., 2021;Keller et al., 2018).With our study, we are the first to comprehensively assess fully homomorphic encryption in cancer Training neural networks on single-site datasets results in inferior performance as compared to FL and SHEFL.A neural network was trained to detect MSI on data from the Epi700, the DACHS and the TCGA cohorts respectively as well as on all three datasets using FL and SHEFL.The resulting networks were then tested on the QUASAR (A) and the YCR-BCIP (B) cohorts demonstrating superior performance of FL and SHEFL.Similarly, tumor segmentation in MRI data was trained on data from five different sites as well as on all data using FL and SHEFL.The resulting neural networks were then tested on an independent held-out test set and demonstrated improved performance (C).Computational overhead was almost negligible (red: overhead for FL, yellow: additional overhead for encryption) as compared to training time needed for backpropagation (blue) (D). .diagnosis on large multi-centric databases spanning both radiology and histopathology.
Our study demonstrates that AI models for oncological imageprocessing can be trained securely on multi-institutional data without compromising patient privacy.This will facilitate collaboration between researchers and industry alike, ultimately leading to the development of advanced and clinically useful AI models.We show that implementing the FL scheme together with homomorphic encryption comes with minimal additional code complexity and can be performed with our publicly available code.
A technical limitation of our study is that we performed all experiments within one institutional network.However, by distributing the datasets to different computing entities and keeping them strictly separate, we simulated the setting in which multiple institutions -each with their own network -perform FL realistically.We assumed a constant network communication cost in our experiments.In realistic settings, communication overhead can be unpredictable, as it depends on more factors than network size (such as concurrent traffic or physical distance of the sites).We thus chose to exclude this factor, believing it to only represent a minor limitation.We note that homomorphically encrypted weights cannot be efficiently reduced in size by compression, however this limitation is negligible compared to the requirement to encode them as 64-bit data types for transmission over the hypertext transfer protocol (HTTP).Moreover, as all parties are mutually trusting and receive an identical copy of the fully trained model at the end of training, we utilized the same key pair to encrypt the weights on all participating nodes, thus avoiding the technical challenge of key distribution.
Further improvements to the FL process are possible: with increasing peer numbers who participate in the FL setup, participation of a bounded number of malicious participants who try to corrupt the training process by delivering adversarial weight updates is possible, whereas we regarded all participants as either fully trusted or honest but curious.It has been shown that regular FL fails to converge in the presence of faulty and malicious clients (Blanchard et al., 2017).Measures to counter these attacks are available and can be integrated in federated learning schemes should the need arise(X.J. Ma et al., 2022).
In conclusion, our study provides a blueprint for the secure and privacy-preserving multi-institutional training of oncological AI models and solves an urgent need, since it is becoming increasingly clear that differences in race and gender affect disease risk among individuals and that existing datasets at local institutions are insufficient to account for these effects.

Ethics statement
This study was carried out in accordance with the Declaration of Helsinki.This study is a retrospective analysis of publicly available anonymized MRI examinations and of anonymized histopathological tissue samples from multiple cohorts of cancer patients.Collection and anonymization of patients in all cohorts took place in each contributing center.Approval by the local ethics committee at each contributing center was given if applicable (QUASAR: North East -York Research Ethics Committee; YCR: Ethical approval was not required, because screening was recommended in all patients diagnosed with CRC.Testing was considered part of the 'standard of care' clinical pathway; Epi700: Northern Ireland Biobank (NIB13/0069, NIB13/0087, NIB13/0088 and NIB15/0168), DACHS: Ethics committee of the Medical Faculty, University of Heidelberg).Approval of the ethics committee at the University Hospital of Aachen was given for the retrospective analysis of anonymized image data under reference number "Ethikkommission EK 028/19″.

Patient cohorts
MRI data for the BraTS patient collective contains brain MRI scans of 341 patients collected from 17 imaging centers and additional 28 patients for whom the imaging centers were not specified by the data provider.During federated learning we allocated the patients to five data clusters simulating the situation in which a regional hospital's image database contains MRI data of multiple imaging centers.This situation is typical in real-world scenarios where patients are referred for surgery and bring their image data that had been acquired at an external institution before.The allocation of patients is detailed in supplemental Table S1.All MRI examinations contained pre-and postcontrast T1-weighted sequences, T2-weighted sequences and fluid attenuation inversion-recovery sequences (FLAIR).All sequences were acquired in axial orientation.All the imaging datasets have been segmented manually, by one to four raters, following the same annotation protocol, and their annotations were approved by experienced neuro-radiologists.
For the histopathological data we collected digital whole slide images (WSI) of H&E-stained slides of human colorectal cancer (CRC) from five patient cohorts, three of which were used as training cohorts and two of which were used as test cohorts following the division of data in a previous study (Saldanha et al., 2022).The training cohorts are representative of real-world clinical settings.First, the Northern Ireland Epi700 (n = 661) cohort study contained data of patients with stage II and III colon cancer.This data was provided by the Northern Ireland Biobank (Lewis et al., 2018;Loughrey et al., 2021) (application NIB20-0346).Second, the "Darmkrebs: Chancen der Verhütung durch Screening" study (DACHS, n = 2448) is a large population-based case-control study.This data includes samples of CRC patients at any disease stage.This data was collected from over 20 hospitals in Germany.Data collection was coordinated by the German Cancer Research Center (DKFZ, Heidelberg, Germany) (Brenner et al., 2006;Carr et al., 2020;Li et al., 2022) and supported by the NCT tissue bank at the National Center for Tumor Diseases and the Institute of Pathology at the University of Heidelberg.Third, "The Cancer Genome Atlas" (TCGA) CRC cohort (n = 632) is a large collection of tissue specimens from multiple populations across different countries, but largely from the United States of America (USA) ("GDC," n.d.).
We employed two separate test cohorts: The "Quick and Simple and Reliable" (QUASAR) cohort was derived from a clinical trial of adjuvant therapy containing 2206 WSI, which aimed to determine survival benefit from adjuvant chemotherapy in CRC patients from the United Kingdom (UK) (Hutchins et al., 2011;Quasar Collaborative Group et al., 2007).The second test cohort used data from the Yorkshire Cancer Research Bowel Cancer Improvement Programme (Taylor et al., 2019) (YCR-BCIP) cohort (n = 889).This was a population-based study collected in the Yorkshire Region in the UK.For all cohorts, microsatellite instability (MSI) / mismatch repair deficiency (dMMR) (Marks and West, 2020) data were acquired.
The distribution of tumor stages in TCGA, DACHS and YCR-BCIP is comparable, see supplemental Table S2.In QUASAR, stage III tumors are overrepresented due to the fact that adjuvant therapy is mainly performed in intermediate stage tumors.Therefore, following previous work (Saldanha et al., 2022), we used YCR-BCIP and QUASAR as test cohorts to investigate the robustness of the AI models both on a general population and on a clinical trial population.Importantly, neither in the MRI data nor in the histopathological data, there was any overlap between training and test cohorts.

Hardware
The hardware used in our experiments were Intel CPUs with 18 cores and 32 GB RAM and Nvidia RTX 6000 GPUs with 24 GB memory.

MRI data
All of the 3D volumes were cropped around the brain to lower the computational costs and standardize the field of view.As intensity distributions vary across magnetic resonance images, intensity normalization is crucial.Therefore, we clipped the intensity values above the 99 percentiles of the image, then subtracted the minimum value of the result from voxel values and divided the shifted image by the maximum value of the image.We performed data augmentation during training by applying random cropping of patches of 128 × 128 × 128 from each original volume around its center.Additionally, we applied mediolateral and cranio-caudal flipping with a probability of 0.4.Intensity was randomly rescaled according to a power-law I new = g.I γ (Cirillo et al., 2021) with gain g and the exponent γ randomly selected between 0.8 -1.2 from a uniform distribution.White Gaussian noise with zero mean and a standard deviation of 0.03 was added to each sequence of the multi modal MRI data.
A modified 4-level 3D U-Net (Çiçek et al., 2016;Ronneberger et al., 2015) was utilized for segmentation of brain tumors.In the contraction path, each layer contained two 3 × 3 × 3 convolutions, each followed by a rectified linear unit (ReLU) (Agarap, 2019), a batch normalization (BN) (Ioffe and Szegedy, 2015) and then a 2 × 2 × 2 max pooling with strides of two in each dimension.The output channel number was doubled after each level in the contraction path, and it was 48 at the end of level one.In the expansion path, each layer consisted of a nearest neighbor up-sampling of 2 × 2 × 2 in each dimension, followed by two 3 × 3 × 3 convolutions each followed by a ReLU and BN.The output channel number was halved after each level in the expansion path.In the last layer, a 1 × 1 × 1 convolution, which reduced the number of output channels to 3, followed by a SoftMax layer, was used for the per-voxel final classification.
The model was optimized using the Adam optimizer (Kingma and Ba, 2017) with a learning rate of 10 − 4 .To be consistent in our comparison scenarios, all the weight and bias parameters of all the different models were initialized using the He initialization scheme (He et al., 2015).As a loss function, we chose the Dice loss tailored to the BraTS data needs (Henry et al., 2021).To minimize the overhead and make maximum use of the graphics processing unit memory, we utilized large input tiles over a large batch size and reduced the batch to a single 3D image (Ronneberger et al., 2015) with 4 channels, each channel being one of the MR modalities.Hence, the batch normalization acted like instance normalization in our implementation.The network contained a total of 5,670, 579 trainable parameters.

Histopathological data
For prediction of molecular features from image data, we based our analysis on a well-established weakly-supervised end-to-end prediction pipeline, which was described and evaluated in a recent benchmark study (Ghaffari Laleh et al., 2022).As a preprocessing step, the original gigapixel WSIs were tessellated into patches of size (512 × 512 × 3) pixels and were color-normalized with the Macenko method (Macenko et al., 2009).Blurry patches and patches with no tissue were removed from the data set using canny edge detection (Ghaffari Laleh et al., 2022).Following that approach, we obtained a normalized edge image using the "canny" method in Python's OpenCV (Culjak et al., 2012) package and then removed all tiles with a mean value below a threshold of 4. A pre-trained ResNet18 was used to extract a (512 × 1) feature vector from 150 randomly selected patches for each patient 9 .Before training, the number of tiles in each class were equalized by random undersampling until all classes had the same number of tiles, as described before (Kather et al., 2020(Kather et al., , 2019)).Feature vectors served as input to a fully connected classification network and the patient-wise MSI label was used to label every single tile derived from that patient.The fully connected classifier network comprised four layers with (512 × 256), (256 × 256), (256 × 128) and (128 × 2) connections with a ReLU activation function and the network contained a total of 492,930 trainable parameters.The model was optimized using the Adam optimizer (Kingma and Ba, 2017) with a learning rate of 4 × 10 − 5 and the He initialization scheme (He et al., 2015) was employed.Cross-entropy was chosen as the loss function and the model was trained in batches of size 124 for 100 epochs and utilizing 5-fold cross-validation.

The collaborative learning procedure
Every participating site performed a complete local training round, in a conventional non-privacy-preserving machine learning manner, using their own data, where in our case each round equaled an epoch, leading to calculation of local gradient updates of the network parameters.Afterward, the local sites applied a homomorphic encryption setup using a public key on their gradient updates according to the SPDZ algorithm (Damgård et al., 2012) while ignoring the zero-knowledge proof of plaintext knowledge (Bendlin et al., 2011) requirement.The encrypted network parameters were aggregated according to the FedAvg (McMahan et al., 2017) algorithm by the central server in the encrypted space, leading to one set of global network parameters (which are still in the encrypted space).A copy of the global encrypted parameters was transferred back to the local sites by the central server.Using the public key, each site decrypted the global model and started another local training round with these new model parameters.This iterative process continued until the convergence of the global model.

Details of the homomorphic encryption method: the SPDZ algorithm
The algorithm utlizes an additive secret sharing strategy, where a message x is encrypted through distributing it as different shares to the participants.Assuming trusted-but-curious aggregation server, it requires only one crypto provider for dividing the shares between local sites.Particularly, assuming there are n sites, where n ∈ {1, 2, 3, …, N}, each site gets assigned a random integer number in the range of (0, Q) as its secret share x n , except for the site N which gets a share as follows: x The public key Q is a large prime number generated by the crypto provider.Consequently, the secret x could be decrypted according to Eq. ( 2): Although all the sites have access to the public key Q, none of them would know about the actual secret x as it is shared additively among them.Importantly, since the central server does not have access to Q, it cannot infer any information about the secret x.Moreover, the scheme has a homomorphic property.Thus, a certain numer of operations could be performed in the encrypted space without any information loss such as addition and multiplication.This method particularly suits our goals as we intended to solely use the HE during the weight aggregation which eventually requires only two types of operations namely addition and multiplication, i.e., no need for expensive operations such as convolution, pooling, and derivation.
Of note, this additive secret sharing algorithm assumes all numbers to be of integer values, which is in conflict with the neural network weights and biases that are usually of floating-point nature.Consequently, an important step before the encryption process is encoding the secret x into an integer value, namely using the fixed-point arithmetic (Catrina and de Hoogh, 2010;Costache et al., 2017).Subsequently, a conversion from fixed-point to the original floating-point precision happens before the decryption process.Depending on the chosen precision, this conversion could be both a lossy or a lossless process.For instance, the fractional value of 2.9874 will be represented by 2987 in the case of selecting a precision of 3. In our implementation, we observed that a precision > 13 results in almost lossless computations for cancer image analysis when using 32-bit memory for storing the image values.

Evaluation metrics and statistical analysis 4.5.1. MRI data
The dice similarity score was employed as a measure of segmentation performance for MRI data.Statistical spread were determined for 125 points.All the mean values were accompanied by a standard deviation values.For determining statistical significance, two-tailed paired t-test or Wilcoxon singed-rank test were employed accounting for normality, which was tested using Shapiro-Wilk test (Shapiro and Wilk, 1965).A P-value ≤ 0.05 was considered significant.

Histopathological data
Area under the receiver operating characteristic curve (AUROC) was employed as the main classification evaluation metric.Bootstrapping was utilized with 1000 redraws for each measure to determine the statistical significance and spread (Konietschke and Pauly, 2014).All the mean values were accompanied by a standard deviation.A P-value ≤ 0.05 was considered significant.

Code availability
Our source code for secure federated learning using homomorphic encryption is publicly available at https://github.com/tayebiarasteh/federated_HE.All source codes for training and evaluation of the deep neural networks, MR image analysis and preprocessing, 3D data augmentation, and gradient inversion attack are available at https: //github.com/tayebiarasteh/federated_HE.All source code for the histological image analysis is available at https://github.com/KatherLab/HIA and all source code for histological image preprocessing is available at https://github.com/KatherLab/preProcessing.All code for the experiments was developed in Python v3.8 using the PyTorch v1.4 framework.The secure federated learning process including homomorphic encryption was developed using PySyft (Ziller et al., 2021) v0.2.9.

Fig. 1 .
Fig. 1.Schematic of FL and SHEFL and associated Information extraction attacks.(A) In FL, each site trains on their own data and weight updates are transmitted to the central server for aggregation.(B) In SHEFL, the weight updates are encrypted and the server only has access to the encrypted values.While FL allows the server to extract patient sensitive information by reconstructing the images from the weights through gradient inversion attacks and eavesdropping on the weight updates (C), this information remains protected in SHEFL and images cannot be reconstructed (D).Experiments were performed on 2D slices including native T1-weighted sequences in the top row, post-contrast T1-weighted sequences in the second row, T2-weighted sequences in the third row and fluid attenuated inversion recovery sequences in the bottom row.

Fig. 2 .
Fig. 2. Schematic of the deep learning workflow.(A) Histology images are first tessellated.Features are then extracted by a feature extractor network (fixed) and a multi-layer perceptron is trained to predict MSI status.(B) The MRI examination is normalized and rescaled to a standard resolution of 128 × 128 × 128.All four three-dimensional sequences are then fed into a 3D U-Net architecture that is trained to predict tumor segmentation outlines.

Fig. 3 .
Fig. 3. Results of training on local data only vs. training using FL and SHEFL.Training neural networks on single-site datasets results in inferior performance as compared to FL and SHEFL.A neural network was trained to detect MSI on data from the Epi700, the DACHS and the TCGA cohorts respectively as well as on all three datasets using FL and SHEFL.The resulting networks were then tested on the QUASAR (A) and the YCR-BCIP (B) cohorts demonstrating superior performance of FL and SHEFL.Similarly, tumor segmentation in MRI data was trained on data from five different sites as well as on all data using FL and SHEFL.The resulting neural networks were then tested on an independent held-out test set and demonstrated improved performance (C).Computational overhead was almost negligible (red: overhead for FL, yellow: additional overhead for encryption) as compared to training time needed for backpropagation (blue) (D). .