Protocol for training MERGE: A federated multi-input neural network for COVID-19 prognosis

Summary Federated learning is a cooperative learning approach that has emerged as an effective way to address privacy concerns. Here, we present a protocol for training MERGE: a federated multi-input neural network (NN) for COVID-19 prognosis. We describe steps for collecting and preprocessing datasets. We then detail the process of training a multi-input NN. This protocol can be adapted for use with datasets containing both image- and table-based input sources. For complete details on the use and execution of this protocol, please refer to Casella et al.1


Highlights Train a multi-input neural network exploiting different types of input data
Steps to train in both a typical machine learning setting and a federated scenario The combined use of different types of data is beneficial to the federated model

This protocol shows how to train a federated multi-input neural network with OpenFL
Installing the OpenFL framework Timing: 15 min OpenFL is a framework-agnostic Python library for FL that enables organizations to collaboratively train a model without sharing sensitive information.OpenFL is a community-supported project, but it was originally developed by Intel Labs and Intel Internet of Things Group.Below are the required steps for installing OpenFL.
1. Open a new terminal window and create a new Virtualenv environment for the project.The recommended version of Python is 3.8 (>= 3.6, <3.9).
2. Activate the virtual environment.
3. Install the OpenFL package from source.
a. Clone the OpenFL repository.
b. Install build tools, before installing OpenFL.
If everything was done correctly, the >fx command in the virtual environment will confirm that OpenFL is installed.
4. Request credentials for accessing the data by clicking the ''Request Credentials'' button at the bottom left of the webpage.a. Insert your name, surname, institutional affiliation, and email, and accept the privacy conditions, the data user policy, and the Centro Diagnostico Italiano citation.
Note: Your credentials will be sent by email in a few minutes.
5. Click on the ''Access the data'' button present in https://aiforcovid.radiomica.it/andinsert your credentials received.6.After the login phase, you can finally download the data by clicking on the ''Download all the data'' button at the top right of the webpage (Figures 2 and 3).At the end of the download process, you will have two Excel files containing the clinical parameters (i.e., age, sex, previous disease, .), and a zip file containing the chest X-ray scans.This dataset consists of data gathered across six different Italian hospitals during the first outbreak of COVID-19, for a total of 1589 patients, divided in 1103 for the training set, and 486 for the test set.For each subject of the dataset are provided sixteen clinical parameters (.xls format) and a chest X-rays image (.JPEG format).Below are described the data preprocessing stages: 7. Install the required packages (''requirements.yml'' is provided in the GitHub repository of MERGE, see key resources table) in your virtualenv.Then, create a new Python script ''preprocessing.py'',and import the required libraries.: 8. Put the datasets in the same folder of the Python script and import them.Replace the blank values with NaN.This step is necessary to make the dataset compatible with the next operations.10.Remove all the columns containing more than 500 missing values.
11. Repeat the 8-10 steps for the test set.12.The test set contains some columns with all NaN values hidden due to a data competition on this dataset.Indeed, these clinical parameters, i.e., the oxygen percentage, cardiovascular disease, ischemic heart disease, atrial fibrillation, heart failure, ictus, and position, have been removed because they are highly predictive features.The same features are present in the train set.However, they will not be considered for the NN training.13.Save the resulting datasets.
14. Open a terminal tab and navigate to the folder containing the Python script.Execute the script.15.Merge the train and test clinical parameters in a single file named ''trainANDtest.xls''.The train data encompass patients coming from six different hospitals, labeled from A to F, while the test data are collected from only one of those hospitals, the F one.In this way, the test set will contain patients coming from all the six hospitals.This step is fundamental for two reasons: a.In a centralized setting, it avoids overfitting the data coming from the hospital F. b.In a federated setting, it allows to respect the main FL principle: data never leave the local institution.Indeed, each client will train and test the model on data coming from a single hospital.24.Put all the resized images from both the train and test set into a new folder ''DATASET''.
The preprocessing steps are finished.At the end of this process, you need to have a directory containing the ''trainANDtest.xls''file, and the ''DATASET'' folder containing the related chest X-rays scans.

KEY RESOURCES TABLE STEP-BY-STEP METHOD DETAILS
Herein, we describe Step-by-step methods for training MERGE, a federated multi-input NN for COVID-19 prognosis.Before showing the steps for training a federated model, we describe the stages for a centralized scenario in which the data are collected in a single data lake.To illustrate these various steps, we use, as an example, the training of a multi-input NN in both a centralized and federated version.Note: This class will return an image, tabular, and label associated with a patient.For considering just images or tabular separately, remove one of them from the return statement according to which type of input source you want to use.

Centralized training
>class ImageDataset(Dataset): """Tabular and Image dataset.""">def __init__(self, indices, image_dir, transform=None): 6. Define data augmentation stages and split data in train and test.The train set will be divided into train and validation later.Note: this step is not mandatory, but it is helpful for tracking metrics.Metrics tracking is also possible thanks to the training function.Indeed, all the metrics will be reported for every single epoch.Moreover, at the end of training, two plots tracking losses and accuracies for each train/validation/test split will be generated.

Timing: 12 h
As for the previous scenario, the time required for the federated training highly depends on GPUs availability.
The following steps describe how to train MERGE, a model for multi-input biomedical federated learning.The federation will encompass six Collaborators, i.e., clients (hospitals) holding local data, and one Aggregator, i.e., the server that aggregates the models.Each Collaborator will train a local model on data coming only from one of the six hospitals (A to F). 18.In the same directory containing the data, create three folders respectively named ''director'', ''envoy'' and ''workspace''.19.Navigate to the ''director''.The Director is responsible for the creation and management of an Aggregator.Create a human-readable data serialization file (''director_config.yaml'')and a bash script (''start_director.sh'').a. ''director_config.yaml'' is a configuration file describing the listening host (localhost), the listen port, the sample, and the target shape of the Director.iv.The CovidShardDescriptor Class holds the following properties.
v. The CovidShardDescriptor Class implements the ''download_data'' function, which is responsible for splitting and returning the data in the correct format.This function calls the ImageDataset Class, which is the same as previous described for the centralized scenario (bullet point 5) Protocol 21.The ''envoy_configX.yaml'' is a configuration file describing the specifics of the Envoy.Assign to X the value of each hospital (1-6).
22. ''start_envoyX.sh''will help us in running the Envoys.For this example, we are considering a simulated federation scenario.If you want a real federation, change ''localhost'' with the Fully Qualified Domain Name (FQDN) of your devices.a.To find your FQDN prompt: 23. Create and connect to the federation by running the bash scripts just created.Start from the director script, and only once it is active, connect the envoys.Protocol 24.Navigate to the ''workspace'' directory and create a new Python script ''federated_multi_input.py'' in the same directory containing the datasets.Import the required libraries.In the following snippet of code, the variable ''myseed'' is responsible for reproducibility purposes.MERGE 1 ran this protocol five times, changing this variable with values from 0 to 4 and averaging the results.

Connect to the federation.
Optional: Request info about sample and target shapes and double-check that all the clients are connected to the federation.

Protocol EXPECTED OUTCOMES
MERGE introduces an FL setting with the advantage of leveraging multiple input sources for solving classification tasks in the bio-medical environment in a privacy-compliant way.The basic assumption for this approach is that each federation participant has both data types, images, and tabular, locally available and accessible (Figure 4).The goodness of this protocol has been demonstrated by running several tests based on images combined with tabular data from the COVID-19 chest X-rays dataset.However, this protocol has also been tested for the Alzheimer's disease detection by training on the ADNI study.MERGE has been compared with models trained only on images or tabular.Results show that enabling multi-input architectures in the FL framework allows for improving the performance regarding both accuracy and f1-score with respect to non-federated models while complying with data protection practices.If the steps presented in this protocol are executed successfully, the results will be the same as those of MERGE (Figures 5 and 6). 1

LIMITATIONS
The main objective of MERGE 1 was to demonstrate the feasibility of a horizontal federated multiinput architecture suitable for the bio-medical field.Consequently, optimizing the performance in the non-federated conditions was not targeted, and improvements concerning state-of-the-art in this respect could not be demonstrated.However, making a federated architecture available enables the exploitation of multiple sources of unshared data that allows building on top of current cutting-edge single-institution solutions, overcoming the low data numerosity issue while improving the generalization ability of the overall system and naturally enabling multicentric studies.The proposed approach does not consider the problem of missing views, which also affects clinical data processing.However, we are confident that the openness and flexibility of the proposed approach will foster research in the field, marking a step in data sharing and distributed processing.Finally, a typical limitation of FL experiments is the need for huge amounts of memory.This problem can be emphasized when dealing with a simulated federation (i.e., all the clients span in the same device).Indeed, all the clients will own a copy of the neural network, and multiple copies of the same model can be problematic to handle for a single machine.

TROUBLESHOOTING
The most common problem when running a real federation with OpenFL is the creation of the federation (protocol step 22).

Problem 1
The scripts for running the envoys do not contain the right FQDN of the director machine (Figure 7).

Potential solution
Double-check the FQDN of the director device.

Problem 2
The envoy script has been executed before the director was alive (Figure 8).

Figure 4 .
Figure 4. Federated learning with multi-input neural networks
to 1,.,6 if you want to use only data coming from one of >np.random.seed(myseed)>generator=torch.Generator() >generator.manual_seed(myseed)>%matplotlib inline >dev = torch.device("cuda"if torch.cuda.is_available()else "cpu") >torch.backends.cudnn.deterministic= True >torch.backends.cudnn.benchmark= False >globalTEST = pd.read_excel("globalTEST.xls") >del globalTEST["Row_number"] >del globalTEST["Unnamed: 0 Protocolsending tasks and data to the envoys.The code below starts a Director entity without encrypting the communication network (by disabling Transport Layer Security.For more information about TLS, check the official OpenFL documentation) and according to the parameters of the ''director_config.yaml''.In particular, it will execute a Director service on localhost, with port 50051 and it will accept envoys compliant with the expected sample and target shapes.20.Navigate to the ''envoy'' folder.Create a Python script (''covid_shard_descriptor.py''), a YAML (''en-voy_configX.yaml''), and a bash script (''start_envoyX.sh'')foreach client of the federation, where X represents the number associated with that client.Considering that we have six clients, this directory will contain 13 files: one shard descriptor, six configuration files, and six bash scripts.a.The ''covid_shard_descriptor.py'' is responsible for sharding the dataset.In particular, working in synergy with the YAML configuration files it will assign the right data to the various clients.i.First of all, import the required libraries.ii.Create the CovidShardDataset Class.