Learning joint segmentation of tissues and brain lesions from task-specific hetero-modal domain-shifted datasets

Brain tissue segmentation from multimodal MRI is a key building block of many neuroimaging analysis pipelines. Established tissue segmentation approaches have, however, not been developed to cope with large anatomical changes resulting from pathology, such as white matter lesions or tumours, and often fail in these cases. In the meantime, with the advent of deep neural networks (DNNs), segmentation of brain lesions has matured significantly. However, few existing approaches allow for the joint segmentation of normal tissue and brain lesions. Developing a DNN for such a joint task is currently hampered by the fact that annotated datasets typically address only one specific task and rely on task-specific imaging protocols including a task-specific set of imaging modalities. In this work, we propose a novel approach to build a joint tissue and lesion segmentation model from aggregated task-specific hetero-modal domain-shifted and partially-annotated datasets. Starting from a variational formulation of the joint problem, we show how the expected risk can be decomposed and optimised empirically. We exploit an upper bound of the risk to deal with heterogeneous imaging modalities across datasets. To deal with potential domain shift, we integrated and tested three conventional techniques based on data augmentation, adversarial learning and pseudo-healthy generation. For each individual task, our joint approach reaches comparable performance to task-specific and fully-supervised models. The proposed framework is assessed on two different types of brain lesions: White matter lesions and gliomas. In the latter case, lacking a joint ground-truth for quantitative assessment purposes, we propose and use a novel clinically-relevant qualitative assessment methodology.


Introduction
Traditional approaches for tissue segmentation used in brain segmentation / parcellation software packages such as FSL , SPM (Ashburner and Friston, 2000) or NiftySeg (Cardoso et al., 2015) are based on subject-specific optimisation. FSL and SPM fit a Gaussian Mixture Model to the MR intensities using either a Markov Random Field (MRF) or tissue prior probability maps as regularisation. Alternatively, multi-atlas methods rely on label propagation and fusion from multiple fully-annotated images, i.e. atlases, to the target image (Cardoso et al., 2015;Iglesias and Sabuncu, 2015). These methods typically require extensive pre-processing, e.g. skull stripping, correction of bias field and registration. They are also often time-consuming and are inherently only adapted for brains devoid of large anatomical changes induced by pathology, such as white matter lesions and brain tumours. Indeed, it has been shown that the presence of lesions can significantly distort any registration output (Sdika and Pelletier, 2009). Similarly, lesions introduce a bias in the MRF. This leads to a performance degradation in the presence of lesions for brain volume measurement (Battaglini et al., 2012) and any subsequent analysis.
While quantitative analysis is expected to play a key role in improving the diagnosis and follow-up evaluations of patients with brain lesions, current tools mostly focus on quantification of the lesions themselves and effectively discard contextual tissue information. Existing quantitative neuroimaging approaches allow the extraction of imaging biomarkers such as the largest diameter, volume, and count of the lesions. Such automatic segmentation of the lesions promises to speed up and improve the clinical decision-making process but more refined analysis would be feasible from tissue classification and region parcellation. In particular, brain atrophy at a global level (Popescu et al., 2013;Giorgio and De Stefano, 2013), at a cerebral level (Bermel and Bakshi, 2006), and, even more specifically, at the grey matter level (Geurts et al., 2012) have been correlated with the speed of disease progression and with physical disability (Fisniku et al., 2008). Consequently, atrophied tissue volumes in the presence of lesions are clinically relevant imaging markers (Dwyer et al., 2018). We believe that, although very few work have addressed this problem yet, a joint model for lesion and tissue segmentation is expected to bring significant clinical benefits. As representative exemplars of the technical challenges involved to build such joint models, we focus, in this work, on patients with white matter lesions or brain tumours.
Deep Neural Networks (DNNs) have become the state-of-the-art for most segmentation tasks (Simpson et al., 2019) and one would now expect these to be able to jointly perform brain tissue and pathology segmentation. However, annotated databases required to train DNNs are usually dedicated to a single task (either brain tissue segmentation or pathology delineation). In addition, the information required for brain tissue or pathology segmentation may come from different scans, leading to hetero-modal (i.e. more than one set of input imaging sequences) datasets. While T 1 -weighted images provide the best grey/white matter model trained on a smaller fully-annotated dataset for white matter lesions. To assess the performance of the joint model for tissue and glioma segmentation, for which no ground-truth is available, we propose a new qualitative evaluation protocol based on the ASPECTS score (Barber et al., 2000). Higher accuracy is obtained compared to time-consuming pipelines that require to mask the lesions using manual annotations.

6.
Experiments show that generating pseudo-healthy annotated scans outperforms the other DA techniques, even with very few pseudo-healthy annotated scans.
This work is a substantial extension of our conference paper (Dorent et al., 2019b). Improvements include: 1) Additional mathematical proofs; 2) integration and validation of three different domain adaptation techniques to cope with domain-shifted datasets; 3) new experiments on joint brain tissue and glioma segmentation; and 4) a new quantitative evaluation protocol for assessing tissue segmentation in the absence of ground-truth.

Related work
Multi-Task Learning (MTL) aims to perform several tasks simultaneously, on a single dataset, by extracting some form of common knowledge or representation and introducing a task-specific back-end. When relying on DNN for MTL, the first layers of the network are typically shared, while the last layers are trained for the different tasks (Ruder, 2017). MTL has been successfully applied to medical imaging for segmentation (Bragman et al., 2018;Moeskops et al., 2016) combined with other tasks such as detection (Saha et al., 2019) or classification (Chen et al., 2019;Le et al., 2019). The global loss function is a weighted sum of task-specific loss functions. Recently, Kendall and Gal (2017) proposed a Bayesian parameter-free method to estimate the MTL loss weights and Bragman et al. (2018) extended it to spatially adaptive task weighting and applied it to medical imaging. Although the aforementioned approaches generate different outputs from the same features, no direct interaction between the task-specific outputs is modelled in these techniques. While a joint tissue and lesion segmentation can be pursued in practice, a strong underpinning assumption is that the two outputs are conditionally independent. Consequently, these approaches do not address the problem of aggregating these outputs to generate a joint segmentation. Moreover, MTL approaches, such as (Moeskops et al., 2018;Roulet et al., 2019), do not provide any mechanism for dealing with heteromodal datasets or changes in imaging characteristics across task-specific databases.
Domain Adaptation (DA) is a solution for dealing with domain-shifted datasets, i.e. datasets acquired with different settings. A classical strategy consists in learning a domain-invariant feature representation of the data. Csurka (2017) proposed an extensive review of these methods in deep learning. Some DA approaches have been developed to tackle a specific and identified shift. For example, data augmentation has been used for shifts caused by different MR bias fields (Sudre et al., 2017) or the presence of motion artefacts (Shaw et al., 2019); Havaei et al. (2016) and Dorent et al. (2019a) proposed network architectures for dealing with missing modalities that encodes each modality into a shared modality-agnostic latent space. Recent studies have proposed to learn a mapping between healthy and decease scans, using Cy-cleGANs (Xia et al., 2019;Sun et al., 2018) or Variational Autoencoders (Chen and Konukoglu, 2018). Although these techniques have shown promising results, they are inherently limited to a specific type of shift. Combining causes of shift, for instance the presence/absence of lesions with different protocols of acquisition, remains an unsolved problem. In contrast, general DA approaches do not make assumptions about the nature of the shift. These methods aim to directly minimise the discrepancy between the feature distributions across the domains. Distribution dissimilarity can be assessed using correlation distances (Sun et al., 2016) or maximum mean discrepancy (Pan et al., 2011;Long et al., 2015). However, more recent techniques are mostly focused on adversarial methods, achieving promising results in medical imaging (Kamnitsas et al., 2017;Dou et al., 2018;Orbes-Arteaga et al., 2019). However, these methods are usually focus on solving a single task across domain.
Weakly-supervised Learning (WSL) deals with missing, inaccurate, or inexact annotations. Our problem is a particular case of learning with missing labels since each task-specific dataset provides a set of labels where the two sets are disjoint. Li and Hoiem (2017) proposed a method to learn a new task from a model trained on another task. This method combines DA through transfer learning and MTL. In the end, two models are created: One for the first task and one for the second one. Kim et al. (2018) extended this approach by using a knowledge distillation loss in order to create a unique multi-task model. This aims to alternatively learn one task without forgetting the other one. The WSL problem was thus decomposed into an MTL problem with aforementioned limitations for our specific use case.
This work proposes a new framework to perform a joint segmentation while dealing with task-specific, domain-shifted and hetero-modal datasets.

Tissue and lesion segmentation learning from hetero-modal and taskspecific datasets: Problem definition
In order to develop a joint model, we first propose a mathematical variational formulation of the problem and introduce a network architecture to leverage existing hetero-modal and task-specific datasets for tissue and lesion segmentation.

Formal problem statement
Let x = (x 1 ,., x M ) ϵ X = ℝ N×M be a vectorized multimodal image and y ϵ Y = {0,., C} N its associated tissue and lesion segmentation map. Ν, M and C are respectively the number of voxels, modalities and classes. Note that images modalities are assumed to be co-registered and resampled in the same coordinate space containing N voxels. Our goal is to determine a predictive function, parametrised by the weights θ ϵ Θ, h θ : X ↦ Y that minimises the discrepancy between the ground truth label vector y and the prediction h θ (x). Let L be a loss function that estimates this discrepancy. Following the formalism used by Bottou et al. (2018), given a probability distribution D over (X, Y) and random variables under this distribution, we want to find θ* such that: As is the norm in data-driven learning, we do not have access to the true joint probability D.
In supervised learning, the common method is to estimate the expected risk using training samples. Given a set of n ε ℕ independently drawn multimodal scans with their associated tissue and lesion segmentation map x k , y k k = 1 n , we want to find θ* that minimises the empirical risk: θ* = argmin θ ∑ k=1 n L ℎ θ x k , y k However, in our multitask scenario, we cannot directly estimate the empirical risk since we do not have access to a fully annotated dataset for the joint task. Instead, we propose to leverage task-specific and hetero-modal datasets.

Task-specific and hetero-modal datasets
Let us assume that we have access to two datasets with either the tissue annotations y T or the lesion annotations y L (task-specificity). Let S control = x k 1 , . , x k M T , y k T k = 1 n T S lesion = x k 1 , ..., x k M L , y k L k = 1 n L denote these two training sets, where M T , M L , n T and n L are respectively the number of modalities in the control and the lesion sets and the size of these sets. Note that although we use the term control for convenience, we may expect to observe pathology with "diffuse" anatomical impact, e.g. from dementia. In addition, for the clarity of presentation, we highlight that the considered lesions in this work are either White Matter Hyperintensities (WMH) or gliomas.
Since such datasets are typically developed in the scope of either tissue or lesion segmentation (but not both), the set of observed modalities may vary from one dataset to another (hetero-modality). Importantly, in this work, we consider that only T 1 scans are provided in the control dataset, while the lesion set contains either 1) the T 1 and the FLAIR scans for WMH segmentation, or 2) the T 1 , contrast-enhanced T 1 (T 1 c), T 2 and FLAIR scans for glioma segmentation. The full set of modalities is consequently given by the modalities in the lesion set, while the control dataset will have missing modalities. In our specific use cases, the T 1 modality is a shared modality across the different datasets. It will nonetheless be apparent that our method can be trivially adapted for other shared modalities.

On the distribution D in the context of heterogeneous databases
As we expect different distributions across heterogeneous databases, two probability distributions of (X Y) over (X, Y) can be distinguished: • under D control , (X Y) corresponds to a multimodal scan and joint segmentation map of a patient without lesions (Y effectively being a tissue segmentation map). • under D lesion , (X Y) corresponds to a multimodal scan and joint lesion and tissue segmentation map of a patient with lesions.
Since traditional tissue segmentation methods are not adapted in the presence of lesions, the most important and challenging distribution D to address is the one for patients with lesions, D lesion . In the remainder of this work, we thus assume that:

Hetero-modal network architecture
In order to learn from hetero-modal datasets, we need a network architecture that allows for missing modalities. Specifically, the input modalities are either a T 1 scan or a full set of modalities. To deal with missing modalities, arithmetic operations are employed, as originally proposed in HeMIS (Havaei et al., 2016). The network architecture is based on a U-Net (Çiçek et al., 2016b), as shown in Fig. 2. Note that, while the proposed method requires a hetero-modal network, any specific architecture can be used. The proposed network is composed of two input branches, one for the T 1 scan and one for the full set of modalities. Although HeMIS originally proposed to encode each modality independently, i.e one branch per modality, we experimentally found higher performance with these two branches. In the presence of the full set of modalities, features extracted from the T 1 scan and all the modalities are averaged. Consequently, the network allows for missing modalities, i.e. is hetero-modal. This hetero-modal network with weights θ is used to capture the predictive function h θ that can accept either T 1 or the full set of modalities as input.

Optimising tissue and lesion segmentation as a joint task
Given the mathematical formulation of the problem and the hetero-modal network architecture, we propose a method to empirically optimise the joint problem of tissue and lesion segmentation.

Loss decomposition
Let C T , C L and 0 be respectively the set of tissue classes and lesion classes and the value of the background class in the segmentation masks. Since C T and C L are disjoint, the segmentation map y can be decomposed into two segmentation maps y =y L +y T with y T ∈ C T ∪ 0 , y L ∈ C L ∪ 0 .
Let us assume that the loss function L can also be decomposed into a tissue loss function L T and a lesion loss function L L . This is common for multi-class segmentation loss functions in particular for those with one-versus-all strategies (e.g. Dice loss, Jaccard loss, Generalized Cross-Entropy). Then, the joint and multi-class segmentation problem can be formulated as a multi-task problem: L ℎ θ x , y = L T ℎ θ x , y T + L L ℎ θ x , y L

(H 2 )
In combination with (H 1 ), (1) can be rewrittern as: While the second expected risk R L can be estimated using the full set of modalities and the lesion annotations provided in the lesion dataset, the first expected risk R T appears to be intractable due to the missing tissue annotations in the lesion dataset. In the next sections, we first propose an upper bound of the expected tissue risk R T and then a means to estimate this upper bound using the control dataset.

Upper bound of the expected tissue risk R T
Although, thanks to its hetero-modal architecture, h θ may handle inputs with varying number of modalities, the current decomposition (3) assumes that all the modalities of x are available for evaluating the loss. In our scenario, the control set of scans with tissue annotations only contains the T 1 scans. Consequently, as we do not have all the modalities with tissue annotations, and as naively evaluating a loss with missing modalities would lead to a bias, estimating R T is not straightforward.
Let us assume that the tissue loss function L T satisfies the triangle inequality: Although not all losses satisfy (H 3 ), it is known that the binary Jaccard is a distance (Späth, 1981;Kosub, 2018) and thus satisfies the triangle inequality.

Definition 4.1. (Binary Jaccard distance)
The binary Jaccard distance J bin is defined such that: However, network outputs are typically pseudo-probabilities, and the soft version of (4) does not satisfy the triangle inequality. To satisfy (H 3 ), we extend the binary Jaccard distance to a multi-class probabilistic formulation that coincides with the binary Jaccard for binary inputs but preserves the metric property for probabilistic inputs.

Definition 4.2. (Probabilistic multi-class Jaccard distance)
Let C be the number of classes in C, N be the number of voxels and P ⊂ 0, 1 C × N denote the set of probability vector maps such that for any p = p c, i c ∈ C, i ∈ 0; N ∈ P :

Europe PMC Funders Author Manuscripts
The probabilistic multi-class Jaccard distance is defined for any (u, v) ∈ P 2 as: where ω c are the class weights such that with ∑ c ∈ C ω C = 1 As shown in A.1, the binary and probabilistic Jaccard distance coincide on the set of binary vectors {0, 1} N . Furthermore, (5) satisfies (H 3 ).
Lemma 4.1. The probabilistic multi-class Jaccard distance is a distance and thus satisfies the triangle inequality.
Proof. The proof, detailed in A.2, follows from the Steinhaus transform (Späth, 1981) applied to the metric space ([0, 1] N , d1) where d1 is the distance induced by the L1 norm. D Under (H 3 ), L T satisfies the following inequality: where x T 1 denotes the T 1 scan associated to x. Consequently, we find an upper bound of the expected tissue risk: Minimising ℜ T 1 T enforces the network to generate accurate tissue segmentation using only T 1 as input. Minimising ℜ T 1 F ull T encourages consistency between the outputs when given only T 1 or the full set of modalities as input. This latter term allows for transferring the knowledge learnt on the T 1 scan to the full set of modalities.
An empirical estimation of R T TFull can be obtained by comparing the network outputs using either T 1 or the full set of modalities as input. In contrast, R T T requires comparison of inference done, under D lesion , from T 1 inputs with ground truth tissue maps y T . While this provides a step towards a practical evaluation of R T , we still face the challenge of not having tissue annotations y T under D lesion.

Estimating ℜ T 1 T using the control dataset
To estimate ℜ T 1 T , we assume that the neural network h θ is invariant to a potential domain shift between the T 1 scans of the control and lesion datasets on the non-lesion regions.
Specifically, we assume that the restriction of the feature distributions (rather than the image intensity distributions) over D lesion and D control to the non-lesion parts of the brain (i.e. the voxels i such that y i ∈ C T ) are comparable, i.e.: This means that the neural network h θ generates similar outputs on the non-lesion parts of the brain between the two datasets, leading to: Consequently, under (H 4 ), R T T can be estimated using the T 1 scans and their tissue annotations in the control dataset. Section 5 presents means of ensuring that assumption (H 4 ) is satisfied even in the presence of domain shift in the image intensity distributions.

Summary of the expected risk decomposition
Bringing all the terms together, given (3), (7) and (8), we seek the parameters θ * that optimise the tractable upper bound R s eg of the (intractable) expected risk:

Matching feature distributions across datasets
In this section, we explore different approaches that ensure the feature distributions extracted from the control and lesion T 1 scans are comparable, i.e. we want to satisfy (H 4 ) even in the presence of domain shift.

Similar acquisition protocols for the control and lesion datasets
Let's first assume that the acquisition protocols are similar for the control and lesion datasets, i.e. they are not domain-shifted. Specifically, we assume that the T 1 images have been acquired with similar sequences, spacial resolution and field strength. In this case, similar to Chen and Konukoglu (2018), the restriction of the distributions D lesion and D control to the non-lesion parts of the brain can be assumed to be the same on the T 1 scans, i.e.: In the absence of domain shift, we can reasonably assume that the network produces similar outputs on the non-lesion parts of the brain for the two distributions, i.e. that (H 4 ) is satisfied. No specific additional action thus needs to be implemented.

scans
Let's now consider the presence of a domain shift between the T 1 control and lesion scans due to different acquisition protocols. In this section, we propose to synthesise pseudohealthy scans from domain-shifted T 1 lesion scans in order to extend the control dataset with control scans associated to the protocol of acquisition of the lesion dataset. Since the control and lesion datasets are domain-shifted, existing lesion removal approaches, based either on CycleGANs (Xia et al., 2019;Sun et al., 2018) or Variational Autoencoders (Chen and Konukoglu, 2018), are not adapted as they require training data with no domain shift beyond the presence of absence of pathology.
To tackle the presence of an acquisition-related domain shift, we propose to generate pseudo-healthy scans and their annotations using traditional image computing techniques that are inherently robust to different acquisition protocols. For example, for white matter lesions, lesion filling methods allow for transforming scans with lesions into pseudo-healthy scans (Valverde et al., 2014;Prados et al., 2016). For large and unilateral pathology, we propose to synthesise pseudo-healthy T 1 scans by symmetrising the "healthy" hemisphere of the patients with lesions located in one hemisphere only. The inter-hemispheric symmetry plane is estimated via the technique described in Prima et al. (2002). Finally, the "healthy" hemisphere of those patients is mirrored in order to create a symmetric pseudo-healthy brain. Having generated pseudo-healthy images, traditional methods, designed for control scans, such as the GIF framework (Cardoso et al., 2015), can then be employed to generate the corresponding bronze standard tissue annotations.
With this set of scans S pseudo T 1 , we have access to a pseudo-control dataset acquired with a similar protocol as in the lesion dataset and similar on the non-lesion part of the brain, and thus are in the scenario described in 5.1. Let denote D pseudo T 1 the distribution of those scans.
The expected tissue risk ℜ T 1 T is then equal to the expect tissue risk under D pseudo T 1 : To take advantage of the manual annotations in the control dataset, we resort to averaging the two formulations (8,11): Consequently, the expected tissue risk R T T can be estimated using the control and pseudocontrol T 1 scans.

Alternative unsupervised DA techniques
In order to satisfy (H 4 ), the feature representation of the non-lesion parts of the brain has to be invariant to the changes induced by the different protocols. A direct way to align the feature distributions restricted to the non-lesion parts would be to match the representations of pairs of scans acquired with the different settings. However, in our scenario, we do not have access to such pairs of domain-shifted scans.
In contrast, unsupervised DA allows to perform domain adaptation using unpaired and nonannotated domain-shifted scans. Unsupervised DA techniques commonly introduce an additional term (R DA ) that encourages the network to be invariant to the domain shift. Then, the total expected risk reads: where λ is a hyper-parameter that allows for balancing the segmentation risk R seg ( (9)) with the DA regularisation R DA .
The definition of the DA term depends on the DA technique. In this work, two common unsupervised DA methods are considered, based either on data augmentation or adversarial learning.

Unsupervised DA via physically-inspired data augmentation-Since
T 1 scans play a key role for structure analysis, we expect high-resolution T 1 scans for datasets developed in the scope of tissue segmentation, such as the control dataset. Conversely, T 1 scans are often less critical for lesion segmentation and T 1 scans may have been acquired with a lower resolution.
Let's assume that, less effort has been done to acquire high-resolution T 1 lesion scans, explaining the differences in acquisition protocols. Specifically, we assume that the domain shift is caused by the presence of T 1 lesion scans with artefacts (e.g. related to the MR bias field or the presence of motion artefacts) and a lower acquisition resolution. We additionally assume that differences of scanner characteristics (manufacturer, field strength) are excluded.
Then, physically-informed augmentation such as random bias field (Sudre et al., 2017) and motion artefacts (Shaw et al., 2019) and spacial smoothing can be employed to generate scans that are similar to the T 1 lesion scans. Let denote Tψ the composition of these transformations parametrised by the parameters ψ ∼ Dψ. For any T 1 control scan x C T 1 , we can thus generate an augmented version T ϕ x C T 1 , i.e. getting access to pairs of domainshifted T 1 scans. This allows to minimise the discrepancy between the feature representations learnt by the neural network across the two domains by enforcing consistency across outputs from paired domain-shifted inputs, i.e.: ℜ DA = E D ψ E D control L T ℎ θ x c T 1 , ℎ θ ∘ T ϕ x c An empirical estimation of the DA regularisation term R DA is obtained by comparing the network outputs using the T 1 control scans and their augmented versions as input.
Consequently, if the domain shift is due to different spatial resolutions and the presence of the aforementioned artefacts, the network can be trained to be invariant to the domain shift, i.e. to satisfy (H 4 ).

5.3.2
Unsupervised DA via adversarial learning-Let's now assume that the domain shift cannot easily be simulated. In this case, we can use adversarial learning. Adversarial approaches for domain adaptation can be seen as a two-player game: A discriminator Dφ, parametrised by the weights φ ∈ Φ, is trained to distinguish the source domain features from the target domain features, while the segmentation network h θ is simultaneously trained to confuse the domain discriminator.
The discriminator aims to predict the probability that extracted features are part of the lesion feature distributions. The discriminator accuracy can thus be seen as a measurement of the discrepancy between the lesion and control feature distributions and used as a DA regularisation term: This DA term can be estimated by using features extracted from T 1 control and lesion scans as input of the discriminator.
The following proposition shows that the discriminator accuracy is a principled measurement of the feature distribution discrepancy: Proposition 1. Let assume that L satisfies the triangle inequality and is bounded. Let us also assume that the family of domain discriminators H Φ = D ϕ ϕ ∈ Φ is rich enough. Then there is a constant K such that: where ε(Θ) is independent of the network parameters θ and corresponds to the accuracy of the best (and unknown) segmenter in the family of functions parametrised in Θ.
Proof. The proof uses (3), (7), is based on Ben-David et al. (2010) and Long et al. (2018) and detailed in Appendix B. D (16) shows that the intractable expected loss is bounded by a weighted sum of the tractable segmentation risk (9) and the accuracy of the best discriminator, up to a constant w.r.t the network parameters.
Moreover, the alternative optimisation strategy can be seen as a way to estimate the best discriminator while minimising the upper bound defined in (16).

Implementation of the joint model optimisation
Given the formulation of the joint model and our proposed computationally tractable decomposition, we present in this section the implementation of our framework.

Stochastic optimisation of the joint model
We use a stochastic gradient descent approach to minimise the expected risk decomposition (9) and to enforce the network to be invariant to a potential domain shift between the datasets. The total loss function reads: where λ is a hyper-parameter that allows for balancing the segmentation loss Lseg (associated to R s eg) with the domain adaptation loss L DA (associated to R DA ). Fig. 3 shows the training procedure without DA. The weights of the segmentation loss are given by the decomposition of the problem. The domain adaptation parameter λ is a hyper-parameter that is experimentally chosen.
At each training iteration, we draw pairs of samples x l , y l L and x C T 1 , y C T from S lesion and S control and compute in each mini-batch the following loss functions and associated gradient.
Note that there is no natural pairing between x l , y l L and x C T 1 , y C T . Our paired sampling procedure thus exploits random pairing.
As presented in Section 5, different scenarios are considered.

Similar acquisition protocols
If the datasets are not domain-shifted, no DA is required (λ = 0 ), and the segmentation loss is: L seg = L L ℎ θ x 1 , y l L + L L ℎ θ x 1 , ℎ θ x l T 1 +L L ℎ θ x c T 1 , y c T We experimentally found that the inter-modality tissue loss ℜ T 1 F ull T has to be skipped for few epochs (50 in our experiments). L seg = L L ℎ θ x 1 , y l L + L L ℎ θ x 1 , ℎ θ x l DA via augmentation If we assume that the differences of protocols can be simulated (random bias field, motion artefacts and spatial smoothing), the domain invariance (14) is learnt by minimising the inter-domain feature discrepancy defined as:

Pseudo-healthy generation Given a pseudo-healthy annotated set of scans
where Tφ corresponds to a composition of theses transformations. The segmentation loss L s eg is the same as in (18).

DA via adversarial learning If adversarial learning is employed, a discriminator D is
trained to discriminate scans from the two domains by maximising the domain classification accuracy. For computational stability, the L1 distance defined in (15) has been replaced by the cross-entropy. Conversely, the segmenter h θ is train to minimise this domain classification accuracy, i.e.: L DA = log D ψ ℎ θ x c T 1 + log 1 − D ψ ℎ θ x l T 1 As in Kamnitsas et al. (2017); Orbes-Arteaga et al. (2019), the DA loss is skipped for few epochs (20 in our experiments) in order to initialise the discriminator. The segmentation loss L s eg is the same as in (18).
Convolutional layers are initialised such as proposed in He et al. (2015). The scaling and shifting parameters in the batch normalisation layers were initialised to 1 and 0 respectively. As suggested by Ulyanov et al. (2016), we used instance normalisation. We used the same discriminator as in Orbes-Arteaga et al. (2019).
We performed a 3-fold cross validation. For each fold, we randomly split the data into 70% for training, 10% for validation and 20% for testing. We used a batch of 2 lesion scans, and 2 control scans. Note that, for the DA approach based on data augmentation, the batch of 2 control scans consists in a pair of non-augmented/augmented control scans. As a data augmentation, a rotation with a random angle in [-10•, 10•] and a random Gaussian noise are employed. The network was trained using Adam optimiser (Kingma and Ba, 2015) the learning rates l R , β1, β 2 were initially respectively set up to 5.10 -4 , 0.9 and 0.999. l R was progressively reduced by a factor of 2 every 10,000 iterations. We employed the training strategy used for the nnU-Net (Isensee et al., 2019): The learning rate is reduced by a factor

Europe PMC Funders Author Manuscripts
Europe PMC Funders Author Manuscripts 2 after 15 epochs without reduction of the exponential moving average of the loss on the validation split.
We used the probabilistic version of the multi-class Jaccard distance (5) as the segmentation loss function. In order to give the same weight to the lesion segmentation and the tissue segmentation, we choose ω such that ∑ c ∈ C tissue ω c = ∑ c′ ∈ C lesion ω c′ = 1 2 7 Experiments and results 7.1 Joint white matter lesion and tissue segmentation 7.1.1 Task and datasets-In this first set of experiments, we focus on the segmentation of white matter lesions and six tissue classes (white matter, grey matter, basal ganglia, ventricles, cerebellum, brainstem), as well as the background. As detailed in Table 1, we used 2 control datasets and 2 lesion datasets: • Lesion data S lesion : The White Matter Hyperintensities (WMH) training database (Kuijf et al., 2019) consists of 60 sets of brain MR images (T 1 and FLAIR, M = 2) with manual annotations of WMHs. The data comes from three different institutes. Note that images modalities are be co-registered and resampled in the FLAIR coordinate space.
• Tissue data S control : Consists of 35 T 1 scans (M' = 1 ) from the OASIS project (Marcus et al., 2007) with annotations of 143 structures of the brain provided by Neuromorphometrics, Inc. (http://Neuromorphometrics.com/) under academic subscription. From the 143 structures, we deduct the 6 tissue classes. In order to have balanced training datasets between the two datasets, to include data acquired at the same field strength (3T) as the lesion data, and similar to Li et al.
• Fully annotated data S fully : MRBrainS18 (http://mrbrains18.isi.uu.nl/) is composed of 30 sets of brain MR images with tissue and lesion manual annotations. 7 scans are publicly available for training and validation. Although the cerebrospinal fluid (CSF) has been annotated in MRBrainS18, it was considered as background to have the same set of tissue classes as in S contro l where the CSF was not labelled. Note that image modalities are be co-registered and resampled in the FLAIR coordinate space.

Similar acquisition protocol for the T 1 scans-Despite the differences in
scanners, all T 1 acquisitions across the datasets followed very similar protocols (MP-RAGE) (see Table 1). Therefore, they were considered as following a similar distribution and the data was only pre-processed as follows. • Skull stripping: All the scans were skull-stripped using ROBEX (Iglesias et al., 2011).
• Resampling: All the scans in S control are resampled into the transversal direction with slices of 3 mm thickness to obtain a similar spacing 1×1×3 mm 3 in the datasets.

• Intensity normalisation:
We used a zero-mean unit-variance normalisation in order to match the intensity distributions.

Description of the compared models-
We considered three different models in our experiments.

• Pipeline model (Pipeline):
This model corresponds to the com bination of two task-specific models: A Tissue segmentation model that only performs tissue segmentation and is trained on the T 1 scans from the dataset with tissue annotations S control .
A Lesion model that only performs lesion segmentation and is trained using the T 1 and FLAIR scans from the dataset with lesion annotations S lesion .
The two models are combined such that the predicted lesion mask has the priority over the predicted tissue mask. Consequently, the background of the Lesion output is replaced by the Tissue output. Each model used the architecture presented in Fig. 2. Consequently, the Pipeline model has twice as many parameters as the other models.
In this set of experiments, the skull-stripped images are first cropped to remove the blank spaces and then padded to size of (144,192,48).

Method for assessing the models-
The performance of the three models was evaluated on the three datasets using Dice Score and 95% Hausdorff distance. On the control data (OASIS1+ADNI2) and WMH, scores were computed on the testing splits, while on MRBrainS18, models were submitted to the challenge MRBrainS18.
For the control data (OASIS1+ADNI2) and MRBrainS18, the full set of annotations allows a direct assessment of the tissue and the lesion segmentation performance. For WMH, only the lesion annotations are provided. In order to assess both the tissue and lesion segmentation on WMH, the lesions are filled as normal-appearing white matter on T 1 images using the method described in Prados et al. (2016) and implemented in NiftySeg (Cardoso et al., Dorent et al. Page 17 Med Image Anal. Author manuscript; available in PMC 2021 March 04.

Europe PMC Funders Author Manuscripts
Europe PMC Funders Author Manuscripts 2015). Then, GIF framework (Cardoso et al., 2015) was performed on the modified T 1 scans to obtain bronze standard tissue annotations. The tissue mask and lesion annotations were then merged by completing the non-lesion parts with the tissue mask. Finally, the model outputs are compared to the merged tissue and lesion masks. In the end, for each model and each dataset, we can assess the performance of tissue and lesion segmentation.
Given that participants to the MRBrainS18 challenge do not have access to the held-out evaluation data set and that the Jac-card score is not provided by the challenge organisers, only the Dice Similarity Coefficient (DSC) and 95th-percentile Hausdorff distance are reported for each class.

Results-
The main results are shown in Table 2 for the Dice Similarly Coefficient and in Table 3 for the 95th-percentile Hausdorff distance.
Firstly, our proposed method (jSTABL) achieves comparable performance to the single-task models on the control data (Tissue) and on WMH (priority of Lesion in Pipeline). This suggests that learning from hetero-modal datasets via our method does not degrade the performance on the tasks characterising the task-specific datasets.
Secondly, jSTABL slightly outperforms Pipeline on segmenting the tissues in WMH for the two sets of metrics. This shows that the tissue knowledge learnt from T 1 scans has been well generalised to multi-modal scans. Although we could have expected that the presence of lesions would create perturbations for the Tissue model, this latter model in fact ignores the lesions and mostly classifies them as white matter. Given that the white matter lesions are usually surrounded by white matter, the Pipeline predictions are consequently not too degraded. However, some artefacts around the lesions in the Pipeline outputs can be observed, in particular in the ventricles for patients with large lesions surrounding them. Fig.  4(a) shows an example for which parts of the ventricles are classified as background. In contrast, we did not observe such artefacts with jSTABL predictions.
Thirdly, jSTABL outperforms the fully-supervised model (Fully-Sup) on the control data (OASIS1+ADNI2) and WMH, while reaching comparable performance on MRBrainS18. This demonstrates the two main advantages of our method. First, without using any fullyannotated data, our model performs as well as a fully-supervised model that could be considered as an upper bound for our method, especially when the testing and training splits are from the same dataset (MRBrainS18). Secondly, our method takes advantage of large task-specific datasets: Unlike the fully-supervised model (Fully-Sup), jSTABL generalises well on unseen data (MRBrainS18). While the fully-supervised model (Fully-Sup) fails to segment scans from OASIS1, ADNI2 and WMH, the jSTABL model obtains relatively good performance on all the datasets we use for tissue and lesion segmentation. In particular, jSTABL outperforms SPM on 6 of the 7 classes. In fact, the only class that is significantly underperformed compared to the fully-supervised model (Fully-Sup) is the brain stem. This is due to observed differences in the annotation protocol across the control and MRBrainS datasets. Fig. 5 shows these differences and the consequences on the prediction.

Glioma and tissue segmentation
7.2.1 Task and datasets-Additionally, we assess our framework on another main types of brain lesions: Gliomas. Our goal is to segment the 6 tissue classes and three tumour classes (whole tumour, core tumour, enhancing tumour). In this case, domain adaptation was required and its evaluation is the focus of this section. We used two sets of data in these experiments.
• Tissue data S control : again we used OASIS1 data and the same 25 T 1 control scans from ADNI2 with tissue annotations as pre sented in section 7.1.1.

• Lesion data S lesion :
We evaluate our method on the training set of BraTS18 (Menze et al., 2015;Bakas et al., 0000) which con tains the scans of 285 patients, 210 with high grade glioma and 75 with low grade glioma. 129 patients have a tumour located in one hemisphere only. Four scans (T 1 , T 1 c, T 2 and FLAIR) have been acquired for each patient and pre-processed by the or ganisers: Co-registration, skull-stripping and re-sampling to an isotropic 1mm resolution. Manual annotations include three tu mour labels: 1) Necrotic core and non-enhancing tumour; 2) oedema; and 3) enhancing core.
The acquisition protocols of the T 1 scans in the two datasets are inconsistent. Specifically, MP-RAGE was used for the tissue data S control , while we observed other protocols such as fast spin echo (SE) for S lesion . Note that the detailled acquisition settings for S lesion are not publicly available.

Description of the compared models-
In order to evaluate our framework with and without the domain adaptation (DA) component, different models are considered, as presented in 5.

• Pipeline model (Pipeline):
This model corresponds to the com bination of two task-specific models: A Tissue segmentation model that only performs tissue segmentation and is trained on the T 1 scans from the dataset with tissue annotations S control .
A Lesion model that only performs lesion segmentation and is trained using the T 1 , T 1 c, T 2 and FLAIR scans from the dataset with lesion annotations S lesion .

• Proposed joint model without DA (jSTABL):
Our joint Segmen tation Tissue And Brain Lesion model is trained using our train ing procedure without domain adaptation, tissue segmentation is learned from the T 1 scans in S control .
Note that the pseudo-healthy scans used for training were generated from the training lesion scans to avoid introducing bias at testing stage.
In this set of experiments, the skull-stripped images are first cropped to remove the blank spaces and then random patches of size (112,112,112) are fed to the network.

7.2.3
Method for assessing the models-While the evaluation of the tumour segmentation on BraTS18 and the tissue segmentation on the control data (OASIS1+ADNI2) is straightforward using the manual annotations, the tissue segmentation performance cannot be assessed on BraTS18 due to the missing tissue annotations. For this reason we propose two methods to assess quantitatively and qualitatively the tissue segmentation on BraTS18.
Quantitative assessment using the symmetrised data Firstly, we propose to use the 129 patients from BraTS18 with a tumour located in one side to generate 129 pseudo-healthy symmetrised data with the bronze standard tissue annotations from GIF as ground truth. Examples are shown in Fig. C.9. By computing the Dice Score Coefficient between the predictions on the symmetrised BraTS18 data and the bronze standard ground truth, we quantitatively evaluate our model on the pseudo-healthy hemisphere of BraTS18 samples.
Qualitative assessment on anatomical landmarks Secondly, in order to assess the accuracy of the models on the tissues surrounding the tumour, we propose a new qualitative protocol. This protocol is based on the Alberta Stroke Program Early CT Score (ASPECTS) which was originally proposed to assess early ischaemic cerebral changes on CT or MRI scans (Barber et al., 2000). The stroke scores are obtained by assessing the integrity of 10 anatomical landmarks as shown on 8. Scores and associated template are commonly used in clinical practice.
The landmarks were chosen because they are easily identifiable, reliable amongst readers and capture a large cerebral coverage. The landmarks include or delineate our tissue classes of interest: Grey matter; white matter; basal ganglia; and ventricles. Instead of evaluating loss of clarity of landmarks due to ischemia we evaluated loss of clarity of landmarks due to incorrect tissue predictions. Unlike ASPECTS, which excludes infratentorial structures which are difficult to evaluate on CT, we added the brainstem and the cerebellum as two additional landmarks for a total of 12 anatomical landmarks. We named our assessment method Anatomy ASPECTS+. For each landmark, 3 scores are possible: 0 = anatomy inaccurate; 0.5 = anatomy mostly accurate; and 1 = anatomy highly accurate. Anatomical landmarks that were infiltrated with substantial tumour were excluded.
For our experiments, we randomly drew 20 patients from the testing sets of BraTS18 and two senior neuro-radiologists independently evaluated the quality of the predictions using Anatomy ASPECTS+ for 4 methods: 1) Les. Ann. + GIF pipeline that uses the tumour annotations and tissue segmentation obtained by GIF while the tumour is masked; 2) Pipeline; 2) jSTABL; 3) jSTABL+5; 4) jSTABL+10. The neuro-radiologists assessed blindly the 4 methods, in a randomised order, for the 20 patients. Table 4 shows the DSC for the 6 tissue classes and the three tumour classes on BraTs18.

Results-
Firstly, jSTABL model outperforms Pipeline on tissue segmentation. We observed that the presence of a large tumour creates major perturbations for the Tissue model. For example, we found samples for which the tumour and the surrounding tissues were partially classified as cerebellum, even though the tumour was far from the cerebellum, as shown in Fig. 6. In contrast, such artefacts were not observed for jSTABL model, demonstrating again advantages of our method compared to a simpler Pipeline approach.
Secondly, while obtaining relatively good performance on most of the tissue classes, jSTABL model fails to segment correctly grey matter and basal ganglial. This highlights the needs for domain adaptation.
Thirdly, learning from pseudo-healthy annotated scans (jSTABL+5 and jSTABL+90) outperforms the other unsupervised DA strategies based either on data augmentation (jSTABL+Augm) and adversarial learning (jSTABL+Adv). This demonstrates the benefits of using a supervised approach for our problem. Moreover, only 5 pseudo-healthy annotated scans are required to obtain an accuracy similar to the one on the control data (see Table 2), i.e. to bridge the domain gap. Fig. 6 shows that learning from pseudo-healthy annotated scans allows the network to be robust to variations in resolution, contrast or glioma grade, even with few samples used for domain adaptation.
Finally, Anatomy ASPECTS+ is employed to provide a quantitative assessment of the segmentation of the tissues surrounding the tumour. Four models are compared by two neuro-radiologists: Pipeline, jSTABL, jSTABL+5 and the time-consuming Les. Ann. + GIF pipeline that requires manual annotations of the lesions. Results are presented in Fig. 7. First, jSTABL is more often "mostly accurate" than the Pipeline (mean score -75% vs 66%). Again, this highlights the strength of our joint model compared to a pipeline approach. Secondly, jSTABL+5 is more often "highly accurate" than the Les. Ann. + GIF pipeline (mean score -46% vs 24%). This shows that our fast and fully automatic method can be considered as a new state-of-the-art for performing joint tissue and lesion segmentation.

Discussion
In this section, we discuss some of the limitations of the different methods.
Firstly, a common modality across the task-specific dataset is required to transfer the knowledge learn between the task-specific sets of modality. Without this common modality, the upper bound is not tractable anymore and our method cannot be applied.
Secondly, our approach relies on a simple hetero-modal architecture that aims to encode modalities in a common shared feature space. Ye t, averaging the feature maps doesn't enforce the network to learn a shared feature representation. To tackle this problem, a hetero-modal variational auto-encoder architecture has been recently introduced Dorent et al. (2019a). Based on a principled formulation of the problem, the induced loss function is the cross-entropy, which does not satisfy the triangle inequality. Consequently, without further research, this approach cannot be directly integrated in the formulation of our problem.
Thirdly, we found that the presence of lesions does not always perturbed a network trained on control data, especially for small lesions. Consequently, our method didn't always show large improvements compared to a simpler Pipeline approach on WMH. However, we observed that Pipeline can be perturbed by larger pathology and thus is less robust.

Conclusion
This work addresses the challenge of learning a joint brain tissue and lesion segmentation with disjoint heterogeneous annotations. Our novel approach is mathematically grounded, conceptually simple, and relies on reasonable assumptions.
The main contribution of this work is to overcome the challenge of the lack of fullyannotated data for joint problems. We demonstrate that a model trained on databases providing either the tissue or the lesion annotations and with different modalities can achieve similar performance to a model trained on a fully-annotated joint dataset. Our work also shows that the knowledge learnt from one modality can be preserved when more modalities are used as input. Finally, domain adaptation for image segmentation can be performed with a small set of data related to the target distribution.
In the future, we will evaluate our approach on new datasets with other lesions. Furthermore, we would like to extend our method to include the full parcellation of the brain (143 structures). Finally, we plan to integrate uncertainty measures in our framework as a future work. As one of the first work to methodologically address the problem of joint learning from hetero-modal and domain-shifted datasets, we believe that our approach will help DNN make further impact in clinical scenarios.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.       Table 4 Evaluation of our framework (jSTABL) on patients with gliomas in comparison to baseline methods. We report means and standard deviations for Dice scores. Metrics were computed on the BraTS 2018 validation dataset. Models w/o DA w/ DA w/ pseudo-healthy gen.