Unified Focal loss: Generalising Dice and cross entropy-based losses to handle class imbalanced medical image segmentation

Automatic segmentation methods are an important advancement in medical image analysis. Machine learning techniques, and deep neural networks in particular, are the state-of-the-art for most medical image segmentation tasks. Issues with class imbalance pose a significant challenge in medical datasets, with lesions often occupying a considerably smaller volume relative to the background. Loss functions used in the training of deep learning algorithms differ in their robustness to class imbalance, with direct consequences for model convergence. The most commonly used loss functions for segmentation are based on either the cross entropy loss, Dice loss or a combination of the two. We propose the Unified Focal loss, a new hierarchical framework that generalises Dice and cross entropy-based losses for handling class imbalance. We evaluate our proposed loss function on five publicly available, class imbalanced medical imaging datasets: CVC-ClinicDB, Digital Retinal Images for Vessel Extraction (DRIVE), Breast Ultrasound 2017 (BUS2017), Brain Tumour Segmentation 2020 (BraTS20) and Kidney Tumour Segmentation 2019 (KiTS19). We compare our loss function performance against six Dice or cross entropy-based loss functions, across 2D binary, 3D binary and 3D multiclass segmentation tasks, demonstrating that our proposed loss function is robust to class imbalance and consistently outperforms the other loss functions. Source code is available at: https://github.com/mlyg/unified-focal-loss.


Introduction
Image segmentation involves partitioning an image into meaningful regions, based on the regional pixel characteristics, from which objects of interest are identified (Pal and Pal, 1993). This is a fundamental task in computer vision and has been applied widely in face recognition, autonomous driving, as well as medical image processing. In particular, automatic segmentation methods are an important advancement in medical image analysis, capable of demarcating structures across a range of imaging modalities including ultrasound (US), computed tomography (CT) and magnetic resonance imaging (MRI).
The most well-known architecture in image segmentation, the U-Net (Ronneberger et al., 2015), is a modification of the convolutional neural network (CNN) architecture into an encoder-decoder network, similar to SegNet (Badrinarayanan et al., 2017), which enables end-to-end feature extraction and pixel classification. Since its inception, many variants based on the U-Net architecture have been proposed (Y. L. Liu et al., 2020;Rundo et al., 2019a)-including the 3D U-Net (Cicek et al., 2016), Attention U-Net (Schlemper et al., 2019) and V-Net (Milletari et al., 2016)-as well as integrated into conditional Generative Adversarial Networks (Kessler et al., 2020;Armanious et al., 2020).
To train deep neural networks, backpropagation updates model parameters in accordance with the optimisation goal defined by the loss function. The cross entropy loss is typically the most widely used loss function in classification problems (L.  and is applied in the U-Net (Ronneberger et al., 2015), 3D U-Net (Cicek et al., 2016) and SegNet (Badrinarayanan et al., 2017). In contrast, the Attention U-Net (Schlemper et al., 2019) and V-Net (Milletari et al., 2016) leverage the Dice loss, which is based on the most commonly used metric for evaluating segmentation performance, and therefore represents a form of direct loss minimisation. Broadly, loss functions used in image segmentation may be classified into distribution-based losses (such as the cross entropy loss), region-based losses (such as Dice loss), boundary-based losses (such as the boundary loss) (Kervadec et al., 2019), and more recently compound losses. Compound losses combine multiple, independent loss functions, such as the Combo loss, which is the sum of the Dice and cross entropy loss (Taghanaki et al., 2019).
A dominant issue in medical image segmentation is handling class imbalance, which refers to an unequal distribution of foreground and background elements. For example, automatic organ segmentation often involves organ sizes that are an order of magnitude smaller than the scan itself, resulting in a skewed distribution favouring background elements (Roth et al., 2015). This issue is even more prevalent in oncology, where tumour sizes are themselves often significantly smaller than the associated organ of origin. Taghanaki et al. (2019) distinguish between input and output imbalance, the former as aforementioned, and the latter referring to classification biases arising during inference. These include false positives and false negatives, which respectively describe background pixels incorrectly classified as foreground objects, and foreground objects incorrectly classified as background. Both are particularly important in the context of medical image segmentation; in the case of image-guided interventions, false positives may result in a larger radiation field or excessive surgical margins, and conversely false negatives may lead to inadequate radiation delivery or incomplete surgical resection. Therefore, it is important to design a loss function that can be optimised to handle both input and output imbalances.
Despite its significance, careful selection of the loss function is not widespread practice, and often suboptimal loss functions are chosen with performance repercussions. To inform loss function choice, it is important to perform large-scale loss function comparisons. Seven loss functions were compared on the CVC-EndoSceneStill (gastrointestinal polyp segmentation) dataset, with the best performance seen with region-based losses and conversely the worst performance with the cross entropy loss (Sánchez-Peralta et al., 2020). Similarly, a comparison of fifteen loss functions using the NBFS Skull-stripped dataset (Jadon, 2020) (brain CT segmentation), which also introduces the log-cosh Dice loss, concluded that Focal Tversky loss and Tversky loss, both region-based losses, are generally optimal . This is further supported by the most comprehensive loss function comparison to the date, with twenty loss functions compared across four datasets (liver, liver tumour, pancreas and multi-organ segmentation), which observed the best performance with compound-based losses, where the most consistent performance was observed with the DiceTopK and DiceFocal loss (Ma et al., 2021). It is apparent from these studies that region-based or compound losses are associated with consistently better performance than distribution-based losses. Less clear, however, is which of the region-based or compound losses to choose, with no agreement among the aforementioned. One major confounding factor is the degree of class imbalance in the datasets, with low class imbalance seen in the NBFS Skull-stripping dataset, moderate class imbalance in the CVC-EndoSceneStill dataset, and a combination of both low and high class imbalanced datasets present in (Ma et al., 2021).
Among medical imaging datasets, those involving tumour segmentation are associated with high degrees of class imbalance. Manual tumour delineation is both time-consuming and operator-dependent. Automatic methods of tumour delineation aim to address these issues, and public datasets, such as the Breast Ultrasound 2017 (BUS2017) dataset for breast tumours (Yap et al., 2017), Kidney Tumour Segmentation 19 (KiTS19) dataset for kidney tumours (Heller et al., 2019) and Brain Tumour Segmentation 2020 (BraTS20) for brain tumours (Menze et al., 2014), have accelerated progress towards this goal. In fact, there has been recent developments for translating the BraTS20 dataset into clinical and scientific practice (Kofler et al., 2020).
Current state-of-the-art models for the BUS2017 dataset incorporate attention gates, which may provide benefits in class imbalanced situations by using contextual information from the gating signal to refine skip connections, highlighting the regions of interest (Abraham and Khan, 2019). In addition to attention gates, the RDAU-NET combines residual units and dilated convolutions to enhance information transfer and increase the receptive field, respectively, and was trained using the Dice loss (Zhuang et al., 2019). The multi-input Attention U-Net combines attention gates with deep supervision, and introduces the Focal Tversky loss, a region-based loss function designed to handle class imbalance (Abraham and Khan, 2019).
For the BraTS20 dataset, a popular approach is to use a multi-scale architecture where different receptive field sizes allow for the independent processing of both local and global contextual information (Kamnitsas et al., 2017;Havaei et al., 2017). Kamnitsas et al. (2017) used a two-phase training process involving initial upsampling of under-represented classes, followed by a second-stage where the output layer is retrained on a more representative sample. Similarly, Havaei et al. (2017) used a sampling rule to impose equal probability of foreground or background pixels at the centre of a patch, and used the cross entropy loss for optimisation.
For the KiTS19 dataset, the current state-of-the-art is the "no-new-Net" (nnU-Net) (Isensee et al., , 2018, an automatically configurable deep learning-based segmentation method involving the ensemble of 2D, 3D and cascaded 3D U-Nets. This framework was optimised using the Dice and cross entropy loss. Recently, an ensemble-based method obtained comparable results to nnU-Net, and involved initial independent processing of kidney organ and kidney tumour segmentation by 2D U-Nets trained using the Dice loss, followed by suppression of false positive predictions of the kidney tumour segmentation using the network trained for kidney organ segmentation (Fatemeh et al., 2020). When the dataset size is small, results from an active learning-based method using CNN-corrected labelling, also trained using the Dice loss, showed a higher segmentation accuracy over nnU-Net (Kim et al., 2020).
It is apparent that for all three datasets, class imbalance is largely handled by altering either the training or input data sampling process, and rarely with adapting the loss function. However, popular methods-such as upsampling the underrepresented class-are inherently associated with an increase in false positive predictions, and more complicated, often multi-stage training processes require more computational resources.
State-of-the-art solutions typically use unmodified versions of either the Dice loss, cross entropy loss or a combination of the two, and even when using available loss functions for handling class imbalance, such as the Focal Tversky loss, consistently improved performance has not been observed (Ma et al., 2021). Deciding which loss function to use is difficult because there is not only a significant number of loss functions available to choose from, but it is also unclear how each loss function relates to one another. Understanding the relationship between loss functions is the key for providing heuristics to inform loss function choice in class imbalanced situations.
In this paper, we propose the following contributions: .
(a) We summarise and extend the knowledge provided by previous studies that compare loss functions to address the context of class imbalance, by using five class imbalanced datasets with varying degrees of class imbalance, including 2D binary, 3D binary and 3D multi-class segmentation, across multiple imaging modalities. (b) We define a hierarchical classification of Dice and cross entropybased loss functions, and use this to derive the Unified Focal loss, that generalises Dice-based and cross entropy-based loss functions for handling class imbalanced datasets. (c) Our proposed loss function consistently improves segmentation quality over six other related loss functions, is associated with a better recall-precision balance, and is robust to class imbalance.
The manuscript is organised as follows. Section 2 provides a summary of the loss functions used, including the proposed Unified Focal loss. Section 3 describes the chosen medical imaging datasets and defines the segmentation evaluation metrics used. Section 4 presents and discusses the experimental results. Finally, Section 5 provides conclusive remarks and future directions.

Background
The loss function defines the optimisation problem, and directly affects model convergence during training. This paper focuses on semantic segmentation, a sub-field of image segmentation where pixel-level classification is performed directly, in contrast to instance segmentation where an additional object detection stage is required. We describe seven loss functions that belong to either distribution-based, regionbased or compound losses based of a combination of the two. A graphical overview of loss functions in these categories, and how all are derivable from the Unified Focal loss, is provided in Fig. 1. First, the distribution-based functions are introduced, followed by region-based loss functions, and finally concluding with compound loss functions.

Cross entropy loss
The cross entropy loss is one of the most widely used loss functions in deep learning. With origins in information theory, cross entropy measures the difference between two probability distributions for a given random variable or set of events. As a loss function, it is superficially equivalent to the negative log likelihood loss and, for binary classification, the binary cross entropy loss (L BCE ) is defined as the following: L BCE (y,ŷ) = − (ylog(ŷ) + (1 − y)log(1 −ŷ)). (1) Here, y,ŷ ∈ {0, 1} N , where ŷ refers to the predicted value and y refers to the ground truth label. This can be extended to multi-class problems, and the categorical cross entropy loss (L CCE ) is computed as: where y i,c uses a one-hot encoding scheme of ground truth labels, p i,c is a matrix of predicted values for each class, and where indices c and i iterate over all classes and pixels, respectively. Cross entropy loss is based on minimising pixel-wise error, where in class imbalanced situations, leads to over-representation of larger objects in the loss, resulting in poorer quality segmentation of smaller objects.

Focal loss
The Focal loss is a variant of the binary cross entropy loss that addresses the issue of class imbalance with the standard cross entropy loss by down-weighting the contribution of easy examples enabling learning of harder examples (Lin et al., 2017). To derive the Focal loss function, we first simplify the loss in Eq. 1 as: Next, we define the probability of predicting the ground truth class, p t , as: The binary cross entropy loss (L BCE ) can therefore be rewritten as: The Focal loss (L F ) adds a modulating factor to the binary cross entropy loss: The Focal loss is parameterised by α and γ, which control the class weights and degree of down-weighting of easy-to-classify pixels, respectively (Fig. 2). When γ = 0, the Focal loss simplifies to the binary cross entropy loss.
For multi-class segmentation, we define the categorical Focal loss (L CF ): where α is now a vector of class weights, p t,c is a matrix of ground truth probabilities for each class, and L CCE is the categorical cross entropy loss as defined in Eq. 2.

Dice loss
The Sørensen-Dice index, known as the Dice similarity coefficient (DSC) when applied to Boolean data, is the most commonly used metric for evaluating segmentation accuracy. We can define DSC in terms of the per voxel classification of true positives (TP), false positives (FP) and false negatives (FN): The Dice loss (L DSC ), can therefore be defined as: Other variants of the Dice loss include the Generalised Dice loss (Crum et al., 2006;Sudre et al., 2017) where the class weights are corrected by the inverse of their volume, and the Generalised Wasserstein Dice loss (Fidon et al., 2017), which combines the Wasserstein metric with the Dice loss and is adapted for dealing with hierarchical data, such as the BraTS20 dataset (Menze et al., 2014).
Even in its most simple formulation, the Dice loss is somewhat adapted to handle class imbalance. However, the Dice loss gradient is inherently unstable, most evident with highly class imbalanced data where gradient calculations involve small denominators (Wong et al., 2018;Bertels et al., 2019).

Tversky loss
The Tversky index (Salehi et al., 2017) is closely related to the DSC, but enables optimisation for output imbalance by assigning weights α and β to false positives and false negatives, respectively: where p 0i is the probability of pixel i belonging to the foreground class and p 1i is the probability of pixel belonging to background class. g 0i is 1 for foreground and 0 for background and conversely g 1i takes values of 1 for background and 0 for foreground. Using the Tversky index, we define the Tversky loss (L T ) for C classes as: When the Dice loss function is applied to class imbalanced problems, the resulting segmentation often exhibits high precision but low recall scores (Salehi et al., 2017). Assigning a greater weight to false negatives improves recall and results in a better balance of precision and recall.
The asymmetric similarity loss is derived from the Tversky loss, but uses the F β score and substitutes α for 1 1+β 2 and β for β 2 1+β 2 , adding the constraint that α and β must sum to 1 (Hashemi et al., 2018). In practice, α and β values for the Tversky loss are chosen such that they sum to 1, making both loss functions functionally equivalent.

Focal Tversky loss
Inspired by the Focal loss adaptation of the cross entropy loss, the Focal Tversky loss (Abraham and Khan, 2019) adapts the Tversky loss by applying a focal parameter.
Using the definition of TI from Eq. 10, the Focal Tversky loss is defined (L FT ) as: where γ < 1 increases the degree of focusing on harder examples. The Focal Tversky loss simplifies to the Tversky loss when γ = 1. However, contrary to the Focal loss, the optimal value reported was γ = 4∕3, which enhances rather than suppresses the loss of easy examples. Indeed, near the end of training where the majority of the examples are more confidently classified and the Tversky index approaches 1, enhancing the loss in this region maintains a higher loss which may prevent premature convergence to a suboptimal solution.

Combo loss
The Combo loss (Taghanaki et al., 2019) belongs to the class of compound losses, where multiple loss functions are minimised in unison. The Combo loss (L combo ) is defined as a weighted sum of the DSC in Eq. 8 and a modified form of the cross entropy loss (L mCE ): where: and α ∈ [0,1] controls the relative contribution of the Dice and cross entropy terms to the loss, and β controls the relative weights assigned to false positives and negatives. A value of β > 0.5 penalises false negative predictions more than false positives. Confusingly, the term "Dice and cross entropy loss" has been used to refer to both the sum of cross entropy loss and DSC (Taghanaki et al., 2019;Isensee et al., 2018), as well as the sum of the cross entropy loss and Dice loss, such as in the DiceFocal loss and Dice and weighted cross entropy loss (Zhu et al., 2019b;Chen et al., 2019). Here, we decide to use the former definition, which is consistent with both Combo loss and the loss function used in the state-of-the-art for the KiTS19 dataset (Isensee et al., 2018).

Hybrid Focal loss
The Combo loss (Taghanaki et al., 2019) and DiceFocal loss (Zhu et al., 2019b) are two compound loss functions that inherit benefits from both Dice and cross entropy-based loss functions. However, neither exploits the full benefits in the context of class imbalance. Both the Combo loss and the DiceFocal loss, with a tunable β and α parameter respectively in the cross entropy component losses, are partially robust to output imbalance. However, both lack an equivalent for the Dice component loss, where positive and negative examples remain equally weighted. Similarly, the Dice component of both losses are not adapted to handle input imbalance, although the DiceFocal loss is better adapted with its focal parameter in the Focal loss component.
To overcome this, we previously proposed the Hybrid Focal loss function, which incorporates tunable parameters to handle output imbalance, as well as focal parameters to handle input imbalance, for both the Dice and cross entropy-based component losses (Yeung et al., 2021). By replacing the Dice loss with the Focal Tversky loss, and the cross entropy loss with the Focal loss, the Hybrid Focal loss (L HF ) is defined as: where λ ∈ [0,1] and determines the relative weighting of the two component loss functions.

Unified Focal loss
The Hybrid Focal loss adapts both the Dice and cross entropy based losses to handle class imbalance. However, there are two main issues associated with using the Hybrid Focal loss in practice. Firstly, there are six hyperparameters to tune: α and γ from the Focal loss, α / β and γ from the Focal Tversky loss, and λ to control the relative weighting of the two component losses. While this allows a greater degree of flexibility, this comes at the cost of a significantly larger hyperparameter search space. The second issue is common to all focal loss functions, where the enhancing or suppressing effect introduced by the focal parameter is applied to all classes, which may affect the convergence towards the end of training.
The Unified Focal loss addresses both issues, by grouping functionally equivalent hyperparameters together and exploiting asymmetry to focus the suppressive and enhancing effects of the focal parameters in the modified Focal loss and Focal Tversky loss components, respectively.
Firstly, we replace α in the Focal loss and α and β in the Tversky Index with a common δ parameter to control output imbalance, and reformulate γ to enable simultaneous Focal loss suppression and Focal Tversky loss enhancement, naming these the modified Focal loss (L mF ) and modified Focal Tversky loss (L mFT ), respectively: where, The symmetric variant of the Unified Focal loss (L sUF ) is therefore defined as: where λ ∈ [0,1] and determines the relative weighting of the two losses. By grouping functionally equivalent hyperparameters, the six hyperparameters associated with the Hybrid Focal loss are reduced to three, with δ controlling the relative weighting of positive and negative examples, γ controlling both suppression of the background class and enhancement of the rare class, and finally λ determining the weights of the two component losses.
Although the Focal loss achieves suppression of the background class, the focal parameter is applied to all classes and therefore the loss contributed by the rare class is also suppressed. Asymmetry enables selective enhancement or suppression using the focal parameter by assigning different losses to each class, and this overcomes both the harmful suppression of the rare class and enhancement of the background class. The modified asymmetric Focal loss (L maF ) removes the focal parameter for the component of the loss relating to the rare class r, while retaining suppression of the background elements : In contrast, for the modified Focal Tversky loss, we remove the focal parameter for the component of the loss relating to the background, retaining enhancement of the rare class r, and define the modified asymmetric Focal Tversky loss (L maFT ) as: The asymmetric variant of the Unified Focal loss (L aUF ), is therefore defined as: The issue of loss suppression associated with the Focal loss is mitigated by complementary pairing with the Focal Tversky loss, with the asymmetry enabling simultaneous background loss suppression and foreground loss enhancement, analogous to increasing the signal to noise ratio (Fig. 2).
By incorporating ideas from previous loss functions, the Unified Focal loss generalises Dice-based and cross entropy-based loss functions into a single framework. In fact, it can be shown that all Dice and cross entropy based loss functions described so far are special cases of the Unified Focal loss (Fig. 1). For example, by setting γ = 0 and δ = 0.5, the Dice loss and the cross entropy loss are recovered when λ is set to 0 and 1 respectively. By clarifying the relationship between the loss functions, the Unified Focal loss is much easier to optimise than separately trialling the different loss functions, and it is also more powerful because it is robust to both input and output imbalances. Importantly, given that the Dice loss and cross entropy loss both are efficient operations, and applying the focal parameter adds negligible time complexity, the Unified Focal loss is not expected to significantly increase training time over its component loss functions.
In practice, optimisation of the Unified Focal loss can be further simplified to a single hyperparameter. Given the different effect of the focal parameter on each component loss, the role of λ is partially redundant, and therefore we recommend settings λ = 0.5, which assigns equal weight to each component loss and is supported by empirical evidence (Taghanaki et al., 2019). Furthermore, we recommend setting δ = 0.6, to correct the Dice loss tendency to produce high precision, low recall segmentations with class imbalance. This is less than δ = 0.7 in the Tversky loss, to account for the effect from the cross entropy-based component. This heuristic reduction of the hyperparameter search space to the single γ parameter makes the Unified Focal loss both powerful and easy to optimise. We provide further empirical evidence behind these heuristics for the Unified Focal loss in the Supplementary Materials.

Dataset descriptions and evaluation metrics
We select five class imbalanced medical imaging datasets for our experiments: CVC-ClinicDB, DRIVE, BUS2017, KiTS19 and BraTS20. To assess the degree of class imbalance, the percentage of foreground pixels/vowels were calculated per image and averaged over the entire dataset (Table 1).

CVC-ClinicDB dataset
Colonoscopy is the gold-standard screening tool for colorectal cancer, but is associated with significant polyp miss rates, presenting an opportunity to leverage computer-aided systems to support clinicians in reducing the number of polyps missed (Kim et al., 2017). We use the CVC-ClinicDB dataset, which consists of 612 frames containing polyps with image resolution 288 × 384 pixels, generated from 23 video sequences from 13 different patients using standard colonoscopy interventions with white light (Bernal et al., 2015).

DRIVE dataset
Degenerative retinal diseases display characteristic features on fundoscopy that may be used to aid diagnosis. In particular, retinal vessel abnormalities such as changes in tortuosity or neovascularisation provide important clues for staging and treatment planning. We select the DRIVE dataset (Staal et al., 2004), which consists of 40 coloured fundus photographs obtained from diabetic retinopathy screening in the Netherlands, captured using 8 bits per colour plane of resolution 768 × 584. 33 photographs display no signs of diabetic retinopathy, while 7 photographs show signs of mild diabetic retinopathy.

BUS2017 dataset
The most commonly used screening tool for breast cancer assessment is digital mammography. However, dense breast tissue, often seen in younger patients, is poorly visualised on mammography. An important alternative is US imaging, which is an operator-dependent procedure requiring skilled radiologists, but has the advantage of no radiation exposure unlike mammography. BUS2017 dataset B consists of 163 ultrasound images and associated ground truth segmentations with mean image size of 760 × 570 pixels collected from the UDIAT Diagnostic Centre of the Parc Taulí Corporation, Sabadell, Spain. 110 images are benign lesions, consisting of 65 unspecified cysts, 39 fibroadenomas and 6 from other benign types. The other 53 images depict cancerous masses, with the majority invasive ductal carcinomas.

BraTS20 dataset
BraTS20 dataset is currently the largest, publicly available and fullyannotated dataset for medical image segmentation (Nazir et al., 2021), and comprises of 494 multimodal scans of patients with either low-grade glioma or high-grade glioblastoma (Menze et al., 2014;Bakas et al., 2017Bakas et al., , 2018. The BraTS20 dataset provides images for the following MRI sequences: T1-weighted (T1), T1-weighted contrast-enhanced using gadolinium contrast agents (T1-CE), T2-weighted (T2) and fluid attenuated inverse recovery (FLAIR) sequence. Images were manually annotated, with regions associated with the tumour labelled as: necrotic and non-enhancing tumour core, peritumoural oedema or gadolinium-enhancing tumour. From the 494 scans provided, 125 scans are used for validation with reference segmentation masks withheld from public access, and therefore are excluded. To define a binary segmentation task, we further exclude T1, T2 and FLAIR sequences to focus on gadolinium-enhancing tumour segmentation using the T1-CE sequence (Rundo et al., 2019b;Han et al., 2019), which not only appears to be the most difficult class to segment (Henry et al., 2020), but is also the most clinically relevant for radiation therapy (Rundo et al., 2017(Rundo et al., , 2018. We further exclude another 27 scans without enhancing tumour regions, leaving 342 scans, with image resolution 240 × 240 × 155 voxels, for use.

KiTS19 dataset
Kidney tumour segmentation is a challenging task due to the widespread presence of hypodense tissue, as well as highly heterogeneous appearance of tumours on CT (Linguraru et al., 2009;Rundo et al., 2020a). To evaluate our loss functions, we select the KiTS19 dataset (Heller et al., 2019), a highly class imbalanced, multi-class classification problem. Briefly, this dataset consists of 300 arterial phase abdominal CT scans from patients who underwent partial removal of the tumour and surrounding kidney or complete removal of the kidney including the tumour at the University of Minnesota Medical Center, USA. The image size is 512 × 512 pixels in the axial plane, with an average of 216 slices in coronal plane. Kidney and tumour boundaries were manually delineated by two students, with class labels of either kidney, tumour or background assigned to each voxel resulting in a semantic segmentation task (Heller et al., 2019). 210 scans and their associated segmentations are provided for training, with the segmentation masks for the other 90 scans withheld from public access for testing. We therefore exclude the 90 scans without segmentation masks, and further exclude another 6 scans (case 15, 23, 37, 68, 125 and 133) due to concern over ground truth quality (Heller et al., 2021), leaving 204 scans for use.

Evaluation metrics
To assess segmentation accuracy, we use four commonly used metrics : DSC, Intersection over Union (IoU), recall and precision. DSC is defined in Eq. 8, and IoU, recall and precision are similarly defined per pixel/voxel and according to Eqs. 23, 24 and 25, respectively:

Implementation details
All experiments are programmed using Keras with TensorFlow  (Müller and Kramer, 2019). Images from the CVC-ClinicDB, DRIVE and BUS2017 datasets are provided in an anonymised tiff, jpeg and png file formats respectively. For both the KiTS19 and BraTS20 dataset, images and ground truth segmentation masks are provided in an anonymised NIfTI file format. For all datasets, except for the DRIVE dataset which is originally partitioned into 20 training images and 20 testing images, we randomly partitioned each dataset into 80% development and 20% test set, and further divided the development set into 80% training set and 20% validation set. All images were normalised to [0,1] using the z-score. We made use of the 'batchgenerators' library to apply on-the-fly data augmentation with probability 0.15, including: scaling (0.85 − 1.25 × ), rotation (− 15 ∘ to +15 ∘ ), mirroring (vertical and horizontal axes), elastic deformation (α ∈ [0,900] and σ ∈ [9.0, 13.0]) and brightness (0.5 − 2 × ).
For 2D binary segmentation, we used the CVC-ClinicDB, DRIVE and BUS2017 datasets and perform full-image analysis, with images resized as described in (Table 1). For 3D binary segmentation, we used the BraTS20 dataset. Here, images were pre-processed, with the skull stripped and images interpolated to the same isotropic resolution of 1 mm 3 , and we performed patch-wise analysis using random patches of size of 96 × 96 × 96 voxels for training with patch-wise overlap of 48 × 48 × 48 voxels for inference. For 3D multiclass segmentation, we used the KiTS19 dataset. Hounsfield units (HU) were clipped to [ − 79, …, 304] HU and voxel spacing resampled to 3.22 × 1.62 × 1.62 mm 3 (Müller and Kramer, 2019). We performed patch-wise analysis using random patches of size of 80 × 160 × 160 voxels for training and patch-wise overlap of 40 × 80 × 80 voxels for inference.
For the 2D segmentation tasks, we used the original 2D U-Net architecture (Ronneberger et al., 2015), and for the 3D segmentation tasks, we used the 3D U-Net (Cicek et al., 2016). Model parameters were initialised using Xavier initialisation (Glorot and Bengio, 2010), and we added instance normalisation and a final softmax activation layer (Zhou and Yang, 2019). We trained using the stochastic gradient descent optimiser with a batch size of 2 and initial learning rate of 0.1. For convergence criteria, we used ReduceLROnPlateau to reduce the learning rate by 0.1 if the validation loss did not improve after 10 epochs, and the EarlyStopping callback to terminate training if the validation loss did not improve after 20 epochs. Validation loss was evaluated after each epoch, and the model with the lowest validation loss was selected as the final model.
To test for statistical significance, we used the Wilcoxon rank-sum test. A statistically significant difference was defined as p < 0.05.

Experimental results
In this section, we first describe the results from the 2D binary segmentation using the CVC-ClinicDB, DRIVE and BUS2017 datasets, followed by 3D binary segmentation using the BraTS20 dataset, and conclude with 3D multiclass segmentation with the KiTS19 dataset.

2D binary segmentation
The results for the 2D binary segmentation experiments are shown in Tables 2, 3, and 4. Across all three datasets, the best performance was consistently observed with the asymmetric variant of the Unified Focal loss, achieving a DSC of 0.909 ± 0.023, 0.803 ± 0.006 and 0.824 ± 0.063 on the CVC-ClinicDB, DRIVE and BUS2017 datasets respectively. This was followed by the symmetric variant of the Unified Focal loss, which achieved the best IoU score of 0.852 ± 0.028 on the CVC-ClinicDB dataset, and comparable DSC scores to the asymmetric variant with a DSC of 0.909 ± 0.024, 0.801 ± 0.006 and 0.814 ± 0.063 on the CVC-ClinicDB, DRIVE and BUS2017 datasets. No statistically significant difference in performance was observed between the two variants of the Unified Focal loss on these datasets. Generally, the worst performance was observed with cross entropy-based loss functions, with the Focal loss performing significantly worse than the cross entropy loss on the CVC-ClinicDB (p = 0.04) and BUS2017 (p = 0.004) datasets, and significantly worse than the asymmetric variant of the Unified Focal loss across the three datasets (CVC-ClinicDB: p = 2×10 − 6 , DRIVE: p = 110 − 4 and BUS2017: p = 5×10 − 5 ). No significant differences were observed between the Dice-based losses.
To evaluate the performance stability of the γ hyperparameter, we display the DSC performance for each value of γ ∈ [0.1, 0.9] for the three datasets in Fig. 3.
For both the symmetric and asymmetric variants, the Unified Focal loss displays consistently strong performance across the range of γ ∈ [0.1, 0.9]. This is most evident with the CVC-ClinicDB dataset, where improved performance over the other loss functions is observed across the entire range of hyperparameter values. The worst performance occurred at high values such as γ = 0.9, while middle values, such as γ = 0.5, provided robust performance benefits across datasets.
To enable a qualitative comparison, example segmentations are shown in Fig. 4.
There is a clear visual difference between the segmentations generated using different loss functions. The segmentations from cross entropy-based loss functions are associated with a greater proportion of false negative predictions compared to the Dice-based loss functions. The highest quality segmentations were produced by the compound loss functions, with the best segmentations produced using the Unified Focal loss. This is particularly clear with the asymmetric variant of the Unified Focal loss in the CVC-ClinicDB example.

3D binary segmentation
The results for the 3D binary segmentation experiments are shown in Tables 5.
The best performance was observed with the Unified Focal loss, specifically the asymmetric variant with a DSC of 0.787 ± 0.049, IoU of 0.683 ± 0.050, precision of 0.795 ± 0.048 and recall of 0.800 ± 0.056. This was followed by the symmetric variant of the Unified Focal loss, with no significant difference between the two loss functions. In contrast, the asymmetric Unified Focal loss displayed significantly improved performance compared to all the other loss functions (cross entropy loss: p = 0.02, Focal loss: p = 0.03, Dice loss: p = 6×10 − 10 , Tversky loss: p = 5×10 − 11 , Focal Tversky loss: p = 0.02 and Combo loss: p = 1×10 − 4 ).
Axial slices taken from an example segmentation are shown in Fig. 5. From the results, there is a clear recall bias on this dataset, and this is reflected by the proportion of false positive predictions with each segmentation prediction. The compound loss functions displayed the best recall-precision balance, and this is evident by the significantly reduced false positive predictions visible in the segmentations produced using these loss functions.

3D multiclass segmentation
The results for the 3D multiclass segmentation experiments are shown in Tables 6.
The Unified Focal loss achieves the best performance, with DSC of Tversky loss (p = 0.03). The worst performance for kidney segmentation was observed using Dice-based losses, with the Tversky loss followed by the Focal Tversky loss. In contrast, the worst performance for kidney tumour segmentation was observed using cross entropy-based losses, with significantly better DSC performance using the Dice loss compared to the cross entropy loss (p = 0.01). For kidney tumour segmentation, the asymmetric variant of the Unified Focal loss achieves significantly better DSC performance compared to the cross entropy loss (p = 6×10 − 5 ), Focal loss (p = 1×10 − 4 ), Dice loss (p < 0.05) and Tversky loss (p = 4×10 − 4 ). Axial slices taken from an example segmentation are shown in Fig. 5.
While the kidneys are generally well segmented with only subtle differences between the loss functions, the tumour segmentations vary considerably in quality. The low tumour recall scores with the cross entropy-based loss functions are reflected in the segmentations, where the boundary between the tumour and kidney are shifted in favour of kidney prediction. The highest quality segmentation is observed with the Unified Focal loss, with visibly the most accurate contour of the tumour.

Discussion and conclusions
In this study, we proposed a new hierarchical framework to encompass various Dice and cross entropy-based loss functions, and used this to derive the Unified Focal loss, which generalises Dice and cross entropy-based loss functions for handling class imbalance. We compared the Unified Focal loss against six other loss functions on five class imbalanced datasets with varying degrees of class imbalance (CVC- ClinicDB, DRIVE, BUS2017, BraTS20 and KiTS19) involving 2D binary, 3D binary and 3D multiclass segmentation. The Unified Focal loss consistently achieved the highest DSC and IoU scores across the five datasets, with slightly better performance observed using the asymmetric variant over the symmetric variant. We demonstrated that the optimisation of the Unified Focal loss can be simplified to tuning a single γ hyperparameter, which we observed is stable and therefore easy to optimise (Fig. 3).
The significant difference in model performance using different loss functions highlights the importance of the loss function choice in class imbalanced image segmentation tasks. Most noticeable is the poor performance using distribution-based loss functions with the segmentation of the kidney tumour class on the highly class imbalanced KiTS19 dataset (Table 6). This susceptibility to class imbalance is expected given the greater representation of classes occupying a larger region in cross entropy-based losses. Generally, the Dice-based and compound loss functions performed better with class imbalanced data, but one notable exception was the BraTS20 dataset, where the Dice loss and Tversky loss performed significantly worse than the other loss functions. This likely reflects the unstable gradient issue associated with the Dice loss, resulting in suboptimal convergence and resulting poor performance. Compound loss functions such as the Combo loss and Unified Focal loss   performed consistently well across datasets, benefiting from the increased gradient stability with the cross entropy-based component, and the robustness to class imbalance from the Dice-based component. The qualitative assessment correlates with the performance metrics, with the highest quality segmentations observed using the Unified Focal loss (Fig. 4-6). As expected, no difference in training time was observed between any of the loss functions used in these experiments. There are several limitations associated with our study. Firstly, we have restricted our framework and comparisons to include only a subset of the most popular variants of the Dice-based and cross entropy-based loss functions. However, it should be noted that the Unified Focal loss also generalises other loss functions that were not included, such as the DiceFocal loss (Zhu et al., 2019b) and Asymmetric similarity loss (Hashemi et al., 2018). One major class of loss functions that were not included were boundary-based loss functions (Kervadec et al., 2019;Zhu et al., 2019a), which are another class of loss functions that instead use distance-based metrics to optimise contours rather than distributions or regions used by cross entropy and Dice-based losses, respectively. Secondly, it is not immediately clear how to optimise the γ hyperparameter in multiclass segmentation tasks. In our experiments, we treated both the kidney and the kidney tumour as the rare class and assigned γ = 0.5. Better performance may be observed by assigning different γ values to each class, given that for example the kidney class in the KiTS19 dataset is four times more prevalent than the tumour class. However, we still achieved improved performance using the Unified Focal loss over the other loss functions even with this simplification.
We conclude by highlighting several areas for future research. To inform the loss function choice for class imbalanced segmentation, it is important to compare a greater number and variety of loss functions, especially from other loss function classes and with different class imbalanced datasets. We use the original U-Net architecture to simplify but also highlight the importance of loss functions on performance, but it would be useful to assess whether the performance gains generalise to state-of-the-art deep learning methods-such as the nnU-Net (Isensee et al., 2021)-and whether this is able to complement or even replace alternatives, such as training or sampling-based methods for handling class imbalance.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. necessarily those of the NHS, the NIHR, or the Department of Health and Social Care.
CBS in addition acknowledges support from the Leverhulme Trust