Focus U-Net: A novel dual attention-gated CNN for polyp segmentation during colonoscopy

Background Colonoscopy remains the gold-standard screening for colorectal cancer. However, significant miss rates for polyps have been reported, particularly when there are multiple small adenomas. This presents an opportunity to leverage computer-aided systems to support clinicians and reduce the number of polyps missed. Method In this work we introduce the Focus U-Net, a novel dual attention-gated deep neural network, which combines efficient spatial and channel-based attention into a single Focus Gate module to encourage selective learning of polyp features. The Focus U-Net incorporates several further architectural modifications, including the addition of short-range skip connections and deep supervision. Furthermore, we introduce the Hybrid Focal loss, a new compound loss function based on the Focal loss and Focal Tversky loss, designed to handle class-imbalanced image segmentation. For our experiments, we selected five public datasets containing images of polyps obtained during optical colonoscopy: CVC-ClinicDB, Kvasir-SEG, CVC-ColonDB, ETIS-Larib PolypDB and EndoScene test set. We first perform a series of ablation studies and then evaluate the Focus U-Net on the CVC-ClinicDB and Kvasir-SEG datasets separately, and on a combined dataset of all five public datasets. To evaluate model performance, we use the Dice similarity coefficient (DSC) and Intersection over Union (IoU) metrics. Results Our model achieves state-of-the-art results for both CVC-ClinicDB and Kvasir-SEG, with a mean DSC of 0.941 and 0.910, respectively. When evaluated on a combination of five public polyp datasets, our model similarly achieves state-of-the-art results with a mean DSC of 0.878 and mean IoU of 0.809, a 14% and 15% improvement over the previous state-of-the-art results of 0.768 and 0.702, respectively. Conclusions This study shows the potential for deep learning to provide fast and accurate polyp segmentation results for use during colonoscopy. The Focus U-Net may be adapted for future use in newer non-invasive colorectal cancer screening and more broadly to other biomedical image segmentation tasks similarly involving class imbalance and requiring efficiency.


Introduction
Globally, colorectal cancer (CRC) ranks third in terms of incidence, and second only to lung cancer as a leading cause of cancer death [1]. The absence of specific symptoms in the early stages of disease often results in delays in diagnosis and treatment, with the stage of disease at diagnosis strongly linked to prognosis. In the United States, the 5-year relative survival rate for Stage I colon cancer is 92%, decreasing to 12% in those with Stage IV [2].
In 1988, Vogelstein proposed the adenoma-carcinoma sequence model for CRC carcinogenesis, describing the transition from benign adenoma to adenocarcinoma with associated well-defined histology at each stage [3]. Importantly, there is a prolonged, identifiable and treatable preclinical phase lasting years prior to malignant transformation [4,5]. As a result, CRC is highly suitable for population level screening, which has been shown to be effective at reducing overall mortality [6,7]. Non-invasive CRC screening tests include stool-based tests, such as the faecal occult blood test, and more recent blood-based tests, such as Epi proColon® (Epigenomics AG, Berlin, Germany). Capsule colon endoscopy and CT colonography are newer, non-invasive radiological investigations useful for screening high-risk individuals unsuitable for colonoscopy. Invasive options include flexible sigmoidoscopy and colonoscopy, offering direct visualisation and the ability to obtain biopsy specimens for histological analysis. Sigmoidoscopy is limited to cancer in the rectum, sigmoid and descending colon, and colonoscopy remains the gold-standard screening tool for CRC with the highest sensitivity and specificity [8]. However, colonoscopy is associated with significant miss rates for polyp detection, contributed by both patient and polyp-related factors [9][10][11]. The risk of missing polyps significantly increases in patients with two or more polyps, with higher miss rates for flat or sessile compared to pedunculated or sub-pedunculated polyps and miss rates vary from 2% for adenomas ≥ 10 mm to 26% for adenomas < 5 mm.
The difficulty in detecting polyps during colonoscopy presents an opportunity to incorporate computer-aided systems to reduce polyp miss rates [12]. Polyps may remain hidden from the field of view, for which a real-time Artificial Intelligence (AI) model has been developed to assess the quality of colonoscopy [13]. Alternatively, polyps may enter the field of view but remain undetected by the operator. In this case, polyp segmentation approaches not only aim to detect polyps, but to also accurately delineate the polyp border from surrounding mucosa. Early automated methods to segment polyps relied on hand-crafted feature extraction, using either shape-based [14][15][16] or texture and colour-based analysis [17,18]. While considerable advancements were made, the accuracy of polyp segmentation remained low with hand-crafted features unable to capture the scale of polyp heterogeneity [19].
This paper is structured as follows. Section 2 outlines the state-of-theart of polyp segmentation on colonoscopy images. Section 3 describes the architecture of the proposed Focus U-Net. Section 4 describes the analysed datasets and the evaluation metrics used in this study. Section 5 present the experimental results. Finally, Section 6 provides a discussion and concluding remarks.

Related work
In recent years, significant improvements have been achieved by adopting automatic methods based on deep learning. The introduction of Fully Convolutional Networks (FCN) enabled Convolutional Neural Network (CNN) architectures to tackle semantic image segmentation tasks [20]. The application of FCNs to polyp segmentation has yielded impressive results [21,22]. Currently, the state-of-the-art approaches are largely based on the U-Net, a modified FCN architecture developed for biomedical image segmentation [23]. The U-Net consists of an encoding network used to capture the image context, followed by a symmetrical decoding network enabling localisation of salient regions. UNet++ extends the U-Net by incorporating a series of nested skip connections, reducing the semantic gap between the features maps of the encoder and decoder networks prior to fusion [24][25][26]. The ResUNet++ combines residual units with the spatial attention-based Atrous Spatial Pyramidal Pooling (ASPP) and channel attention-based squeeze-and-excitation block [25,26]. Similarly, both attention components are incorporated into the DoubleU-Net, which further leverages transfer learning from the first U-Net to generate features as input into the second network [27]. Despite excellent segmentation results with these models, the large memory and associated long inference time limits use in clinical practice where real-time polyp segmentation is required. Recently, several efficient models with significantly faster inference times, in addition to greater accuracy, have been proposed. Priotising efficiency over performance, PolypSegNet introduces the depth dilated inception module, enabling efficient feature extraction across a range of receptive field sizes [28]. Similarly, ColonSegNet is a light-weight network that includes residual connections and channel attention to achieve real-time polyp segmentation [29]. PraNet uses a two-step process that involves initial localisation of the polyp area, followed by progressive refining of the polyp boundary, resembling the method by which humans identify polyps [30]. HarDNet-MSEG uses a low memory latency HarDNet68 backbone [31], together with a Cascaded Partial Decoder [32] for fast and accurate polyp segmentation. Progressively normalised self-attention network introduces a self-attention module that incorporates channel-split, query-dependent and normalisation rules to improve computational efficiency [33]. The feedback attention network (FANet) uses a form of hard attention based on an iterative refinement method using Otsu thresholding [34].
In this paper, we introduce a novel attention-gated U-Net architecture, named the Focus U-Net, which uses a new attention module known as the Focus Gate (FG), incorporating both spatial and channel-based attention with a focal parameter to control the degree of background suppression. Using this architecture, we achieve state-of-the-art results across five public polyp segmentation datasets. With an efficient and accurate polyp segmentation algorithm, we provide the latest advancement towards using AI in colonoscopy practice, with the aim of assisting clinicians by increasing polyp detection rates.

The proposed focus U-Net architecture
In this section, we introduce the techniques used in the Focus U-Net, beginning with the FG and associated channel and spatial attention modules, followed by explanations of deep supervision and loss function optimisation.

Overview of the focus U-Net
The architecture of the Focus U-Net is shown in Fig. 1. Similar to the U-Net, the Focus U-Net begins with an encoding network, capturing features relevant to polyps such as edges, texture and colour. The deepest layer of the network contains the richest information relating to image features, at the cost of spatial resolution, and forms the gating signal used as input into the FG. The FG uses the gating signal to refine incoming signals from the encoding network in the form of long-range skip connections, by highlighting specific image features and regions that are integrated into the decoding network. Successive upsampling in the decoding network enables polyp localisation at progressively higher resolution, with the final output producing the segmentation map defining, if present, the precise shape and location of the polyp. Shortrange skip connections and deep supervision create additional pathways for information transfer, diversifying the features extracted and providing shortcuts for the loss to propagate backwards to the deeper layers when updating parameters.

Attention Gates and the Focus Gate
The concept of attention mechanisms in neural networks is inspired by cognitive attention, where relevant stimuli in the visual field are identified and selectively processed. In the context of neural networks, distinctions are made between hard and soft attention, as well as global and local attention [35,36]. Hard attention calculates attention scores for each region of the image to select the regions to attend. This requires a stochastic sampling process, which is a non-differentiable calculation relying on reinforcement learning to update parameters [37]. In contrast, soft attention is deterministic and assigns regions of interest (ROIs) with higher weight, with the benefit that this process is differentiable and therefore trainable by standard backpropagation [38,39]. The distinction between global and local attention refers to whether the whole input or only a subset of the input is attended [40]. For training of neural networks, a combination of soft and local attention is often favoured [41].
Attention Gates (AGs) provide neural networks with the capacity to selectively attend to inputs. The use of AG first originated in the context of machine translation as part of Natural Language Processing (NLP) [40,[42][43][44], but has also more recently shown success in Computer Vision, with particular interest in the Attention U-Net for medical image segmentation [41,45].
The structure of the additive AG is illustrated in Fig. 2 [41]. This AG receives two inputs, the gating signal and associated skip connection generated at that level. The gating signal originates from the deepest layer of the neural network, where feature representation is the greatest at the cost of significant down-sampling. In contrast, skip connections arise in more superficial layers, where feature representation is coarser, but image resolution is relatively spared. The AG uses contextual  The gating signal and skip connection are first resized and then combined to form attention coefficients. Multiplication of the original skip connection with the attention coefficients provides spatial context highlighting ROIs. Bottom: schematic of the Focus Gate. The gating signal and skip connection are first resized and then combined prior to spatial and object-related feature extraction. The attention coefficients pass through an additional focal filter controlling the degree of background suppression. Finally, multiplying the original skip connection with the attention coefficients provides both spatial and feature context highlighting regions and features of interest.
information from the gating signal to prune the skip connection, highlighting ROIs and therefore reducing false positive predictions. To accomplish this, the initial stage involves simultaneous upsampling of the gating signal and downsampling of the skip connection to produce equivalent image dimensions enabling element-wise addition. Although computationally more expensive, additive attention has been shown to achieve higher accuracy than multiplicative attention [40].
The resulting matrix is passed through a ReLU activation, followed by global average pooling along the channel axis and final sigmoid activation, generating a matrix of attention weights, also known as the The final step is an element-wise multiplication of the upsampled attention coefficients with the original skip connection input, providing spatial context to the skip connection prior to fusion with outputs from the decoder network.
Before describing the FG, illustrated in the bottom of Fig. 2, we first describe two of its main components, namely the channel attention module and the spatial attention module.

Channel attention module
The global average pooling operation in the additive AG extracts the spatial context to localise the ROIs. However, by pooling across the channel axis, information conveyed by the channels relating to objects features, such as edges and colour, is lost. On the contrary, by assigning weights along the channel axis, channel interdependencies may be explicitly modelled, enabling networks to better recalibrate the features used for segmentation [46][47][48][49]. Squeeze-and-excitation (SE) blocks achieve this by initial feature aggregation using global average pooling along the spatial axis, known as the 'squeeze' operation, followed by two fully connected layers with ReLU and sigmoid activations producing the 'excitation' operation [46,50]. The two fully connected layers involve dimensionality reduction to control model complexity, with implications for computation and performance. Efficient Channel Attention (ECA) [51] avoids dimensionality reduction by modelling cross-channel interaction with an adaptive kernel size k, defined by: , where C is the channel dimension, while b and γ are set to 2 and 1, respectively. A separate insight incorporated into the Convolutional Block Attention Module (CBAM) for channel attention involves using a global max pooling operation in addition to global average pooling, providing two complementary spatial contexts prior to feature recalibration [48].
The channel module used in the FG is illustrated in Fig. 3. We extend the ideas provided by ECA and CBAM by using initial global average and global max pooling to generate two separate spatial contexts, followed by feature recalibration using an adaptive convolutional kernel size avoiding dimensionality reduction. Finally, a sigmoid activation redistributes the values between [0, 1], generating attention coefficients along the channel axis.

Spatial attention module
Complementary to the channel attention module, spatial attention modules involve feature aggregation along the channel axis [47,48,52]. While dimensional reduction is not an issue for spatial attention modules, the replacement of fully connected layers with a convolutional layer requires an additional kernel size parameter. Larger kernel sizes provide a larger receptive field, with better performance but at the cost of computational efficiency [52]. The spatial attention module used in the FG is illustrated in Fig. 3. Again, we extend the ideas provided by ECA and CBAM by using initial global average and global max pooling along the channel axis, generating two separate channel contexts, followed by spatial recalibration with an adaptive convolutional kernel size. In contrast to ECA, the spatial dimension is inversely proportional to the channel dimension, and therefore we modify the original equation and determine kernel size k for the spatial attention module by: where C max is the maximum channel dimension of the network, C 0 is the channel dimension for the first layer, and C is the channel dimension for current layer. The parameters b and γ are set to 2 and 1, respectively [51]. This provides an efficient compromise by scaling the kernel size in proportion to the input dimension, with larger kernel sizes reserved for larger inputs.

Focus gate
Having introduced both spatial and channel attention modules, in this section we describe the structure of the FG (Fig. 2).
Similar to the attention gate, the gating signal is generated from the deepest layer of the network. The upsampling operation is replaced with a learnable kernel weight using a transposed convolution, but otherwise the skip connection and gating signal are resampled to matching dimensions. Following element-wise addition and non-linear activation, spatial and channel attention coefficients are processed in parallel, analogous to processing of the dorsal "where" and ventral "what" pathways, respectively, of the two-streams hypothesis for visual processing [53]. The spatial and channel attention coefficients are combined with element-wise multiplication, and passed through a tunable filter involving element-wise exponential parameterised by the focal parameter prior to resampling.
The concept of a focal parameter originates from work on loss function optimisation, where the contributions of easy examples are downweighed enabling the learning of harder examples [54,55]. Here, we apply the focal parameter to the matrix of attention coefficients, enhancing the contrast between foreground and background objects by controlling the degree of background suppression. Following sigmoid activation, all attention coefficient values are redistributed i ∈ [0, 1]. This enables higher values of the focal parameter to significantly reduce the weights of irrelevant regions and features, while salient regions and important features are relatively spared. The effect of altering the focal parameter is illustrated in Fig. 4.
Careful tuning of the focal parameter is required, to suppress background regions while preserving attention for borders between foreground and background where attention coefficients take middle values.

Deep supervision
The vanishing and exploding gradients problems are well-recognised issues with training deep CNNs [56,57]. The Focus U-Net incorporates two separate, complementary mechanisms to overcome this. Firstly, short-range skip connections, in addition to the long-range skip connections characteristic of the U-Net, allow the error signal to propagate to earlier layers more directly. However, this comes at the cost of computational efficiency, and therefore for the Focus U-Net, we leverage filter factorisation introduced by the Inception network and incorporated into the MultiResUNet, providing an efficient implementation while maintaining performance gains [58,59].
In contrast, deep supervision encourages semantic discrimination of intermediate feature maps at each level by assigning a loss to outputs at multiple layers [60,61]. Equal weighting of outputs produces sub-optimal results due to converging to solutions favouring improved performance of deeper layers at the cost of performance of the final layer. To overcome this, more complicated solutions have been developed, such as multi-scale training [55], or fine-tuning using a fully connected layer [41]. To preserve efficiency, here we assign weights w to different output layers according to Eq. (3): where the stride length and width refer to the final transposed convolution stride dimensions required to resample the feature map to the original image dimension. Intuitively, higher weights are therefore assigned to the layers requiring a smaller degree of upsampling, with the greatest weight assigned to the final output, followed by an exponential decrease in weighting with increasing depth of the network.

Hybrid Focal loss
The training of neural networks is based on solving the optimisation problem defined by the loss function. For semantic segmentation tasks, a popular choice of loss function is the sum of the Dice similarity coefficient (DSC) loss and cross entropy (CE) loss: with: where TP, FP and FN refer to true positives, false positives and false negatives respectively, and y,ŷ ∈ {0, 1} N where ŷ refers to the predicted value and y refers to the ground truth label. However, with class imbalanced tasks such as polyp segmentation, the resulting segmentation using the Dice loss often leads to high precision but low recall rate [62]. By weighting false negative predictions more heavily, the Tversky loss improves recall-precision balance: where the Tversky index (TI) is defined as: p 0i is the probability of pixel i belonging to the foreground class and p 1i is the probability of pixel belonging to background class; g 0i is 1 for foreground and 0 for background and conversely g 1i takes values of 1 for background and 0 for foreground. Complementary to weighting the positive and negative examples, applying focal parameters to both the Tversky and cross entropy loss enables the downweighting of background objects in favour of foreground object segmentation, and produces the Focal Tversky loss and Focal loss, respectively [54,55]: where α controls the class weights.
Finally, we define the Hybrid Focal loss (HFL) as the sum of the Focal Tversky loss and Focal loss: To mitigate suppression of the loss near convergence, we supervise the last layer without the focal parameters [55].
The CVC-ClinicDB database consists of 612 frames containing polyps with image resolution 288 × 368 pixels, generated from 23 video sequences from 13 different patients using standard optical colonoscopy interventions. The Kvasir-SEG database consists of 1000 polyp images collected and verified by experienced gastroenterologists. Images vary in size from 332 × 487 to 1920 × 1072 pixels. CVC-ColonDB consists of 300 images of resolution 500 × 574 pixels obtained from 15 video sequences with a random sample of 20 frames per sequence. ETIS-Larib PolypDB similarly consists of 300 images, with image resolution 1225 × 966 pixels. Lastly, CVC-T consists of 182 frames containing polyps from 8 patients derived from either the CVC-ClinicDB and CVC-ColonDB datasets, with image resolutions of 288 × 384 or 500 × 574 pixels.

Experimental setup and implementation details
For our experiments, we use the Medical Imaging Segmentation with Convolutional Neural Networks (MIScnn) open-source Python library [68]. For all datasets, images and associated ground truth masks are provided in the png file format. For the Kvasir-SEG dataset, we resize all images to 512 × 512 pixels following pre-processing methods used in previous models [30,31], but otherwise resize images to 288 × 384 pixels for all other datasets. Pixel values are normalised to [0, 1] using the z-score. We perform full-image analysis with a batch size of 16, except for the Kvasir-SEG dataset where the large image sizes required the batch size to be reduced to 8. We use the Focus U-Net architecture as described previously with a final softmax activation layer.
For the ablation studies, we use the CVC-ClinicDB dataset, with fivefold cross validation using random assignment. We evaluate the baseline performance of the U-Net [23] and Attention U-Net [41], and sequentially assess the performance with subsequent additions of the Focus Gate, Hybrid Focal loss, short-range skip connections and deep supervision. Similar to hyperparameter selection of Focal loss functions [54,55,69], we perform a grid-search, selecting values for the focal parameter γ ∈ [1,3]. Model parameters are initialised with Xavier initialisation, and each model is trained for 100 epochs using Stochastic Gradient Descent with Nesterov momentum (μ = 0.99). We set the initial learning rate at 0.01, and follow a polynomial learning decay rate schedule [70]: For fairer comparison, we do not apply any data augmentation techniques at this stage.
In contrast, when attempting for state-of-the-art results on the CVC-ClinicDB dataset, we train our final model using five-fold cross validation for 500 epochs and use the following data augmentation techniques: scaling, rotation, elastic deformation, mirror and gamma transformations. For both evaluation on the Kvasir-SEG dataset and evaluation on the combination of all five public datasets, we follow the single train-test split used in Refs. [30,31], and train each model for 1000 epochs with the same data augmentation settings for result reproducibility and comparisons.
For the Hybrid Focal loss, we follow the optimal hyperparameters reported in the original studies. We set α = 0.3 and β = 0.7 for the Tversky index, α = 0.25, γ = 2 for the Focal loss and α = 0.3, β = 0.7 and γ = 3/4 for the Focal Tversky loss [54,55,62]. For all cases, the validation loss is evaluated at the end of each epoch, and the model with the lowest validation loss is selected as the final model. All experiments are programmed using Keras with TensorFlow backend and trained with NVIDIA P100 GPUs, with CUDA version 10.2 and cuDNN version 7.6.5. Source code is available at: https://github.co m/mlyg/Focus-U-Net.

Evaluation metrics
To assess segmentation accuracy, we follow recommendations from Ref. [64], and use DSC and intersection over union (IoU) as the two main metrics. DSC is previously defined in Eq. (5), and IoU is defined as: The IoU metric penalises single instances of poor pixel classification more heavily than DSC, providing similar but complementary perspectives on assessing segmentation accuracy. We further assess recall and precision: In the context of polyp segmentation, recall, also known as sensitivity, measures the proportion of the pixels corresponding to the polyp that are correctly identified. In contrast, precision, also known as the positive predictive value, measures the proportion of pixels correctly labelled as representing the polyp. While both are accounted for in the DSC metric, measuring recall and precision provides additional information regarding the false positive and false negative rates.
Finally, we evaluate model efficiency by calculating the frames per second (FPS) using the mean inference time:

Experimental results
We first perform a series of ablation studies to evaluate individual components of the Focus U-Net, followed by separate evaluations on the CVC-ClinicDB and Kvasir-SEG datasets, and finally evaluate against a test set combining five public polyp datasets.
The results from the ablation study are shown in Table 1. Performance gains are observed with successive addition of each component, and with all components present there is a significant improvement with a DSC score of 0.875 ± 0.016 compared to the U-Net (0.828 ± 0.021) and Attention U-Net (0.801 ± 0.019). The Focus U-Net achieves similar FPS performance to the Attention U-Net and comparable FPS performance to the U-Net, justifying the improvement in accuracy with minimal efficiency losses.
The results for the CVC-ClinicDB dataset are shown in Table 2.
The Focus U-Net achieves state-of-the-art results with a mDSC score of 0.941 and a mIoU score of 0.893, outperforming the ResUNet++ with Conditional Random Field (CRF) and DoubleU-Net. Focus U-Net also has the best recall-precision balance, while PolypSegNet achieves the highest precision at the cost of recall, and conversely the ResUNet++ with CRF achieves high recall at the cost of precision.
Next, we evaluate our model on the Kvasir-SEG dataset. The results are shown in Table 3. The Focus U-Net achieves state-of-the-art results with a mDSC score of 0.910 and mIoU score of 0.845. The highest mIoU is achieved by HarDNet-MSEG [31].
Finally, Table 4 shows the results for the evaluation on five public polyp datasets. It is worth noting that, for a fair comparison, the evaluations metrics are shown only for the approaches that focused on segmentation performance over computational efficiency.
The accuracy of segmentations obtained from the intermediate layer highlights the ability for the deepest layers to localise the polyp effectively. The Focus U-Net generalises well with consistently accurate segmentations across all datasets. For the images corresponding to the poorest segmentation quality, these are either objectively challenging polyps to identify, or in many cases poor-quality images such as in the CVC-ColonDB example.

Discussion and conclusion
In this paper, we introduce a novel dual attention-gated U-Net architecture, named the Focus U-Net, which uses a Focus Gate to encourage learning of salient regions combined with a focal parameter controlling suppression of irrelevant background regions. Moreover, with the additions of short-range skip connections and deep supervision, as well as optimisation based on the Hybrid Focal loss, the Focus U-Net outperforms the state-of-the-art results across five public polyp datasets. Importantly, the proposed architecture performs consistently well across all datasets, demonstrating an ability to generalise to unseen data from different datasets. Visualising the resulting polyp segmentations confirms the segmentation quality, with poorer segmentations associated with a combination of either poorer image quality or objectively more challenging polyps to identify.
The proposed Focus U-Net is the latest addition to lightweight, yet accurate, polyp segmentation models, achieving state-of-the-art results with a mDSC of 0.878 and mIoU score of 0.809 when evaluated on the combination of five public datasets, a 14% and 15% improvement over the previous state-of-the-art results from HarDNet-MSEG with a mDSC of 0.768 and mIoU of 0.702, respectively.
While these results are promising, it is important to determine whether such a model may be applied in clinical practice. Given that colonoscopy involves recordings of live video, a model with a fast inference time is required to process images in real-time. Accordingly, the Focus U-Net architecture is efficiently designed, with both efficient channel and spatial attention mechanisms, as well as a lightweight U-Net backbone, with comparable FPS performance to the standard U-Net [79]. With polyp miss rates as high as 26% reported for small adenomas [9], the primary advantage of AI-assisted colonoscopy is to aid clinicians in reducing polyp miss-rate detection. However, a secondary advantage with segmentation-based computer-aided detection is providing an accurate and operator-independent estimate of the polyp size; an important factor in guiding biopsy decisions that may be required during colonoscopy.
There are several limitations associated with our current study. Firstly, the datasets used to train our model consist of images all containing polyps, in contrast to in practice, where the majority of live video data would not contain a polyp. However, in terms of model training, it has been observed that training with images in the absence of polyps results in poorer generalisation [80]. In terms of model performance, we would expect a higher false positive rate. This is not as undesirable as the converse of a high false negative rate, because the purpose of the computer-aided system is to focus the operator to attend to highlighted regions that may contain missed polyps [81].
While colonoscopy remains the gold-standard for investigating suspected CRC, CT virtual colonography is a relatively newer method for bowel cancer screening that offers non-invasive visualisation of the colon [82]. The flexibility of our model does not restrict usage to polyps in visible light and is equally applicable for polyp detection using CT colonography. However, these newer modalities present additional challenges, such as fluid submersion obscuring polyps [83]. In fact, the scope for using the Focus U-Net architecture is not limited for colorectal polyps, and is applicable for any image segmentation problem where there is the issue of class imbalance and requirement for efficiency.

Declaration of competing interest
All authors in this paper do not have any conflict of interest.