Vessel-CAPTCHA: an efficient learning framework for vessel annotation and segmentation

Deep learning techniques for 3D brain vessel image segmentation have not been as successful as in the segmentation of other organs and tissues. This can be explained by two factors. First, deep learning techniques tend to show poor performances at the segmentation of relatively small objects compared to the size of the full image. Second, due to the complexity of vascular trees and the small size of vessels, it is challenging to obtain the amount of annotated training data typically needed by deep learning methods. To address these problems, we propose a novel annotation-efficient deep learning vessel segmentation framework. The framework avoids pixel-wise annotations, only requiring weak patch-level labels to discriminate between vessel and non-vessel 2D patches in the training set, in a setup similar to the CAPTCHAs used to differentiate humans from bots in web applications. The user-provided weak annotations are used for two tasks: 1) to synthesize pixel-wise pseudo-labels for vessels and background in each patch, which are used to train a segmentation network, and 2) to train a classifier network. The classifier network allows to generate additional weak patch labels, further reducing the annotation burden, and it acts as a noise filter for poor quality images. We use this framework for the segmentation of the cerebrovascular tree in Time-of-Flight angiography (TOF) and Susceptibility-Weighted Images (SWI). The results show that the framework achieves state-of-the-art accuracy, while reducing the annotation time by ~77% w.r.t. learning-based segmentation methods using pixel-wise labels for training.


Introduction
The segmentation of the 3D brain vessel tree is a crucial task to the diagnosis, management, treatment and intervention of a wide range of conditions with a vast population-level impact (World Health Organization, 2020). Due to the high complexity of the cerebrovascular tree, its automatic extraction is a challenging task. Despite decades of research (Lesage et al., 2009;Moccia et al., 2018), the problem remains open.
With the advent of machine learning and, more precisely, deep learning techniques over the last decade (Litjens et al., 2017;Lundervold and Lundervold, 2019), image segmentation of organs, organs substructures, and lesions has reached stateof-the-art performance. This progress, however, has not been 1 Joint first auhtorship. as fast in 3D brain vessel segmentation. Differently from the segmentation of other organs, there is no consolidated deep learning method which has reached human performance, and a vast majority of methods (Bernier et al., 2018;Li et al., 2014, This work presents a novel framework to address the challenges faced by deep learning-based 3D vessel segmentation.

Taking inspiration from Completely Automated Public Turing
Test To Tell Computers and Humans Apart, better known as CAPTCHA (von Ahn and Dabbish, 2004), we initially divide the image volume into 2D image patches and we subsequently request the user to identify the patches containing a vessel or part of it. This task is common on websites to differentiate humans from bots, using image CAPTCHAs (von Ahn and Dabbish, 2004;Elson et al., 2007) of natural images. This procedure, which we denote Vessel-CAPTCHA, simplifies the annotation process by requiring 2D patch tags indicating the presence of a vessel (a part of it, or multiple vessels) and, thus, avoiding pixel-wise annotations. The user-provided patch tags are subsequently used to synthesize a pixel-wise pseudo-labeled training set in a selfsupervised manner using a clustering technique. These two sets are used to train the framework.
The proposed framework is composed of two networks: a segmentation network and a classification network. The segmentation network extracts vessels on a patch basis to tackle the limitations of deep nets in the segmentation of small objects.
The final volumetric segmentation is obtained by concatenating the 2D segmented patches. The classification network is used for two tasks. First, it allows to enlarge the labeled data without the need for further user-provided annotations. Second, it may act as a second opinion (Leibig et al., 2017;Vrugt and Robinson, 2007) that provides a measure of uncertainty in low quality or complex images. We evaluate the role of the classification network as an expert opinion, where only the segmentations from patches identified as vessel patches are kept and those classified as non-vessel patches are masked out.

3D Brain Vessel Segmentation
A comprehensive collection of methods and techniques for general vascular image segmentation is reviewed in (Lesage et al., 2009;Moccia et al., 2018), where they classify different segmentation frameworks according to their characteristic strategies. Classical approaches typically rely on hand-crafted features, with image intensity-derived (Taher et al., 2020), and first (Law and Chung, 2008), second (Frangi et al., 1998;Sato et al., 1997) or higher order (Cetin and Unal, 2015) tensorderived features among the most common. Feature extraction is followed by a vessel extraction scheme, which performs the final segmentation. Notable extraction schemes include deformable models (Klepaczko et al., 2016;Zhao et al., 2015), voting (Zuluaga et al., 2014b), tracking algorithms (Rempfler et al., 2015;Robben et al., 2016) and statistical approaches (Hassouna et al., 2006). Their main drawbacks are two. First, these methods rely on hand-crafted features that need to be tuned, requiring high expertise to find a good set of parameters. Second, extraction schemes are not fully automatic: many need manual initialization, and the final results typically call for manual correction, specially when images are noisy.
Deep learning techniques have emerged as an alternative to circumvent the difficulties of classical approaches. Existing methods have tried to explicitly address the brain vessel tree complexity by designing shallow convolutional neural networks (CNNs) architectures to avoid possible over-fitting (Phellan et al., 2017), or by partitioning the input image volume, while still relying on deeper and more powerful architectures Ronneberger et al., 2015). Different partitioning strategies include anatomical regions (Kandil et al., 2018), 2D slices (Ni et al., 2020), 3D (Phellan et al., 2017;Tetteh et al., 2020) and 2D patches (Livne et al., 2019). Despite achieving accuracies similar to those of classical approaches, the main limitation towards the broader use of deep learning techniques remains to be the burden linked to pixel-wise data annotation, including multi-plane annotations (Phellan et al., 2017) or further pre-processing (Phellan et al., 2017;Kandil et al., 2018;Livne et al., 2019).
Patch-based approaches (Livne et al., 2019;Tetteh et al., 2020) not only aim at reducing the vessel tree's complexity, but they also try to mitigate the limitations of neural nets in the segmentation of objects occupying small portions of an image. Our work adopts a similar strategy and it builds upon the advantages of 2D patch-based approaches (Livne et al., 2019), thus making vessels cover a significant portion of the patch, while avoiding pixel-wise annotations.

Limited Supervision for Image Segmentation
Different strategies have been explored as an alternative to pixel-wise annotation (Cheplygina et al., 2019;Ørting et al., 2020;Tajbakhsh et al., 2020), a tedious and time consuming task requiring a high level of expertise. These strategies can be roughly classified, according to the type of labels they use, as partial pixel-wise labels, which include incomplete, sparse or noisy pixel-wise labels (Tajbakhsh et al., 2020); or as weak labels, which refer to high-level labels and drawing primitives (Cheplygina et al., 2019).
Partial pixel-wise labels refer to annotations where only a fraction of the pixels of the object of interest are provided (Bai et al., 2018;Ç içek et al., 2016;Liang et al., 2019;Ke et al., 2020). These labels can be provided by the user or generated by simpler methods to produce rough segmentation masks. Semisupervised methods follow different strategies to exploit partially labeled data under the assumption that it is enough to train a segmentation model. Bai et al. (2018) used image registration to propagate user-provided labels over some image slices containing the aorta. Ç içek et al. (2016) designed the 3D-Unet to account for sparse and incomplete pixel-wise labels. Other methods resort to iterative stages of refinement (Liang et al., 2019;Ke et al., 2020). Although these methods have reported good performances in medical image segmentation (Cheplygina et al., 2019), the complexity of the 3D brain vessel tree makes pixel-wise annotation, even if partial, highly time consuming. As one of our aims is to minimize the annotation effort, our work focuses on the use of weak labels.

Weakly Supervised Learning
Weak Labels for Medical Image Segmentation. We consider two forms of weak labels for medical image segmentation tasks: image-level labels and drawing primitives. Image-level labels (Feng et al., 2017;Jia et al., 2017;Raza et al., 2019;Schlegl et al., 2015;Xu et al., 2019;Zhao et al., 2019) assign a tag or rating to an image under the assumption that images contain cluttered scenes with enough information from which a model can learn (Qi et al., 2017). In medical tasks, they have been mainly used with 2D images/slices to segment pathologies, i.e. lung nodules (Feng et al., 2017), damaged retinal tissue (Schlegl et al., 2015), brain tumors (Izadyyazdanabadi et al., 2018) or cancerous tissue (Jia et al., 2017;Kraus et al., 2016;Lerousseau et al., 2020;Xu et al., 2014Xu et al., , 2019. To a lesser extent they have been used for organ structures segmentation, i.e. the optic disc (Zhao et al., 2019). Despite the good reported performances and the annotation time savings they represent, image tags have not been used for 3D vessel segmentation.
Drawing primitives include bounding boxes and contouring shapes (Cheplygina et al., 2016;Gao et al., 2012;Dai et al., 2015;Li et al., 2018;Rajchl et al., 2017;Wang et al., 2018), scribbles and lines (Can et al., 2018;Lin et al., 2016;Matuszewski and Sintorn, 2018;Wang et al., 2015) and clicks (Bruggemann et al., 2018). In 3D vessel segmentation, bounding boxes have been used for aortic segmentation, with the assumption that the aorta is a compact structure, which can be enclosed within a bounding box (Pepe et al., 2020). This assumption does not hold for highly sparse bifurcated trees, as the brain vascular tree, where a 3D bounding box would nearly cover the full brain. Moreover, if an image is analyzed in 2D, the vessel tree appears as a series of disconnected blobs or elongated structures, which challenges the use of 2D contouring shapes. Koziński et al. (2020) address this limitation by using 2D annotations in Maximum Intensity Projections of 3D vascular images. To some extent, these can be considered 2D image scribbles of varying density for the original 3D volume.
The framework, however, requires full 2D pixel-wise annotations. Although the scheme significantly reduces the labeling time, more than four hours are needed to generate sufficiently dense 2D annotations that do not compromise performance. Finally, clicks are common in classical 3D vessel segmentation approaches (Benmansour and Cohen, 2009;Moriconi et al., 2019) to provide seed-points, but no works yet integrate them in a weakly supervised learning framework. This may be due to the complexity of the 3D brain vessel tree, where a single click might not carry sufficient information to train a model.
Our work relies on image tags. To cope with the granularity and sparse appearance of vessels, we use 2D patch-level tags, in the form of clicks over a grid. A click selects the patches containing at least one vessel or a part of it. We denote this annotation scheme the Vessel-CAPTCHA.
Weakly Supervised Learning with Image Tags. Our weakly supervised vessel segmentation framework using image tags can be cast as a multi-instance learning (MIL) problem (Dietterich et al., 1997;Maron and Lozano-Pérez, 1997;Cheplygina et al., 2019), where a bag corresponds to an image patch and the instances are the image pixels. A bag is considered positive (a vessel patch) if at least one instance within the bag is positive (a vessel pixel). The goal is then to infer the key instances (Liu et al., 2012), i.e. the vessel pixels, that activate the bag label.
Standard MIL segmentation approaches, which have been less studied than the classification counterpart (Campanella et al., 2019;Hou et al., 2016;Quellec et al., 2012), follow a multi-stage strategy. In a first stage common to MIL segmentation and classification, they train a model to learn instance-level probabilities of belonging to the positive class. At a second stage, these probabilities are used to obtain pixel-wise labels, which can be considered as the segmentation output (Xu et al., 2014;Kraus et al., 2016) or as pseudo-labels to train a segmentation model in supervised way (Lerousseau et al., 2020;Xu et al., 2019). A main limitation is that the instance-level probabilities are not originally conceived to generate segmentations, but to serve as inputs for bag classification. Therefore, the segmentation results may be poor. Mitigation strategies rely on area constraints (Jia et al., 2017;Lerousseau et al., 2020); robust instance selection operations (Kraus et al., 2016;Xu et al., 2019); post-processing (Kraus et al., 2016); or enriched information, such as supplementary instance-level inputs (Shin et al., 2019) or image landmarks (Schlegl et al., 2015). However, these strategies often come at the cost of further required user inputs (Jia et al., 2017;Schlegl et al., 2015;Shin et al., 2019).
Attention-based MIL (Ilse et al., 2018), an alternative to standard MIL, uses attention mechanisms (Niu et al., 2021), such as class activation maps (CAM) (Zhou et al., 2016), under the assumption that the discriminative regions identified by a network correspond to the key instances, i.e. the pixels to segment (Ahn and Kwak, 2018;Feng et al., 2017;Hong et al., 2017;Izadyyazdanabadi et al., 2018;Ouyang et al., 2019;Shen et al., 2021;Zhao et al., 2019). Since attention mechanisms focus on the localization of the most discriminative regions, they suffer from the same limitations as standard MIL, which lead to inaccurate segmentation masks. For instance, some works (Shen et al., 2021) consider the resulting mask as a localization/detection mask and not as a segmentation one. Others have attempted to refine the attention maps through pixel similarity propagation (Ahn and Kwak, 2018;Zhao et al., 2019), feature assembling (Izadyyazdanabadi et al., 2018) and post-processing stages (Krähenbühl and Koltun, 2011), which all lead to increased model complexity. To avoid the increased complexity, other works propose manual intervention (Feng et al., 2017) or the use of some pixel-wise annotated data (Ouyang et al., 2019;Zou et al., 2021), leading to more user-required inputs.
A last set of methods favors the use of simpler techniques to generate an initial pseudo-labeled set that can be then refined using a learning-based approach. Luo et al. (2020) relied on traditional saliency methods along with a quality control step for object detection from videos. Hou et al. (2016) used a mixture of Gaussians in cancer tissue classification. Lu et al. (2021) used a simple threshold to segment tissue regions, which are refined with a CAM to classify cancerous tissue.
While the cerebrovascular tree is a highly complex structure, the typical available dataset size for training a model to segment it is relatively small. Therefore, avoiding high model complexity is critical in 3D brain vessel segmentation (Phellan et al., 2017). Our work favors model simplicity and minimal user interaction. Thus, similarly to (Hou et al., 2016;Luo et al., 2020;Lu et al., 2021), we use a simpler self-supervised technique, such as the K-means, to generate pixel-wise pseudo-labels. As other weakly supervised approaches (Feng et al., 2017;Lerousseau et al., 2020;Luo et al., 2020;Xu et al., 2019), we use the pseudo-labeled set as input of a supervised training phase that learns to segment the brain vessel tree, without the need for any additional user inputs.

Biomedical Image Classification
Our work explores the use of the Unet (Ronneberger et al., 2015) and the Pnet , two networks originally conceived for medical image segmentation, for the classification tasks of our framework. These two networks have been originally designed for image segmentation. Their adaptation to a classification task can be considered as a MIL formulation, where instance-level information, i.e. pixels, are used to predict a bag label, i.e. the patch tag. Similar to most biomedical classification tasks, previous MIL-based biomedical image classification works (Campanella et al., 2019;Qi et al., 2017) rely on customized versions of VGG-16 (Simonyan and Zisserman, 2015) and ResNet , the most popular architectures for natural image classification. Others (Hou et al., 2016) use task-specific architectures adapted from general purpose networks such as end-to-end CNNs. However, no major performance differences are currently found among them (Lundervold and Lundervold, 2019).

Contributions
The contributions of this work are four-fold: 1. we introduce an annotation and segmentation scheme, the Vessel-CAPTCHA, to reduce the labeling burden of 3D brain vascular images, consisting of two phases: a first phase where the user provides tags at the 2D image patch-level, and a second stage where pixel-wise pseudo-labels are obtained, in a self-supervised fashion, using only the userprovided patch tags as input.
2. We propose a weakly supervised learning framework on 2D image patches to achieve 3D brain vessel segmentation. To circumvent the problems faced by deep neural networks when segmenting small objects, the framework uses a 2D patch-based segmentation network trained with 2D pixel-wise pseudo-labeled patches synthesize by the Vessel-CAPTCHA annotation scheme using the weak userprovided patch tags as input.
3. We investigate the use of network architectures specifically designed for medical imaging tasks to classify 2D image patches (vessel vs. non-vessel). The classifier networks are used to pseudo-label a potential training set without further user effort, and it may act as a second opinion for segmentation masks obtained from low quality images.
4. Using two different image modalities, we demonstrate that the proposed framework achieves state-of-the-art performance for 3D brain vessel segmentation, while significantly reducing the annotation burden by ∼77% compared to the annotation time required in other deep learningbased methods.
To foster reproducibility and encourage other researchers to build upon our results, the source code of our framework made publicly available on a Github repository 2 .

Method
The proposed Vessel-CAPTCHA framework algorithm for 3D vessel segmentation is depicted in Fig. 1. In the following, we introduce the Vessel-CAPTCHA annotation scheme and we describe how pixel-wise pseudo-labels are synthesized from the user-provided weak patch labels in a self-supervised way (Sec. 2.1). In Sec. 2.2, we present the two networks conforming the proposed framework: a classifier network and a segmentation network. Sec. 2.3 explains how the classifier network can be used to enlarge the set of weak pixel-wise annotations, allowing to have a larger set to train 2D-WnetSeg. Finally, Sec. 2.4 briefly explains how to segment unseen images using the proposed framework.

The Vessel-CAPTCHA Annotation Scheme
We consider a dataset I of training images. Given an image  Fig. 1. The Vessel-CAPTCHA framework. At Stage 1, an image grid with patch size 32×32 covering the brain tissue is presented to the user for annotation. The user selects the patches which contain at least one vessel or a part of it. The process, which we denote the Vessel-CAPTCHA annotation scheme, is done for every axial slice in an image volume. This weakly annotated set T P is used to synthetize pixel-wise pseudo-labels for every patch using the K-means algorithm. The resulting pseudo-labeled set is denoted T M . At stage 2, T P is used to train a classification network (2D-PnetCl) and T M is used to train a segmentation network (2D-WnetSeg). In the segmentation network training, it is possible to enlarge the set of pseudo-labeled data through an optional data augmentation step. For an unseen image, the final volumetric segmentation is obtained by concatenating the 2D segmentations obtained from 2D-WnetSeg. Optionally, the classification network can be used as a second opinion to refine the segmentation results. In that case, only 2D segmentations from patches classified as vessel ones are considered in the final volume segmentation. coordinate (i, j) ∈ D k . The set of annotations for a given patch is summarized by an indicator function f : U k → {0, 1} which takes value 1 if at least one pixel in the patch was labeled with 1: (1) The training set of patch-level labels for the image I is defined by the set: This set is therefore composed by patches and associated indicators/tags of the presence of a vessel according to the user's annotation. Based on the training set T I P , we estimate approximated vessel masks via a model fitting procedure. For every patch we define a function M k : , which assigns to each pixel's coordinate a label according to the following scheme: where K M is a K-means predictor trained on the intensity values of the patch {X k (i, j), (i, j) ∈ D k }. By specifying K = 2 clusters we therefore obtain a rough estimate of the low-high intensity partitioning of the patch. The ensemble of estimated partitions across patches is denoted as M s = {M k } P s k=1 , and we define the pixel-wise labeled training set for the image I as . Finally, patch-and pixel-level training sets across the image dataset are denoted by and respectively.

Segmentation Network
The segmentation network learns from the input training set T M how to segment 2D image patches using the Dice similarity coefficient, as proposed by Milletari et al. (2016), which is specifically tailored for segmentation tasks in medical images.
The segmented 2D patches are concatenated to reconstruct the original segmented 3D image volume. For this task, we use a segmentation network connecting two 2D-Unets in cascade (Dias et al., 2019). We denote it 2D-WnetSeg (Fig. 3). The network is trained on T M , the set of 2D image patches with pixelwise pseudo-labels to tackle the neural networks limitations in the segmentation of objects with a small object-to-image ratio. The human cerebrovascular system has an intricate shape with large and smaller blood vessels which mainly differ in the spatial scale, but which share similar shapes. The selected selfsupervised method, the K-means, favors the over-segmentation of larger vessels. Thanks to a set of max pooling layers, the first 2D-Unet allows to learn spatial scaling features from the input training data. Thus, it can recover rough-mask labels from smaller vessels not initially extracted by K-means. This means that the first Unet acts as a refinement module to correct the initial masks by inferring missing vessels based on the structural redundancy of the cerebrovascular tree. The second Unet, which has a similar architecture as the first one, receives as input the output of the first Unet with the recovered labels from small vessels. As a result, the 2D-WnetSeg is able to learn vessels even with a pseudo-labeled training set with imperfect labels or noise.
The smaller vessels in the brain vessel tree may disappear in very deep networks due to the subsampling layers. To tackle this, the 2D-WnetSeg has 14 blocks with convolutional layers structured into 4 levels. In this, it differs from previously proposed cascaded networks (Dias et al., 2019) or the Unet-based vessel segmentation from (Livne et al., 2019). This also contributes to reduce the number of trainable parameters. Specifically, the number of trainable parameters in (Livne et al., 2019) is about 3.1e7, whereas the WnetSeg has only about 1.6e7 parameters.
In our architecture, the first 7 blocks form the first Unet and the second 7 blocks belong to the second one. Each block consists of 2 convolutional layers with kernel size 3 × 3 pixels, each followed by a rectified linear unit (ReLU). They are both added to the padding to ensure that the output has the same shape as the input. A drop-out layer is applied between them. As the input proceeds through different levels along the contracting path, its resolution is reduced by half. This is performed through a 2 × 2 max-pooling operation with stride 2 on 3 levels except for the bottom level. We double the number of feature channels at each level of the contracting path. The right portion of a half-network (Unet), i.e. the expansive path, consists of blocks with concatenation and up-sampling for each level to extract low-features and it expands the spatial support of the lower resolution feature maps to assemble the necessary information and recover the original input size. Finally, we employ skip-connections from the shallow layers to deeper layers between the two 2D-Unets, at the same levels, to ease the training of the network.

Networks for vessel vs. non-vessel patch classification
The classification network is trained on T P to discriminate between vessel and non-vessel patches in unseen data. This discrimination serves two purposes: 1) to synthesize patch tags  without the need of user interventions and 2) to act as a second opinion for segmentations. In the latter case, the segmentation network serves as a first expert predicting pixel-wise labels, whereas the classifier network provides a concept on a per-patch basis. This can be considered an ensemble approach to uncertainty (Vrugt and Robinson, 2007), where a disagreement among the two networks/opinions indicates uncertainty on the predictions of a given patch.
Most works in the literature rely on customized versions of VGG-16 (Simonyan and Zisserman, 2015) and ResNet , the most popular architectures for natural image classification, or on task-specific architectures adapted from general purpose networks (Chen et al., 2016;Setio et al., 2016).
In this work, we investigate the use of networks specifically designed for medical imaging applications for our classification task: the Unet (Ronneberger et al., 2015) and the Pnet . As these two networks have been designed for image segmentation, we hereby describe how they have been modified to achieve classification.
We denote the modified 2D Pnet architecture (

Data Augmentation for Segmentation Network Training
The set T M consisting of pseudo-labels is used to train the 2D-WnetSeg. To augment its size without increasing the annotation burden, we make use of the classification network to generate a larger set with pixel-wise pseudo-labels. The procedure is depicted in Fig. 4.
Assuming that there is an initial set of unlabeled images I * that can be used for training, we consider the joint image dataset of labeled and unlabeled images I ALL = I I * . The subset I of these images is used to generate Vessel-CAPTCHAs, which are presented to the user for annotation. This results in the training set T P (Eq. 3), which is used to both train the classification network and to synthesize the pixel-wise pseudo-labeled set T M (Eq. 4).
Using the trained classification network, a set of patches {X * s } is obtained in the remaining set of images I * . Rather than presenting another Vessel-CAPTCHA to the user for annotation, the {X * s } are inputted to the classification network to estimate patch labels {Y * s }. The paired set of patches and estimated labels conform a new set T * P = {T I P } I∈I * . The set T * P is used to synthesize pixel-wise pseudo-label masks M * following the same procedure applied to T P (Sec. 2.1). This leads to a new pseudo-labeled set T * M . The extended set of pixel-wise pseudo-labels is formed by the union of the two sets T M ALL = T M T * M , and is subsequently used to train the 2D-WnetSeg architecture.

Inference Phase
Unseen 3D images are segmented by extracting 2D image patches that are then segmented by the 2D-WnetSeg and concatenated to build back the original volume (Fig. 1). In low quality or noisy images, the resulting segmentation can often present a large set of pixels erroneously segmented as vessels. To avoid this problem, the trained classifier network may act as an expert providing a second opinion to the results from the segmentation network. In such case, only those patches which have been classified as vessels are taken into account to reconstruct the final volume. All the pixels of the remaining patches are set to zero.

Implementation Details
We used the Keras library to implement 2D-PnetCl, 2D-UnetCl and 2D-WnetSeg. The networks were trained on a GPU workstation with 4-core Intel(R) Xeon(R) CPU @ 2.30GHz, a NVIDIA Tesla P100-PCIE-16GB, and 25GB memory. For both 2D-UnetCl and 2D-PnetCl we optimized the binary crossentropy loss function with a minibatch stochastic gradient descent and a conservative learning rate of 0.01 and momentum of 0.9. The weights of the 2D-WnetSet were optimized using an Adam optimizer with learning rate lr = 1e−4, β 1 = 0.9, and β 2 = 0.999. All networks were trained from scratch using mini-batches of 64 patches. All input patches were normalized by the mean and standard deviation of the whole training data. A dropout of 0.5 for 2D-PnetCl and 2D-UnetCl, and of 0.1 for 2D-WnetSeg was added to prevent overfitting during the training. The image input sizes of 2D-PnetCl and 2D-WnetSeg were 32×32 and 96×96, respectively. We implemented a zeropadding technique to preserve output size as input size at each convolution layer in both networks. Therefore, the feature map size at each level in the 2D-PnetCl is 32×32.

Experimental Setup
In this section, we describe the experimental setup. First, we present the datasets used in our experiments (3.1) and the baselines used for comparison (Sec. 3.2). Then, we describe the training setup (3.3). Finally, we present the performance evaluation metrics used in our experiments (Sec. 3.4).

Data
Three different types of data were used in this study: synthetic, Time-of-Flight (TOF) angiography and Susceptibility-Weighted Images (SWI). The latter two correspond to two magnetic resonance imaging (MRI) sequences commonly used to image and assess the cerebrovascular tree (Radbruch et al., 2013), although blood vessels present different appearances in each modality. In TOF, vessels are hyper-intense structures, whereas they are hypo-intense in SWI.

Setup
Pre-processing and Annotation. We used the available ground truth from the synthetic images to generate Vessel-CAPTCHA annotations. Since the in-plane dimensions of the images are not a multiple of the patch size (Table 1), we overlap the last two rows/columns of patches. Both TOF and SWI were skull-stripped using a standard tool and we generated the Vessel-CAPTCHA annotation grid only over the brain tissue (Fig.2). Where the minimum-sized rectangle mask covering the brain tissue was not a multiple of the patch size in a given dimension, we dilated the mask in that dimension until the condition was met and generate the annotation grid.
If the minimum-sized rectangle mask touched the image slice borders and the in-plane dimensions of the images were not a multiple of the patch size, we generated the annotation grid by overlapping the last two rows or columns of patches. Three users annotated the images using the Vessel-CAPTCHA annotation scheme: a trainee, an experienced rater and a neurologist.
In addition to this, TOF data was pixel-wise annotated. Finally, no pixel-wise labels were obtained for SWI, since it is difficult to obtain a sufficiently robust ground truth. All annotation times were recorded.
For the Vessel 2D-Unet, further data pre-processing for synthetic and TOF data was performed as described in (Livne et al., 2019). All datasets where normalized (within modality). For TOF, where two different sources were used, we follow the intensity and spacing normalization strategy from (Full et al., 2021).
Training Setup. Two different rules are used to synthesize pseudo-labels for the annotated training set T M with the K-means algorithm. In synthetic data and TOF, vessels are associated to the cluster with the highest mean value, whereas the vessel class is associated to the cluster with the lowest mean value in SWI. The training sets, T P and T M , are used to separately train a classification and a segmentation network per modality.

Evaluation Metrics
Vessel Segmentation. We estimate the Dice Similarity Coefficient (DSC), the Hausdorff Distance (HD), the 95% Hausdorff Distance (95HD) and the mean surface distance error (µD) between the segmentation and the annotated ground truth to quantitatively assess the segmentation accuracy in TOF and the synthetic dataset. We measure HD, 95HD and µD in voxels.
In SWI, the segmentations are assessed qualitatively. Based on a visual inspection by two raters (an expert rater and a neurologist), the segmented images are classified as good (3), average (2) or low quality (1). A segmented image is considered good, if it segments the large and medium vessels, and avoids the segmentation of noisy regions, with an elongated appearance similar to a vessel, and sulci. It might miss some small vessels. A segmented image is considered of average quality if it segments large and medium vessels, it misses small ones, it may segment noisy areas in a small proportion (less than 50%), specially in the anterior part of the brain, and often segments sulci. All other cases are considered as low quality ones. We use the Cohen's Kappa coefficient (κ) to measure the level of agreement among raters.
Patch Classification. We measured precision (P), recall (R) and the F-score (F 1 ), using a vessel patch as the positive class to assess the quality of the classification results obtained by the classifier networks.

Experiments and Results
We assess the performance of the Vessel-CAPTCHA in terms of vessel segmentation accuracy and required annotation time (Sec. 4.1). In Section 4.2, we compare our weak learning strategy with other limited supervision techniques. Section 4.3 studies the proposed classification networks and their performance as a data augmentation strategy. Next, we perform an ablation study to understand how the different components of the framework contribute to performance (Sec. 4.4) and we present a brief summary of all the obtained results in Section 4.5.

3D Brain Vessel Segmentation Performance
We evaluate the performance of the Vessel-CAPTCHA framework in terms of segmentation accuracy and required annotation time using all available datasets. We compare it against the 3D brain vessel segmentation, i.e. the deep learning vessel segmentation frameworks and the classical techniques.
Synthetic Data. We use the synthetic data to provide a controlled setup, where the ground truth is fully reliable, to assess the learning-based vessel segmentation strategies. In addition to the required fully supervised training, Vessel 2D-Unet and DeepVesselNet are trained using weak labels from the Vessel-CAPTCHA annotation scheme. We denote them as Vessel 2D-Unet-W and DeepVesselNet-W.  processing operations. We record the time required to obtain a visually satisfactory segmentation.  The better distance-based measures suggest that the differences in the DSC might come from the ground truth annotation protocol, in which our data might include more distal, hence thinner vessels that are more prone to be unsegmented. This is confirmed by DeepVesselNet's DSC on synthetic data.
In the controlled setup, the reported results are comparable to (Tetteh et al., 2020). Susceptibility-Weighted Images (SWI). We study the capacity of the Vessel-CAPTCHA to segment different image modalities by qualitatively assessing the segmentation results obtained in SWI. The framework was trained and visually assessed on the validation set. The model visually judged as best was used to segment the test set. Figure 8 illustrates some segmentation results. Overall, SWI is more complex than TOF, thus further errors are observed. As  a general pattern, the SWI segmentations tend to miss small vessels, while there is also a high incidence of false positives due to erroneously segmented sulci and noise. Nevertheless, the raters judged more that 50% of the segmentations as good and only one image was considered poor by one of them. Their visual judgment an average rating score of 2.57 with an agreement κ=0.75.
SWI Vessel-CAPTCHA annotation requires 38% more time than in TOF (94.5±11.5). This is expected given the increased complexity of SWI scans: small vessels require more effort to be identified and vessels often present an appearance similar to sulci (Fig 8). These factors have a direct incidence in the time needed by a rater to discriminate vessel from non-vessel patches. Nevertheless, SWI Vessel-CAPTCHA accounts for 71% less time than the pixel-wise annotation baseline (327.5±20.5 min, see Fig. 7).

Alternative Limited Supervision Strategies
Using the TOF dataset, we choose to do a separate comparison of the Vessel-CAPTCHA and other limited supervision strategies, which excludes fully supervised 3D brain vessel segmentation approaches. As there are no works using limited supervision addressing 3D brain vessel segmentation we consider that a direct comparison between the two families of methods is advantageous towards the fully supervised techniques.
Partial Labeling Techniques. Table 4 compares our framework with the partial labeling techniques, 3D-Unet, and Pseudolabeling. The 3D-Unet is trained with the pixel-wise annotations, under the assumption that these are highly prone to error, given the difficulties that the brain vessel tree poses for annotation. Pseudo-labeling uses rough segmentation masks obtained using the Sato filter (Sato et al., 1997) to the image volumes, Table 4. Comparison with partial labeling methods using TOF. The bold font denotes best value. Our framework uses 2D patches, Pseudo-labeling image slices and 3D-Unet image volumes as input.
Weakly Supervised Strategies. In our experiments, we were not able to achieve sufficiently good results with WS-MIL and AffinityNet that could allow a quantitative comparison with the other baselines. In this section, we perform a qualitative analysis of the obtained results to gain understanding about the limitations of standard MIL-and CAM-based segmentation techniques for brain vessel tree segmentation.
We adapt WS-MIL to address 3D brain vessel segmentation by using the Vessel-CAPTCHA patches as input rather than an image slice (Lerousseau et al., 2020). WS-MIL splits its input into sub-patches and it ranks them according to their predicted probability of containing a vessel. We consider two sub-patch sizes, 16×16 and 8×8. The final sub-patch labeling is achieved by using the ranked patches along with two hyper-parameters, α and β, which control the minimum number of pixels belonging to the foreground (α) and the background class (β) ( Table 2). We observe two limitations in the obtained results (Fig. 9). First, the resulting masks correspond to vessel localization masks, not segmentations, due to the granularity of the patches. The original WS-MIL formulation (Lerousseau et al., 2020) has been conceived for super resolution histology images, where the resulting labeled sub-patches can be considered a segmentation mask. Standard brain images have a much lower resolution.
Therefore, the final result lacks the necessary specificity to be considered a segmentation. Second, we observe that it is difficult to set a value for α and β that works well for all the slices in an image volume. As shown in Figure 9, while a low α value works well in image slices with larger vessels, the same value fails to detect smaller vessels, hence it is necessary to train a new model with different α, β values.
The architecture of AffinityNet does not allow images below a certain size to be fed into it. Therefore, we had to enlarge the patch used from 32×32 to 96×96, similar to the one we use as input of 2D-WnetSeg. The larger patches were obtained by grouping 32×32 patches. A vessel label was assigned if at least one sub-patch was originally labeled as a vessel patch. Otherwise, the patch was labeled as non-vessel.
Despite the larger field of view of the new input patches, our experiments did not achieve good results with AffinityNet. A visual inspection of the CAMs showed that, although they activate consequently with the class associated to the patch, these did not contain discriminative information about vessels ( Fig. 10). Let us recall that AffinityNet (Ahn and Kwak, 2018) uses the input image and the CAMs (Zhou et al., 2016) to synthesize pseudo-labels, which are then used to train a segmentation model. However, CAMs are rough approximations of the object of interest (Ahn and Kwak, 2018;Bae et al., 2020;Zou et al., 2021). In the past, CAM-based methods have been used to segment relatively large objects in natural scenes (Ahn and Kwak, 2018;Hong et al., 2017;Zou et al., 2021), damaged tissue (Izadyyazdanabadi et al., 2018) or blob-like structures occupying an important part of the image, such as the optic disc (Zhao et al., 2019). In our case, as vessels are relatively small objects, it seems that the network requires to use much more information from the scene to discriminate between vessel and non-vessel patches, as reflected by the CAMs (Fig. 10). The information, however, is to broad to locate the vessels and thus AffinityNet fails.

Classification Networks
Classification Networks Performance. We study the performance of the two classification networks, 2D-UnetCl and 2D-PnetCl, to determine if they are well-suited as discriminators within our framework. Table 5 compares the classification performance of 2D-UnetCl and 2D-PnetCl in TOF and SWI images with VGG-16 and the ResNet. For each network, two models were trained, one for TOF and one for SWI. Results are reported on the best performing model in the validation set.
The two proposed networks, derived from medical imaging task-specific networks, present a higher overall performance (Fscore) than VGG-16 and the ResNet, suggesting that the networks specifically designed for medical imaging tasks can contribute to an increased performance. All methods report a drop in performance from TOF to SWI, which is expected given that SWIs are more challenging to classify and segment due to several factors. First, vessels in SWI are hypo-intense, being similar in appearance to the image background. As such, vessels close to the brain surface are prone to misclassification. Second, SWI is capable of imaging very small vessels that can be difficult to identify within a patch, as they can have an appearance similar to the one of brain tissue inhomogeneities or sulci, this leading to misclassification.
Among the proposed networks, 2D-PnetCl presents the highest performance in both modalities. This reflects a good balance in the network's capability to discriminate among vessel and non-vessel patches, which is key for its use within the Vessel-CAPTCHA framework. In the remaining, we rely on 2D-PnetCl as a classification network.
Classification Network as a Weak Pseudo-label Generator. We use a percentage (25%, 50% and 100%) of the weakly annotated training set T M . Where applicable, we enlarge it with a fixed set of 10 images automatically labeled through the data augmentation process, i.e. |T * M |=10, (Fig. 4). Figure 11 reports DSC in the different scenarios. The results show that the data augmentation step improves performance w.r.t. using the same annotated training set with no augmentation, while reaching a comparable performance to that one of using a dataset entirely annotated by the user. The comparable performances come as a result of the high classification accuracy of the 2D-PnetCl (F-score=94.71%), which sits close to the performance of a human rater.
Classification Network as a Second Opinion. The results obtained by post-processed classical methods (Table 3) suggest that a revision of the segmentation results and their refinement through post-processing can lead to a significant improvement in performance. We investigate if the classification network can act as an expert providing a second opinion on the segmentation results obtained by the 2D-WnetSeg, on a per-patch basis.
If the classification network labels a patch as vessel patch, the segmented pixels in the patch will be preserved. Instead, if the classification network classifies the patch as a non-vessel one, any segmented pixels are masked out. To this end, we calibrate the 2D-PnetCl output by choosing the classification threshold of the final prediction layer, which maximizes the DSC (Fig. 12). Figure 13 reports vessel segmentation DSC, using Set 1 of the TOF images, in the following scenarios: 1) on all the testing set (ALL); 2) on 4 images identified as of low quality (LQ); 3) using a second opinion on the testing set (Cl(ALL)); 4) using a second opinion on the low quality data (Cl(LQ)); and 5) in all the testing set with the a second opinion only on the low quality data (ALL+Cl(LQ)). The results suggest that using the classifier network as a second opinion has a significant impact We follow the same procedure using SWI segmentations and present the revised segmentation masks to the raters for visual judgement. The average rating score achieved was 2.30 with an agreement κ=0.57. The lower rating score is explained by the fact that using the classification network as an expert opinion allowed to correct segmentations containing large regions of false positives caused by noise in the image, mostly in the boundaries of the brain tissue, at the cost of removing some true positives (Fig. 14). One rater considered this as less criti- Fig. 13. Classification network as a second opinion in TOF. Vessel segmentation DSC for all the test set (ALL), low quality test images (LQ), full test set after second opinion (Cl(ALL)), low quality images after second opinion (Cl(LQ)) and full test with only the low quality subject to a second opinion (ALL+Cl(LQ)) using 2D-WnetSeg trained on original training set 1.
cal than the other, which explains the lower agreement among them. The results suggest that the classifier network should not be considered as an expert, i.e. it acts as a mask, but as a second opinion providing a heuristic measure of uncertainty on patches where the two networks disagree. The mismatching and uncertain regions should be thus validated by an external user.

Ablation Study
We study the properties of the different components of the proposed annotation and segmentation framework through a set of ablation studies. We investigate the incidence of the K-means as and we investigate the role of the 2D-WnetSeg network.

K-means as a Pseudo-label Generation Strategy
We study how the pixel-wise pseudo-labeled dataset T M synthesized from user-provided weak patch tags affects the framework's performance in TOF. We achieve this in two ways.
First, we investigate if the pixel-wise pseudo-labels synthesized by K-means represent a good rough approximation of pixelwise user-annotated labels. Second, we assess how the size of the patches used as input of the segmentation network influences the latter's performance. In our experiments, we compare with Gaussian mixture models (GMM), an alternative selfsupervised approach to obtain pixel-wise pseudo-labels from image tags (Luo et al., 2020). Two components (vessel and background) are used for the GMM to be comparable with Kmeans. For both cases, patches with more than 30% pixels marked as vessel are fully masked out and considered as nonvessel. These correspond to highly noisy patches containing only brain tissue.
The role of the self-supervised method, i.e. the K-means in our case, is to synthesize pixel-wise pseudo-label masks {M s } S s=1 which are sufficiently good to train the segmentation network. In other words, the pseudo-labels should be as close as possible to hypothetically pixel-wise annotations provided by a user. We thus measure the similarity between the pixel-wise pseudo-labeled masks {M s } S s=1 and the available pixel-wise annotations of the TOF training set. The K-means (and GMM) are applied on different input sizes, namely directly on the full image volume, or on subsets of it that are then concatenated.  with the DSC. The performance of both methods is inverse to the size of the input sample. As it would be expected, when applied to large extents of the image volume, i.e. the full image volume (FV) or on a per image slice basis (IS), the DSC is very low (< 40%), with GMM reporting slightly higher values.
As the extent of the input sample decreases, i.e using patches, K-means performs better, which could be justified by the fact that smaller regions tend to be more homogeneous. Two aspects should be highlighted from the obtained results. Firstly, we observe that GMMs lead to thinner vessel masks than those synthesized by K-means (Fig. 16), which is consistent with the higher DSC, as over-segmentations tend to be less penalized than mis-segmentations. Given the way that the 2D-WnetSeg learns, it is better to have overestimated masks from K-means than the finer ones. However, being K-means a simpler algorithm, the patch size used as the input plays an important role.
Our results suggest that smaller patch sizes lead to better results.
Secondly, we shall recall that both self-supervised methods are only applied to vessel patches. This is a necessary condition to obtain pseudo-labels of a minimum quality using these two algorithms. The condition is guaranteed by the patch tags discriminating vessel from non-vessel patches, which are obtained through the Vessel-CAPTCHA. Based on these results, for the remaining experiments we set the patch size input to the K-means to 32 × 32, which corresponds to the same value used in the Vessel-CAPTCHA.
Larger Patches are Best for Segmentation. Figure 15 (bottom) shows the 2D-WnetSeg accuracy with varying input patch sizes over the validation set. The patches are obtained by rebuilding the rough mask volume from the 32×32 patches and recropping the volume into different patch sizes. It should be noted that the segmentation network's input patch size does not have to match that one of the Vessel-CAPTCHA. Coherently with the previous results showing that K-means pseudo-labels are more similar to true annotations, their use consistently leads to higher DSCs. The Vessel-CAPTCHA patch size, 32 × 32, seems too small for the 2D-WnetSeg to capture the features that allow to discriminate vessel pixels from non-vessel ones. Instead, larger patches lead to higher DSCs. However, we avoid the use of larger patch sizes to avoid the problem of vessels becoming a small portion of the full image/patch, leading to drops in performance. For instance, we set the segmentation network's input patch size to 96 × 96.  We perform an ablation study to explore the effectiveness of the 2D-WnetSeg. Figure 17 compares the performance of 2D-WnetSeg with its ablated version consisting its first Unet (2D-Unet), while varying the size of the training set. The 2D-WnetSeg reports a higher DSC across datasets. The better performance of the 2D-WnetSeg is explained by the fact that the deep networks are trained on rough segmentation maps. The first Unet works as a refinement module to correct the mask by inferring potentially missing vessels based on the structural redundancy of the cerebrovascular tree. The second Unet can learn from the raw brain image and the previously improved segmentation mask, leading to an increased segmentation performance. The single Unet, instead, is faced directly with the rough masks. We further investigate this behavior using the synthetic dataset, which provides a controlled setup for comparison ( Table 6). The higher reported DSC of 2D-WnetSeg indicates it is better at detecting vessel pixels. Moreover, the lower 95HD and µD are a sign of the more refined results that the 2D-WnetSeg can achieve w.r.t. its ablated version. Vessel-CAPTCHA has a performance comparable to the best fully supervised methods (Livne et al., 2019), which avoids postprocessing steps, while providing an important speed-up for training data annotation.

Discussion and Conclusions
Context and Proposed Solution. Deep convolutional networks have achieved state-of-the-art performance in many medical image segmentation tasks. However, their success has not been as wide for 3D brain vessel segmentation. This can be explained by two factors. First, deep learning techniques are less performing when the object of interest occupies a small portion of the image, as it is is the case for brain vessels (Livne et al., 2019). Second, manual pixel-wise annotation of vessels is highly time consuming and complex (Moccia et al., 2018). In this work, we introduced the Vessel-CAPTCHA, an efficient learning framework for vessel annotation and segmentation. The framework formulates the Vessel-CAPTCHA annotation scheme, which allows users to annotate a dataset through simple clicks on patches containing vessels, similarly to the commonly used image-CAPTCHAs of web applications (von Ahn and Dabbish, 2004). As such, our work can be considered a multi-instance learning problem where a bag corresponds to an image patch and the instances are the image pixels to be segmented.  User-provided patch-level tags are used to synthesize pixelwise pseudo-labels that serve as input to train a 2D patch-based segmentation network. In particular, we use the K-means algorithm to synthesize the pixel-wise pseudo-labels along with the proposed 2D-WnetSeg network, concatenating two 2D-Unets, as backbone architecture. The use of a 2D patch-based segmentation network instead of more complex end-to-end 3D or hybrid architectures, is motivated by the need to increase the object-of-interest to image size ratio, as a way to mitigate the reduced performance of deep learning-based methods when the object of interest does not occupy an important portion of the input image. Furthermore, this simplifies the learning process: at a larger scale, the complexity and uniqueness of each brain vessel tree makes it difficult to learn common underlying patterns (Moriconi et al., 2019), whereas, at a local scale, the characteristic patterns of vessels are similar between each other, allowing the network to learn them. Reducing the input size is a common strategy in learning-based vessel segmentation, beyond brain vessel tree segmentation (Kitrungrotsakul et al., 2019;Koziński et al., 2020). The lower results obtained by 3D networks validate our choice of a 2D patch-based segmentation network.
To further ease the annotation process, our framework includes a classification network that can label training data without further user effort. This network is trained using the same user-provided patch tags and it allows to classify image patches from unseen images that can be used to enlarge the original training set without the need for further user annotations.
Framework Evaluation. We evaluated the proposed framework in terms of its accuracy and required annotation time, using a synthetic dataset and two image modalities, TOF and SWI (Fig. 18). Our framework achieved performances comparable to those of current state-of-the-art deep learning approaches for brain vessel segmentation (Livne et al., 2019;Tetteh et al., 2020), while reducing the annotation burden by 77% on average. Moreover, when compared to other approaches subject of limited supervision, our simple yet effective framework demonstrated its superiority. Our promising results, with competitive accuracies and a significant reduction of the user-required effort, should enable the wider use of deep learning techniques for vessel segmentation.
Our results show that the classifier network not only allows to enlarge the training dataset, but it can act as a second opinion to assess the segmentations. This concept could be further extended to guide a user in the manual correction of a segmentation mask. In this work, we used the classification network as an expert. However, the disagreements between the segmentation and classification network (i.e. 2D-WnetSeg segments a vessel in a patch classified as non-vessel or vice versa) could be used as a measure of uncertainty. Since WnetSeg and PnetCl architectures are significantly different, they extract low-level and high-level features differently. As such, they are complementary to each other: if both agree on a prediction over a patch, the prediction can be considered as one of high confidence, whereas when there is a disagreement the patch can be suggested to the rater for revision.
Limitations and Perspectives. Although our work focuses on the brain vessel tree, we consider that the proposed framework is general enough that it can be easily extended to other vascular structures (Aughwane et al., 2019), other tubular structures with complex networks to annotate (Zuluaga et al., 2014a), or different image modalities. However, for some modalities the K-means algorithm used to obtain pixel-wise pseudo-labels can be limited. As an example, the coronary vessel tree imaged with computed tomography angiography is likely to present calcified or lipid plaques that appear as hyper and hypo-intense objects, respectively (Zuluaga et al., 2011). In the current setup, they would be segmented as a vessel (calcified plaques) or the background (lipid plaques). A natural extension of this work would be to develop novel self-supervised methods, beyond those studied in this work, which can cope with the characteristics of different vessel/tubular trees and image modalities.
Our main effort in this work has been directed towards a simplified annotation process and the development of mechanisms that can mitigate the negative effects of 'simpler' annotations to achieve performances comparable to the state-ofthe-art. Nevertheless, we consider that there are different ways to achieve higher segmentation performance that could be explored. For instance, similarly to what has been proposed by (Koziński et al., 2020;Phellan et al., 2017), the annotations could be performed in different image planes. Currently, these are done in the axial plane. In addition, the Vessel-CAPTCHA allows for flexible annotations as, for some users, it is simpler to label vessels by following their trajectory. Now, all this information is discarded (see Fig. 2(e) and (g)), when in some cases it may have relevant content. The challenge here would be to identify when the patch annotations contain relevant information beyond the mere identification of the patch. Finally, one last limitation of the current framework is related to the selection of the patch grid scheme. While it is convenient to present non-overlapping patches to the user, in some cases, this may degrade the framework's performance. This is particularly true when the grid partition results in the split of vessels, in particular the smaller ones, across two or more patches causing them to lose their characteristic shape. The use of overlapping patches is a straightforward extension of this work that could reduce the number of misclassified vessels.