Prototypical few-shot segmentation for cross-institution male pelvic structures with spatial registration

The prowess that makes few-shot learning desirable in medical image analysis is the efficient use of the support image data, which are labelled to classify or segment new classes, a task that otherwise requires substantially more training images and expert annotations. This work describes a fully 3D prototypical few-shot segmentation algorithm, such that the trained networks can be effectively adapted to clinically interesting structures that are absent in training, using only a few labelled images from a different institute. First, to compensate for the widely recognised spatial variability between institutions in episodic adaptation of novel classes, a novel spatial registration mechanism is integrated into prototypical learning, consisting of a segmentation head and an spatial alignment module. Second, to assist the training with observed imperfect alignment, support mask conditioning module is proposed to further utilise the annotation available from the support images. Extensive experiments are presented in an application of segmenting eight anatomical structures important for interventional planning, using a data set of 589 pelvic T2-weighted MR images, acquired at seven institutes. The results demonstrate the efficacy in each of the 3D formulation, the spatial registration, and the support mask conditioning, all of which made positive contributions independently or collectively. Compared with the previously proposed 2D alternatives, the few-shot segmentation performance was improved with statistical significance, regardless whether the support data come from the same or different institutes.


Introduction
Multi-structure segmentation is one of the fundamental computing tasks in medical imaging applications, found in diagno-sis, treatment, and monitoring, and remains a research interest.Diagnosis of varieties of diseases can be assisted by quantifying the morphology, or its change, of multiple structures.For example, brain disorders, including Alzheimer's disease (Petrella et al., 2003) and Parkinson (Hutchinson and Raff, 2000), associates with abnormal volumes or shapes of neurological re-gions.Identifying brain structures is the key to many quantitative studies, such as functional activation mapping and brain development analysis (Han and Fischl, 2007).Minimally invasive treatments often benefit from careful planning of both interventional instruments and guidance imaging, with respect to segmented patient-specific anatomical structures.In endoscopic pancreatobiliary procedures, as previously reported, image guidance that displays registered anatomical models outside the endoscopic field of view helps the surgeon during targeting and navigation (Howe and Matsuoka, 1999).
This work is primarily concerned with segmenting multiple organs and urologically interesting structures on T2weighted MR images from prostate cancer patients, to plan targeted biopsy, focal therapy, and, increasingly, other therapeutic procedures such as radiotherapy.Accurate segmentation of these structures help with targeting suspected cancerous regions found in multiparametric MR imaging with respect to the prostate gland and avoiding vulnerable surrounding structures, such as rectum, bladder and neurovascular bundles, for minimising risks in infection, impotence and other potential injury and complications (De la Rosette et al., 2010).
The data-driven representation learning enabled by deep neural networks has led to promising segmentation results in multi-structure segmentation tasks, for example, in neuroimaging (Henschel et al., 2020) and abdominal organs (Weston et al., 2019).Automating this task reduces the current requirement of manual segmentation which is often associated with costs in expertise and intra-and inter-observer variations (Fiorino et al., 1998).However, recent supervised segmentation methods mostly rely on large data sets with full annotations, subject to the similar limitations in labelling, albeit only with training.On the other hand, few-shot image learning aims to classify unseen classes using only a few labelled examples (Snell et al., 2017;Sung et al., 2018).For medical image segmentation, such novel classes may represent new types of organ or anatomical regions, whose expert annotations are not available for large training data sets.Using our intended interventional planning application as an example, different biopsy or therapy procedures may require different anatomical structures and pathological regions to be annotated during the planning stage.Prostate zonal structures, if not routinely segmented for MR-planning of radiotherapy, may provide more precise target localisation in registration-assisted ultrasound-guided focal ablation or a new therapy.
The now well-recognised performance loss in cross-institute generalisation of deep learning models (Gibson et al., 2018) has motivated a body of research such as domain adaptation (Ren et al., 2018;Meyer et al., 2021) and federated learning (Li et al., 2021a).This work focuses on few-shot segmentation with the practically important cross-institution context (as shown in Fig. 1), which aims to segment a novel class from a query image, given only a few support images and their binary masks of the novel class, from a novel institution where the limited labelled data are available.In other words, the model should be able to simultaneously adapt to both novel classes and novel institutions.
To improve the inter-class and inter-institute generalisation, this work first examines the manifestation of the performancereducing inter-institute variability in prototypical few-shot image segmentation algorithms.In particular, we investigate the impact of spatial alignment, or the lack thereof, between support and query data from different institutes, a key component in such prototypical learning paradigm, on the few-shot segmentation accuracy.First, addressing one of the previously identified challenges of spatial inconsistency (Tian et al., 2020), also found in medical image applications (Guo et al., 2021;Sun et al., 2022), we develop a spatial registration mechanism to align the support and query images prior to the comparison between the two.This spatial registration mechanism consists of a segmentation head and a spatial alignment module, trained end-to-end, and is motivated by medical-image-specific observations of the difference between intra-and inter-institution data characteristics, due to different scanners and local imaging protocols, discussed further in Section 4.4.
Second, we propose an additional support mask conditioning module, also trained end-to-end, to enforce the conditioning on the available novel class labels.The conditioning module is empirically designed to work together with the spatial registration mechanism, to maximise utilisation of the few and prized support masks.
One widely identified problem in both developing and evaluating multi-structure segmentation algorithms is the lack of a sizable labelled data set.To the date of submission of this paper, there were no multi-structure annotations publicly available for pelvic MR images.Through this work, all of our manual labels from open data sets have been made available at https://zenodo.org/record/7013610,for aiding the reproducibility of this work and, potentially, for other urologic or radiologic tasks concerning multiple pelvic anatomical structures.
Our preliminary results were recently presented (Li et al., 2022), contributions from this paper include 1) a more detailed description and discussion of the spatial registration mechanism, 2) a new support mask conditioning module, 3) a substantially larger and fully labelled data set, and 4) more ablation comparison experiments.These are also summarised as follows.
• We introduced the cross-institution few-shot segmentation task to address the data scarcity problem specifically faced in medical applications.
• We first proposed to extend prototypical neural network to 3D for few-shot multi-class segmentation, which requires fewer parameters while achieving similar performance to its 2D counterpart.
• We developed an spatial registration mechanism and a support mask conditioning module, directly addressing the observed limitations in medical image few-shot segmentation, for improving generalisation across data from different institutions.
• We presented extensive ablation studies to investigate the impact of the proposed individual components, the increasing number of support data, the varying size of the training set, and the permutations in the available institutes.
• We published all expert annotations based on public image data sets at https://zenodo.org/record/7013610,which includes full segmentation of eight distinctive male lower pelvic structures on 589 3D MR images (including 178 3D MR images from our preliminary work (Li et al., 2022)).
• The code implementing the proposed algorithms has also been made publicly available at https://github.com/kate-sann5100/CrossInstitutionFewShotSegmentation.

Few-shot segmentation
The few-shot segmentation task was first introduced in computer vision applications (Shaban et al., 2017) where the goal is to segment the novel class in a query image in the presence of a few support images having the same class labelled.Using episodic training which takes both query and support images as input, Shaban et al. (2017) demonstrated a better performance compared to the common fine-tuning methods, which fine-tunes the models on the support images per novel class.In 2018, Dong and Xing (2018) proposed prototypical episode learning that represents the novel class in the support image with a single prototype vector and compares it with query features to perform segmentation.This strategy was later adopted in many further research works (Zhang et al., 2019;Liu et al., 2020;Li et al., 2021b).
Due to the common challenges in data collection, few-shot segmentation was also adapted to different medical images, including CT (Roy et al., 2020), MRI (Mondal et al., 2018), ultrasound (Guo et al., 2021), etc.The early methods applied finetuning strategy and addressed the over-fitting on support images with multi-tasking (Mondal et al., 2018;Cui et al., 2020) and data augmentation (Zhao et al., 2019;He et al., 2020;Wang et al., 2021).Roy et al. (2020) was one of the first that adopted prototypical learning in medical imaging, reporting promising performance.Ouyang et al. (2020) and Yu et al. (2021) extracted multiple prototype vectors and performed a locationguided comparison, with the assumption of similar spatial layouts between the query and support images.However, in crossinstitution scenarios, regions of interest may be located differently between queries and supports, as shown in Fig. 9.In this work, we proposed an integrated spatial registration mechanism to address such inconsistency.
In addition to data scarcity, higher-dimensional data in medical imaging often poses practical challenges in neural network training.Roy et al. (2020) proposed to use 2D neural networks, that pre-trained on large data sets such as ImageNet, and performed slice-by-slice inference when applied to 3D medical images.This strategy was also adopted by most of the follow-up prototypical methods (Feyjie et al., 2020;Ouyang et al., 2020;Abdel-Basset et al., 2021;Guo et al., 2021;Tang et al., 2021;Yu et al., 2021;Sun et al., 2022).Kim et al. (2021) integrated a bidirectional gated recurrent unit to process 2D features extracted from adjacent slices.Zhou et al. (2021) proposed 3D pyramid reasoning modules (PRMs) to model the anatomical correlation between query features at each location and all support features at neighbouring corresponding locations.To reduce computational cost, a relatively small number of channels were used for each convolutional kernel.The proposed method, in contrast, reduced the number of comparisons by extracting a single prototype vector for each spatial window.
To the best of our knowledge, there has been no prior work that successfully deployed 3D neural networks that receive and output 3D image volumes directly, using prototypical training for few-shot segmentation in medical imaging applications.Investigating the 3D formulation is not only technically interesting, but may also lead to potentially superior performance and/or efficiency in this inherently 3D segmentation task.

Cross-institution learning
The proposed cross-institution few-shot segmentation task aims to segment a novel class in images from a novel institution, with support images and labels from the same novel or other non-novel institutions.Although the objective is also to generalise on novel data, it differs from federated learning and domain adaptation due to their constraints on data privacy, accessibility, and availability.However, an optimal gain in efficient use of labelled data could be achieved by combining these methods with few-shot learning.
Federated learning is a learning paradigm that targets the problem of data governance and privacy by training algorithms collaboratively without the need for physically exchanging the data themselves, sometimes requiring the compliance of varying access policy (Rieke et al., 2020).It has been used in different medical imaging applications (Li et al., 2021a), but the focus is mainly on improving performance in trained classes without the need for generalisation in a novel class.
There exist three different domain adaptation methods: supervised, unsupervised, and semi-supervised.While sharing the same overarching objective, i.e. generalising to new domains such as novel classes, the supervised domain adaptation (Hosseini-Asl et al., 2016) does not explicitly focus on the data scarcity as few-shot learning.Unsupervised methods (Perone et al., 2019), on the other hand, often assume large-scale data set from the novel institution but without labels, thus paying a different attention from that of cross-institution few-shot learning.Combining both techniques, semi-supervised methods (Fu et al., 2019;Xia et al., 2020) can leverage labelled and unlabelled data sets more efficiently.
Despite the distinct focuses, techniques and methodologies from federated learning and domain adaptation have indeed been considered in developing our cross-institute few-shot segmentation approach.For example, the spatial normalisation and divergence between features have commonly been adopted in federated learning (Li et al., 2020) and feature-level domain adaptation (Tomar et al., 2021), respectively.(1)

The Cross-Institution Few-Shot Segmentation Task
Similarly, test images and the corresponding labels of novel classes from all base institutions form another data set, together with the data set of images from the novel institutions and the novel classes' labels, a novel data set is built focusing on novel classes: Therefore, the cross-institution segmentation task aims to train a model on the base data set D base and generalise to the novel data set D novel which contains novel classes on both base and novel institutions.Specifically, following the few-shot setting described in Roy et al. (2020) and Ouyang et al. (2020), the model is tasked to segment a novel class c ∈ C novel in a query image I q u acquired from a novel institution u ∈ U novel with only from the same or different institutions.The predicted query mask M(I q u , c) is compared with the label M(I q u , c) with a segmentation metric such as the Dice score (Sudre et al., 2017).Such evaluation procedure is named an episode with K-shot, which is also detailed in Algorithm 1.

Episodic Few-shot Training
This work adopts the common episodic training paradigm (detailed in Algorithm 2) that simulates the few-shot task during training.Each episode consists of query (I q , M(I q , c)) and K support {(I s k , M(I s k , c))} K k=1 image-label pairs, for a base class c sampled from C base .In this work, K = 1 during training and the model is trained to predict the query mask M q (c) given the query image I q and one support image-label pair, denoted as (I s , M s (c)).

Prototypical Network
A prototypical network (Dong and Xing, 2018) first defines a prototype feature vector per class, extracted from the embedded features of the labelled voxels, in the support image, corresponding to the class.The feature vector is then compared with the query image voxel-wise in the embedded feature space for segmentation prediction.
Specifically, the query and support images, denoted by I s and I q respectively, are encoded by a shared feature extractor into support and query feature maps of the same shape, F s and F q , respectively.The class prototype h c and the background prototype h 0 are then derived by averaging F s over voxels of the class c (where label M s (c) equals 1) and background (where label M s (c) equals 0), respectively, i.e., h c = f (F s , M s (c)) and with (x, y, z) iterating over all voxels in F along x, y and z axes.The similarity between each query voxel and class (or background) is calculated through cosine similarity between the voxel feature map F q and the class prototype feature vector h c (or h 0 for background): where ⋆ ∈ {c, 0} represents the class or background, and • represents the dot product between vectors.

Local Prototypical Network
To extract location-sensitive local prototypes (Yu et al., 2021), images of spatial size W × H × D are partitioned into overlapping windows g ∈ G of size α w W × α h H × α d D with the equidistant spacing between window centres being half of the window size.As shown in Fig. 2, for each window g ∈ G, two local prototype feature vectors h g c and h g 0 are calculated via Eq.( 5) by iterating (x, y, z) over the voxels inside the window g: For each query voxel (x, y, z), G (x,y,z) denotes the set of all windows that contains the voxel: G (x,y,z) = {g | (x, y, z) ∈ g}.The local similarity between this voxel and the class (or background) is then calculated using the maximum cosine similarity over windows g ∈ G (x,y,z) with the corresponding local prototype feature vectors {h g c } g∈G (x,y,z) and {h g 0 } g∈G (x,y,z) :  Sample c ∼ C base Sample I q , I s ∈ u∈U base I train u , such that I s I q Denote the mask of c in I s as M s (c) Denote the mask of c in I q as M q (c) Predict Mq (c) = ϕ θ (I q , I s , M s (c)) Compute loss: L Update model parameters: θ = θ − α∇ θ L end with ⋆ ∈ {c, 0} representing the class or background.The foreground/background probability map is derived by: with ⋆ ∈ {c, 0} representing the class and background.The model is trained to minimise the Dice loss between the predicted and ground-truth mask: where M q (0) = 1 − M q (c)

Spatial Registration Mechanism
As discussed in Section 1, the differences in intra-and interinstitution variations pose challenges in the local prototypical network due to the varying image sizes, orientations, and voxel dimensions of the acquiring institution.The target structure in the query and support images can be distant (as in Fig. 9).Therefore, they may not be included inside of the same or even adjacent windows, and this results in erroneous comparison between the query voxels of the structure, thus irrelevant prototype vectors.However, while the absolute locations of target structures varies among images, the relative position between different structures remains consistent.This observation motivated spatial alignment of the query and the support images, I q and I s , illustrated in Fig. 2, before extracting the local prototype feature vectors.This spatial registration process is conjectured to alleviate the discrepancy between different institutions and therefore reduce the amount of cross-institution training data required for generalisation.
In this work, we consider an affine transformation to account for the above-discussed spatial difference with potentially uncertainties due to variable imaging positioning, signal sampling and scanner calibration, although higher-degree transformation will also be of interest.Furthermore, to avoid repeated feature map extraction, we propose to apply the transformation directly on feature maps (F q and F s ), rather than on images (I q and I s ).The affine transformation prediction consists of two stages.Firstly, a shared base class segmentation head segments all base classes from the query and support feature maps, F q and F s , respectively.The multi-class predictions are denoted as Mq base and Ms base , for the query and support images, respectively.During evaluation, these predictions are concatenated and passed into the spatial alignment module, illustrated in Fig. 2, which predicts an affine transformation matrix τ ∈ R 3×4 of 12 degrees of freedom.During training, as illustrated in Fig. 3, base classes segmentation masks M q base and M s base are used for alignment prediction.Secondly, alignment τ is applied to the support feature map F s and the label M s base , to obtain the aligned support feature map F s τ = τ • F s and the aligned support label for all base classes M s τ,base = τ • M s base , respectively.These aligned feature maps are then used to generate the local prototypes (as detailed in Section 4.3) for segmentation.
Two losses are defined for training the spatial registration mechanism.First, a Dice loss is defined for the multi-class segmentation: where the multi-class Dice loss is defined as: with M(⋆) (x,y,z) and M(⋆) (x,y,z) representing the ground truth and the predicted probability of a base class or background ⋆ at (x, y, z).
The second loss aims to optimise alignment by minimising the Dice loss between the query label M q base and the aligned support label M s τ,base of all base classes: In theory, the transformation could be applied the other way around, i.e. by applying the reverse alignment τ −1 to the query feature map F q resulting in the aligned query feature map F q τ −1 = τ −1 • F q , to achieve spatial alignment between query and support features.A cycle-consistent two-way registration may also apply.However, once the aligned query mask Mq τ −1 is predicted, it needs to be inverted, in order to obtain the mask of the original query image Mq = τ • Mq τ −1 .The mechanism was adopted in this work for its computational efficiency without additional resampling in practice.

Support Mask Conditioning Module
During the prototype feature vector extraction using the novel class mask in the support image, the voxel-wise information may be made invariant to the transformation during the aggregation in Eq. ( 5) and ( 7), which is designed for "normalising" cross-institution data, but may also result in large spatial variability in the prediction.Therefore, a simple yet effective support mask conditioning module is proposed.The support mask conditioning module takes as input a concatenation of the class similarity sim(c), the background similarity sim(0) and the aligned mask of the class in the support image M s τ (c), for the final prediction Mq (c).Unlike the multiplication of support mask to support features for prototype calculation (in Eq. ( 5) and ( 7)), the direct use of support mask here provides a more direct route for the novel class information to the final segmentation task, similar to commonly designed shortcut layers for skipping networks.
Different from 3D medical segmentation algorithms using mask predicted from downsampled image to provide context information for high-resolution patches, the proposed method uses the segmentation of query image with location and shape information of the support mask.

Loss
Both the spatial registration mechanism and support mask conditioning module are trained with the original local prototypical network, with an overall training loss function used in this study as follows:

Multiple-shot Evaluation
Due to memory limitation, the training was carried out in one-shot paradigm, i.e.K = 1.During the evaluation, for the query image I q and each of the K support images I s k , the base class segmentation is predicted from the base class segmentation head, denoted by Mq base and Ms base,k , respectively.Among the K support images, only the support image that is the most similar to the query image in terms of the cosine similarity on base class segmentation prediction, denoted by I s k , is chosen to calculate the local prototypes in (7).Precisely, where the dot product and norm are calculated on the flattened predictions.The local prototypes are then calculated using the selected support example, ), g) for each window g.
These images were divided into seven subsets based on the acquiring institution.The number of images acquired from each institution is anonymously summarised in Table 1.The crossinstitution imaging protocols contain multiple scanners (two manufacturers with a mixed 1.5 and 3T field strengths), varying field-of-view and anisotropic voxels, in-plane voxel dimensions ranging between 0.3 and 1.0 mm and out-of-plane spacing between 1.8 and 5.4 mm.
For each image, eight anatomical structures of planning interest, including bladder, bone, central gland, neurovascular bundle, obturator internus, rectum, seminal vesicle, transition zone were labelled (as shown in Fig. 4).All segmentations were manually annotated by eight biomedical imaging researchers, with experience ranging from 2 to 10 years in the annotation of medical image data, each annotating a mixed-institution subset using an institution-stratified sampling.Each annotation has been reviewed at least once.
The full segmentation masks and the derived intensity arrays from T2-weighted sequences, after pre-processing, used to produce the results in this study, are available at https: //zenodo.org/record/7013610.
The eight lower-pelvic structures were randomly divided into four folds as shown in Table 2.In a cross-validation experiment, the classes contained in each fold were considered as novel classes, the other three folds representing base classes.The images of each institution u were then randomly sampled into training and testing subsets in a 3:1 ratio.A further validation set was formed with 12 images, 2 from each institution, randomly chosen from the novel data set D novel .Those images were excluded from the testing.Unless otherwise specified, the same data partitioning was used for all the results presented.All statistical conclusions are reported using paired Student's t-tests at the significance level of α t−test = 0.05.

Implementation Details
All images were normalised, resampled and centre-cropped to an image size of 256 × 256 × 48, with a voxel dimension of 0.75 × 0.75 × 2.5 during pre-processing.Random rotation, translation and scaling were adopted for data augmentation during training.
The best training episode was chosen based on the performance on the validation sets.During evaluation, all images from the novel institution were considered query images for evaluation for each of the fold-specified novel classes.A binary Dice score for this novel class was calculated for each query image based on a sampled support image from each of the seven institutions, excluding the query.As described in Algorithm 1, the results were reported when support images from 1) all institutions, 2) base institutions, and 3) novel institutions.The institution where the support image comes from is denoted as 'support ins'.A 3D UNet was adopted as the feature extractor, whose architecture is detailed in Fig. 5. α w = α h = 1 8 and α d = 1 were selected for local prototypical comparison as detailed in Section 4.3.As shown in Fig. 6, the support mask conditioning module was made up of two convolutional layers.For the spatial registration mechanism, the base class segmentation head was a single convolutional layer, and the spatial alignment module was a GlobalNet (Hu et al., 2018a) as specified in Fig. 7.The models were trained using an Adam optimiser starting at a learning rate of 10 −4 with a minibatch size of 1.The implementation code has been released at https://github.com/kate-sann5100/CrossInstitutionFewShotSegmentation.

Compared Baseline Networks
For comparison, we report the results of the following baseline networks.
1.The '3d finetune' network -The '3d finetune' baseline implemented the same UNet as the feature extractor.It was pre-trained on the base data set to segment all base classes for 100 epochs using an Adam optimiser starting at a learning rate of 10 −4 .During evaluation, for each query, the pre-trained model is fine-tuned on the support images for 10 iterations before testing.This baseline provides a reference as a "lower-bound" performance, using a simple transfer learning strategy.
2. The '2d' network -LSNet (Yu et al., 2021) was adopted as the 2D episodic baseline.It adopted the same localprototype comparison approach as detailed in Section 4.3 but using a 2D backbone based on ResNet-50 pre-trained on ImageNet, instead of 3D networks.To the best of our knowledge, this is the prototypical network closest to our work that has been proposed for medical image segmentation.3. BiGRU (Kim et al., 2021) -Another recent few-shot medical segmentation method with a UNet-like network for 2D slice prediction and a bidirectional gated recurrent unit (GRU) for adjacent slices consistency.4. The 'localnet (unsupervised)' network for multi-atlas segmentation -A non-rigid registration network, Local-Net (Hu et al., 2018b), trained on the base data set images with no organ label supervision.Given a support-query pair, the model is trained to predict a dense displacement field that warps the support towards the query.For nshot evaluation, n dense displacement fields are predicted, each registering a support example towards the query.n query mask predictions are derived by warping each support mask using the corresponding predicted dense displacement field.The final prediction is made through majority voting by the n query mask predictions.The implementation was based on the open-source repository MONAI (Cardoso et al., 2022). 5.The 'localnet (supervised)' network for multi-atlas segmentation -A non-rigid registration network, Local-Net (Hu et al., 2018b), trained on the base data set with masks from all classes including novel classes.Similar to the 'localnet(unsupervised)' network, the query mask is the warped support mask using the registrationpredicted support-to-query dense displacement field.The implementation was based on the open-source repository MONAI (Cardoso et al., 2022).6.The '3d supervised' network -A fully supervised 3D model was trained on the base data set images with masks from all classes.The results on the novel institution images are reported as an "upper-bound" performance.

Ablation Studies
Ablation on different modules To assess the effectiveness of different modules in the proposed method, we report the results of the following variations.
1.The '3d' network variant -The proposed 3D local prototypical network, detailed in Section 4.3 and Fig. 8(a), without the support mask conditioning or spatial registration.2. The '3d con' network variant -The '3d' version with the support mask conditioning, but without the spatial registration, described in Section 4.5 and Fig. 8(b)).3. The '3d align' network -The '3d' version of the proposed network (detailed in 8(c)) with the spatial registration mechanism, without the support mask conditioning.4. The '3d con align' network -The "complete" version of the proposed network with both the support mask conditioning module and the spatial registration mechanism, as shown in Fig. 2.
Ablation study on the number of shots We report model performance when different number of support trios available in each episode.
Ablation on varying training data availability To investigate the dependency of our proposed method on the training set, we report the performance of the proposed model (3d con align) trained on various availability in training sets.The 'half' and 'quarter' experiments respectively includes 1 2 and 1 4 of the training subset D train u for each base institutions u ∈ U base , in order to test the impact of the training data set size on the few-shot segmentation performance.The same ratio between institutions was maintained in these experiments.Additionally, the 'half single ins' experiment tests the same number of images as the 'half' experiment, but all the images are sampled from the same institution (Institution 1).
To assess and compare the intra-and inter-institution generlisation, results are also reported when all the institutions, only the base institutions and only the novel institutions were used as support institutions, denoted as 'all', 'base' and 'novel', respectively.
Table 3. Dice score (%) and 95% Hausdorff distance achieved when institution 3 was adopted as the novel institution.'support ins' refers to the institution from which the support images were sampled from.∆ refers to the percentage difference between predictions made with support images from the base and novel institutions.led 3.76%, 3.25% and 6.76% absolute increase in Dice score comparing to '3d', when support images came from all, base and novel institutions, respectively.Qualitatively, it predicted more "compact" segmentation with smoother boundary as shown in Fig. 9.However, it also resulted in a higher ∆a greater improvement was achieved when support and query came from the same institution, possibly because of its sensitivity to support-query (mis)alignment.This was mitigated by the spatial registration mechanism which not only further improved the Dice score by 6.81% but also reduced ∆ by 9.00%.Fig. 10 shows an example where the target structure of the query and support were misaligned, resulting in segmentation failure.Aligning the support towards query, the spatial registration mechanism considerably improved the performance.
Interestingly, when spatial alignment mechanism was available, the support mask conditioning module led to a further im-provement -9.36%, 9.35% and 9.41% absolute increase in Dice score from '3d align' to '3d con align' when support images came from all, base and novel institutions, respectively.
It is also important to report that the proposed 3D network '3d' without the support mask conditioning module and the spatial registration mechanism contained 5.7 million parameters.It achieved comparable performance to the '2d' baseline with 23.5 million parameters, resulted in around 75% reduction in number of parameters.For reference, the complete version of our proposed method '3d con align' had contained 27.3 million parameters (16% more parameters compared with '2d'), had achieved 28.9% of relative Dice improvement over '2d'.
Furthermore, Table 6 and Table 7 report the mean Dice score achieved by '2d' and '3d con align' from different supportquery institution combinations, when institution 3 is the novel institution.Better performance was often achieved by both methods when support images come from the query institution.A two tailed t-test is performed per query institution between the Dice scores where support institution equals query institution and the maximum Dice scores achieved differs from the query institution.Such observation is consistent with the hypothesis that the domain shift is smaller between support-query pairs within the same institution, leading to better performance.
Table 5 reports the performance of the proposed model ('3d con align') and baseline methods on novel classes, when both query and support come from base institutions while institution 3 is used as the novel institution.We report results when support come from the same institution as query, a different institution from query, and over all base institutions.The proposed method outperformed all baseline methods in few-shot setting, including two state-of-the-art methods -BiGRU (Kim et al., 2021) by 17.01%, 16.88% and 16.9% and LSNet (Yu et al., 2021) (which is adopted as our '2d' baseline) by 9.39%, 13.37% and 12.88% absolute Dice, when support come from the same institution as query, different institution from query and all base institutions respectively, proving its efficacy even when no novel institution is involved during evaluation.Moreover, better performance was achieved when query and support came from the same institution than different institutions, with the performance gap reported as ∆.The proposed method achieved smaller mean ∆ compare to baseline few-shot methods, suggesting its ability to mitigate domain shifts from crossinstitution query and support data.
Table 8 reports the performance achieved by the proposed method ('3d con align') as training set varies.The performance dropped as the number of images inside the training set reduced.Notably, model trained on 'half single ins' performed worse than model trained on 'quarter', which had only half the size of 'half single ins'.This suggests that the cross-institution fewshot task can be sensitive to the number of institutions available in the training set.To quantify this sensitivity is an interesting future research question.
Table 9 reports the performance of the proposed method '3d con align' and '3d finetune' baseline, using 1 to 4 support examples (denoted as # shot).Performance improved for both methods, as the number of support examples increased.While '3d finetune' is more sensitive to the number of support examples, the proposed method still outperformed '3d finetune' using 1 to 4 support examples.

Discussion
While both the spatial registration mechanism and the support mask conditioning were motivated by the observation on multiple structure types found in the multi-institution data set used in this study, they may be promising to be beneficial for wider image types and anatomical regions as similar challenges were found in the few-shot segmentation of other types of nonmedical data.
The labels used in this study were annotated by a mixture of clinicians and experienced medical imaging researchers.The estimated time for completing this task was more than one thousand observer-hours, a practically challenging task for most local hospitals if an alternative supervised learning was adopted for adopting or validating a segmentation tool.This further justifies the clinical relevance for the proposed few-shot segmentation approach.
It is noteworthy that the reported segmentation performance results were based on as few as 1 -4 labelled training examples of the regions of interest, which had not been labelled in the model training stage.
Though challenging, cross-institutional few-shot segmentation could benefit situations when limited number of annotated data are available.Potential applications, although not investigated in this study due to relevant data availability, include specific pathology detection and segmentation with rare instances without previously observed occurrence in training institutions and longitudinal analysis with available within-subject data from individual patients.
Research questions remain for future research include the achievability or conditions to fill the gap to the upper-bound performance from supervised learning.For example, whether other types of data variance, such as scanner, imaging protocol and intensity in addition to the spatial domain studied in this Table 6.Mean Dice score (%) achieved at different (s ins, q ins) combinations by '2d' when Institution 3 was adopted as the novel institution, where s ins and q ins respectively refers to the institution from which the support and query were sampled from.Best performance for each query institution were bolded.p-values were derived from paired t-test performed between the Dice scores where support institution equals query institution and the maximum Dice scores achieved when support institution differs from the query institution.

Conclusion
This paper described the first 3D prototypical learning algorithm for medical image segmentation, applied on multiple structures on pelvic MR images from different institutes.Substantial validation was based on clinical data from 589 patients, with full segmentation of eight anatomical classes made available to the scientific community.The demonstrated novelty, efficacy, and clinical applicability of the proposed algorithm suggested an interesting direction for addressing the cost of expert labelling and cross-institute generalisation of current deep learning-based segmentation applications.
Table 9. Dice score achieved by '3d con align' with various number of shots when Institution 3 was adopted as the novel institution.The 'support ins' refers to the institution from which the support images were sampled from.

Fig. 1 .
Fig. 1.Visualisation of the proposed cross-institution few-shot segmentation task.As shown in (a), the task aims to train a model on the base data set including images from base institutions with corresponding masks on base classes and generalise to novel institution as well as novel classes.The model is evaluated on segmenting novel classes from query images acquired by novel institutions, while support images comes from either base or novel institutions, respectively shown in (b) and (c).
Consider a set of classes and institutions, C and U, respectively.For each institution u ∈ U, I u denotes the set of all images.All classes have been segmented for each image: given a class c ∈ C, M(I u , c) represents the corresponding mask for the image I u ∈ I u from the institution u.The classes C and the institutions U are split into disjoint sets (C base , C novel ) and (U base , U novel ), respectively.The images I u of each institution u are also separated into disjoint training and test subsets I train u and I test u .A base data set D base is formed with training images and the corresponding labels of the base classes from all the base institutions.D base = u∈U base I∈I train u c∈C base { (I, M(I, c)) }.

Fig. 2 .
Fig.2.Overview of the proposed method.A shared feature extractor outputs query/support features from query/support images.The spatial registration mechanism spatially registers support feature/mask towards query.Foreground/background similarity maps are derived through local prototypical comparison, concatenated with the aligned support mask and processed by the support mask conditioning module to make the final prediction

Fig. 3 .
Fig.3.The training procedure of the proposed method.The model is trained to minimise the sum of few-shot loss (Eq.10), base class segmentation loss (Eq.11) and alignment loss (Eq.13).
Algorithm 1: Evaluation Procedure Input : Neural network ϕ θ with parameters θ Images I test u for all base institutions u ∈ U base Images I u for all novel institutions u ∈ U novel Masks for all novel classes in C novel Output: Dice scores for all, novel, and base institutions: ALL DICE, NOVEL DICE, and BASE DICE ALL Episodic Training Procedure input : Neural network ϕ θ with parameters θ.Learning rate α.Images I train u for all base institutions u ∈ U base .Masks for all base classes in C base .output: Trained model ϕ θ .while Training do

Table 1 .
The number of images acquired from each institution.

Table 2 .
The eight structures are randomly divided into 4 folds.

Table 4 .
Dice score (%) and 95% Hausdorff distance achieved when institution 4 was adopted as the novel institution.'support ins' refers to the institution from which the support images were sampled from.∆ refers to the percentage difference between predictions made with support images from the base and novel institutions.

Table 5 .
Dice score (%) achieved when query and support both came from base institutions while institution 3 was adopted as the novel institution.'support ins' here denote if the support and query come from the same institution, different institutions or all base institutions are counted.

Table 8 .
Dice score (%) and 95% Hausdorff distance achieved by '3d con align' with various training data availability when Institution 3 was adopted as the novel institution.'support ins' refers to the institution from which the support was sampled from.∆ refers to the percentage difference when the support image comes from the base and novel institutions.