Robust deep learning-based semantic organ segmentation in hyperspectral images

Semantic image segmentation is an important prerequisite for context-awareness and autonomous robotics in surgery. The state of the art has focused on conventional RGB video data acquired during minimally invasive surgery, but full-scene semantic segmentation based on spectral imaging data and obtained during open surgery has received almost no attention to date. To address this gap in the literature, we are investigating the following research questions based on hyperspectral imaging (HSI) data of pigs acquired in an open surgery setting: (1) What is an adequate representation of HSI data for neural network-based fully automated organ segmentation, especially with respect to the spatial granularity of the data (pixels vs. superpixels vs. patches vs. full images)? (2) Is there a benefit of using HSI data compared to other modalities, namely RGB data and processed HSI data (e.g. tissue parameters like oxygenation), when performing semantic organ segmentation? According to a comprehensive validation study based on 506 HSI images from 20 pigs, annotated with a total of 19 classes, deep learning-based segmentation performance increases, consistently across modalities, with the spatial context of the input data. Unprocessed HSI data offers an advantage over RGB data or processed data from the camera provider, with the advantage increasing with decreasing size of the input to the neural network. Maximum performance (HSI applied to whole images) yielded a mean DSC of 0.90 ((standard deviation (SD)) 0.04), which is in the range of the inter-rater variability (DSC of 0.89 ((standard deviation (SD)) 0.07)). We conclude that HSI could become a powerful image modality for fully-automatic surgical scene understanding with many advantages over traditional imaging, including the ability to recover additional functional tissue information. Code and pre-trained models: https://github.com/IMSY-DKFZ/htc.


Introduction
Surgical data science is the discipline of capturing, organizing, analysing and modelling surgical data in order to improve the quality of interventional healthcare (Maier-Hein et al., 2017, In this paper, we address the challenge of semantic scene segmentation based on hyperspectral imaging (HSI) data.We put a particular focus on the optimal representation of the data, both in the spectral domain (here referred to as input modality) and in the spatial domain (here referred to as spatial granularity).Modalities comprise hyperspectral imaging (HSI) (c = 100 channels) as well as corresponding RGB data (c = 3) and tissue parameter images (TPI) (c = 4).The latter includes tissue oxygen saturation (StO 2 ), near-infrared perfusion index (NPI), tissue water index (TWI) and tissue hemoglobin index (THI).Each combination (input modality, spatial granularity) yields an organ segmentation map for the whole image after potential aggregation of image parts.The corresponding neural network architectures are illustrated in Figure 3.
2021a).Semantic scene segmentation is an important prerequisite for various tasks in surgical data science, including contextaware assistance and surgical robotics.So far, the scientific literature has focused on binary segmentation tasks (e.g.medical instruments (Roß et al., 2021)) and conventional RGB video data (see e.g.(Scheikl et al., 2020;Grammatikopoulou et al., 2021)).
Recently, spectral imaging (Clancy et al., 2020) has evolved as a promising technique for advanced optical imaging in the operating room.While conventional RGB imaging is limited by imitating the human eye, spectral imaging systems capture the reflectance spectrum of the tissue over the entire field of view, thereby generating a datacube consisting of two spatial and one spectral dimension.Since the underlying tissue optical properties determine the measured tissue reflectance spectrum, spectral imaging has the potential to extract biological information such as tissue type and pathologies while being non-invasive (Studier-Fischer et al., 2021).Spectral imaging with up to tens of spectral bands is generally referred to as multispectral imaging (MSI), whereas spectral imaging with up to hundreds of spectral bands is named hyperspectral imaging (HSI) (Clancy et al., 2020).Early work on semantic scene segmentation with MSI data indicated that more detailed spectral information may improve the segmentation performance, but the number of classes was relatively low with six and three classes, respectively (Moccia et al., 2018;Garifullin et al., 2018).It has not been determined yet whether there is a benefit in using HSI data over processed HSI data (e.g. in the form of parameter images, cf. Figure 1) and RGB data.Furthermore, we are not aware of any prior work that investigated what is an adequate representation of medical HSI data for neural network-based fully automated scene segmentation, especially with respect to the spatial granularity of the data.The spatial variability of images acquired during open surgery is large due to the less controlled imaging conditions (e.g.imaging distance), complex three-dimensional relationships between multiple soft tissues as well as anatomical differences between individuals.In addition, acquisition and annotation of large data sets are difficult and time-consuming.In order to determine the ideal spatial granularity of the HSI input data, the generalization capability towards a new surgery (e.g.unseen individual or experimental conditions) and the required amount of training data thus also need to be considered.
To address this gap in the literature, this paper investigates the following research questions (cf. Figure 1 for an overview): 1. What is an adequate representation of HSI data for neural network-based fully automated organ segmentation?Specifically, what is the optimal granularity of the data (pixels vs. superpixels vs. patches vs. full images) with respect to segmentation quality, the required number of training cases and the capability to generalize towards new surgeries?2. Is there a benefit of using HSI data compared to RGB data and processed HSI data (e.g.tissue parameter estimations) when performing semantic organ segmentation?
The remainder of this paper is structured as follows: After presenting the related work in Section 2, we describe our HSI data set and approach to semantic organ segmentation in Section 3. The performed experiments and results are presented in Section 4 and summarized and discussed in Section 5.

Related work
As only very limited prior work on automatic organ segmentation in MSI/HSI exists, we first summarize related work on organ segmentation based on RGB data in surgery and then present a brief overview on the state of the art in segmentation with MSI/HSI data, both within and outside the field of biomedical image analysis.

Deep learning-based organ segmentation on RGB data
During the past years, deep learning-based segmentation of RGB data has found several applications in surgery, especially in minimally invasive surgery (Rivas-Blanco et al., 2021) such as cataract (Grammatikopoulou et al., 2021) or colorectal surgery (Maier-Hein et al., 2021b).However, work has mainly focused on medical instrument segmentation, driven by various challenges in this area (e.g.CATARACTS challenge on automatic tool segmentation in cataract surgery (Al Hajj et al., 2019), Robust Medical Instrument Segmentation challenge in laparoscopic surgery (Maier-Hein et al., 2021b)).Only few recent works have tackled organ segmentation, either restricted to organ classes (e.g.(Fu et al., 2019;Gibson et al., 2017)) or, more often, in the context of full scene segmentation (e.g.(Allan et al., 2020;Kadkhodamohammadi et al., 2019;Madad Zadeh et al., 2020;Scheikl et al., 2020)).The data sets used differ highly in terms of annotation sparsity (e.g.full scene vs. specific organ segmentation) and the number of considered classes.Input to the models are video frames of varying size (e.g.960 × 540 in (Scheikl et al., 2020), 512 × 512 in (Laves et al., 2019)).Relatively few works have tackled organ segmentation in open surgery, where, compared to minimally invasive surgery, image acquisition is often more difficult to realize and challenges arise from the even larger complexity and variability of the surgical scene (Gong et al., 2021).We are aware of only a single investigation of deep learning-based organ segmentation on RGB images in open surgery: Gong et al. analysed segmentation performance under different imaging conditions such as lightning changes or varying distances based on RGB images of 130 patients and found that these factors have a high influence on the image scores.
Overall, prior work has identified several major challenges for automated organ segmentation on RGB data such as a high variability in tissue appearance across patients (e.g.(Gibson et al., 2017;Collins et al., 2015)) and across images (e.g.due to occlusions or deformations (Moccia et al., 2018)) as well as the variability in the image acquisition.Including further spectral information could be key for addressing those challenges as MSI/HSI may be less reliant on the spatial context and encodes additional clinical information such as tissue perfusion (Fei, 2020).

Segmentation with MSI/HSI data
Within biomedical image analysis.Only a small number of papers address a biomedical segmentation problem based on MSI/HSI data with deep learning (Khan et al., 2021).Even without restricting the search to deep learning-based approaches, we could only identify nine related publications (with only four of them using deep learning techniques): Trajanovski et al. segmented healthy and tumorous tongue tissue in histopathological HSI images on an in-house data set consisting of 14 patients (one image per patient) (Trajanovski et al., 2021).Expanding on their earlier work (Trajanovski et al., 2019), they compared several pixel-based networks, networks based on patches of size 256 × 256 and hybrid networks, taking a combination of entire pixel spectra and patches with a reduced number of channels as input.They found a U-Net architecture (Ronneberger et al., 2015) based on patches to perform best in their specific segmentation task.However, as the performance analysis was conducted on the validation data set (on which hyperparameters were tuned), a subsequent evaluation on an independent test set remains to be performed.Garifullin et al. analysed 55 retinal MSI images and segmented three tissue types (vessels, optic disc and macula) (Garifullin et al., 2018).They used SegNet (Badrinarayanan et al., 2016) and Dense-FCN (Jégou et al., 2017) models and compared MSI with RGB data but their results did not reveal a clear winner (neither from the model nor the modality perspective).
Cervantes-Sanchez et al. analysed 18 HSI images from seven hepatic surgery patients and 21 HSI images from seven thyroid surgery patients.They created sparse annotations of circular shape for four organs (liver, bile duct, artery, portal vein) in the case of hepatic surgeries and three organs (thyroid, parathyroid, muscle) in the case of thyroid surgeries (Cervantes-Sanchez et al., 2021).They compared the performance of several machine learning methods (logistic regression (Cox, 1958), support vector machine (SVM) (Boser et al., 1992), multilayer perceptron (Haykin, 1994) and U-Net) based on single pixels or small patches of an 8 × 8 shape for automatic segmentation of the annotated organ classes.However, evaluation was only performed on the sparse annotations and on the validation data set (on which hyperparameters were tuned).Therefore, a subsequent evaluation on full semantic annotations and an independent test set remains to be performed.
In the project HELICoiD (Fabelo et al., 2016), the potential of HSI for the segmentation of tumorous and healthy brain tissue from patients undergoing neurosurgery was studied.The entire data set comprised 36 images from 22 patients and was made publicly available (Fabelo et al., 2019b).Sparse annotations of four classes (tumor tissue, normal brain tissue, blood vessels and background) were created by combining manual expert segmentations based on pathological findings with k-means clustering (k = 15).Pixels belonging to small clusters were removed as they were suspected to be annotation errors.Ravì et al. trained a Semantic Texton Forest (Shotton et al., 2008) on a subset of the HELICoiD data set, consisting of 33 HSI brain images from 18 patients that were embedded with an adapted version of the t-distributed stochastic neighbour approach (t-SNE) (Maaten and Hinton, 2008) to segment tumorous and healthy brain tis-sue (Ravì et al., 2017).However, the performance analysis was conducted on the validation data set (on which hyperparameters were tuned).Fabelo et al. proposed a multi-class semantic segmentation concept based on fusing a segmentation prediction from a supervised pixel-based SVM classifier that was spatially homogenized through k-nearest neighbours filtering with a segmentation prediction obtained through unsupervised clustering (Fabelo et al., 2018).Due to the sparsity of the annotations, a quantitative validation of the segmentations could only be performed for the SVM classifier but the separation between train, validation and test data is unclear.Fabelo et al. used 26 HSI brain images from 16 patients (6 patients with grade IV glioblastoma and 10 patients with normal brain tissue) to compare baseline SVM-based methods to a pixel-based deep neural network and a two dimensional convolutional neural network classifier on small patches of an 11 × 11 shape (Fabelo et al., 2019a).They found that both deep learning-based methods yielded similar performance and outperformed the SVM-based methods.However, the performance analysis was conducted on the validation data set (on which hyperparameter tuning was performed).Given the sparsity of the annotations and unclear or missing separation between train, validation and test data, a subsequent performance assessment on an independent test set with full semantic annotations remains to be performed for all three studies.Moccia et al. acquired MSI data of seven pigs (57 images) in the setting of hepatic laparoscopic surgery (Moccia et al., 2018).They turned the actual organ segmentation problem into a classification problem in the following way: Based on manually extracted textural and spectral features from automatically segmented superpixels, they trained a SVM to classify six organs (liver, gallbladder, spleen, diaphragm, intestine, and abdominal wall).They showed that the classification accuracy for their MSI data was superior to the classification accuracy for a selection of only three channels.However, the selected channels were too narrow to represent an RGB image.
Akbari et al. acquired seven HSI images of abdominal organs for a single pig.Five organs (spleen, colon, small intestine, bladder, peritoneum) were annotated for these images and pixelbased organ classification was performed while learning vector quantization (Kohonen, 1995) of compressed spectra (Akbari et al., 2008).Given the small size of the data set consisting only of a single individual and the unclear separation between train and test data, a subsequent evaluation on an independent test set comprising a larger number of individuals remains to be performed.
Outside biomedical image analysis.MSI/HSI is applied in various fields such as biochemistry, agriculture, archaeology and especially remote sensing (Prasad and Chanussot, 2020).However, investigation of deep learning-based semantic scene segmentation in those fields is rare (Khan et al., 2018).The validity of existing works is very limited due to small data sets composed of only one to two images and training and testing being performed on the same data (e.g.(Alam et al., 2016;Mughees and Tao, 2016;Nalepa et al., 2020;Paul and Bhoumik, 2021)).Generally, the application of deep learning-based semantic scene segmentation is hampered by limitations in the available annotations (Vali et al., 2020): training data is sparse and often only several discrete pixels instead of entire images are labelled, while due to the high dimensionality of the data, large data sets would be required to avoid overfitting (Zhu et al., 2021).Due to these limitations, most segmentation tasks in these fields are addressed via pixel-based classification (e.g.(Nalepa et al., 2020)) and the few existing patch-based or image-based segmentation approaches are at high risk of train-test-leakage (Nalepa et al., 2019).
In summary, prior work on semantic scene segmentation in open surgery is extremely sparse and even non-existent in medical HSI.Furthermore, the data sets used so far are rather small, and the high complexity and variability in surgical scenes due to non-standardized image acquisition, inter-subject variability and complex three-dimensional relationships between multiple soft tissues (e.g.overlapping tissue, shadowing, deformations) (Gong et al., 2021) remain to be addressed.Among the related work on organ segmentation from MSI/HSI data, models have been based on superpixels (Moccia et al., 2018), patches (Trajanovski et al., 2021;Garifullin et al., 2018;Fabelo et al., 2019a;Cervantes-Sanchez et al., 2021) and pixels (Akbari et al., 2008;Fabelo et al., 2019a).However, the optimal granularity of the data with respect to segmentation quality, the required number of training cases, and the capability to generalize towards new surgeries given the large variability in the surgical scene, has not been determined up to the present date.Furthermore, no prior work could show a clear benefit of MSI/HSI data over RGB data for deep learning-based organ segmentation.We address these gaps in the literature based on a semantically annotated HSI data set of unprecedented size and number of classes (506 images from 20 pigs semantically annotated with 19 classes): The data sets in the related work are composed of a maximum of 22 individuals (Fabelo et al., 2019b) and annotated with a maximum of six classes (Moccia et al., 2018).To the best of our knowledge, the largest medical HSI data sets outside the field of segmentation are composed of 316 images from 30 patients, of which 215 images were annotated with 35 classes (Hyttinen et al., 2020) and 9059 images from 46 pigs annotated with 20 classes (Studier-Fischer et al., 2021).Despite the data sets provided by (Fabelo et al., 2019b;Hyttinen et al., 2020;Studier-Fischer et al., 2021) surpassing our data set in the number of individuals, none of these larger data sets provides semantic segmentations.

Materials and methods
The following sections describe the hardware and data set that served as a foundation for this work (Section 3.1) and the individual components of our image processing pipeline (Section 3.2).Our implementation and the pre-trained models can be found in our GitHub repository1 (Sellner and Seidlitz, 2022).

Image acquisition and data set
The HSI data was acquired at the Heidelberg University Hospital after approval by the Committee on Animal Experimentation of the regional council Baden-Württemberg in Karlsruhe, Germany (G-161/18 and G-262/19).HSI images were taken for 20 pigs that were managed according to the German laws for animal use and care and in agreement with the directives of the European Community Council (2010/63/EU).Details on the animals, the performed anaesthesia and surgical procedures are available in (Studier-Fischer et al., 2021).

HSI camera system
The HSI camera system Tivita ® Tissue (Diaspective Vision GmbH, Am Salzhaff, Germany) was used to acquire the HSI data.In a push-broom fashion, it captures hyperspectral images with a spectral resolution of approximately 5 nm in the spectral range between 500 nm and 1000 nm, resulting in datacubes with a size of 640 × 480 × 100 (width w × height h × number of spectral channels c).The camera system images an area of approximately 20 × 30 cm.An imaging distance of about 50 cm is ensured through an integrated distance calibration system composed of two light marks that overlap if the distance is correct.The image acquisition takes approximately seven seconds.In addition to the HSI datacubes, the camera system estimates tissue parameter images (TPI) that include oxygenation (tissue oxygen saturation (StO 2 )), perfusion (near-infrared perfusion index (NPI)), water content (tissue water index (TWI)) and hemoglobin content (tissue hemoglobin index (THI)) from the HSI datacubes.Furthermore, RGB images are reconstructed from the HSI data by aggregating spectral channels capturing red, green and blue light, respectively.The underlying calculations are described in (Holmer et al., 2018).Figure 1 shows the reconstructed RGB image and TPI associated with an exemplary HSI datacube.More technical details on the hardware are available in (Holmer et al., 2018;Kulcke et al., 2018).

Data acquisition
In order to prevent distortion of the spectra from stray light, light sources other than the integrated halogen lighting unit were shut off during image acquisition and window blinds were closed.Motion artefacts were reduced in the following ways: (1) The camera was mounted on a swivel arm and the entire camera system was left untouched during image acquisition, thereby preventing camera motion.(2) Images were taken from still scenes without any movements of objects induced by the operating surgeon.Therefore, motion artefacts could only originate from natural sources such as respiration and heartbeat and are thus not very strong and limited to images of thoracic organs (cf.example images in Figure 8).Acquiring a fixed number of images with a fixed set of camera perspectives from a fixed set of situses per pig is infeasible in real-world surgery as no two surgeries are exactly the same.Tissues could be affected by a variety of complications that alter their state, such as inflammatory change or tissue trauma.Non-physiological tissues were excluded from the HSI acquisition, leading to variations in situses and number of images across pigs.The camera perspectives were chosen to allow for a good view of all organs of interest in the scene.
An overview of the data set is given in Figure 2. In total, 506 images from 20 pigs were acquired.For each organ, between 32 and 405 images were acquired from 5 to 20 individuals.To provide insights on characteristic organ spectra, an interactive figure is available in the supplementary materials.By selecting a pixel in an example image, the pixel spectrum as well as the mean spectrum and standard deviation across pixels of the corresponding organ class can be explored.

Image annotation
19 different classes were annotated, including two thoracic organs (heart, lung), eight abdominal organs (stomach, small intestine, colon, liver, gallbladder, pancreas, kidney, spleen) and one pelvic organ (bladder).In the case of kidney, images before and after removal of Gerota's fascia were taken and labelled kidney with Gerota's fascia and kidney, respectively.Furthermore, subcutaneous fat, skin and muscle tissue as well as omentum, peritoneum and major veins were annotated.Pixels that belong to any inorganic object (e.g.cloth, compresses, foil, tubes, metallic objects and gloves) were labelled background.This label is present on every image and the annotated areas cover on average 47 % (standard deviation (SD) 24 %) of an image.Additionally, pixels were labelled as ignore if it could not be decided to which organ they belong or if they belonged to an organic object other than the 18 organ classes.221 of 506 images contain this label and, on average, the annotated areas cover 2 % (SD 3 %) of the pixels in the 221 images.The ignore pixels were later excluded from our analysis.
The semantic annotations were performed by two different annotators using vector annotation tools provided on the annotation platform SuperAnnotate (SuperAnnotate, Sunnyvale, USA)2 .To ensure consistent labelling, all annotations were then revised by the same medical expert.
Imbalances in the number of images per class arose since some organs naturally occur more often in the field of view of other organs.For example on images of the gallbladder, the surrounding liver is always present, whereas not on all liver images the gallbladder is visible.Heterogeneity in the number of animals per organ arose from differences in the surgical procedure performed.For example, opening of the thorax, which is a highly invasive and demanding surgical procedure and thus associated with a significant mortality and prolongation of the surgery, was only performed for eight out of 20 pigs, making HSI data from heart and lung unavailable for the remaining 12 pigs.

Deep learning-based full semantic scene segmentation
Our approach to deep learning-based semantic HSI image segmentation is summarized in Figure 3.The following sections present our image processing pipeline in detail, including an overview of our input modalities (Section 3.2.1),our preprocessing of the HSI data (Section 3.2.2), the architectures of our deep-learning models (Section 3.2.3),our training setup (Section 3.2.4) and our approach to increase randomness in the data loading (Section 3.2.5).In general, our design choices with respect to model architectures and training setup were motivated by our comparative study.We aimed to have common model and training parameters across the different spatial models and modalities whenever possible (e.g.same hyperparameters and data splits).We explicitly avoided individual parameter tuning for each model to ensure a fair comparison and reduce computational costs.

Model input modalities
A primary purpose of this study was to investigate whether there is a benefit in using HSI data compared to RGB and TPI data for neural network-based fully automated organ segmentation.For simplicity, we will refer to these different input data types as input modalities although they were in practice all computed with the same camera.This reflects the fact that a future application leveraging semantic scene segmentation could be based on RGB images from a conventional camera, on the preprocessed HSI images of an HSI camera provider or on raw HSI spectra.As illustrated in Figure 3, we trained neural networks separately on all three input modalities for all studied levels of data granularity.RGB data reconstructed from the HSI data was available through the camera system.To study the organ segmentation performance on processed HSI data, the associated StO 2 , NPI, TWI and THI images were stacked, yielding a 640 × 480 × 4 (w × h × c) TPI cube that served as model input.

HSI data pre-processing
In order to remove the influence of sensor noise and convert the acquired HSI data from radiance to reflectance, the raw HSI datacubes were automatically corrected with a pre-recorded white and dark reference cube by the camera system as described in (Holmer et al., 2018).After exporting the HSI cubes from the camera system, the 1 -norm was applied to each pixel spectrum in order to account for multiplicative illumination changes that arise, for example, from fluctuations in the measurement distance.

Deep learning models
Our approach to semantic scene segmentation is presented in Figure 3 for the exemplary case of HSI input data.Implementation details are provided in the following paragraphs.
Pixel-based segmentation.The smallest possible spatial granularity of the input data is to use single pixel spectra, resulting in input feature vectors of length c = 100 in the case of HSI input data, c = 3 in the case of RGB input data and c = 4 in the case of TPI input data.Following our previous work in (Studier-Fischer et al., 2021), the deep learning architecture for the HSI input data is composed of three one-dimensional convolutional layers with 64 filters in the first, 32 in the second, and 16 in the third layer.Each convolution is performed with a kernel size of 5 and operates on a single pixel spectrum (first layer) or across the spectral features from all filters of the previous layers (second and third layer).Each convolutional layer is followed by an average pooling layer with a kernel size of 2 which operates on the spectral features separately for each filter.The output of the last convolutional layer is stacked and serves as input for two fully connected layers, with 100 neurons in the first and 50 in the second layer.For TPI and RGB input data, no convolutional operations along channels are feasible due to the small channel size.Instead, the network consists of three fully connected layers with 200 neurons in the first, 100 neurons in the second and 50 neurons in the third layer.The exponential linear unit (ELU) (Clevert et al., 2016) is used as an activation function and batch normalization is applied to the outputs of all layers except pooling layers.The class logits are calculated by a final linear layer, resulting in one organ label for each input pixel spectrum.The cross-entropy (CE) loss function is used to optimize the model during training.
The architecture was designed such that local information from neighbouring spectral bands and global information across the entire spectrum is aggregated while keeping a small network size of only 34,275 weights for HSI, 27,819 weights for TPI and 27,619 weights for RGB input data and thus being computationally efficient.The local information aggregation is achieved through the convolutional layers: a relatively small kernel size For superpixel-based classification, superpixel boundaries are segmented based on the reconstructed RGB images and each superpixel is then extracted from the pre-processed HSI cubes.A minimum enclosing bounding box is computed and areas not belonging to the superpixel are replaced with zeros.After reshaping, the superpixel cube is input to the U-Net encoder and the output is further processed by a classification head.For patch-based classification, patches of a fixed shape are extracted from the pre-processed HSI cubes and processed by the U-Net.Pixel-wise, superpixel-wise and patch-wise predictions belonging to the same image are aggregated to yield an image segmentation map.For image-based classification, the input and output to the U-Net are a whole pre-processed HSI cube and an image segmentation map, respectively.
was chosen to focus on local structures and by stacking three convolutional layers, a compromise between increasing the receptive field of the network and keeping the number of learnable weights small was achieved (Szegedy et al., 2016).The global context was learned by the fully connected layers.
To retrieve a segmentation map for an image, we predict a class label for each pixel in the image and then map the resulting labels back to the image positions.
Superpixel-based segmentation.Superpixels are regions of low spatial granularity that adhere to local boundaries which enclose pixels with similar features.As for the pixel-wise organ segmentation, the unsupervised clustering of superpixels turns the actual organ segmentation task into a superpixel-wise organ classification task.This is justified by the assumption that all pixels within a superpixel belong to the same organ class since superpixels are supposed to lie within the local boundaries of an organ.Superpixels are generated by the use of the simple linear iterative clustering (SLIC) algorithm on the reconstructed RGB data (Achanta et al., 2012).Prior to clustering, the image is smoothed with a Gaussian kernel of width 3 and then 1000 segments are computed in ten iterations while adaptively changing the per-superpixel compactness parameter (SLICO mode).For each superpixel, a minimum enclosing bounding box is computed and areas not belonging to the superpixel are replaced with zeros.To ensure one common input shape, superpixels are resized via bilinear interpolation to the shape 32 × 32 × 100 (w × h × c) for HSI, 32 × 32 × 4 for TPI and 32 × 32 × 3 for RGB input data.
The resized superpixel cubes are passed to an efficientnet-b5 encoder (Tan and Le, 2019) pre-trained on the ImageNet data set (Deng et al., 2009) using the library of Yakubovskiy (Yakubovskiy, 2020) 3 .We chose this encoder as it yields good performance while at the same time being economical in terms of the number of parameters leading to a low memory footprint and fast computation times.The output of the encoder network is passed on to a classification head consisting of a fully connected layer with 19 neurons for calculating the class logits.This way, the superpixel network shares the same architecture as the segmentation networks for the image and patch-based models with only minor modifications.
It is possible that not all pixels within one superpixel belong to the same organ class, for example due to inconsistencies at the border between organs.To account for this, we introduced the concept of fuzzy labels where we assigned a label vector of length O to each superpixel (e.g.O = 19 classes in our case).The fuzzy label vector stores the relative frequency of each class label considering all enclosed pixels inside the superpixel.The Kullback-Leibler divergence (Kullback and Leibler, 1951) between fuzzy labels and the softmax output is used as a loss function during training.
During inference, the argmax of the class logits is computed and the resulting label is assigned to every pixel position of the superpixel.Predictions of all pixels from all superpixels are combined to yield a segmentation map for an image.
Patch-based segmentation.Patches are regions of low spatial granularity that are extracted from images according to a fixed shape.They are generally more easily generated and more straightforward to use in neural networks than superpixels, for example because their rectangular shape matches with the rectangular kernel shapes of convolutional neural networks.In order to capture different degrees of granularity, patches of two different shapes are extracted: 32 × 32 × c and 64 × 64 × c.These sizes serve as intermediate steps between the superpixel and the image model in terms of spatial granularity (cf.Table 1).We use patch sizes which are a power of two so as to easily integrate them with encoder architectures which halve the input shape multiple times.The number of generated patches per image corresponds to the number of patches that could have been generated via a grid-based tiling.
The patches are passed to a U-Net with an efficientnet-b5 encoder pre-trained on the ImageNet data set (like the superpixel network).Dice loss and CE loss are calculated during training based on all pixels in the batch and equally weighted to compute the loss function.While each misclassified pixel contributes equally in the computation of the CE loss, misclassified pixels belonging to an organ class of smaller image area contribute more to the dice loss than misclassified pixels belonging to a dominant class (e.g.background).By computing a weighted sum of both loss terms, the network training can benefit from the respective advantages.
During inference, images are divided into a grid of nonoverlapping patches of the corresponding patch size.In cases where an image dimension is not an integer multiple of the patch dimension, missing image regions are zero-padded.For each patch, the network yields a segmentation map.The segmentation maps of all patches of one image are combined to yield an image segmentation map.Segmentations belonging to previously zero-padded image regions are cropped.
Image-based segmentation.Entire images offer a maximum of spatial granularity and we use them directly without any further adaptations of the image dimensions, i.e. the input tensors have a shape of 480 × 640 × 100 (w × h × c) for HSI, 480 × 640 × 4 for TPI and 480 × 640 × 3 for RGB.Equivalent to the patch-based segmentation, the images are passed to an efficientnet-b5 U-Net pre-trained on the ImageNet data set.Again, both dice and CE loss are equally weighted to compute the loss function.

Training setup
To prevent biases due to differences in the training setup when comparing the organ segmentation performance on input data for different levels of spatial granularity and modalities, the training setup was made as comparable as possible for all models.The following paragraphs describe the data augmentation, optimization and regularization methods that were consistently applied for all the different segmentation models.
Data augmentation.In order to increase the size and diversity of the available training data and thereby improve convergence, generalization and robustness on out-of-distribution samples, data augmentation is commonly applied in computer vision (Buslaev et al., 2020).For all models, the training data is augmented using the Albumentation library 4 with default settings (Buslaev et al., 2020) on a per-image basis before extracting pixels, superpixels or patches: images are shifted (shift factor limit: 0.0625), scaled (scaling factor limit: 0.1), rotated (rotation angle limit: 45°) and 4 https://github.com/albumentations-team/albumentationsflipped with a probability of p.All transformations are applied with a probability of p = 0.5 to reduce the computational data loading costs that are induced by extensive data augmentations.
Optimization.For all models, the Adam (Kingma and Ba, 2017) optimization algorithm and an exponential learning rate scheme are used (initial learning rate: 0.001, decay rate γ: 0.99, Adam decay rates β 1 : 0.9 and β 2 : 0.999).Training is performed for 100 epochs and stochastic weight averaging (Izmailov et al., 2019) is applied.To achieve a comparable training procedure across models, the available training budget should be equal for each model.To this end, the size of one epoch is defined as seeing 500 training images for image-based segmentation.For pixel-, superpixel-and patch-based segmentation, it was ensured that the total number of extracted pixels for each approach matches roughly the total number of pixels of 500 images.Such matching can only be roughly approximated because the epoch size has to be divisible by the batch size so that every worker in the data loader can contribute equally to every batch, cf.Section 3.2.5.
Recommendations for the optimal batch size are mixed (e.g.(Smith et al., 2018;Kandel and Castelli, 2020)).While smaller batch sizes can, for example, speed up the learning process (Montavon et al., 2012), larger batches are a better representation of the real population, potentially leading to more stable gradients and better batch statistics (Ioffe and Szegedy, 2015).As a compromise, we decided to maximize the batch size while choosing a large number of epochs to rule out a potentially slower learning process.In practice, the batch size is constrained by the available graphics processing unit (GPU) memory and depends on the network and input size.We determined the maximum batch size thus per model, resulting in larger batch sizes for smaller input spatial granularities (cf.Section 5.1 for further discussion).Table 1 gives an overview of the resulting epoch and batch sizes.During training, each model was evaluated after each epoch on the validation set of the respective training fold.The dice similarity coefficient (DSC) was calculated while respecting the hierarchical structure of the data, as described in more detail in Section 4.1.The final validation score was obtained by averaging the DSC values of three validation pigs, and this score was also used to determine the best model across all epochs.
Regularization.To avoid overfitting, dropout regularization is applied for the fully connected layers in the pixel models and in the superpixel classification head while using a dropout probability of 0.1.

Variability of results.
Training of neural networks is always subject to various sources of variation, some easier to control than others (e.g.seeding vs. hardware influences) (Pham et al., 2020).Especially in our study, in which we aim for a fair comparison between different spatial granularities and modalities, reduction of such sources of variation is important.While obtaining exactly reproducible results is only achievable at the cost of longer training times (e.g. by restricting oneself to slow deterministic operations or a single homogeneous hardware infrastructure) (Pham et al., 2020), we took several measures to reduce the variation.We controlled the weight initialization of the networks, the initialization of the workers responsible for data loading (which also affects the data augmentation) and the ordering in which training samples are presented to the network for the modalities (e.g.corresponding spatial models for different modalities receive patches from the same spatial locations and in identical order).To this end, we set a seed for the network initialization and the data loaders and fixed the number of workers on each data loader to 12 across all experiments.Since we had to run many training runs efficiently for this study, we did not enforce deterministic operations and we used our in-house cluster infrastructure which consists of an inhomogeneous hardware infrastructure with many different GPUs (e.g.GeForce GTX™ 2080 Ti or DGX™ A100 (Nvidia Corporation, Santa Clara, USA)).

Multi-image contribution in data loading
The models that work on part of an image such as the patch or pixel models feature an additional challenge when it comes to data loading.Due to the size of the data set, it is not feasible to load all images at the beginning of the training process as the HSI data easily exceeds memory limitations.Pre-computing a set of image parts is also not feasible as the number of instances can be very high (e.g. each image has 480 • 640 = 307,200 pixel instances).To overcome these issues, we used a custom data loading strategy outlined in Figure 4 where all training images are distributed in unique sets to each worker.Each worker processes its own set of images as an endless stream but only ever loads one image at a time to extract the relevant parts (pixel, patch or superpixel in our case).The image parts are directly stored in a shared memory location by the workers so that the main process only needs to move the data to the GPU without any further processing.The shared memory implements a ring buffer which gets filled by the workers in a cyclic manner where the next batch is filled as soon as the batch is moved to the GPU by the main process.Every worker contributes equally to the batch, meaning that the same number of image parts are added to the batch dimension by every worker.This increases randomness across images in one batch so that the batch distribution more closely resembles the data distribution, which is usually desired when training a deep learning network.Once a worker has finished extracting parts from an image, the next image is loaded without keeping the old image in the memory.This leads to a constant memory footprint for data loading during training that is only determined by the number of used workers.The ordering of the images is reshuffled after each training epoch.4: Data loading strategy designed to work with large data sets which ensures that batches are filled with image parts from multiple images.In this example, three workers generate random patches from their own set of images (mutually exclusive across workers), collect multiple patches in one batch part (boxes on the arrows) and store the result in the next free location in the ring buffer (non-transparent block) which is located in shared memory.The main process takes one of the next ready batches (transparent boxes) from the buffer and moves it to the graphics processing unit (GPU).I i refers to the i-th image in the data set, b j is the number of samples the j-th worker contributes to the batch and h, w, c denote the height, width and number of channels of an image, respectively.The assignment of images I i to the workers changes randomly from one epoch t to the next t + 1.

Experiments and results
The purpose of our experiments was to investigate what constitutes an adequate representation of HSI data for neural network-based organ segmentation (cf.Section 4.2) and determine whether there is a benefit in using HSI data over other modalities like RGB and TPI data (cf.Section 4.3).The underlying experimental setup is described in Section 4.1.

Experimental setup
We validated our framework on the data set described in Section 3.1.This section describes our train-test-split, validation metrics as well as how we assessed the quality of our reference annotations and how much training data was required.
Train-test-split.We split the data set comprising 20 pigs (506 images) into a train data set consisting of 15 pigs (340 images) and a disjoint test set consisting of five pigs (166 images).The five test pigs were randomly selected such that in both train and test set each organ class was represented by at least one pig.The test set remained untouched throughout model development.k-fold cross-validation was performed on the train data set with k = 5.The folds were generated such that the number of organ classes was maximized across validation folds.We refer to this traditional validation set obtained for each fold, which consists of three pigs that are not seen during training, as V unknown .Additionally, for each of the 12 pigs included in each of the five training sets, one random image was excluded from the training and used to compose a second validation set consisting of 12 unseen images per fold from pigs known from the training, referred to as V known .By comparing the model's performance on the two validation sets, its generalization capabilities towards new surgeries could be estimated.Such new surgeries comprise changes due to inter-pig variability as well as imaging-related changes (e.g.different imaging perspectives) and context-related changes (e.g.variations in the performed surgery, surgical phase during acquisition and visible instruments).The latter two sources of variation may also be present across different images of the same surgery, but may be more prominent across different surgeries.Once model architectures and parameters were finalized based on the performance on the validation set, we evaluated the segmentation performance on the hold-out test data set by ensembling the network predictions for each fold via averaging the softmax values.
Validation metrics.Since individual validation metrics do not reflect all clinical requirements (Yeghiazaryan and Voiculescu, 2018;Reinke et al., 2021), we combined several validation metrics with different strengths and weaknesses to obtain a more informative quantification of the model performance.The most commonly used validation metric for biomedical segmentation tasks is the DSC (Dice, 1945;Maier-Hein et al., 2018).It measures the overlap between a predicted object segmentation and the corresponding reference segmentation and is highly sensitive to the object size and insensitive to the object shape (Reinke et al., 2021).DSC values are available per class and are always in the range [0; 1] with 0 indicating that either there is no overlap between predicted segmentation and reference segmentation for the class or the class is present on the image but has not been predicted.A value of 1 for a class is obtained if the prediction overlaps perfectly with the reference segmentation.It should be noted that any binary segmentation task can be regarded as a pixel-wise classification tasks and that the DSC is identical to the F1-score, which is typically represented as a function of true/false positives/negatives.Furthermore, the DSC is very closely related to the Intersection over Union (IoU), which is commonly used as primary validation metric in the machine learning community.For the computation of the DSC, we used the MONAI framework 5 (Consortium, 2020).
Boundary-distance-based methods measure the dissimilarity between the predicted segmentation and reference segmentation in terms of distances between boundaries.In contrast to overlapbased metrics, boundary-distance-based metrics are insensitive to the object size and sensitive to the object shape.An example is the average surface distance (ASD) (Heimann et al., 2009) which for an organ o is the average of all distances between pixels on the predicted object segmentation border B o ML and its nearest neighbour on the reference segmentation border The ASD has the disadvantage that it is unbounded, yielding values in the range [0; ∞) with 0 in cases where boundaries of objects match exactly.Therefore, ASD values are generally harder to interpret.Furthermore, special attention must be placed on the case of missed classes (classes that are present in the reference annotations but were not predicted), as no natural limit is available (Reinke et al., 2021).Here, we decided to set the ASD value for a missed class to the maximum ASD obtained for the other classes on the same image.This introduced a potentially high and image-dependent penalty when a class could not be predicted in an image (cf.discussion on this point in Section 5.3).
The normalized surface dice (NSD) (Nikolov et al., 2021) estimates which fraction of a segmentation boundary is correctly predicted with an additional threshold τ related to the clinically acceptable amount of deviation in pixels.It is thus a measure of what fraction of a segmentation boundary would have to be redrawn to correct for segmentation errors.Instead of one common threshold τ, we used a class-specific threshold τ o for each organ class o since the difficulty of annotating varies between organs (e.g. an organ with a clear boundary, such as liver, is easier to precisely annotate than an organ with a fuzzy boundary, such as omentum).We adapted the NSD, which was originally invented for 3D segmentation maps, to our 2D segmentation maps in the following way: Instead of considering 3D segmentation surfaces, we considered 2D segmentation boundaries.For all pixels of the predicted segmentation boundary of organ The entire procedure was symmetrically repeated for For each class that appears in both the prediction and the reference segmentation the NSD o was then computed as: The NSD is bounded in the range [0; 1] with 0 indicating that either the boundary is completely off, with all distances being larger than the acceptable deviation τ o , or that the class is present on the image, but has not been predicted.1 is obtained for cases where no redrawing of the segmentation boundary is necessary since every deviation is below the acceptable threshold τ o .
One of the major challenges of the NSD is the determination of the organ-specific thresholds τ o (cf.discussion in Section 5.3).To this end, 20 randomly selected images (one image per pig such that all organ classes are represented by at least two images) were re-annotated by a second medical expert.Equivalent to the ASD, we computed distances between the boundaries of the original and the re-annotation for each organ o in each image i and averaged the results to obtain the image-and organ-specific threshold τ o i .If an organ was annotated in only one of the two corresponding images, no distances could be computed and the corresponding structure was ignored (cf.discussion in Section 5.3).We computed the class-specific distance threshold τ o by averaging the τ o i for the set of images I o where the organ o is present and τ o i could be computed: For each image, metric values were first separately computed for all classes annotated in the reference segmentation.These class-wise metric values were then averaged to obtain a perimage metric value.To account for the hierarchical nature of the data (following (Holland-Letz and Kopp-Schneider, 2020)), all metric values were first averaged for all images of one pig and these per-pig scores served as a basis for visualizations and model rankings.
We investigated the model ranking and its stability with respect to two sources of variability, namely the sampling variability and variability due to choice of validation metric.The ranking analyses were performed using the challengeR6 toolkit (Wiesenfarth et al., 2021).
The model ranking was determined in the following way: For each model, the average of the five per-pig metric values was computed and models were ranked according to these mean metric values.To assess the stability of the ranking with respect to different validation metrics, the ranking was performed separately for each metric and compared.To assess the stability of the model ranking concerning the sampling variability, rankings were performed repeatedly on 1000 bootstrap samples.Each bootstrap sample consisted of 5 cases randomly drawn with replacement from the set of five test cases (one metric value for each pig in the test set).Metric values were first averaged across all five test cases and then the rank of each model was determined based on this aggregate, resulting in 1000 ranks for each model (one rank per bootstrap sample, meanThenRank option in challengeR).
Quality of reference annotations.Crucial for any deep learning algorithm is the quality of the data, including the available reference annotations.It is known from previous studies that the variability between different human raters can be large (Joskowicz et al., 2019).To quantify the quality of our reference annotations, we used the same set of re-annotated images as described above and the annotations of the second medical expert for inter-rater estimations.Additionally, the medical expert who originally annotated our data set re-annotated the same set of 20 images for an estimation of the intra-rater variability.In both cases, we compared the re-annotations with the original annotation per image and calculated our three evaluation metrics.In contrast to the comparison of model predictions, we did not use pixels which the new annotator assigned to the ignore class (e.g.pixels for which an annotator was not sure).That is, the union of ignored pixels in reference and re-annotated segmentation map was computed and those pixels were ignored.

Required amount of training data.
To study the effect of the amount of training data on the performance of the different models, we randomly sampled n pigs from the set of 15 training pigs without replacement and varied n from 1 to 14. Training of the models was only performed on the images of the n sampled pigs without k-folds, while the performance was evaluated on the same test set described in Section 4.1 but only for eight organ classes and without ensembling.These eight organ classes (background, stomach, small intestine, colon, liver, spleen, skin and peritoneum) are the set of organs of which images are available for all 15 training pigs.This design choice avoids the problem of having a pig sampled during training that does not contain any of the target classes, and hence would not give indicative results for this experiment.To increase the stability of the results towards pig sampling variability, we repeated the experiment five times with different random pig selections.
Misclassifications.Misclassifications were analysed by computing the confusion matrix for the model of interest.To account for the hierarchical structure of the data, confusion matrices were first computed per pig and row-wise 1 -normalized by dividing every value by the total number of pixels of the respective organ in the reference segmentation.Thereupon, the row-normalized confusion matrices of all test set pigs were averaged to yield the model-wise confusion matrix.

Representation of HSI data
A primary purpose of our study was to investigate what is an adequate representation of HSI data both with respect to the segmentation performance, the required amount of training data and the capability to generalize towards new surgeries.The experiments performed to answer these research questions are outlined together with their results in the following paragraphs.
Model performance.Figure 5 shows the performance of the pixel-based (referred to as pixel), superpixel-based (referred to as superpixel), patch-based (referred to as patch_32 for an input shape of 32×32×c and patch_64 for an input shape of 64×64×c) and image-based (referred to as image) segmentation models for our three validation metrics DSC, ASD and NSD.Despite the fact that performance gaps for different spatial granularities are less prominent for HSI data than for RGB and TPI data, a consistent trend is visible for all levels of spectral granularity and validation metrics: the larger the spatial granularity of the input data, the better the segmentation performance.
Notably, the image-based HSI organ segmentation model performance is, on average, consistently comparable to predictions of a second human expert for all validation metrics.For the inter-rater variability, we obtained a DSC of 0.89 (SD 0.07), an ASD of 4.88 (SD 5.33) and an NSD of 0.80 (SD 0.08).The intrarater variability is better on all three metrics with a DSC of 0.91 (SD 0.05), an ASD of 4.74 (SD 5.04) and an NSD of 0.82 (SD 0.06).Across all 20 images, it occurred 8 times in the inter-rater and 6 times in the intra-rater agreement evaluation that organ classes that were not annotated in the reference segmentation map were newly assigned to an image.7 times in the inter-rater and 4 times in the intra-rater agreement evaluation, organ classes that were annotated in the reference segmentation map were missing in the re-annotations.Differences in the ignore class occurred for 14 and 14 out of the 20 images in the inter-rater and intra-rater comparison, respectively.In total, for 34,063 px in the inter-rater and 37,397 px in the intra-rater comparison, the label ignore was assigned in the re-annotation, but an organ label had been assigned in the reference annotation or vice-versa.
We described our strategy for the reduction of software-and hardware-related variability of our results in Section 3.2.4.To get an estimate of the controlled source of variation, we ran the image-based HSI model 5 times with different seeds and found the DSC to be in the range [min; max] = [0.893;0.900], the ASD in [5.542; 6.948] and the NSD in [0.796; 0.806] on the test set.The controlled software-and hardware-related variability between pigs in the test set is thus at least one magnitude smaller than the inter-pig variability.
Model ranking.In Figure 6, we illustrate the ranking stability with respect to the sampling variability for the DSC (results for the ASD and NSD can be found in Figure A.2 and Figure A.3,respectively).The bootstrapped ranking is relatively stable with at least the first and last two ranks being very clear (more than 90 % of bootstraps resulting in the same rank) for all metrics.For the boundary-distance-based metrics, the number of models with a clear ranking is even larger and for the NSD, all ranks vary by a maximum of ±1 rank around the median.Fig. 6: Ranking stability of the different models and modalities based on bootstrap sampling on the test set with the dice similarity coefficient (DSC).The area of each blob at position (A i , rank j) is proportional to the relative frequency of algorithm A i (model#modality combination) achieving rank j across 1000 bootstrap samples with one bootstrap sample consisting of 5 pig-level metric values.The median rank for each algorithm is indicated by a black cross and the black lines indicate the 95 % quantile of the bootstrap results.This plot was generated with the challengeR toolkit (Wiesenfarth et al., 2021).Ranking results for the average surface distance (ASD) and normalized surface dice (NSD) can be found in To assess the stability of the ranking with respect to different validation metrics, rankings for the three metrics were compared as depicted in Figure 7. Across all modalities and metrics, the ranking of the spatial models is always (from best to worst): image, patch_64, patch_32, superpixel and pixel.Hence, more context always improves the segmentation performance irrespective of the modality and metric.Generally, rankings for the different metrics are in close agreement: The image-based segmentation of HSI data always ranks first, while the last five ranks are always taken by superpixel#TPI, superpixel#RGB, pixel#HSI, pixel#TPI and pixel#RGB (from best to worst).The largest difference in ranking across metrics occurs for the superpixel#HSI model, which achieves rank six for the ASD compared to rank nine and ten for the DSC and NSD, respectively.Each line visualizes how the rank of the corresponding model#modality changes when evaluating with different metrics.This plot is based on ranking data from the challengeR toolkit (Wiesenfarth et al., 2021).

% of bootstraps
Visual segmentation quality assessment.Example predictions for the five spatial models on the HSI data are shown in Figure 8.Based on the image DSC averaged over all five models, the images corresponding to the 5 % quantile, 50 % quantile and 95 % quantile were selected, representing examples for overall bad, intermediate and good segmentation performances, respectively.For pixel-based segmentation predictions, boundaries are more incomplete and scattered than for the other models.In some of the patch-based segmentation examples, sharp vertical and horizontal edges are visible where adjacent patches connect, whereas, for the superpixel-based segmentation, edges appear wiggly due to misclassified superpixels in the proximity of organ boundaries.

Required amount of training data.
A potential benefit when using input data of smaller spatial granularity is that more training samples are available, e.g. one image corresponds to 307,200 training samples in the case of pixel-based segmentation, but only to a single training sample in the case of image-based segmentation (Table 1).Figure 9 shows the development of the DSC over the number of training pigs for the different models.Results for ASD and NSD can be found in Generalization capability.We wanted to study how well our neural network-based organ segmentation generalizes towards new surgeries.To obtain an initial estimate for the generalization capabilities, we compared the segmentation performance achieved on the validation set containing only unseen pigs V unknown to those obtained on the validation set composed of unseen images of seen pigs V known (cf.Section 4.1).Figure 10 (a) shows the average DSC on V unknown and V known , respecting the hierarchical structure of the data, for the 5 different levels of spatial granularity over all 100 epochs throughout training.Performance is generally better for V known .For all modalities, the gap in performance between V known and V unknown is smallest for the pixel-based segmentation.

Model input modalities
We wanted to investigate whether there is a benefit in using HSI data compared to RGB and TPI data.
Segmentation performance and ranking.Figure 5 shows that for all metrics and all models, the average segmentation performance on HSI data is consistently better than the performance on RGB and TPI data.While the performance gap is largest in the case of pixel-based segmentation, it decreases with an increased level of spatial granularity and is minimized in the case of image-based segmentation.Nevertheless, a model based on HSI data ranks better than the same model based on TPI and RGB data regarding sampling and metric stability (Figure 6 and Figure 7, respectively).In most cases, a model based on TPI data ranks better than the same model based on RGB data.However, compared to the gap in performance/ranking between HSI-based and non-HSI-based segmentation, the difference in performance/ranking between TPI-and RGB-based segmentation is usually smaller.
Generalization capability.When comparing the generalization capabilities towards new surgeries for different input modalities in Figure 10, it can be observed that across all modalities, performance gaps between V unknown and V known are considerably smaller for the pixel-based segmentation.Overall, the RGB pixel model yields the best generalization performance.Regarding the progress over training time, there are striking differences across modalities: while the DSC changed relatively smoothly for TPI and RGB data, the training was noisier for HSI data.This holds especially true for the pixel model whereas the training was only slightly noisier for the image model.Additionally, training converged faster for the TPI and RGB modalities whereas HSI benefited more from longer training times.
Misclassifications. Figure 11 shows the confusion matrix of the best model (image model on the HSI modality) for the test set described in Section 4.1.For 8 out of the 19 classes, on average more than 95 % of the organ pixels were correctly identified.For major vein, the recall was lowest with only 57.1 % of the organ pixels being correctly identified.Confusion matrices for the image model on the TPI and RGB modalities are shown in

Discussion
With the present study, we addressed two important and so far unanswered research questions in deep learning-based organ segmentation on HSI data, namely: (1) What is the optimal spatial granularity of the input data, both with respect to segmentation performance, required amount of training data and capability to generalize towards new surgeries?(2) Is there a benefit in using HSI data over other modalities such as TPI and RGB data?The key insights are: 1. Performance: Consistently across all our validation metrics and modalities, the segmentation performance improved with larger spatial granularity of the input data.For imagebased segmentation on HSI data, the performance of our model was comparable to that of a second medical expert.2. Required amount of training data: Across all studied numbers of training pigs, the image-based segmentation on HSI data performs comparable or better than the other models.3. Generalization capability: Pixel-based segmentation generalizes better towards new surgeries than image-based segmentation.For larger spatial context, the generalization capabilities are comparable across modalities.4. Benefit of HSI data over TPI and RGB data: For all models and metrics, segmentation performance on HSI data is better than on TPI and RGB data.However, the gap in performance decreases with increased spatial context from an improvement in the DSC of 80.4 % for the pixel, 11.9 % for the superpixel, 8.1 % for the patch_32, 5.5 % for the patch_64 to only about 2.8 % for the image model, compared to RGB.
The following sections provide a detailed discussion of the results of our experiments (Section 5.1), design choices for our models and validation (Section 5.2) as well as limitations of our study (Section 5.3).We further discuss the impact of our work (Section 5.4), give an outlook on the future (Section 5.5) and close with our conclusions (Section 5.6).

Interpretation of results
Segmentation performance across models.We found that, consistently across all modalities and metrics, the segmentation performance improved with larger spatial granularity of the input data.This raises the question whether there are practical use cases to justify using input data of small spatial context despite the apparent reduction in performance.
A potential benefit could be an improved generalization to out-of-distribution data with respect to scene geometry for a smaller input spatial granularity.We found in Figure 10 that the performance to generalize towards new surgeries was by far best for the pixel-based models.This finding may be explained in the following way: While a pixel-based model only needs to generalize for variation in the spectral dimension, the spatial models need to additionally account for variations of the spatial context.With higher spatial context of the input data, segmentation models may thus be more sensitive to out-of-distribution scene geometries.Such changes in scene geometry may for   example result from partially or completely resected organs or changes in the acquisition hardware (e.g. using a laparoscopic camera instead of an open system).Since radical changes in the scene geometry were not covered by our data set, future work should study this aspect (cf.Section 5.5).
Another advantage of reduced input spatial granularities is indicated by Table 1: Smaller input sizes generally enable larger batch sizes.Besides the advantages mentioned in Section 3.2.4(smoother gradients and better batch statistics), this enables model improvements that benefit from a batch distribution close to the distribution of the entire data set.For example, confounders in HSI data potentially lead to an overestimation of machine learning performance (Dietrich et al., 2021).However, some promising techniques to achieve confounder-invariant representations (e.g.metadata normalization (Lu et al., 2021)) require large batch sizes and would thus benefit from a pixel-based instead of an image-based model.
Ranking stability.Across metrics, the rankings were generally in close agreement, with only one exception visible in Figure 5 (b) and Figure 7: Despite the smaller spatial context, superpixel-based segmentation of HSI data ranks better than patch-based segmentation of RGB and TPI data for the ASD metric.This may be attributed to the boundary-sensitive nature of the ASD metric.While the superpixel boundaries match the reference segmentation very well with an average lower bound for the ASD of 2.91 (SD 0.74) if all superpixels were correctly classified (cf.Section 5.3), we see from the example predictions in Figure 8 that sharp vertical and horizontal edges can be present in the patch-based predictions.These are due to our choice of aggregation scheme in which an image segmentation prediction is assembled from non-overlapping patches (cf.Section 5.3).The resulting incomplete and scattered boundaries are especially penalized by boundary-distance metrics such as the ASD, whereas well matching boundaries are rewarded (cf. Figure 7).Generalization capability.We found in Figure 10 that the organ segmentation performance is generally better for V known , which makes sense since the gap between different images of the same pig can be expected to be smaller than the domain gap between different images of different pigs.The improved generalization capability for pixel-based models was discussed above (cf.Section 5.1).We further saw in Figure 10 that there are striking differences in the noisiness of the training: While the DSC is changing relatively smoothly for TPI and RGB data, the HSI training curve is much noisier.This holds especially true for the pixel model whereas the training was only slightly noisier for the image model.This may be attributed to the larger input feature dimension of HSI data compared to the other modalities.
Segmentation performance across modalities.We observed that increasing the input spatial context of the RGB and TPI modalities led to the gap in segmentation performance between using HSI data and TPI/RGB data decreasing, potentially since the model can use additional information from the spatial context, which compensates for the lack of detail in the spectral dimension.However, the smaller performance gap for the increased spatial context may also be attributed to the quality of the provided expert annotations since the performance of our HSI models approaches the inter-rater variability.Therefore, future work should re-evaluate the performance for different modalities on improved annotations (cf.Section 5.5).
Given the possibly only small boost in segmentation performance on HSI data compared to RGB images, further advantages and disadvantages of HSI camera systems should be considered when choosing the optimal imaging modality for scene segmentation.Apart from the ability to differentiate tissue classes, the detailed spectral information captured by HSI systems creates additional opportunities in surgical guidance, e.g. by revealing functional tissue information such as perfusion state or diagnosing diseased tissue (Fei, 2020;Zhang et al., 2020).Disadvantages of the HSI system used in this study are the, compared to RGB imaging, long acquisition time of approximately seven seconds, high cost and limited availability.However, HSI is an emerging field and improvements can be expected in future realisations (e.g.video-rate intraoperative MSI, which was unavailable a decade ago, is nowadays possible (Ayala et al., 2021)).
Regarding the comparison between TPI and RGB data, we saw in Figure 5 that in most cases, a model based on TPI data ranks better than the same model based on RGB data, indicating that the manually derived TPI data contains relevant information for the segmentation task.
Misclassifications.The confusion matrix in Figure 11 revealed that the largest part of the confusion (32.2 %) is between peritoneum and major vein, which can be explained by the neighbouring relation of the two organs and the very limited amount of training data available for major vein since it could only be imaged for 32 images (cf. Figure 2) and visible parts are generally of small size with on average 4192 px (SD 3621 px).Other more frequently misclassified classes are either classes that are already difficult to annotate due to fuzzy boundaries (e.g.omentum, peritoneum, subcutaneous fat) or unclear distinction (e.g.kidney with Gerota's fascia and peritoneum).Generally, most of the misclassifications in the confusion matrix occur between classes that are neighbours in the images (e.g.heart instead of lung and vice versa, stomach instead of omentum and vice versa, liver instead of gallbladder, background instead of skin, etc.) which may be attributed to errors in the predicted segmentation boundaries.This assumption is supported by the segmentation examples in Figure 8.

Design choices
Class imbalances.Our data set described in Section 3.1 is highly imbalanced in terms of number of images (Figure 2) as well as number of pixels per class.We therefore tested different countermeasures against imbalanced data sets during model development.More precisely, we tried weighted loss functions and oversampling strategies.
For the loss functions, we first calculated a weight ω o for each class o based on the number of pixels for this class across all images in the respective fold of the training set.We chose ω o in such that majority classes received a low and minority classes a high weight.We then computed the loss value for the whole batch as a weighted average of the individual class losses.Even though we tried several approaches for calculating the class weights (e.g.inverse proportional weights), none of them could consistently improve the results on our validation data compared to having no weights at all.
In a similar manner, we evaluated oversampling strategies by sampling instances from minority classes more often compared to instances of majority classes based on the same weights as used for the loss function.We tried this for the pixel-and patchbased methods, but the resulting DSC performance was always worse compared to the default sampling on the validation data.
Metrics.We followed recommendations of segmentation challenges and used more than one metric to evaluate our results and to perform the ranking (Roß et al., 2021;Antonelli et al., 2021).We used an overlap-based measure (DSC), a distancebased measure (ASD) and a measure for quantifying annotation uncertainty (NSD).Each metric analyses specific properties of the predicted segmentation map and a model may be biased towards a metric, e.g. as discussed in Section 5.1, the pixel model performs best under the DSC metric (cf. Figure 7), whereas the superpixel model performs best under the ASD.Hence, only a combination of multiple metrics can give insights into the overall performance of a model.
One particular design choice which affects all metrics is the case of missing classes in the prediction.Common evaluation frameworks such as MONAI usually return nan or inf values in these cases, leaving the aggregation to the user.While being less problematic for the DSC and NSD metric as they are bounded, the manner of handling missing classes is a crucial design choice for the ASD which is unbounded.There are several options such as ignoring these cases completely or imposing a fixed penalty which e.g.depends on the image diagonal.We set missing classes to the maximum distance of the other classes, which introduces a penalty without producing outliers.However, it has the disadvantage that the value for the missing class depends on the prediction of the other classes in the image.
Using the NSD requires setting a (class-specific) threshold and for this, re-annotations of a subset of the images by at least one additional human rater are needed (Nikolov et al., 2021).This subset is usually small compared to the data set size (e.g.20 of 506 images in our case) since annotations of more images or even annotations from multiple annotators are often not feasible.Hence, the thresholds depend mainly on this subset and errors in these annotations have a high influence on the results.Missing classes in the re-annotation are also problematic since corresponding distances cannot be computed so that this part of an image does not contribute to the threshold.In the original formulation, this problem did not occur since an organ annotation was created separately for each known organ class (Nikolov et al., 2021).In our case, however, the annotators did not know which organ classes were present in the image.
An additional problem is the choice of the aggregation function of the class distances per organ.For each image pair which both experts annotated, we computed the distances between the two annotations for each organ, applied an aggregation function and finally averaged the aggregated values across pigs and organs (respecting the hierarchical structure).In Figure 12, we show several thresholds τ o resulting from different aggregation functions.First, we see that there is a high variation across organ classes with e.g.large differences between the two annotations for peritoneum and low deviations between annotations of bladder as well as high variations across pigs (e.g. the standard deviation for the mean aggregation in case of skin is 2.5 times higher than the mean itself).This underpins that not for every organ it is equally difficult to determine its boundary.In general, the agreement of our two expert annotations is rather low indicating that even for medical experts the decision of which pixel belongs to which organ is neither easy nor unambiguous.Second, the choice of the aggregation function is crucial for the thresholds.In their original work, Nikolov & Blackwell & Zverovitch et al. used the 95 % quantile of the distances (Nikolov et al., 2021), but this led to very high thresholds even above 80 px in our case, so we decided to use the mean instead which results in more moderate distances always below 20 px.However, it is important to note that other aggregation methods like the median or another quantile would also have been possible.
Hyperparameters.In our study, we chose default parameters whenever possible and fixed all parameters across algorithms (e.g.learning rate) or selected them based on the same criteria (e.g.memory consumption in case of the batch size).However, hyperparameters can impact the model performance and due to the different input sizes and network architectures, it is possible that our hyperparameter settings are not optimal for all algorithms.To find the optimal hyperparameter set for each algorithm, many different training runs would be required.Since training all our 15 algorithms (five models and three modalities) for all five folds already required about 292 h GPU training time (about 32 kg of CO 2 equivalent if training on a GeForce GTX™ 2080 Ti (Nvidia Corporation, Santa Clara, USA) (Lacoste et al.,  4).

2019) 7
) just for a single hyperparameter setting, an extensive hyperparameter tuning would come at extremely high resource costs.
To demonstrate the potential impact of our design choice on the algorithm ranking, we exemplarily evaluated a small hyperparameter search for the learning rate.We selected the learning rate since it is considered to be one of the most important hyperparameters (Montavon et al., 2012).For each algorithm, we trained two additional networks: One with a lower learning rate of η −1 = 0.0001 and one with a higher learning rate of η +1 = 0.01, compared to the default learning rate of η 0 = 0.001.For each algorithm individually, we determined its optimal learning rate, that is, the one learning rate among η −1 , η +1 and η 0 that yielded the largest average DSC on the validation data.We then repeated the overall ranking across algorithms (Figure 6) on the test data, but instead of using a single fixed learning rate, the network corresponding to the optimal learning rate was selected for each algorithm.For all algorithms apart from the pixel-based models, the optimal learning rate was identical to the default learning rate.However, even for the pixel-based models, the average improvement in the DSC was only minor (< 0.007 for all pixel models), resulting in an overall identical to that in Figure 6 with only minor differences in the ranking across the different bootstrap samples as shown in Figure 13.This supports the validity of our study results even without an extensive algorithm-specific hyperparameter tuning.Fig. 13: Differences in ranking across bootstrap samples when optimizing the learning rate.For each of the 1000 bootstrap samples, the ranking based on the default learning rate η 0 is compared to the ranking based on an optimal learning rate per algorithm η which was selected among the default (η 0 = 0.001), a higher (η +1 = 0.01) and a lower (η −1 = 0.0001) learning rate according to the highest average dice similarity coefficient (DSC) on the validation set.The algorithms on the x-axis are sorted according to the overall ranking (from left to right: best to worst, cf. Figure 6), which was not affected by the different choice of learning rate.The difference in the rank ∆ rank was computed for each bootstrap and algorithm.Each value in the matrix depicts the fraction of bootstraps with a certain ∆ rank for a certain algorithm.
Fraction values of 0 are not shown to preserve clarity.

Limitations
Superpixels.There are two main assumptions behind our superpixel classification approach: (1) superpixels consist of homogeneous regions with every pixel inside a superpixel belonging to the same organ class and (2) superpixel borders follow the borders of the organs (instead of crossing them).We evaluated these assumptions and created a performance limit for our superpixel model by taking the modal value of all pixel labels inside a superpixel and using this modal value as a superpixel label (Figure 14 (a)).This directly incorporates the annotation labels and hence serves as a performance limit for our model.
In Figure 14 (b), we show the results of the performance limit for the different metrics on our data set.The performance limit comes closest to a perfect segmentation for the ASD with an average of 2.91 (SD 0.74), followed by the DSC and NSD with average values of 0.92 (SD 0.03) and 0.74 (SD 0.04), respectively.The ASD is very low and has a small standard deviation because the distances between the annotation and performance limit are constrained by the superpixel size and since each superpixel contains roughly 300 px, large distances are unlikely (with √ 300 ≈ 17.32 px).For DSC and NSD the gap to a perfect segmentation is larger, which indicates that the superpixels are not in perfect agreement with the annotations; this can mainly be attributed to the borders between organs.It is possible that either the superpixels or the annotations (or both) are not located at the "true organ border" and any deviation leads to a reduced overlap (DSC) or the necessity to redraw some superpixel borders to be in alignment with the annotation (NSD).The NSD is lower than the DSC because the acceptable threshold τ o is very low for some organs (cf. Figure 12) so that pixels with minor deviations already influence the NSD.
There is a gap between the performance limit and our model predictions for all metrics with average scores of 0.82 (SD 0.06), 16.51 (SD 9.23) and 0.61 (SD 0.09) for the DSC, ASD and NSD, respectively.Assuming that the features of the superpixels are discriminative enough, this indicates that our superpixel model could be improved.This is amplified by our design choices of making the models and modalities as comparable as possible without specific optimizations for a single model (e.g. when we introduced augmentations, we added them to all models).However, the image HSI model is not too far away from the performance limit (with average values of 0.90 (SD 0.04), 6.19 (SD 3.20), 0.80 (SD 0.07) for the DSC, ASD and NSD, respectively) so that the gain in investing further in the superpixel model development is low.Data set.Non-standardized image acquisition is an important challenge in open surgery.Instead of following a standardized imaging protocol, we thus decided to acquire images in a fashion reflecting intraoperative reality (i.e.number of images and situses were decided on individually for every pig to mimic natural variations in the surgery performed).However, we are aware that this does not cover other, additional sources of variation present in a real-world open surgery setting, such as perfusion changes of certain tissue areas, presence of tissue pathologies (e.g.fatty liver), image parts being covered by body fluids (e.g.blood) and interventions being performed.Furthermore, we found that for all studied imaging modalities, image-based models performed similarly to a second human rater.Additional performance improvements of our models may thus be limited by the quality of our reference annotations.
Model comparison.We designed our models to be as comparable as possible by, for example, using the same U-Net architecture or a comparable epoch size.This was important for our study as we wanted only the size of the input and the modality to be the main sources of variation instead of model-specific design choices.
For the same reason, we also did not apply any postprocessing to the network output even though models like the pixel model could benefit from this, for example via morphological operations.For inference, we restricted the spatial context of each model to its defined input size.For example, the patches which the patch_32 model sees during inference are explicitly non-overlapping as this would increase the spatial context beyond a resolution of 32 × 32, making a comparison across different spatial resolutions (e.g.superpixel vs. patch_32) harder.However, we also see in Figure 8 that this design choice produces visible artefacts, for example at patch borders in the patch-based segmentation.
Similarly, we did not introduce any HSI-specific model variations or comparisons with existing task-specific networks to avoid biasing the results towards certain inputs.Instead, we used, for example, the same U-Net for the RGB and HSI modalities.We see in the benchmarking of Figure 5 that the basic U-Net was able to leverage the additional spectral information, but the performance gap is more dominant in the patch-based and less in the image-based segmentation.This may be attributed to the U-Net leveraging the spectral information more poorly if more spatial context is available, but could also be influenced by our limitations in the annotation quality.

Impact
Many prior reports on organ segmentation in surgery have focused largely on conventional RGB imaging and minimally invasive surgery.While previous approaches to segmenting organs on HSI data have been based on a variety of input data sizes including pixels, superpixels and patches, neither the optimal granularity of the input data has been determined nor has a clear benefit in using HSI data over RGB data been shown.Furthermore, leveraging existing approaches to segmenting organs in an open surgery setting (in contrast to minimally invasive surgery) poses several challenges such as the large complexity and variability in the surgical scene due to non-standardized image acquisition, inter-subject variability and complex threedimensional relationships between multiple soft tissues.
We addressed these gaps in the literature in a validation study of unprecedented size, both in number of pigs and number of classes.Despite the challenges imposed by the open surgery setting, we could segment organs with an average DSC of 0.90 (SD 0.04) in an image-based approach that is close to the inter-rater variability with an average DSC of 0.89 (SD 0.07).Consistently over all input sizes, the segmentation based on HSI outperforms the segmentation based on RGB data.We are the first to clearly show a benefit in using HSI data compared to RGB data for organ segmentation, thus offering a foundation that further segmentation studies on HSI data can build upon.We found that the segmentation performance decreases with less spatial context of the input data, whereas the generalization capability towards new surgeries increases.These findings deliver valuable guidance for future work.With the inference of our image HSI network taking only about 122 ms per image on a GeForce RTX™ 3090 (Nvidia Corporation, Santa Clara, USA), organ segmentation maps can be obtained in real-time.

Outlook
The following aspects should be addressed in future work: • Due to our non-standardized image acquisition, some typical variations regarding situses, perspectives and number of images were covered in our data set.However, we could not cover all sources of variation that are present in real-world open surgery.Mainly, we imaged healthy and surgically unaltered pig organs and thus did not account for perfusion changes of certain tissue areas, presence of tissue pathologies (e.g.fatty liver), image parts being covered by body fluids (e.g.blood) and interventions being performed.Such out-of-distribution situations may alter the performance and resulting rankings of our algorithms.For example, changes in the scene geometry (e.g.partially or completely resected organs or changes in the acquisition hardware) may affect models with larger spatial granularity of the input data stronger than pixel-based segmentation models.These aspects should be studied in the future.Since the ultimate goal is real-world open surgery in humans, the generalization of our findings from pigs to humans should be verified.• In the present study, we focused on organ segmentation.
A natural next step would be to expand our work to full scene segmentation by introducing additional instruments and other non-biological object classes as well as medical classes (e.g.sutures).• Our study on the intra-and inter-rater variability of our data set showed that organ segmentation is not an easy task, even for medical experts.The quality of our reference annotations may impose an upper limit to the performance of our models with our image-based models approaching the inter-rater variability.Future work may thus also focus on improving the reference annotations, e.g. by having the data annotated by multiple medical experts and seeking for consensus agreement in cases of differing annotations.• We studied three different levels of spectral granularity of the input and found that the segmentation performance on HSI data with c = 100 channels was better than on two aggregations thereof, namely RGB images with c = 3 and TPI data with c = 4 channels.A natural next step would be to study whether the segmentation performance could further be improved by a selection of 4 < c < 100 input channels.Some spectral channels may be more informative than others and several studies on channel selection from HSI data reported a performance boost in the downstream task (Trajanovski et al., 2021;Ayala et al., 2022).Furthermore, knowledge of the optimal spectral channels for organ segmentation could guide the development of MSI systems that often provide higher image acquisition speed and/or spatial resolution.

Conclusion
In a comprehensive validation study, we showed that unprocessed HSI data offers a huge benefit compared to RGB data or processed data of the camera provider for organ segmentation with the image-based HSI model approaching inter-rater performance.We conclude that HSI could become a powerful imaging modality for fully-automatic surgical scene understanding with many advantages over traditional imaging including the ability to recover additional functional tissue information.The performance of five spatial models and three modalities (RGB, tissue parameter images (TPI) and hyperspectral imaging (HSI)) is shown for three different metrics on the validation set.The dotted line and the shaded area represent the mean and standard deviation range, respectively, of the inter-rater results for the corresponding metric.Each boxplot shows the quartiles of the metric value distribution with the whiskers extending up to 1.5 times the interquartile range, and the median and mean as solid and dashed line, respectively.Each dot represents one test pig.Results on the test set can be found in Figure 5.

Fig. A.2:
Ranking stability of the different models and modalities based on bootstrap sampling on the test set with the average surface distance (ASD).The area of each blob at position (A i , rank j) is proportional to the relative frequency of algorithm A i (model#modality combination) achieving rank j across 1000 bootstrap samples with one bootstrap sample consisting of 5 pig-level metric values.The median rank for each algorithm is indicated by a black cross and the black lines indicate the 95 % quantile of the bootstrap results.This plot was generated with the challengeR toolkit (Wiesenfarth et al., 2021).Ranking results for the dice similarity coefficient (DSC) and normalized surface dice (NSD) can be found in Figure 6

Fig. A.3:
Ranking stability of the different models and modalities based on bootstrap sampling on the test set with the normalized surface dice (NSD).The area of each blob at position (A i , rank j) is proportional to the relative frequency of algorithm A i (model#modality combination) achieving rank j across 1000 bootstrap samples with one bootstrap sample consisting of 5 pig-level metric values.The median rank for each algorithm is indicated by a black cross and the black lines indicate the 95 % quantile of the bootstrap results.This plot was generated with the challengeR toolkit (Wiesenfarth et al., 2021).Ranking results for the dice similarity coefficient (DSC) and average surface distance (ASD) can be found in Figure 6 Fig.1: In this paper, we address the challenge of semantic scene segmentation based on hyperspectral imaging (HSI) data.We put a particular focus on the optimal representation of the data, both in the spectral domain (here referred to as input modality) and in the spatial domain (here referred to as spatial granularity).Modalities comprise hyperspectral imaging (HSI) (c = 100 channels) as well as corresponding RGB data (c = 3) and tissue parameter images (TPI) (c = 4).The latter includes tissue oxygen saturation (StO 2 ), near-infrared perfusion index (NPI), tissue water index (TWI) and tissue hemoglobin index (THI).Each combination (input modality, spatial granularity) yields an organ segmentation map for the whole image after potential aggregation of image parts.The corresponding neural network architectures are illustrated in Figure3.

Fig. 2 :
Fig.2: Overview of the data set.506 images from 20 pigs have fully semantic annotations for 18 different organ classes and background.Each pig is represented by a unique color and denoted by PXX, where XX represents a unique pig identifier.Since background is naturally present on each image, the overview focuses on the number of images and pigs per organ.

Fig. 3 :
Fig.3: Overview of our deep learning pipeline for different spatial granularities based on hyperspectral imaging (HSI) input data.For pixel-based classification, pixel spectra are extracted from the pre-processed HSI cubes and input to the spectrum network, yielding a pixel-wise class prediction.For superpixel-based classification, superpixel boundaries are segmented based on the reconstructed RGB images and each superpixel is then extracted from the pre-processed HSI cubes.A minimum enclosing bounding box is computed and areas not belonging to the superpixel are replaced with zeros.After reshaping, the superpixel cube is input to the U-Net encoder and the output is further processed by a classification head.For patch-based classification, patches of a fixed shape are extracted from the pre-processed HSI cubes and processed by the U-Net.Pixel-wise, superpixel-wise and patch-wise predictions belonging to the same image are aggregated to yield an image segmentation map.For image-based classification, the input and output to the U-Net are a whole pre-processed HSI cube and an image segmentation map, respectively.
Fig.4: Data loading strategy designed to work with large data sets which ensures that batches are filled with image parts from multiple images.In this example, three workers generate random patches from their own set of images (mutually exclusive across workers), collect multiple patches in one batch part (boxes on the arrows) and store the result in the next free location in the ring buffer (non-transparent block) which is located in shared memory.The main process takes one of the next ready batches (transparent boxes) from the buffer and moves it to the graphics processing unit (GPU).I i refers to the i-th image in the data set, b j is the number of samples the j-th worker contributes to the batch and h, w, c denote the height, width and number of channels of an image, respectively.The assignment of images I i to the workers changes randomly from one epoch t to the next t + 1.
B o REF .Here, we used the symmetric version of the ASD which repeats the computation of the set of nearest neighbour distances D o ML with the role of the predicted segmentation and reference reversed, yielding D o REF .All obtained distance values are averaged, yielding 5 https://monai.ioan average distance value ASD o for each class:

Fig. 5 :
Fig. 5: Segmentation performance of the spatio-spectral models.The performance of five spatial models and three modalities (RGB, tissue parameter images (TPI) and hyperspectral imaging (HSI)) is shown for three different metrics on the test set.The dotted line and the shaded area represent the mean and standard deviation range, respectively, of the inter-rater results for the corresponding metric.Each boxplot shows the quartiles of the metric value distribution with the whiskers extending up to 1.5 times the interquartile range, and the median and mean as solid and dashed line, respectively.Each dot represents one test pig.The results on the validation data set reveal a similar pattern and are shown in Figure A.1.
Fig.6: Ranking stability of the different models and modalities based on bootstrap sampling on the test set with the dice similarity coefficient (DSC).The area of each blob at position (A i , rank j) is proportional to the relative frequency of algorithm A i (model#modality combination) achieving rank j across 1000 bootstrap samples with one bootstrap sample consisting of 5 pig-level metric values.The median rank for each algorithm is indicated by a black cross and the black lines indicate the 95 % quantile of the bootstrap results.This plot was generated with the challengeR toolkit(Wiesenfarth et al., 2021).Ranking results for the average surface distance (ASD) and normalized surface dice (NSD) can be found in FigureA.2 and Figure A.3, respectively.

Fig. 7 :
Fig.7: Ranking stability of the different models and modalities (RGB, tissue parameter images (TPI) and hyperspectral imaging (HSI)) for three different metrics (dice similarity coefficient (DSC), average surface distance (ASD) and normalized surface dice (NSD)) on the test set.Each line visualizes how the rank of the corresponding model#modality changes when evaluating with different metrics.This plot is based on ranking data from the challengeR toolkit(Wiesenfarth et al., 2021).
Figure A.4 and  Figure A.5, respectively.For all studied numbers of training pigs, the performance of image-based segmentation on HSI is comparable or better than the performance of other models.
Figure A.6 and Figure A.7.To enable for a direct comparison of the organ-specific performance between the three modalities for the image model, Figure A.8 shows the recall stratified by organ and modality.Image-based segmentation of HSI data performs better or comparable to TPI and RGB data for all organ classes except from pancreas and major vein.

Fig. 8 :
Fig.8: Example predictions for the different spatial models on the hyperspectral imaging (HSI) modality.Images are selected based on the q % quantile of the dice similarity coefficient averaged across all five models (DSC q ).ignore summarizes instances which are excluded in our models (e.g.pixels with unsure label).For each prediction, metric values for the DSC, average surface distance (ASD) and normalized surface dice (NSD) are shown.

Fig. 9 :
Fig. 9: Dice similarity coefficient (DSC) performance on the test set as a function of the number of individuals n in the training set for different models.The solid line represents the average performance and the shaded area 0.5 standard deviations across 5 runs with a different random selection of individuals in the training set.Only results for the hyperspectral imaging (HSI) modality are shown.Results for the average surface distance (ASD) and normalized surface dice (NSD) are shown in Figure A.4 and Figure A.5, respectively.

Fig. 10 :
Fig. 10: Generalization error over training time by comparing the two validation data sets V unknown (solid lines) with V known (dotted lines).The shown values are obtained by first averaging dice similarity coefficient (DSC) values all images of one pig and then averaging the mean DSC values of the different pigs in V unknown and V known .See Section 4.1 for more details on the validation data set splits.The error curves for the superpixel and patch_64 models are not shown for clarity and can be found in the interactive version of this plot in the supplemental material.
ta 's fa sc ia b la d d e r su b c u ta n e o u s fa t sk in m u sc le o m e n tu m p e ri to n e u m m a jo r v e in

Fig. 11 :
Fig. 11: Confusion matrix of the image model on the hyperspectral imaging (HSI) modality for the test set.The matrix is based on the rownormalized pixel classification results from all images of one pig, i.e. the (i, j)-th entry depicts the percentage of pixels of the true class i which were classified into the j-th class.This normalization was done on a porcine level yielding a confusion matrix per pig.These matrices were then averaged across pigs while ignoring non-existent entries (e.g.due to missing organs).The number in brackets indicates the standard deviation across pigs.Values < 0.1 % are not shown for brevity.Confusion matrices for the tissue parameter images (TPI) and RGB modality are shown in Figure A.6 and Figure A.7, respectively.
Required amount of training data.The reduced standard deviation range of the metrics with an increasing number of training pigs in Figure 9, Figure A.4 and Figure A.5 should be interpreted with care since pigs were always sampled without replacement and this inevitably increases the overlap of selected pigs across random selections on runs with a higher number of training pigs due to the limited number of 15 available training pigs.For example, when randomly selecting two different sets of pigs (each of size 14) out of the 15 training pigs without replacement, then these two sets differ only by one pig.

Fig. 12 :
Fig. 12: Distances which could be used as possible organ-specific thresholds τ o for different aggregation functions (median, mean and the 95 % quantile) based on the comparison of the two expert annotations.The error bars indicate 0.25 standard deviations across pigs of the aggregated values.The thresholds for the mean aggregation correspond to the thresholds we used in our analysis (see Equation4).

Fig. 14 :
Fig. 14: Evaluation of the superpixel approach by taking into account the annotations for each superpixel which serves as performance limit.(a) sketch of the superpixel label assignment highlighting also the superpixels in the image (the same colorbar as in Figure 8 applies here).(b) comparison of the prediction by either the annotation or the model prediction for three different metrics (dice similarity coefficient (DSC), average surface distance (ASD) and normalized surface dice (NSD)).
Fig. A.1: Segmentation performance of the spatio-spectral models.The performance of five spatial models and three modalities (RGB, tissue parameter images (TPI) and hyperspectral imaging (HSI)) is shown for three different metrics on the validation set.The dotted line and the shaded area represent the mean and standard deviation range, respectively, of the inter-rater results for the corresponding metric.Each boxplot shows the quartiles of the metric value distribution with the whiskers extending up to 1.5 times the interquartile range, and the median and mean as solid and dashed line, respectively.Each dot represents one test pig.Results on the test set can be found in Figure5.
Fig. A.3:Ranking stability of the different models and modalities based on bootstrap sampling on the test set with the normalized surface dice (NSD).The area of each blob at position (A i , rank j) is proportional to the relative frequency of algorithm A i (model#modality combination) achieving rank j across 1000 bootstrap samples with one bootstrap sample consisting of 5 pig-level metric values.The median rank for each algorithm is indicated by a black cross and the black lines indicate the 95 % quantile of the bootstrap results.This plot was generated with the challengeR toolkit(Wiesenfarth et al., 2021).Ranking results for the dice similarity coefficient (DSC) and average surface distance (ASD) can be found in Figure6and FigureA.2, respectively.

Fig. A. 4 :
Fig. A.4: Average surface distance (ASD) performance on the test set as a function of the number of individuals n in the training set for different models.The solid line represents the average performance and the shaded area 0.5 standard deviations across 5 runs with a different random selection of individuals in the training set.Only results for the hyperspectral imaging (HSI) modality are shown.Results for the dice similarity coefficient (DSC) and normalized surface dice (NSD) are shown in Figure 9 and Figure A.5, respectively.

Fig. A. 5 :
Fig. A.5: Normalized surface dice (NSD) performance on the test set as a function of the number of individuals n in the training set for different models.The solid line represents the average performance and the shaded area 0.5 standard deviations across 5 runs with a different random selection of individuals in the training set.Only results for the hyperspectral imaging (HSI) modality are shown.Results for the dice similarity coefficient (DSC) and average surface distance (ASD) are shown in Figure 9 and Figure A.4, respectively.
Fig. A.6: Confusion matrix of the image model on the tissue parameter images (TPI) modality for the test set.The matrix is based on the row-normalized pixel classification results from all images of one pig, i.e. the (i, j)-th entry depicts the percentage of pixels of the true class i which were classified into the j-th class.This normalization was done on a porcine level yielding a confusion matrix per pig.These matrices were then averaged across pigs while ignoring non-existent entries (e.g.due to missing organs).The number in brackets indicates the standard deviation across pigs.Values < 0.1 % are not shown for brevity.Confusion matrices for the hyperspectral imaging (HSI) and RGB modality are shown in Figure 11 and Figure A.7, respectively.
Fig. A.7: Confusion matrix of the image model on the RGB modality for the test set.The matrix is based on the row-normalized pixel classification results from all images of one pig, i.e. the (i, j)-th entry depicts the percentage of pixels of the true class i which were classified into the j-th class.This normalization was done on a porcine level yielding a confusion matrix per pig.These matrices were then averaged across pigs while ignoring non-existent entries (e.g.due to missing organs).The number in brackets indicates the standard deviation across pigs.Values < 0.1 % are not shown for brevity.Confusion matrices for the hyperspectral imaging (HSI) and tissue parameter images (TPI) modality are shown in Figure 11 and Figure A.6, respectively.
Fig. A.8: Recall stratified by organ and modality.The recall of the image model on the test set for three modalities (RGB, tissue parameter images (TPI) and hyperspectral imaging (HSI)) is shown, stratified by organ.Each boxplot shows the quartiles of the metric value distribution with the whiskers extending up to 1.5 times the interquartile range, and the median and mean as solid and dashed line, respectively.Each dot represents one test pig.

Table 1 :
Epoch and batch sizes for each of the different models.# pixels refers to the number of pixels of one input sample of a model.The model names patch_64 and patch_32 refer to models with the input shapes 32 × 32 × c and 64 × 64 × c, respectively.
and Figure A.3, respectively.