CellViT: Vision Transformers for Precise Cell Segmentation and Classification

Nuclei detection and segmentation in hematoxylin and eosin-stained (H&E) tissue images are important clinical tasks and crucial for a wide range of applications. However, it is a challenging task due to nuclei variances in staining and size, overlapping boundaries, and nuclei clustering. While convolutional neural networks have been extensively used for this task, we explore the potential of Transformer-based networks in this domain. Therefore, we introduce a new method for automated instance segmentation of cell nuclei in digitized tissue samples using a deep learning architecture based on Vision Transformer called CellViT. CellViT is trained and evaluated on the PanNuke dataset, which is one of the most challenging nuclei instance segmentation datasets, consisting of nearly 200,000 annotated Nuclei into 5 clinically important classes in 19 tissue types. We demonstrate the superiority of large-scale in-domain and out-of-domain pre-trained Vision Transformers by leveraging the recently published Segment Anything Model and a ViT-encoder pre-trained on 104 million histological image patches - achieving state-of-the-art nuclei detection and instance segmentation performance on the PanNuke dataset with a mean panoptic quality of 0.50 and an F1-detection score of 0.83. The code is publicly available at https://github.com/TIO-IKIM/CellViT


Introduction
Cancer is a severe disease burden worldwide, with millions of new cases yearly and ranking as the second leading cause of death after cardiovascular diseases [1].Despite novel and powerful non-invasive radiological imaging modalities, collecting tissue samples and evaluating them with a microscope remains a standard procedure for diagnostic evaluation.A pathologist can draw conclusions about potential therapeutic approaches or use them as a starting point for further investigations by identifying abnormalities within the tissue.One crucial component is the analysis of the cells and their distribution within the tissue, such as detecting tumor-infiltrating lymphocytes [2] or inflammatory cells in the tumor microenvironment [3,4].However, large-scale analysis on the cell level is time-consuming and suffers from a high intra-and inter-observer variability.Due to the development of high-throughput scanners for pathology, it is now possible to create digitized tissue samples (whole-slide images, WSI), enabling the application of computer vision (CV) algorithms.CV facilitates automated slide analysis, for example, to create tissue segmentation [5], detect tumors [6], evaluate therapy response [7], and the computer-aided detection and segmentation of cells [8,9].In addition to the clinical applications mentioned above, cell instance segmentation can be leveraged for downstream deep learning tasks, as each WSI contains numerous nuclei of diverse types, fostering systematic analysis and predictive insights [10].Sirinukunwattana et al. [11] showed that cell analysis supports the creation of high-level tissue segmentation based on cell composition.Corredor et al. [12] used hand-crafted features extracted from cells to detect tumor regions in a slide.Existing algorithms for analyzing WSI [6,13,7] are often based on Convolutional Neural Networks (CNNs) used as feature extractors for image regions.The algorithms, despite achieving clinical-grade performance [13], face limitations in interpretability, which in turn poses challenges in defining novel human-interpretable biomarkers.However, accurate cell analysis within these slides presents an opportunity to construct explainable pipelines, incorporating human-interpretable features effectively in downstream tasks [10,14].Nevertheless, since subtask WSI analysis models [6,13,7] rely on abstract entity embeddings, features must be extracted from the detected cells.One approach is to generate hand-crafted features, such as morphological attributes, from the segmentation [15].In the radiology setting, this is referred to as Radiomics [16].Alternatively, employing a CNN on image sections of single cells can derive deep learning features.While hand-crafted features may have limited performance, using CNNs for each cell is computationally complex.Thus, the need for automated and reliable detection and segmentation of cells in conjunction with cell-feature extraction in WSI is evident.
We developed a novel deep learning architecture based on Vision Transformer for automated instance segmentation of cell nuclei in digitized tissue samples (CellViT).Our approach eliminates the need for additional computational effort for deriving cell features via parallel feature extraction during runtime.The CellViT model proves to be highly effective in collecting nuclei information within patient cohorts and could serve as a reliable nucleus feature extractor for downstream algorithms.Our solution demonstrates exceptional performance on the PanNuke [17] dataset by leveraging transfer learning and pre-trained models [18,19].The PanNuke dataset contains 189,744 segmented nuclei and includes 19 different types of tissues.Among these tissues, there are five clinically important nuclei classes: Neoplastic, inflammatory, epithelial, dead, and connective/soft cells.In addition to the high number of tissue classes and nuclei types, the dataset is highly imbalanced, creating additional complexity.Besides class imbalance, segmenting cell nuclei itself is a difficult task.The cell nuclei may overlap, have a high level of heterogeneity and interor intra-instance variability in shape, size, and staining [9].Sophisticated training methods such as transfer learning, data augmentation, and specific training sampling strategies next to postprocessing algorithms are necessary to achieve satisfactory results.
The proposed network architecture is based on a U-Net-shaped encoder-decoder architecture similar to HoVer-Net [8], one of the leading models for nuclei segmentation.Notably, we replace the traditional CNN-based encoder network with a Vision Transformer, inspired by the UNETR architecture [20].This approach is depicted in Figure 1.Vision Transformers are token-based neural networks that use the attention mechanism to capture both local and global context information.This ability enables ViTs to understand relationships among all cells in an image, leveraging long-range dependencies and substantially improving their segmentation.Moreover, when using the common token size of 16 pixels (px) and pixel-resolutions such as 0.25 µm/px (commonly ×40 magnification) or 0.50 µm/px (commonly ×20 magnification) of the images, the token size of ViTs is approximately equivalent to that of a cell, enabling a direct association between a detected cell and its corresponding token embedding from the ViT encoder.As a result, we directly obtain a localizable feature vector during our cell detection that we can extract simultaneously within one forward pass, unlike CNN networks.Given the limited amount of available data in the medical domain, pre-trained models are an essential requirement as ViTs have increased data requirements compared to CNNs.Chen et al. [18] recently published a ViT pre-trained on 104 million histological images (ViT 256 ).Their network outperformed current state-of-the-art (SOTA) cancer subtyping and survival prediction methods.Another important contribution is the Segment Anything Model (SAM), proposed by Kirillov et al. [19].They developed a generic segmentation network for various image types, whose zero-shot performance is almost equivalent to many supervised trained networks.In our work, we compare the performance of pre-trained ViT 256 [18] and SAM [19] models as building blocks of our architecture for nuclei segmentation and classification.We demonstrate superior performance over existing nuclei instance segmentation models.We summarize our contributions as follows: 1. We present a novel U-Net-shaped encoder-decoder network for nuclei instance segmentation, levering Vision Transformers as encoder networks.Our approach surpasses existing methods for nuclei detection by a substantial margin and achieves competitive segmentation results with other state-of-the-art methods on the Pan-Nuke dataset.We demonstrate the generalizability of CellViT by applying it to the MoNuSeg dataset without finetuning.
2. We are the first to employ Vision Transformer networks for nuclei instance segmentation on the Pan-Nuke dataset, demonstrating their effectiveness in this domain.The proposed approach combines pre-trained ViT encoders with a decoder network connected by skip connections.
3. We provide a framework that enables fast inference results applied on Gigapixel WSI by using a large inference patch size of 1024 × 1024 px in contrast to conventional 256 px-sized patches.Compared to HoVer-Net, our inference pipeline runs 1.85 times faster.
2 Related Work

Instance Segmentation of Nuclei
Numerous methods have been developed to solve the challenging task of cell nuclei instance segmentation in WSIs.Previous works have explored diverse approaches, ranging from traditional image processing techniques to deep learning (DL) methods.Commonly used image processing techniques involve the design and extraction of domain-specific features.These features encompass characteristics such as intensity, texture, shape, and morphological properties of the nuclei.The primary challenge is separating overlapping nuclei, and different techniques have been devised to do this [21,22,23,24,25,26,27,28].
For instance, the works of Cheng and Rajapakse [24], Veta et al. [25], and Ali and Madabhushi [26] rely on a predefined nuclei geometry and the watershed algorithm to separate clustered nuclei, while Wienert et al. [27] used morphological operations without watershed and Liao et al. [28] utilized eclipse-fitting for cluster separation.A common drawback of these techniques is their dependency on hand-crafted features, which require expert-level domain knowledge, have limited representative power, and are sensitive to hyperparameter selection [8,29].
The complexity of extracting meaningful features increases when cell nuclei classification is added to the segmentation task.Consequently, their performance is insufficient for our needs to classify and segment nuclei in various tissue types [29].
To overcome the limitations of traditional image processing techniques, DL has emerged as a powerful approach for nuclei instance segmentation.An inherent advantage of DL networks is their automatic extraction of relevant features for the given task, surpassing the need for expert-level domain knowledge to generate hand-crafted features.DL algorithms, particularly convolutional neural networks (CNNs) [30,31], have shown remarkable success in various computer vision tasks [32].Especially the invention of the U-Net architecture by Ronneberger et al. [33] has significantly impacted medical image analysis by enabling accurate and efficient segmentation of complex structures, contributing to advancements in various medical domains such as radiology [34,35] and digital pathology [36].It consists of a U-shaped encoder-decoder structure with skip connections at multiple network depths to preserve fine-grained details in the decoder.However, the original U-Net implementation is not able to separate clustered nuclei [8].Therefore, specialized network architectures are necessary to separate clustered and overlapping cell nuclei.
In the current literature, DL algorithms for nuclei instance segmentation are further divided into two-stage and one-stage methods [9].Two-stage methods incorporate a cell detection network in the first stage to localize cell nuclei within an image, generating bounding box predictions of nuclei.These detected nuclei are then passed on to a subsequent segmentation stage to retrieve a fine-grained nucleus segmentation.Mask-RCNN [37] is one of the leading two-stage models built on top of the object detection model Fast-RCNN [38].Koohbanani et al. [39] utilized Mask-RCNN networks for nuclei instance segmentation.Based on the proposed nuclei detections in the first stage, the model incorporates a segmentation branch for the fine-grained nucleus segmentations in the second stage.
A rectangular image section of the detected nuclei is used as input for the segmentation stage, which causes the problem that overlapping neighboring nuclei may be segmented as well and need to be cleaned up by an additional postprocessing algorithm.Another two-stage method for nuclei segmentation is BRP-Net [40], which creates nuclei proposals in the first place, then refines the boundary, and finally creates a segmentation out of this.However, this network structure is computationally complex and not designed for end-to-end training due to three independent stages.Additionally, the network requires a considerable time of 12 minutes to segment a 1360 × 1024 px image, making its practical application nearly impossible [40].
While two-stage systems offer advantages in localizing cells and improving individual nucleus detection, they often require additional postprocessing for segmentation and suffer from time and computational complexity.
In comparison, one-stage methods combine a single DL network with postprocessing operations.Micro-Net [41] extends the U-Net by using multiple resolution input images to be invariant against nuclei of varying sizes.The DIST model by Naylor et al. [42] adds an additional decoder branch next to the segmentation branch to detect nuclei markers for a watershed postprocessing algorithm.For this, they predict distance maps from the nucleus boundary to the center of mass of the nuclei.Distance maps are regression maps indicating the distance of a pixel to a reference point, e.g., from a nuclei pixel to the center of mass.HoVer-Net [8], one of the current SOTA methods for automatic nuclei instance segmentation, uses horizontal and vertical distances of nuclei pixels to their center of mass and separates the nuclei by using the gradient of the horizontal and vertical distance maps as an input to an edge detection filter (Sobel operator).The models STARDIST [43,44] and its extension CPP-Net [29] generate polygons defining the nuclei boundaries over a set of predicted distances.For this, STARDIST utilizes a star-convex polygon representation to approximate the shape of nuclei.Whereas in STARDIST, the polygons are derived just by features of the centroid pixel, CPP-Net uses context information from sampled points within a nucleus and proposes a shape-aware perceptual loss to constrain the polygon shape.STARDIST demonstrates comparable segmentation performance to HoVer-Net, while CPP-Net exhibits slightly superior results.
In contrast, boundary-based methods such as DCAN [45] and TSFD-Net [9] adopt a different approach, where instead of using distance maps, watershed markers, or polygon predictions, they directly predict the nuclear contour using a prediction map.
While DCAN is based on the U-Net architecture, TSFD-Net utilizes a Feature Pyramid Network (FPN) [46] to leverage multiple scales of features.Additionally, the authors of TSFD-Net introduce a tissue-classifier branch to learn tissue-specific features and guide the learning process.To address the class imbalance across nuclei and tissue types, they employ the focal loss [47] for the tissue detection branch, a modified cross-entropy loss with dynamic scaling, and the Focal Tversky loss [48] for the segmentation branch, which enlarges the contribution of challenging regions.While TSFD-Net shows promising results, its comparability to other methods is limited due to the lack of a standardized evaluation procedure.

Vision Transformer
All promising DL models [37,40,41,42,43,8,29,45,9] for nuclei instance segmentation mentioned previously are based on CNNs.Even though CNN models have demonstrated their effectiveness in image processing, they are bound to local receptive fields and may struggle to capture spatial long-range relationships [5].Inspired by the Transformer architecture in NLP [49], Vision Transformers [50] have recently emerged as an alternative to CNNs for CV [51].Their architecture is based on the self-attention mechanism [49], allowing the model to attend to any region within an image to capture long-range dependencies.Unlike CNNs, they are also not bound to fixed input sizes and can process images of arbitrary sizes depending on computational capacity.Vision Transformers have shown promising results not only in image classification [50,51,52], but also in other vision tasks such as object detection [53] and semantic segmentation [20,5].

Methods
Our architecture is inspired by the UNETR model [20] for 3D volumetric images, but we adapt its architecture for processing 2D images as shown in Fig. 2. Unlike traditional segmentation networks that employ a single decoder branch for computing the segmentation map, our network employs three distinct multitask output branches inspired by the approach of HoVer-Net [8].The first branch predicts the binary segmentation map of all nuclei (nuclei prediction, NP), capturing their boundaries and shapes.The second branch generates horizontal and vertical distance maps (horizontal-vertical prediction, HV), providing crucial spatial information for precise localization and delineation.Lastly, the third branch predicts the nuclei type map (NT), enabling the classification of different nucleus types.In summary, our network has the following multi-task branches for instance segmentation: • NP-branch: Predicts binary nuclei map • HV-branch: Predicts the horizontal and vertical distances of nuclear pixels to their center of mass, normalized between -1 and 1 for each nuclei

• NT-branch: Predicts the nuclei types as instance segmentation maps
To integrate these outputs, we utilize additional postprocessing steps.These steps involve merging the information from the different branches, separating overlapping nuclei to ensure accurate individual segmentation, and determining the nuclei class based on the nuclei type map.
In our exepriments, we also evaluated the effectiveness of the STARDIST decoder method and its extension, CPP-Net.We integrate their techniques into the proposed UNETR-HoVer-Net architecture with modifications.Instead of the NP-branch, an object probability branch PD is used to predict whether a pixel belongs to an object by predicting the Euclidean distance to the nearest background pixel.The HV-branch is replaced by a branch RD to predict the radial distances of an object pixel to the boundary of the nucleus (star-convex representation) [43].
The NT-branch remains unchanged.For the CPP-Net decoder, an additional refinement step is added for the radial distances [29].

Network Structure
In our network, we integrate a Vision Transformer as an image encoder that is connected to an upsampling decoder network via skip connections.This architecture allows us to leverage the strengths of a Vision Transformer as an image encoder for instance segmentation without losing fine-grained information.
Even though many other adaptations of the U-Net structure for Vision Transformers have been proposed (e.g., SwinUNETR [56]), it was important for us to choose a network structure that incorporates the original ViT structure by Dosovitskiy et al. [50] without modifications such that we can make use of the largescale pre-trained ViTs, namely ViT 256 and SAM.As in NLP [49], Vision Transformers take as input a 1D sequence of tokens embeddings [50,49].Therefore we need to divide an input image x ∈ R H×W×C with height H, width W and C input channels into a sequence of flattened tokens Each token is a squared image section with the dimension P × P.
The number of tokens N can be calculated via N = HW/P 2 , which is the effective input sequence length [20].Accordingly, a linear projection layer E ∈ R N×D is used to map the flattened tokens x p into a D-dimensional latent space.The latent vector size D remains constant through all of the Transformer layers.
In contrast to the UNETR-network, we incorporate a learnable class token x class [50], which we can use for classification tasks and append it to the token sequence.Unlike CNNs, which inherently capture spatial relationships through their local receptive fields, Transformers are permutation invariant and, therefore, cannot capture spatial relationships.Thus, a learnable 1D positional embedding E pos ∈ R (N+1)×D is added to the projected token embeddings to preserve spatial context [20].In summary, the final input sequence z 0 for the Transformer encoder is: The Transformer encoder comprises alternating layers of multiheaded self-attention (MHA) [50] and multilayer perceptrons (MLP), assembled in one Transformer block.A ViT is composed of several stacked Transformer blocks such that the latent tokens z i are calculated by with L denoting the number of Transformer blocks, Norm(•) denoting layer normalization, and i is the intermediate block identifier [20].Inspired by the U-Net and UNETR architectures, we add skip connections to leverage information at multiple encoder depths in the decoder.In total, we use five skip connections.The first skip connection takes x as input and processes it by two convolution layers (3 × 3 kernel size) with batchnormalization and ReLU activation functions.For the remaining four skip connections, the intermediate and bottleneck latent tokens z j , j ∈ L 4 , 2L 4 , 3L 4 , L are extracted without the class token and reshaped to a 2D tensor Z j ∈ R H P × W P ×D .This is only valid if 4 | L holds, which is commonly satisfied for common ViT implementations [50,18,19].Each of the feature maps Z j is transformed by a combination of deconvolutional layers that increase the resolution in both directions by a factor of two and convolutions to adjust the latent dimension.Subsequently, the transformed feature maps are successively processed in each decoder, beginning with Z L , and fused with the corresponding skip connection at each stage.This iterative fusion ensures the effective incorporation of multi-scale information, enhancing the overall performance of the decoder.Our network is designed in such a way that the output resolution of the segmentation results exactly matches the input image resolution.As denoted in Fig. 2, our three segmentation branches (NP, HV, NT) share the same image encoder with the same skip connections and their transformations.The only difference lies in the isolated upsampling pathways of the decoders specific to each branch.To leverage the additional tissue type information available in the PanNuke dataset, we introduce a tissue classification branch (TC) to guide the learning process of the encoder.For this, we use the class token z L,class as input to a linear layer with softmax activation function to predict the tissue class.

Target and Losses
For faster training and better convergence of the network, we employ a combination of different loss functions for each network branch.The total loss is where L NP denotes the loss for the NP-branch, L HV the loss for the HV-branch, L NT the loss for the NT-branch, and L TC the loss for the TC-branch.Overall, the individual branch losses are composed of the following weighted loss functions: with the individual segmentation losses and the cross-entropy as tissue classification loss with the contribution of each branch loss (5) to the total loss (4) controlled by the i − th hyperparameters λ i .L MSE denotes the mean squared error of the horizontal and vertical distance maps and L MSGE the mean squared error of the gradients of the horizontal and vertical distance maps, each summarized for both directions separately.In the segmentation losses ( 6)-( 8), y ic is the ground-truth and ŷic the prediction probability of the ith pixel belonging to the class c, C the total number of nuclei classes, N px the total amount of pixels, ε a smoothness factor and α FT , β FT and γ FT are hyperparemters of the Focal Tversky loss L FT .The Cross-Entropy loss (6) and Dice loss (7) are commonly used in semantic segmentation.To address the challenge of underrepresented instance classes, the Focal Tversky loss (8), a generalization of the Tversky loss, is used.The Focal Tversky loss places greater emphasis on accurately classifying underrepresented instances by assigning higher weights to those samples.This weighting enhances the model's capacity to handle class imbalance and focuses its learning on the more challenging regions of the segmentation task.

Postprocessing
As the network does not directly provide a semantic instance segmentation with separated nuclei, postprocessing is necessary to obtain accurate results.This involves several steps, including merging the information from the different branches, separating overlapping nuclei to ensure accurate individual segmentation, and determining the nuclei class based on the nuclei type map.Moreover, when performing inference on whole gigapixel WSI, a fusion mechanism is necessary.Due to the significant size of WSIs, inference needs to be performed on image patches extracted from them using a sliding-window approach.The segmentation results obtained from these patches must be assembled to generate a segmentation map of the entire WSI.The postprocessing methods are therefore explained in the following two paragraphs, starting with the segmentation of a single patch followed by its composition into a segmentation output for the entire WSI.
Nuclei Separation and Classification To separate adjacent and overlapping nuclei from each other, we utilize HoVer-Net's validated postprocessing pipeline.This involves computing the gradients of the horizontal and vertical distance maps to capture transitions between nuclei boundaries and the boundary between nuclei and the background.At these transition points significant value changes occur in the gradient.The Sobel operator (edge detection filter) is then applied to identify regions with substantial differences in neighboring pixels within the distance maps.Finally, a marker-controlled watershed algorithm is employed to generate the final boundaries.To calculate the nuclei class, the output of the separated nuclei is merged with the nuclei type predictions.For this purpose, majority voting is performed in the nuclei region using the NT prediction map with the majority class assigned to all nuclei pixels [8].The STARDIST and CPP-Net decoder methods, on the other hand, use non-maximum suppression (NMS) to prune redundant polygons that likely represent the same object [43,44].We use this approach when testing CellViT with STARDIST and CPP-Net decoders.In difference to STARDIST, the CPP-Net approach uses the refined radial distances as input for the NMS.The nuclei classes are then again assigned to the resulting binary polygons via majority voting.
Inference The encoder ViT offers a significant advantage for performing inference on gigapixel WSI over CNNs based U-Nets.Its capability to process input sequences of arbitrary length, constrained only by memory consumption and positional embedding interpolation, allows for increased input image sizes during inference.It is important to note that positional embedding interpolation must be considered when scaling the input images.
In preliminary experiments on the MoNuSeg dataset (see Sec. 5.3), we found that our network achieves equal performance when inferring on a single 1024 × 1024 px patch compared to cutting the same patch into 256 × 256 px sub-patches with an overlap of 64 px.Based on these findings, we have chosen to perform WSI inference using 1024 × 1024 px large patches with a 64 px overlap.Due to the high computational overhead, it is not feasible to keep the segmentation results of the entire WSI in memory.Consequently, we process and merge only the overlapping nuclei during postprocessing.By utilizing just a small overlap in the inference patches relative to the patch size, the postprocessing effort is reduced.To efficiently store the results in a structured and readable format, as well as for compatibility with software such as QuPath [65], the nuclei predictions for an entire WSI are exported in a JSON file.Each nucleus is represented by several parameters, including the nuclei class, bounding-box coordinates, shape polygon of the boundaries, and the center of mass for detection location.In the Appendix, we provide example visualizations of the prediction results from an internal esophageal adenocarcinoma and melanoma cohort, imported into QuPath (see Fig. A.2).This approach ensures the accessibility of the instance segmentation results for further analysis and visualization.Moreover, for each detected nuclei ŷ, we store the corresponding embedding token z ŷ L ∈ R D .Importantly, as the cell embedding vectors can be directly extracted during the forward pass and are spatially linked to each nuclei ŷ, there is no need for an additional forward pass on cropped image patches of the detected cells, again saving inference time.If a nucleus is associated with multiple tokens, we average over all token embeddings in which the nucleus is located.The cell embedding can be used as extracted cell-features for downstream DL algorithms addressing problems such as disease prediction, treatment response, and survival prediction.

Datasets
PanNuke We use the PanNuke dataset as the main dataset to train and evaluate our model.The dataset contains 189,744 annotated nuclei in 7,904 256×256 px images of 19 different tissue types and 5 distinct cell categories, as depicted in Fig. 3. Cellimages were captured at a magnification of ×40 with a resolution of 0.25 µm/px.The dataset is highly imbalanced, especially the nuclei class of dead cells is severely underrepresented, as apparent in the nuclei and tissue class statistics (see Fig. 3).PanNuke is regarded as one of the most challenging datasets to perform the simultaneous nuclei instance segmentation task [9].

MoNuSeg
The MoNuSeg [66,67] dataset serves as an additional dataset for nuclei segmentation.In contrast to PanNuke, the dataset is much smaller and does not divide the nuclei into different classes.For this work, we only use the test dataset of MoNuSeg to evaluate our model.The test dataset consists of 14 images with a resolution of 1000 × 1000 px, acquired at ×40 magnification with 0.25 µm/px.In total, the test dataset contains more than 7000 annotated nuclei across the seven organ types kidney, lung, colon, breast, bladder, prostate, and brain at several disease states (benign and tumors at different stages).Since no nuclei labels are included, the dataset cannot be used for evaluating classification performance.To process the dataset more effectively with our ViT-based networks with a token size of 16 px, we resized the data to a size of 1024 × 1024 px.Due to the sufficient patch size of the original data, we also created a ×20 dataset with 0.50 µm/px resolution, where the patch size is 512 × 512 px accordingly.
CoNSeP We utilized the colorectal nuclear segmentation and phenotypes (CoNSeP) dataset by Graham et al. [8] to analyze extracted cell embeddings (see Sec. 3.3) of detected cells on an external validation dataset.This dataset comprises 41 H&E-stained colorectal adenocarcinoma WSI at a resolution of 0.25 µm/px and an image size of 1000 × 1000 px, which we rescale to 1024 × 1024 px similar to the MoNuSeg data.The dataset exhibits significant diversity, encompassing stromal, glandular, muscular, collagen, adipose, and tumorous regions, along with various types of nuclei derived from originating cells: normal epithelial, dysplastic epithelial, inflammatory, necrotic, muscular, fibroblast, and miscellaneous nuclei, including necrotic and mitotic cells.

Experiments
In this study, we conducted two experiments on the PanNuke dataset and one on the MoNuSeg dataset to assess algorithms performance.We additionally used an internal dataset for comparing inference speed time.Given the higher clinical relevance of the detection task over achieving the optimal segmentation quality, we (1) performed an ablation study on PanNuke to determine the most suitable network architecture for nuclei detection.We compared the performance of pre-trained models (see Sec.  4.4) against randomly initialized models and explored the impact of regularization techniques such as data augmentation, loss functions, and customized oversampling, as well as comparing the HoVer-Net decoder method to the STARDIST and CPP-Net decoder methods in our UNETR-structure.Based on these investigations, we identified the best models, which were (2) subsequently evaluated for segmentation quality.To assess both detection and segmentation performance, we compared our models with multiple baseline architectures, namely DIST [42], Mask-RCNN [37], Micro-Net [41], HoVer-Net [8], TSFD-Net [9], and CPP-Net [29].We also re-trained the STARDIST model with a ResNet50 [68] backbone and the hyperparameters of Chen et al. [29] to retrieve comparable detection results.For comparison, we conducted our experiments using the same three-fold cross-validation (CV) splits provided by the PanNuke dataset organizers and report the averaged results over all three splits.It is worth mentioning that all the comparison models we evaluate in this study adhere to the same evaluation scheme for the PanNuke dataset, with one exception.The TSFD-Net publication reports results based on an 80-20 train-test split, making their results more optimistic.Nevertheless, we include their results for the purpose of comparison.As a third experiment (3), we evaluated our models trained on PanNuke on the publicly available 14 test images of the MoNuSeg dataset to test generalizability.The dataset serves a second purpose next to generalization: We compare various input image sizes and assess the performance of our inference pipeline outlined in Section 3.3.In this context, we evaluate the performance using two scenarios -one involving an uncropped MoNuSeg slide with 1024 px input patch size and the other using cropped 256 px input images.Additionally, we investigate the impact of our overlapping strategy with a 64-pixel overlap, focusing on the 256 px input size.To analyze the cell embeddings z ŷ L for detected nuclei with our CellViT models, we utilize the CoNSeP dataset (4).To achieve this, we perform inference with the pre-trained PanNuke models on the CoNSeP images (1024 px input patch size) and extract the token embeddings z ŷ L for each nuclei ŷ from the last Transformer block that are spatially associated with ŷ.Subsequently, we employ the Uniform Manifold Approximation and Projection (UMAP) method for dimension reduction to transform the cell embedding vectors (of the 27 training images) into a two-dimensional representation, which can be visualized in a two-dimensional scatter plot.We additionally trained a linear classifier on top of the cell embeddings (extracted from the 27 training images) to classify the detected cells into the CoNSeP nuclei classes and tested the classifier on the cell embeddings of the cells from the 14 test images.Finally, to compare the inference runtime (5), we collected a diverse dataset of 10 esophageal WSIs with tissue areas ranging from 2.79 mm 2 to 74.07 mm 2 .We measured the inference runtime for the HoVer-Net model, as well as for the CellViT 256 and CellViT-SAM-H models with 256 px and 1024 px patch input size and an overlap of 64 px.For each WSI, we repeated the process three times and averaged the runtime results.

Evaluation Metrics
Nuclear Instance Segmentation Evaluation Usually, the Dice coefficient (DICE) or the Jaccard index are used as evaluation metrics for semantic segmentation.However, as Graham et al. [8] have already shown, these two metrics are insufficient for evaluating nuclear instance segmentation as they did not account for the detection quality of the nuclei.Therefore, a metric is needed that assess the following three requirements (see Graham et al. [8]): 1. Separate the nuclei from the background 2. Detect individual nuclei instances and separate overlapping nuclei 3. Segment each instance These three requirements cannot be evaluated with the Jaccard index and the DICE score, as they just satisfy requirement (1).In line with [8] and the PanNuke dataset evaluation recommendations [17], we use the panoptic quality (PQ) [69] to quantify the instance segmentation performance.The PQ us defined as with IoU(y, ŷ) denoting the intersection-over-union [69].In this equation, y denotes a ground-truth (GT) segment, and ŷ denotes a predicted segment, with the pair (y, ŷ) being a unique matching set of one ground-truth segment and one predicted segment.As Kirillov et al. [69] proved, each pair of segments (y, ŷ), i.e., each pair of true and predicted nuclei, in an image is unique if IoU(y, ŷ) > 0.5 is satisfied.For each class, the unique matching of (y, ŷ) splits the predicted and the GT segments into three sets: • True Positives (TP): Matched pairs of segments, i.e., correctly detected instances • False Positives (FP): Unmatched predicted segments, i.e., predicted instances without matching GT instance • False negatives (FN): Unmatched GT segments, i.e., GT instances without matching predicted instance The PQ score can be intuitively decomposed into two parts, the detection quality similar to the F 1 score commonly used in classification and detection scenarios, and the segmentation quality as the average IoU of matched segments [8,69].To ensure a fair comparison, we use binary PQ (bPQ) pretending that all nuclei belong to one class (nuclei vs. background) and the more challenging multi-class PQ (mPQ), taking the nuclei class into account.In doing so for mPQ, we calculate the PQ independently for each nuclei class and subsequently average the results over all classes [17].
Nuclear Classification Evaluation To evaluate the detection quality of our model, we employ commonly used detection metrics.Similar to the approach used in the PQ-score for nuclear instance segmentation evaluation, we split GT and predicted instances into TPs, FPs, and FNs.We use the conventional detection metrics precision (P d ), recall (P d ) and the (F 1,d )-score as a harmonic mean between precision and recall.The index 'd' indicates that these are the scores for the entire binary nuclei detection over all classes c.Thus, the binary detection scores are defined as follows: We further break down T P d into correctly classified instances of class c (T P c ), false positives of class c (FP c ) and false negatives of class c (FN c ) to derive cell-type specific scores.We then define the F 1,c -score, precision (P c ) and recall (R c ) of each nuclei class c as In order to prioritize the classification of different nuclear types, we incorporated an additional weighting factor for the nuclei classes, as suggested in the official PanNuke evaluation metrics [17,8], Since we cannot use the IoU(y, ŷ) > 0.5 criterion to find matching instances (y, ŷ) between GT-instances and predictions for the detection task, we use the methodology of Sirinukunwattana et al. [70] and define a match (y, ŷ) if both centers of mass are within a radius of 6 px (0.50 µm/px) and 12 px (0.25 µm/px), respectively.

Model Training
Oversampling Even though the PanNuke dataset has around 200,000 annotated nuclei, they are distributed just across a limited number of 8,000 patches with 256 × 256 px patch size.Furthermore, there is a substantial class imbalance among tissue types and nucleic classes (see Fig. 3).Thus, we developed a new oversampling strategy based on class weightings to balance both tissue classes and nuclei classes.For each patch i in the training dataset with N Train training samples, we calculate the sampling weights for the tissue class and the cell class with where w Tissue (i, γ s ) is a weight factor for the tissue class and w Cell (i, γ s ) for the nuclei class.The parameter γ s ∈ [0, 1] is a weighting factor that determines the strength of the oversampling.A γ s value of 0 indicates no oversampling, while γ s = 1 corresponds to maximum balancing.To ensure neither w Tissue (i, γ s ) nor w Cell (i, γ s ) dominates the sampling, normalization is applied to both summands in eq. ( 10).The calculation of the weighting factor of the tissue class can be calculated directly via (11) as each patch can only belong to one tissue class denoted by c T,i .
For cell weighting, it must be considered that each patch can contain multiple nuclei from different cell classes.Therefore, we create a binary vector c i ∈ {0, 1} C , where each entry is set to 1 for each existing nuclei type c in the patch.To get a reference value for scaling similar to eq. ( 11), we calculate N Cell = N Train i=1 ∥c i ∥ 1 .The cell weighting for each training image i is then calculated by Data Augmentation In addition to our customized oversampling strategy, we extensively employ data augmentation techniques to enhance data variety and discourage overfitting.We use a combination of the following geometrical and noisy/intensity-based augmentation methods: random 90-degree rotation, horizontal flipping, vertical flipping, downscaling, blurring, gaussian noise, color jittering, superpixel representation of image sections (SLIC), zoom blur, random cropping with resizing and elastic transformations.These augmentation techniques were selected to introduce variations in the shape, orientation, texture, and appearance of the nuclei, enhancing the robustness and generalization capabilities of the model.For detailed information on the augmentation methods utilized, including the selected probabilities and corresponding hyperparameters, please refer to the Appendix.

Optimization and Training Strategy
We train all our models for 130 epochs and incorporate exponential learning rate scheduling with a scheduling factor of 0.85 to gradually reduce the learning rate during training (denoted as CellViT hyperparameters).To balance our training, we use our modified oversampling strategy with γ s = 0.85.For the STARDIST and CPP-Net models, we also conducted experiments using the proposed CPP-Net hyperparameters by Chen et al. [29].A complete overview of all hyperparameters, including optimizer, data augmentation, and weighting factors of the loss functions in eq. ( 5) is provided in the Appendix.As for the encoder models, we leverage the ViT 256 -model (ViT-S, D = 384, L = 12), which has been pretrained on histological data (see Sec.These checkpoints provide different model sizes and complexities, allowing us to evaluate their respective performance and choose the most suitable one for our task.During training, we initially freeze the encoder weights for the first 25 epochs.After this initial warm-up phase to train the decoder, we proceed to train the entire model including the image encoder.
Implementation All models are implemented in PyTorch 1.13.1.To augment images and masks, we used the Albumentations library [71].Other used libraries include the official STARDIST [44], CPP-Net [29] and CellSeg-models implementations [72].For the pre-trained ViT 256 -model, we utilized the ViT-S checkpoint1 provided by Chen et al. [18].As for the SAM-B, SAM-L, and SAM-H models, we use the encoder backbones of each final training stage of SAM [19], published on GitHub 2 .All experiments were conducted on an 80 GB NVIDIA A100 GPU with automatic mixed precision.However, it is worth not-

Results
In the section below, the results for the experiments (1) nuclei detection quality and (2) segmentation quality on PanNuke, (3) generalization performance on the independent MoNuSeg cohort, (4) cell-embedding analysis and (5) inference speed comparisons are given.If not stated otherwise, all models were trained on the PanNuke dataset with a resolution of 0.25 µm/px.

Detection Quality on PanNuke
Considering the clinical importance of nuclei detection and classification over achieving the best possible segmentation quality, our ablation study aimed to determine the best model based on the detection results using the PanNuke dataset.Tab. 1 presents the precision, recall, and F 1 -Score for both detection and classification performance across all nuclei classes, including the binary case.To determine the optimal settings, we evaluated different variations of our network.These include a randomly initialized network (CellViT-Random), networks with pre-trained weights from the ViT 256 network (CellViT 256 ), and networks with different pre-trained SAM backbones (CellViT-SAM-B, CellViT-SAM-L, CellViT-SAM-H).To ensure comparability, the CellViT-Random network shares the same architecture (ViT-S, D = 384, L = 12) as the CellViT 256 network.All mentioned CellViT model variants were trained using data augmentation and our customized sampling strategy as regularization methods.The decoder network strategies (HoVer-Net, STARDIST or CPP-Net decoder) and hyperparameter settings are given behind the network name in Tab. 1.
We first analyze the CellViT models with HoVer-Net decoder.
Compared to the baseline models, the randomly initialized CellViT-Random network achieves detection results comparable to the HoVer-Net CNN network.However, when using pretrained encoder networks, we observe a significant performance increase, reaching state-of-the-art performance.We notice a strong increase in F 1 -scores compared to all existing solutions, especially for the epithelial nuclei class.Both the ViT 256 and the three different SAM encoders exhibit significantly better performance, all at a similar level, with the CellViT-SAM-H model as the best solution.Notably, we even outperform purely detection-based methods like Mask-RCNN and all state-of-theart approaches by a large margin with up to a 26 % increase in the F 1,Epi -score of epithelial nuclei.
To demonstrate the effect of extensive data augmentation, customized sampling strategy and the Focal Tversky loss, we additionally report the results for a CellViT 256 model without regularization (CellViT 256 -Raw), with oversampling only (CellViT 256 -Over), with data-augmentation only (CellViT 256 -Aug) and a model trained with oversampling and all augmentations, but without Focal Tversky loss (CellViT 256 -No-FC) in Tab. 1.Our experiments reveal that data augmentation, in particular, is a crucial regularization method that significantly enhances performance.Specifically, the addition of data augmentation results in a 0.13 increase in the F 1,Dead score for the dead nuclei class compared the the (CellViT 256 -Raw) model.Oversampling and Focal Tversky loss just lead to minimal improvements in the detection scores.We also tested the STARDIST and CPP-Net decoder structures with the CellViT 256 and CellViT-SAM-H model with our hyperparameters and the CPP-Net hyperparameters suggested by Chen et al. [29].These models usually achieve higher precision values but often a significantly lower recall and a lower F1 score than the models with the HoVer-Net decoder architecture.As an extension of the STARDIST method, the CPP-Net decoder achieves slightly better results.Overall, the models achieve better detection results than comparable CNN-based SOTA networks and outperform the ResNet50-based STARDIST model, but are inferior to our suggested models with HoVer-Net decoder architecture.The results also reveal that our hyperparameters provide better detection performance.
In addition to the provided dataset resolution of 0.25 µm/px, we performed training and evaluation for the two best model variants CellViT 256 and CellViT-SAM-H on downscaled Pan-Nuke data (from 256 × 256 to 128 × 128 px patch size), resulting in 0.50 µm/px resolution.The results are presented in the last two rows of Tab. 1.The downsizing leads to a substantial drop in performance compared to the 0.25 µm/px networks, with detection results approaching the baseline models.Notably, the recall of individual classes significantly decreases (by an average of −0.20).In particular, the recall for the dead nuclei class drops to 0.04, indicating that this class is almost not detected at all.Interestingly, the precision increases minimally or remains almost the same compared to our best 0.25 µm/px models.We conclude that despite detecting significantly fewer nuclei, when a nucleus is identified and classified correctly, it corresponds to the true nucleus class with high accuracy for most classes.
For subsequent investigations, we decided to further just consider the CellViT 256 and CellViT-SAM-H models to enable a comparison between in-domain and out-of-domain pre-training.To provide a visual representation of the segmentations, we include tissue-wise comparisons between ground-truth and segmentation predictions of the CellViT-SAM-H model in Fig. 4.
As observed in the lung example, the instance segmentation of dead cells poses a significant challenge due to their small size.Furthermore, detecting and segmenting dead nuclei becomes even more difficult when these images are scaled down from 0.25 µm/px to 0.50 µm/px resolution.

MoNuSeg Test Performance
In this experiment, we focused on instance segmentation without classification on the MoNuSeg dataset to assess the generalizability of our models (just with HoVer-Net decoder) at resolutions of 0.25 µm/px and 0.50 µm/px.Additionally, we aim to evaluate the impact of changing the input sequence size by performing inference on large-scale tiles of size 1024 px (0.25 µm/px) and 512 px (0.50 µm/px), respectively, comparing the results to non-overlapping 256 px patches and 256 px patches with an overlap of 64 px derived by a shifting window approach.We utilized the three final models of the PanNuke training folds for each architecture and conducted inference on the MoNuSeg data without retraining.The evaluation results are presented in Tab. 4. Consistent with the previous experiments, the CellViT-SAM-H model is the best-performing model.It achieves a bPQ-score of 0.672 on 1024 px tiles when no patching was applied and of 0.671 for 256 px tiles with an overlap of 64 px.However, when using 256 px patches without overlap, the bPQ-score decreases to 0.631, likely due to the absence of merging overlapping nuclei at cell borders and double detected cells (higher recall).Importantly, the overall comparison between larger tiles and smaller tiles with overlapping indicates that inference on larger tiles did not lead to a degradation in performance.This justifies our inference pipeline for large-scale WSI, in which we are using 1024 px sized patches with an overlap of 64 px and overlapping merging strategies.The CellViT 256 model yields slightly inferior results compared to the CellViT-SAM-H model.Using the models trained on 0.50 µm/px data on the 0.25 µm/px data and vice versa, the 0.50 µm/px trained models exhibit poor performance on 0.25 µm/px data, while the 0.25 µm/px trained models experience a less severe performance drop on the 0.50 µm/px data.Nevertheless, networks trained and evaluated on the same WSI resolution achieved the best performance, thus it is advisable to align image resolution between different dataset and use the appropriate model.Consistently, the best results are achieved for WSI acquired with a resolution of 0.25 µm/px.We include a visual demonstration presenting a tissue tile from the MoNuSeg test set along with binary segmentation masks generated by the CellViT-SAM-H and CellViT-SAM-H (0.50 µm/px) models in the Appendix.

Token Analysis
In Figure 5, we present the two-dimensional UMAP embeddings of cell tokens from the CoNSeP dataset.The CellViT-SAM-H and CellViT 256 models with HoVer-Net decoder, trained on the PanNuke dataset, were utilized.The tokens were extracted simultaneously with cell detections in a single inference pass.The color overlay in the scatter plots (left) and tissue images (right) indicates the respective nuclei classes.Consistent with Graham et al. [8], we grouped normal and malignant/dysplastic epithelial nuclei into an "epithelial" class, while fibroblast, muscle, and

Inference Runtime
Our inference runtime benchmark shows that our inference pipeline is accelerated by a factor of 2.49 (CellViT 256 ) and 2.25 (CellViT-SAM-H) when using 1024 px input patches instead of 256 px.The CellViT 256 model with 1024 px input patches is 1.34 times faster than the CellViT-SAM-H model with 1024 px patches.Both CellViT models with our large 1024 px input patch size outperform the HoVerNet model, with speedups of 1.85 (CellViT 256 ) and 1.39 (CellViT-SAM-H), respectively.

Discussion and Conclusion
Nuclei instance segmentation is crucial for clinical applications, requiring automated tools that offer high robustness and reliability.In the clinical context of performing large-scale analysis on clinical patient cohorts, accurate detection is considered more important than precise segmentation.In this work, we introduced a novel deep learning-based method for simultaneously segmenting and detecting nuclei in digitized H&E tissue samples.Our work was inspired by the success of previous works using large-scale trained Vision Transformers, particularly by the contributions made by Chen et al. [18] (ViT 256 ) and Kirillov et al. [19] (SAM).The CellViT network proposed in this study demonstrates state-of-the-art performance for both nuclei instance segmentation and nuclei detection on the PanNuke dataset.Additionally, the results on the MoNuSeg dataset validate the generalizability of our model to previously unseen cohorts.Notably, our model surpasses all other existing methods by a significant margin for nuclei detection and classification, elevating nuclei detection in H&E-slides to a new level.By leveraging the most recent approaches, we showed that both in-domain pre-training (ViT 256 ) and the use of the SAM foundation model yields significantly better results compared to randomly initialized network weights.Our larger inference patch size allows us to be 1.85 times faster than the popular HoVer-Net inference framework by Graham et al. [8], which could save hours in computational time when dealing with huge gigapixel WSI.Moreover, our framework allows direct assessment of a localizable ViT-token from a detected nucleus that can be further used in downstream tissue analysis tasks.Although an evaluation of this aspect is pending, we anticipate promising prospects based on our first results in Sec.5.4.Our work provides the potential to design interpretable algorithms that directly correlate with specific cells or cell patterns.One possible direction for future research involves graph-based networks with attention mechanisms using these embeddings.Nevertheless, external validation of the results is necessary.Yet, additional datasets are required, especially to verify the detection quality of our model.Furthermore, our models exhibit reliable performance only for WSI acquired at 0.25 µm/px resolution.While the results obtained with 0.50 µm/px images are acceptable in terms of detection, there is room for improvement, as there is a huge performance gap between 0.25 µm/px and 0.50 µm/px-WSI processing.We recommend to scan the tissue samples on a resolution of 0.25 µm/px if technically possible.In the future, we plan to apply the proposed model with extracted nuclei tokens to downstream histological image analysis tasks.This will enable us to validate if simultaneously extracted tokens are an advantage for building interpretable algorithms for computational pathology.Additionally, it will allow us to evaluate which tokens have achieved a more meaningful representation of the tissue and are better suited for downstream tasks, as there are just minimal differences in the segmentation and detection performance between our best-performing CellViT 256 and CellViT-SAM-H models.To ensure the accessibility of our results, we have made both the code and pre-trained models publicly available under an open-source license for non-commercial use.

Figure 1 :
Figure 1: Network structure of CellViT.An input image is transformed into a sequence of tokens (flattened input sections).By using skip connections at multiple encoder depth levels and a dedicated upsampling decoder network, precise nuclei instance segmentations are derived.Nuclei embeddings are extracted from the Transformer encoder.

Figure 2 :
Figure 2: Network structure of our proposed CellViT-network consisting of a ViT encoder connected to multiple decoders via skip connections.Postprocessing is used to separate overlapping nuclei and perform nuclei type classification.For visualization purposes, the tissue classification branch is not illustrated.As encoder networks, we used the pre-trained ViT 256 and SAM models.

Figure 3 :
Figure 3: PanNuke nuclei distribution overview for each of the nineteen tissue types, sorted by the total number of nuclei inside the tissue.The total number of nuclei within a tissue type is given in parentheses.Adapted from [17].

,
with c i j the vector entry of c i at position j.The training images are randomly sampled in a training epoch with replacement based on their sampling weights p i (γ s ).

Table 1 :
Precision (P), Recall (R) and F 1 -score (F 1 ) for detection and classification across the three PanNuke splits for each nuclei type.The centroid of each nucleus was used for computing detection metrics for segmentation networks.*TSFD-Net was not evaluated on the official three-fold splits of the PanNuke dataset and left out by the comparison **Model re-trained by ourselves ***Models trained on downscaled 0.50 µm/px PanNuke 2.2).Additionally, we compare the performance with the three pre-trained SAM checkpoints: SAM-B (ViT-B, D = 768, L = 12), SAM-L (ViT-L, D = 1024, L = 24) and SAM-H (ViT-H, D = 1280, L = 32).

Figure 4 :
Figure 4: Example of PanNuke patches with ground-truth annotations and CellViT-SAM-H predictions overlaid for each tissue type.

Figure 5 :
Figure 5: Two-dimensional UMAP embedding visualization (left) of the CoNSeP dataset with the CellViT-SAM-H and CellViT 256 (HoVer-Net encoder) models trained on PanNuke.We extract cell-tokens for each detected cell with our model, resulting in one embedding vector per cell.On the right side of the figure, representative clusters derived with the CellViT-SAM-H model are displayed alongside corresponding tissue images.The color overlay illustrates the ground-truth nuclei types within the dataset.endothelial nuclei were grouped into the "spindle-shaped nuclei" class.The global clusters in the scatter plot represent cells from different images, with clusters containing cells from the same tissue phenotype being grouped together.An example of this is cluster 1 for the CellViT-SAM-H model.It comprises cell clusters from two images, both containing multiple glands.Within this cluster, the local spatial arrangement of the cell embeddings allows differentiation of nuclei types (epithelial, spindle-shaped, and inflammatory) despite the model not being explicitly trained for all cell classes (spindle-shaped cells are not explicitly defined in the PanNuke dataset).Cluster 3, which is spatially close to cluster 1, contains even more glands, while the tissue image associated with the distant cluster 2 lacks glands and primarily consists of spindle-shaped and inflammatory nuclei.In summary, the global UMAP arrangement primarily captures differences in the nuclei's tissue environment (e.g., nearby glands, muscles).The local arrangement highlights distinctions between nuclei without the need for fine-tuning the model for specific nuclei types.Notably, for the CellViT 256 model, the global tissue differences are even more pronounced.To quantitatively assess the quality of the embeddings, we trained a linear nuclei classifier (Appendix) on the embeddings of the training data (15,548 nuclei) to classify the nuclei into the CoNSeP classes.We evaluated the classifier on the embeddings of the test images (8,773 nuclei).The model achieved an area under the receiver operating characteristics curve (AUROC) of 0.963 for the validation data using the CellViT-SAM-H embeddings.When utilizing the CellViT 256 embeddings, the model achieved an AUROC of 0.960.This demonstrates the effectiveness of our embeddings in classifying unknown nuclei classes, with both CellViT-SAM-H and CellViT 256 embeddings yielding high AUROC values.

Fabian Hörst : 22 AFigure A. 1 :Figure A. 2 :
Figure A.1: Example of one MoNuSeg tissue sample with ground-truth binary masks and predictions of the CellViT-SAM-H model for different input sizes and magnifications. images

Table 2 :
Average PQ across the three PanNuke splits for each nuclear category on the PanNuke dataset.*TSFD-Net was not evaluated on the official three-fold splits of the PanNuke dataset and left out by the comparison.**Model re-trained by ourselves ***Models trained on downscaled 0.50 µm/px PanNuke images.Net decoder against the best baseline models by computing the binary PQ (bPQ) and the more challenging multi-class PQ (mPQ) for each of the 19 tissue types in PanNuke, providing an assessment of both instance segmentation qualities.As baseline experiments, we just include the best HoVer-Net model by Graham et al. [8], TSFD-Net and the original STARDIST and CPP-Net models with ResNet50 encoder Chen et al. [29].For our

Table 4 :
MoNuSeg validation result for CellViT 256 and CellViT-SAM-H models with HoVer-Net decoder and trained with CellViT hyperparameters on different dataset resolutions and inference patch sizes averaged over all three PanNuke training folds.The original image size for 0.25 µm/px resolution with ×40 magnification (mag.) is 1024 px, and 512 px for 0.25 µm/px (×20 mag.).*Models trained on downscaled 0.50 µm/px PanNuke images