Domain generalization across tumor types, laboratories, and species - insights from the 2022 edition of the Mitosis Domain Generalization Challenge

Recognition of mitotic figures in histologic tumor specimens is highly relevant to patient outcome assessment. This task is challenging for algorithms and human experts alike, with deterioration of algorithmic performance under shifts in image representations. Considerable covariate shifts occur when assessment is performed on different tumor types, images are acquired using different digitization devices, or specimens are produced in different laboratories. This observation motivated the inception of the 2022 challenge on MItosis Domain Generalization (MIDOG 2022). The challenge provided annotated histologic tumor images from six different domains and evaluated the algorithmic approaches for mitotic figure detection provided by nine challenge participants on ten independent domains. Ground truth for mitotic figure detection was established in two ways: a three-expert majority vote and an independent, immunohistochemistry-assisted set of labels. This work represents an overview of the challenge tasks, the algorithmic strategies employed by the participants, and potential factors contributing to their success. With an F 1 score of 0.764 for the top-performing team, we summarize that domain generalization across various tumor domains is possible with today’s deep learning-based recognition pipelines. However, we also found that domain characteristics not present in the training set (feline as new species, spindle cell shape as new morphology and a new scanner) led to small but significant decreases in performance. When assessed against the immunohistochemistry-assisted reference standard, all methods resulted in reduced recall scores, with only minor changes in the order of participants in the ranking.


Introduction
Despite advances in molecular characterization of biological tumor behavior, morphological tumor classification using established histopathologic techniques remains an important factor in tumor prognostication Makki [2015], Soliman and Yussif [2016].One criterion of particular interest within many tumor grading schemes is the density of cells undergoing division, which are visible as mitotic figures (MFs) in hematoxylin and eosin (H&E)-stained histopathological sections [Veta et al., 2015[Veta et al., , 2019]].The number of MFs within a specific tumor area is enumerated by experienced pathologists, resulting in the mitotic count (MC).Despite the prognostic relevance of the MC, low inter-rater consistency on an object level has been reported in many studies [Veta et al., 2016, Meyer et al., 2005, 2009, Malon et al., 2012, Bertram et al., 2021].The recommendation for pathologists is to select the region of the suspected highest mitotic activity, which is considered to be the best predictor of tumor behavior [Azzola et al., 2003, Meuten et al., 2008, Veta et al., 2015].Selection of this regions of interest (ROI) within the tumor has a great impact on the MC [Bertram et al., 2020a], but is difficult for pathologists to reliably accomplish and is poorly reproducible [Aubreville et al., 2020, Bertram et al., 2021].While assessment of mitotic activity in the entire tumor section (or in the case of the digital image: the whole slide image (WSI)) would be preferable in order to identify those mitotic hotspot ROI, this is not feasible in current practice.Additionally, low inter-rater consistency on an object level within these selected ROI has been reported in many studies with the tendency of pathologists to overlook MFs [Veta et al., 2016, Meyer et al., 2005, 2009, Malon et al., 2012, Bertram et al., 2021].The combination of these circumstances and the recent availability of large-scale digital pathology solutions makes automatic detection of MFs desirable.
Unsurprisingly, MF detection was one of the earliest identified areas of research interest in computational pathology, with the first approaches in 2008 [Malon et al., 2008].The first challenge on MF detection in breast cancer (MITOS2012, [Roux et al., 2013]) was held at the International Conference on Pattern Recognition (ICPR) and resulted in the first publicly available MF dataset.While this gave rise to algorithm development in the field, it was also an example of questionable dataset quality, as the training and test sets were selected from the same histology slides [Roux et al., 2013].More recent challenges (MI-TOS2014 [Roux et al., 2014], AMIDA13 [Veta et al., 2015], TUPAC16 [Veta et al., 2019]) also comprised breast cancer and incorporated a higher number of cases, yet, were still limited by having the same digitization device for the training and test set.
As shown by prior research [Aubreville et al., 2021], the digitization device has a decisive influence on the detection quality, as it coincides with a shift in the image representation, leading to a domain shift in the latent representation of the detection models [Stacke et al., 2020, Aubreville et al., 2023a].Investigation of these limitations was the main idea behind the MItosis DOmain Generalization (MIDOG) challenge, held as a one-time event at the International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI) in 2021.This challenge, which was the first to directly target domain generalization in histopathology, evaluated the detection of MFs in ROIs of human breast cancer, digitized using various devices (WSI scanners).
Since MFs are not only of interest for human breast cancer, the 2022 MIDOG challenge extended the task of MF domain generalization to include further representation shifts of interest: In addition to the use of different WSI scanners, the training dataset was enhanced by including histological specimens from different tumor types as well as different species (human, canine, feline), processed by different laboratories.Each of these contributing factors defined a tumor domain.We define a tumor domain as a specific combination of tumor type, species, lab, and WSI scanner.We found that the domain gap between tu-mor types is substantial [Aubreville et al., 2023b] and seems to be more important than the domain gap between scanners, thus the cases used for the MIDOG 2022 challenge were primarily categorized by the tumor type.

Challenge format and task
As in previous challenges on MF detection, we provided ROIs, selected by an experienced pathologist from a tumor region with the presumed highest mitotic activity and appropriate tissue and scan quality.MF candidates were identified and assessed by a blinded majority vote of three experts (with the third expert only asked if the first two disagreed).The training set, consisting of 405 tumor cases (corresponding to 405 patients) and featuring 9,501 MF annotations was released on April 20, 2022.These cases were split across six tumor domains (see Fig. 1), out of which five were provided with labels and one was provided without labels as an additional data source for unsupervised domain adaptation techniques.An extended version of the training set, including two novel domains, was made available under a Creative Commons CC0 license post-challenge [Aubreville et al., 2023b].
The participants were required to package their algorithmic solution in the form of a docker container 1 , which was subsequently evaluated on the test data on the grand-challenge.orgplatform 2 in a fully automatic manner, i.e., no participant had access to any of the test images during or after the challenge.To perform a technical validation of the docker containers, we provided an independent preliminary test set, consisting of four unseen tumor domains.During a preliminary test phase, which started on August 5, participants were allowed to perform one evaluation of an algorithmic approach per day.We explicitly made the participants aware that the four domains of the preliminary test set were disjointed from the tumor domains of the actual challenge test set, so overfitting to those domains by means of hyperparameter or model selection would not be meaningful.The final challenge submission phase started on August 26 and lasted until August 30.During this phase, participating teams were exclusively authorized to submit a single algorithmic approach.
The challenge provided two tracks: As multiple openly accessible datasets on MF detection already exist, we gave participants the choice to either use only data provided by the challenge (track 1) or additionally use publicly available data and labels (track 2).In the second track, participants also had the option to use in-house datasets under the condition that these datasets were made publicly available and announced on the challenge website up to one month prior to the challenge.We opted for this strategy to maximize the reproducibility of the challenge results.However, no participating team chose to use previously non-public datasets.
The structured challenge design includes details about the policies regarding participation, publication, awards, and results announcement, and was made available publicly [Aubreville et al., 2022].The challenge design was proposed and evaluated in a single-blinded peer review for admission to MICCAI 2022.

Main novelties over the predecessor
While the task (MFs detection on ROIs images) was identical to the preceding MIDOG 2021 challenge, we 2 https://midog2022.grand-challenge.orgincorporated three major modifications in the 2022 challenge design that set it apart from its predecessor: • We extended the sources of domain shift by not only including the imaging device and the inherent stain differences between cases but also by incorporating different laboratories (and hence tissue processing), different tumor types, and different species, minimizing the gap to real-world data variability.
• The evaluation was carried out on ten independent tumor domains, representing a wide variety of conditions and thus allowing for better generalization of the assessment.The ten domains were additionally disjoint from the four independent domains of the preliminary test used for technical validation of the docker pipeline.
• We established the ground truth of the test set not only as the majority vote of three experts on the H&E-stained sections (used for challenge evaluation and ranking) but also by additionally using an immunohistochemical (IHC) stain for Phospho-Histone H3 (PHH3) (specific for cells entering the mitotic cycle [Hendzel et al., 1997]), which was superimposed on the H&E image for assisted labeling aiming to object-level confusion, which is a main source of inter-rater disagreement [Veta et al., 2016].

Material and evaluation methods
For all tumor types included in our datasets, the MC has well-established prognostic relevance for discriminating patient outcome, either as a solitary prognostic test or as part of an established grading scheme.We retrieved human tissue samples from the diagnostic archives (DAs) of the Department of Pathology of the University Medical Center (UMC) Utrecht, The Netherlands, as well as the Institute of Neuropathology and the Institute of Pathology of the University Hospital Erlangen, Germany.All samples were prepared from paraffinembedded tumor sections stained according to the standard procedures of the respective institutions.We received ethics approval from the UMC Utrecht (TCBio 20-776) and the ethics board of the medical faculty of FAU Erlangen-Nürnberg (AZ 92 14B,AZ 193 18B,22 342 B).For samples taken from the DAs of veterinary pathology laboratories (Freie Universität Berlin (FUB), Germany and University of Veterinary Medicine Vienna (VMU), Austria), no ethics approval was required.

Challenge cohort and tumor domains
In our datasets, we included tumors from multiple different morphological categories: aggregated cell pattern, round cell shape, and spindle cell shape.While these categories were used in order to allow comparison of the algorithmic performance depending on the tumor morphology, we acknowledge that some tumor types (see below) are difficult to group into these categories and the best fitting category was chosen.In the training dataset, we included 405 cases (see Fig. 1), split into the following domains: • Domain F: Human melanoma, a neuroectodermal tumor comprising all three morphological patterns, retrieved from the DA of UMC Utrecht.51 cases digitized with a Hamamatsu XR (C12000-22) at 40× magnification (0.23 µm/px).MC is part of the staging and classification scheme of the AJCC for melanoma [Gershenwald et al., 2017].This domain was not labeled and only provided as an additional source of data diversity for unsupervised approaches.
While, ideally, a consecutive selection of cases would be desirable to provide representative samples, we intentionally deviated from this norm in this iteration of the challenge.Specifically, we ensured the inclusion of a minimum number of mitotically active cases across all domains.This was done in order to ensure sufficient dataset support for MF objects in each respective domain.
We prepared a small (20 cases) preliminary test set to check the validity of the algorithmic approaches through the docker submission system.In this dataset, the following domains were included: • Domain α: Human breast carcinoma, an epithelial tumor with aggregated cell pat- • Domain β: Canine osteosarcoma, a mesenchymal tumor with predominantly spindle cell morphology, retrieved from the DA of VMU.Five cases digitized with a 3DHistech Pannoramic Scan II at 40× magnification (0.25 µm/px).
• Domain δ: Canine pheochromocytoma, a neuroendocrine tumor with aggregated cell pattern/morphology, retrieved from the DA of VMU.Five cases digitized with a 3DHistech Pannoramic Scan II at 40× magnification (0.25 µm/px).
For the evaluation of the challenge, we constructed the so-called final test set, where only a single evaluation per team was permitted.The dataset, of which an overview is shown in Fig. 2, included 10 cases per domain, encompassing the following domains, evenly divided between human and veterinary samples: • Domain 1: Human melanoma, a neuroectodermal tumor comprising all three morphological patterns (round cells, spindle cells, aggregated cell pattern), retrieved from the DA of UMC Utrecht, digitized using a Hamamatsu S360 (C13220) at 40× magnification (0.23 µm/px).
MC is part of the staging and classification scheme of the AJCC for melanoma [Balch et al., 2009].
• Domain 2: Human astrocytoma, a neuroectodermal tumor with round nuclear shape and star-like cytoplasmic projections (mostly fitting into the round cell category) retrieved from the DA of the Institute of Neuropathology at University Hospital Erlangen, digitized with a Hamamatsu S60 at 40× magnification (0.22 µm/px).
• Domain 3: Human bladder carcinoma, an epithelial tumor with aggregated cell pattern, retrieved from the DA of the Institute of Pathology at University Hospital Erlangen, digitized with a 3DHistech Pannoramic Scan II at 40× magnification (0.25 µm/px).MC is used in the differentiation of tumor types according to [Epstein et al., 1998] and was recently confirmed to be prognostically significant by [Akkalp et al., 2016].
• Domain 4: Canine breast carcinoma, an epithelial tumor with aggregated cell pattern, retrieved from the DA of VMU, digitized with a 3DHistech Pannoramic Scan II at 40× magnification (0.25 µm/px).MC is part of the grading scheme by Peña et al. [2013].
• Domain 5: Canine cutaneous mast cell tumor, a mesenchymal tumor with round cell morphology, retrieved from the DA of FUB, digitized with a Hamamatsu S360 (C13220) at 40× magnification (0.23 µm/px).MC is part of the grading scheme by Kiupel et al. [2011].
• Domain 6: Human meningioma, a mesenchymal/neuroecodermal tumor with spindle cell shape, retrieved from the DA of the Institute of Neuropathology at University Hospital Erlangen, digitized with the Hamamatsu S60 at 40× magnification (0.22 µm/px).MC is part of the 2016 WHO grading scheme [Louis et al., 2016].
• Domain 7: Human colon carcinoma, an epithelial tumor with aggregated cell pattern, retrieved from the DA of UMC Utrecht, digitized using a Hamamatsu S360 (C13220) at 40× magnification (0.23 µm/px).MC is not part of the grading scheme but was shown to predict survival for lymph-node negative colon carcinoma by Sinicrope et al. [1999].
• Domain 8: Canine splenic hemangiosarcoma, a mesenchymal tumor with spindle cell morphology, retrieved from the DA of VMU, digitized with a 3DHistech Pannoramic Scan II at 40× magnification (0.25 µm/px).MC is part of the grading scheme of Ogilvie et al. [1996].
• Domain 9: Feline (sub)cutaneous soft tissue sarcoma, a mesenchymal tumor with spindle cell morphology, retrieved from the DA of VMU, digitized with a 3DHistech Pannoramic Scan II at 40× magnification (0.25 µm/px).MC is part of the grading scheme of Dobromylskyj et al. [2021].
• Domain 10: Feline gastrointestinal lymphoma, a mesenchymal tumor with round cell morphology, retrieved from the DA of VMU, digitized with a 3DHistech Pannoramic Scan II at 40× magnification (0.25 µm/px).For cats, the MC is know to be correlated with the grade according to the National Cancer Institute working formulation [Valli et al., 2000].
While human melanoma (unlabeled) and canine cutaneous mast cell tumor were already part of the training set, the test set used different scanners for both tumor types.

Establishment of ground truth
The MC is typically assessed on an ROI of 10 high power fields, the size of which is dependent on the optical properties of the microscope [Fitzgibbons and Connolly, 2023].For digital microscopy, it is more sensible to directly define the area, calculated from the resolution of the digitization device, which we set in accordance with previous work [Veta et al., 2019[Veta et al., , 2015] ] to 2 mm 2 .The ROI was selected from each digitized WSI by a pathologist with expertise in tumor pathology (C.A.B.) as the area with appropriate tissue and scan quality and the perceived highest mitotic activity, which was considered to be more likely found in a region with high cellular density.This is in accordance with current guidelines [Donovan et al., 2021, Avallone et al., 2021, Ibrahim et al., 2022, Fitzgibbons and Connolly, 2023].
Given the well-known inter-rater disagreements in identification and annotation of MFs, strategic study design methods are essential to limit the effects of these factors on the ground truth for subsequent (ideally unbiased) evaluation.Two main annotation biases need to be considered: When presented with a MF, previously identified as such by another expert, an independent expert might be subject to a confirmation bias.Similarly, it has been reported that the chance of overlooking individual MFs, especially in densely populated cell areas or under sub-optimal image quality, should not be neglected [Bertram et al., 2021].
Our annotation method (described in more detail in Bertram et al. [2019]) takes both factors into account by identifying all candidate objects (i.e., MFs) as well as non-mitotic figures (NMFs)/imposters and then independently rating them by three experts.For the identification, an expert (C.A.B.) initially identified MF objects and a roughly similar amount of NMF objects in the images.To avoid missing MFs in this identification step, we trained a RetinaNet [Lin et al., 2017] single-stage object detection model on the annotations of this expert and carried out model inference in a cross-validation scheme to spot additional candidate objects that were previously overlooked.These additional objects were then also assessed by the first expert to yield a first label for each object.At the end of this first step, both classes had a similar prevalence according to the initial assessment, which was not communicated to the experts conducting the consecutive assessments of the MF/NMF cells.Next, a secondary expert (R.K.) who was blinded to the assessment of the first expert assessed all previously identified objects according to the same two classes (MF vs. NMF).In case of nonconsensus amongst those two experts, a third expert (T.A.D.) was presented the object in question without any information about previously assigned labels to render the final vote.
All three experts have more than five years of experience in MF identification.This independent vote counteracts a confirmation bias, while use of the machine-learning support mitigates the omission of individual objects.Prior to the assessment, the experts agreed on common criteria for the identification of MFs [Donovan et al., 2021].The annotation of all parts of our dataset (training set, preliminary test set, and the final challenge test set) was carried out using the same methodology.This ground truth definition was used for performance evaluation and ranking of the participants during the MIDOG 2022 challenge.

PHH3-assisted ground truth
Due to the known high degree of inter-rater disagreement for mitotic figure assessment [Meyer et al., 2005], it is prudent to create a ground truth that relies less on the subjective judgments of experts.Hence, as an alternative ground truth for the test set, we used IHC staining for PHH3 as a decision support for annotations by a single expert.This ground truth definition was not available during the MIDOG 2022 challenge and was developed for this summary paper to gain a better understanding of the algorithmic performance.Histone H3 is a protein that is phosphorylated in the early stages of the mitotic phase and represents a specific marker for mitosis [Hendzel et al., 1997, Bertram et al., 2020a, Tellez et al., 2018].However, the specific stain is less pronounced in the last phase (telophase) of mitosis, a phase which is usually morphologically conspicuous with the H&E stain, and is already present in early prophase, which is usually not apparent based on H&E morphology.Thus this IHC stain cannot be used alone for annotating mitotic figures according to definitions of the H&E morphology.We hypothesized that the combination of these two staining techniques would increase label consistency.To evaluate H&E and PHH3 in the same cells, we de-stained the H&E-stained slides after digitization and re-stained them with an antibody for PHH3, combined with a secondary antibody equipped with a tailored enzyme that reacts with a substrate to yield a brown stain (see Fig. 3).After digitization of the IHC-stained slide and subsequent manual registration of both scans, a tool based on the EXACT annotation server was employed by an expert [Marzahl et al., 2021], in which both scans could be superimposed with variable transparency.Hence, it was possible to simultaneously evaluate both the specific immunopositivity for PHH3 as well as the morphology in the H&E stain for each cell.In case of non-perfect registration between cells in the PHH3 and H&E stain, the expert annotated the ex- act coordinate of the MF in the H&E stain.Out of 100 cases of the test set, we were able to register 98 to the respective ROIs in the H&E image.For two cases (068 and 100) restaining with PHH3 was not possible due to damage during tissue handling.Immunopositive cells lacking H&E morphology of MFs were not annotated (mostly early prophase MFs) as it is impossible to identify them in the H&E images.
A considerable number of objects was ambiguous from H&E alone, in which cases the IHC staining pattern was used to decide on these borderline objects.

Dataset statistics
The MC is expected to vary across tumor types and species.This expectation was confirmed in the distribution of MC shown in the histogram for the training set (Fig. 4) as well as in the box-whisker plots for the preliminary and the final challenge test set (Fig. 5).Tumor types with a comparatively high MC in our samples were canine lung cancer (domain B), canine lymphoma (domain C), canine osteosarcoma (domain β), as well as human bladder carcinoma (domain 3), human colon carcinoma (domain 7), canine hemangiosarcoma (domain 8), and both feline tumors (domains 9 and 10).The mean MC of the training, pre-liminary test, and final challenge test set were 26.84, 18.00, and 34.74, respectively.

Reference approaches
For optimal familiarization, challenge participants were provided with three baseline approaches with algorithmic descriptions and preliminary test results.Out of these three approaches, two were based on the RetinaNet [Lin et al., 2017] single-stage object detection architecture and one was based on the Mask RCNN [He et al., 2017] architecture.The first RetinaNet-based approach used a domain-adversarial [Ganin et al., 2016] branch and was trained solely on the MIDOG 2021 training set (i.e., the identical setting as the reference approach for the MIDOG 2021 challenge) and the reference approach for the MIDOG 2021 challenge [Wilm et al., 2022].Considering that this approach was only trained on human breast cancer, we expected a considerable domain gap.The second RetinaNet-based approach was trained on the six domains of the training set (A-F) and used additional stain augmentation, based on Macenko's method for stain deconvolution [Macenko et al., 2009].As the top-performing approaches of MIDOG 2021 were all using (instance) segmentation, we also included the Mask RCNN for this purpose.This approach was, however, not trained with any specific domain-generalizing methods besides default image augmentation.We provided a detailed description of both approaches as part of the challenge proceedings [Ammeling et al., 2023].

Evaluation methods and metrics
MF identification is a balanced pattern recognition problem in that both an over-and an underestimation of the MCs can lead to equally detrimental consequences: overestimation may lead to excessively aggressive treatment with significant side effects, whereas an underestimation may contribute to more conservative treatment, potentially diminishing the overall treatment outcome.As in prior challenges [Veta et al., 2019, 2016, Roux et al., 2013, 2014], we thus decided to use the F 1 score as our primary metric, as it represents the geometric mean between  precision and recall and thus benefits from a good operating point set as a balance between both.To counter averaging effects from the strongly heterogeneous distribution of the MC, we opted to calculate the F 1 score across all cases/images from the summary of respective true positives, false positives, and false negatives over all slides.As the F 1 score is calculated from thresholded results, it is additionally insightful to see if competing approaches only chose an unsuitable decision threshold while having an otherwise proper pattern discrimination.Hence, we additionally evaluated the average precision (AP) metric, calculated as the mean precision for 101 linearly spaced recall values between 0 and 1.Further, we calculated the precision and recall for all algorithms.

Statistical analysis of the results
To compare the performance of the approaches, an omnibus test using analysis of variance (ANOVA) was first conducted, as is standard procedure.Given significant differences, the test was followed by a Tukey's honest significant difference (HSD) test [Tukey, 1949] as post-hoc test for a pairwise comparison.Both tests were performed on the respective F 1 score per individual image of the test set.As significance level, we chose α = 0.05, as commonly done.
The tests allowed us to determine if there were statistically significant differences between the approaches in terms of their F 1 scores.In contrast to multiple pairwise t-tests, Tukey's method is inherently controlling for the family-wise error rate.
Additionally, we performed a mixed linear model regression analysis to analyze effects that can be attributed to the tumor domain or groups therein.For the purpose of our study, we categorized tumors into distinct groups based on tumor morphology, species, and scanner used for digitization.Each of those categories had a previously unseen condition (e.g., an unknown species, scanner, or morphology) in the test set.Morphology was differentiated between aggregated cell pattern, round cell shapes, and spindle cell morphology.We differentiated species between human, canine, and feline.Scanners were differentiated between the 3DHistech scanner and the Hamamatsu S360 and S60 scanners.Within this framework, the F 1 score was selected as the dependent (endogenous) variable to assess the predictive success of each model.Since each teams' algorithmic contribution can be thought of as a random draw from a larger population of algorithms that are not separate and independent, we accounted for the variance attributed to the different teams by incorporating it as a random effect for the intercept, resulting in a linear mixed effects model with a random intercept and fixed slope.This means that each team gets its own intercept estimate but has a common slope.The models were fitted using the restricted maximum likelihood method and we report the residual variance and Bayesian information criterion (BIC) values for each fit.
To investigate the distribution across samples, tumor domains and methods, we performed empirical bootstrapping of the results of each test case, i.e., we randomly selected the same number of cases with replacement from the set of results per case before calculating the precision, recall, AP and F 1 values.Bootstrapping offers a robust alternative to individual image-based metrics for statistical analysis, particularly when evaluating metrics like the AP and F 1 scores in contexts where the target class, such as MFs, varies widely in prevalence.
The use of bootstrapping mitigates the disproportionate influence that the target class frequency within individual images might exert on our results, as the F 1 /AP scores are now calculated on instances comprising multiple images.By this methodological choice, we thus obtain values that are empirically aligned with the expected distributional characteristics of the dataset, while simultaneously reducing our dependence on the variable prevalence of mitotic figures in individual images.
From the bootstrapped (across tumor domain and team) sets, we additionally determined the 80 % confidence levels of precision and recall for each team and tumor.This was performed by using a Gaussian kernel density estimator over the bootstrapped sample of precision and recall values and thresholding at an interval to include 80 % of the respective values.We chose 80% as interval, since it facilitated a more easy comparison of the approaches than using larger intervals.

Pattern recognition tasks
The majority of teams (5/9) chose to frame the task as an object detection task (see Table 1), partially with a second classification stage.Two teams used a semantic segmentation approach and two teams chose a classification-based detection.In particular, the approach by Jahanifar et al. [2022] used fixed-size disks around the centroid coordinate of the MFs to generate the segmentation mask target for track 1 and a segmentation mask generated by the NuClick algorithm [Koohbanani et al., 2020] for track 2, while the approach by Yang et al. [2022] used the filled inner circle of the provided bounding box as segmentation target.In contrast, Lafarge and Koelzer [2023] used a classification of patches (78×78 px) with a sliding window like in the original works by Cireşan et al. [2013].Gu et al. [2023] framed object localization as a weakly-supervised learning task derived from class activation maps of medium-sized (240×240 px) patches that were classified as containing a MF or not.

Architectures
The majority of submissions were derivatives of convolutional neural networks (CNNs), while one team

Ensembling and Test-Time Augmentation
While both ensembling and test-time augmentation (TTA) are strategies well-known to enhance model robustness, they were only employed by a minority of participants (see Table 1).Only the winning approach by Jahanifar et al. [2022] employed both ensembling and TTA.The runner-up approach by Kotte et al. [2023] employed a tailored ensembling of model scores of the first and second stages but only in cases where the score of an object in the first stage did not exceed a given threshold.The approach by Lafarge and Koelzer [2023] ensembled two models trained with different augmentation strategies, and integrated the effect of 90-degree rotation for TTA via the use of a rotation invariant model [Cohen and Welling, 2016].The approach by Annuscheit and Krumnow [2023] used four-fold TTA using mirroring of the images.

Augmentation
All participating teams used standard geometric image transformations like rotation, scaling, and elastic deformations.Additionally, the majority of teams opted to use one form of standard color augmentation that aims at manipulating the hue, brightness, and contrast.Additionally, multiple teams opted to use image perturbations such as blurring, sharpening, and noising.Bozaba et al. [2022] additionally employed mosaic augmentation.Bochkovskiy et al. [2020], and Gu et al. [2023] additionally used balanced mixup [Galdran et al., 2021].Besides those general computer vision augmentation strategies, specific stain augmentation strategies for H&Estained images were employed by three teams [Jahanifar et al., 2022, Gu et al., 2023, Annuscheit and Krumnow, 2023], while the approach by Yang and Soatto [2020] augmented images by performing a style-transfer in the frequency-domain.

Use of the unlabeled domain
The unlabeled domain F of the training set (human melanoma) was employed by four teams.Lafarge and Koelzer [2023] designed a hard-negative mining scheme that additionally employed the unlabeled domain by un-mixing the stains into hematoxylin, eosin, and a residual component, and then extracted objects with high residual components (e.g., stain artifacts), which can be mistaken for MFs.Gu et al. [2023] used the surplus domain to treat all images as negatives and counteracted these noisy labels with a specifically crafted loss function.Annuscheit and Krumnow [2023] used the domain as an additional domain in a representation learning scheme for domain adaptation.Finally, Wang et al. [2023b] used the additional data in an auxiliary domain classifier in a multi-task learning scheme.

Domain generalization methodologies
Besides augmentation, several teams employed specific strategies targeted at domain generalization.Annuscheit and Krumnow [2023] designed a domain adaptation scheme based on metric learning where the distance of each sample to prototypes of all domains was minimized to achieve domain generalization.[Wang et al., 2023b] employed multi-task learning with two auxiliary tasks: an overall MF classification for the patch and a tumor domain classifier, likely regularizing the model (and hence counteracting domain overfitting).Similarly, [Yang et al., 2022] added a weight perturbation to the loss term, as this was shown to regularize the model and make it more robust to domain shifts [Wu et al., 2020].

Addional datasets used in track 2
In the second track of the challenge, it was permitted to use publicly available datasets.Yang et al.

Results
Overall, as shown in Fig. 6a and Fig. 6b, we found that two of the approaches submitted to the challenge had outstanding performance.The evaluation of track 1 on all ten tumor domains of the test set shows that the TIA Center approach [Jahanifar et al., 2022] yielded the best overall performance (F 1 = 0.764), closely followed by the approach from the TCS Research team [Kotte et al., 2023] (F 1 = 0.757).Breaking this down into the ten tumor domains, we find a similar overall picture, with both approaches scoring first or second in all domains (see Table 2).We also note that the two leading approaches chose different strategies when optimizing the operating point: While the TIA Center approach yielded a moderately lower recall value at a higher precision value, we found the opposite to be true for the TCS Research approach (see Fig. 6c and Fig. 6d).The F 1 score is roughly reflected in the precision-recall curves of Fig. 7.
In the second track of the challenge, we find a clear superiority of the approach by [Jahanifar et al., 2022], further supported by having the leading edge in all tumor domains.
Comparing the performance in both tracks across tumor domains, we find that tumor domain 2 (human astrocytoma) and 6 (human meningioma), i.e, the neuropathological domains, seemed to have been particularly challenging, with overall maximum F 1 scores of 0.63 and 0.68, respectively (see Table 2).On the contrary, the domains 1 (human melanoma), 3 (human bladder carcinoma), 5 (canine cutaneous mast cell tumor), and 8 (canine splenic hemangiosarcoma) were the tumor domains to which the algorithms generalized best, achieving F 1 scores of up to 0.82, 0.81, 0.82 and 0.82, respectively.

Statistical analysis
The ANOVA yielded an F-value of 7.295 (df group = 13, df residual = 1386, p < 0.0001), indicating a significant difference between at least two of the algorithms.While the F 1 score, which we assessed statistically, differed considerably between the algorithms, the post-hoc analysis, performed as Tukey HSD hypothesis test (depicted in Fig. A.2 of the supplementary material), yielded statistically significant (p < 0.05) differences only between the results of the MIDOG 2021 domain-adversarial RetinaNet baseline and most other approaches (with the exception of the Mask RCNN baseline and the Virasoft / HITszCPath approaches), between the HTW Berlin and the leading TIA Centre method, and between both the Virasoft and HITszCPath and all approaches by TIA Centre and the approach by TCS Research.While the TIA Centre approaches and the TCS Research approach reached a higher overall F1 score than the MIDOG 2022 domain-adversarial baseline provided by the organizers, this difference was not significant, as of this statistical test.
Visual analysis of the 80% confidence regions of precision and recall per tumor domain, shown in Fig. 8, reveals a wide spread in performance in the human astrocytoma and the human meningioma domains, both originating from a lab that did not provide samples to the training set.Furthermore, the analysis yields low recall for the baseline MIDOG 2021 Reti-naNet approach in feline soft tissue sarcoma, canine hemangiosarcoma, feline lymphoma, and for canine mammary carcinoma.The figure also reveals that the UCLA-HCI approach, while performing well overall, had a particularly low recall in the human melanoma case.The results of the linear mixed effects model shown in Table 3 confirm the visual impression of considerable differences between tumor domains.Since all variables in this regression model are binary projections of the categorical variable and hence mutually exclusive, the coefficients can be interpreted directly as differences in the F 1 score.Feline soft tissue sarcoma, canine mammary carcinoma, and human astrocytoma all showed a highly significant decrease in detection performance.The subgroup analysis of morphological patterns (Table 4) showed a significantly reduced performance for the group of spindle cell tumors (not seen in training).Similarly, also for the analysis of domains according to the species (Table 5), we find a small but significant drop in performance for feline specimens, which were also not part of the training set.Notably, we also found significant differences between the scanners, with the images scanned on the Hamamatsu S360 scanner (which was already seen as part of the training set) performing slightly better than on the Hamamatsu S60 (Table 6).The variance of the random intercept is small with 0.004 in all models suggesting that the teams have similar intercepts and that the random intercept contributes little to the overall variability.

Runtime analysis
All inference runs were conducted on the same platform, grand-challenge.org,where all jobs are run on the same environment (ml.g4dn.xlargeconfiguration, 4 virtual CPU cores, 16 GiB RAM, NVIDIA Tesla T4 with 16GB of VRAM, hosted by Amazon Web Services).This enabled us -within certain limits -to also compare the runtime of the approaches.The analysis, shown in Fig. 9, reveals that only three approaches (the HTW Berlin and both Reti-naNet approaches) had a runtime in the area of 100 s per image.The approach by USZ/UZH Zurich and HITszCPath reached medium inference time per image of below 200 s, while most approaches were in the 200 − 400 s range.In contrast, the two-stage approach by the AI medical team required a considerably higher run-time in the range of 800 s per image.

Assessment on alternative (PHH3-assisted) ground truth
After the full annotation of 98 cases based on the joint information of the H&E and PHH3-stained images, we found an increase in the count of MF by 15.0%.Out of the mitotic figures identified aided by the PHH3-stained images, 28.78% were previously not part of the majority vote of the three experts based on the H&E stain.We performed a post-hoc analysis of all these cells, the results of which are depicted in Table 7. Out of those MF only identified with help of the PHH3 stain, 20.97% were from the 9% of cases of feline lymphoma, which are generally difficult due to the small cell size resulting at low cellular details at the given image resolution.Over the complete test set, the primary reason for the discrepancy was a borderline mitotic figure morphology, which was hard to discriminate against imposters due to cells being out of focus or superimposed to other cells in thick tissue sections, poor tissue/image quality such as overstained chromatin structures, prophase morphology without obvious chromatin spikes that are difficult to differentiate from apoptotic cells or other, not further classified reasons.Less common were difficulties distinguishing the MF from imposters due to a borderline cell cycle phase to the G2-phase with early membrane changes and G1-phase with formation of nuclear membranes of the two neighboring daughter cells.In 5.85% of cases the MFs were found to have an unusual morphology, while in 0.54% of cases we found an incomplete capture of the cell at the image borders.Only 1.08% of mitotic figures were consid-Figure 8: 80% confidence regions for precision and recall for each team/approach and tumor type.Confidence intervals were established using a Gaussian kernel density estimator applied to the bootstrapped datasets.Kindly refer to the supplementary material for an alternative version of this Figure , where the subplots compile data over tumor domains for each team.ered labeling errors in the HE approach, as characteristic MF morphology was apparent.
When evaluating with this alternative, IHCassisted ground truth, we found overall lower recall values for all approaches, as shown in Fig. 10, also resulting in overall lower AP and F 1 values.However, the order of the approaches, when sorted by the F 1 value, was almost unaltered.Fig. 10 also shows the precision and recall values of both experts using the original H&E images when evaluated on the PHH3-assisted alternative ground truth, as well as the respective values for the three-expert majority vote, indicating a good alignment between the H&E-based and the IHC-assisted GT.For expert 1 (C.A.B.), we found an overall precision, recall, and F 1 value of 0.926, 0.611, and 0.736, respectively, and for expert 2 (R.K.) we found an overall precision, recall, and F 1 value of 0.659, 0.747 and 0.700, respectively.The three expert majority vote achieved a precision, recall, and F 1 value of 0.818, 0.711, and 0.761 respectively.The fit of the model was determined using the Restricted Maximum Likelihood (REML) method, with a residual variance of 0.0486 and a BIC of −100.79.
Figure 10: Precision and recall of all approaches and the experts, evaluated on the PHH3-assisted alternative ground truth.AP values and curves are only given for approaches where the model scores were provided and consistent.The expert scores represent the independent assessment of experts 1 and 2 on the hematoxylin and eosin-stained images, which was performed when establishing the original challenge ground truth, the majority vote represents the challenge ground truth.1.08 % Table 7: Breakdown of mitotic figures that were additionally identified using the phosphohistone H3 (PHH3) stain.

Assessment of employed strategies
The MIDOG 2022 challenge was the first to assess MF recognition across multiple tumor types.This extends the range of covariate shifts to the visual context of the MFs.In the previous challenge, the main domain shift could be attributed to changes in color, sharpness, and depth of field (caused by the differing scanners).In this challenge, the generalization to different tumor types and hence unknown tissue types that surround the MFs is harder to reflect in dedicated domain generalization strategies, e.g., domain augmentation.This may explain why the participants of this iteration of the challenge did not opt to formulate novel augmentation strategies.It is noteworthy that the top three approaches to track 1 of the challenge used distinctively different strategies to address the pattern recognition problem (semantic segmentation followed by connected components analysis [Jahanifar et al., 2022], object detection [Kotte et al., 2023] and classification on a sliding window [Lafarge and Koelzer, 2023]), highlighting that the how (i.e., augmentation, sampling scheme, post-processing) of training was likely more important than the what (i.e., the neural network architecture).One commonality between the three top performing approaches to track 1 was that they all used some form of ensembling technique, which has been reported as a strong determinant of success in biomedical challenges [Eisenmann et al., 2023], and likely contributed directly to domain robustness.Of note, in contrast to our expectations, approaches in track 2 (using additional data) of our challenge did not show improved performance compared to track 1.This is particularly interesting, given the similarity of the approaches of the TIA Centre team for both tracks.We attribute this to the fact that the additional dataset that was used for training (TUPAC16, Veta et al. [2019]) in the approach by Jahanifar et al. [2022] might introduce a semantic diversity in the labels for mitotic figures, since it used a different annotation process [Bertram et al., 2020b].Given the size of the MIDOG22 training set in comparison to the TUPAC16 set, and that the MIDOG22 data set already includes three different scanner-domains for breast cancer (and the TUPAC16 providing another two), there might also be little added informational value by the additional dataset.

Domain generalization on the test set
Our test set contained three notable conditions that were not part of any training set.First of all, it introduced a new scanner (Hamamatsu S60), which also coincided with specimens from another, yet unseen lab.As the analysis in Table 6 shows, we found a moderate, but significant reduction in performance on samples from this scanner/lab.Secondly, the test set introduced a new species (felines).The regression analysis in Table 5 yielded only a moderate but significant performance drop for the unseen species.The third condition that was not part of the training set was the group of spindle cell tumors, where we also found a moderate but statistically significant reduction of performance (see Table 4).It is worth noting that feline soft tissue sarcoma, a tumor with spindle cell morphology, had the lowest scores across individual domains (refer to Table 3).This is likely to have an impact on both aggregated evaluations.While our test set represents the broadest span of tumor classifications yet, the pool of conditions with shared characteristics remain limited.This prohibits a definitive stratification of influencing aspects at this point.However, it is worth highlighting that all three domain-defining factors that were new in the test set of the challenge yielded drops in performance.
It should be noted that the evaluation using F 1 /AP scores on individual images, and not the collective, tends to skew results towards lower values.Consider the scenario of a tumor image with a solitary MF.If this singular cellular event goes undetected (false negative), the F 1 score drops from 1.0 to 0.0.On the other hand, if there is a single false positive misdetection, the F 1 score reduces from 1.0 to 0.5.Now contrast this scenario in the context of an image with 100 MFs.In this case, the impact of either events changes the scores only marginally.This means, that especially for cases with a low true MFs count, the F 1 score is strongly influenced by the number of false positives.Consequently, we anticipate a more significant decline in the F 1 score for low-grade cases, inherently weighting the deviation from the overall expected performance higher in the macro-average.The bootstrapped results, on the other hand, mitigate this problem by aggregating over a larger sample before calculating the metrics.

Container-based submission
The use of containers for the algorithmic submission comes with an increased risk of unintended and unexpected technical failures for the participants.For this reason, we made an independent preliminary test set available to the participants.Additionally, test cases were shipped with the container.To avoid overfitting of hyperparameters to this set by the participants and, at the same time, to reduce the computational budget required to evaluate the containers, one daily execution was admitted during a two-week time frame prior to the submission.Since overfitting could still not be ruled out, in this version of the challenge, we opted to use four independent (disjointed from the challenge test set) domains in this phase.
For every detected object, the submission format provided a field for the detection class (MF or NMF) as well as for the detection score, which we used for an automatic evaluation of the AP score.It was not mandatory to provide meaningful values in the score field, however.Consequently, we were only able to determine AP scores as well as precision-recall curves for the approaches where these scores were meaningful, whereas the other approaches were excluded in Figures 6b, 7, and 10.
Overall, the usage of containers for algorithmic submissions presented challenges that required careful handling, and our approach of providing a preliminary test set and imposing limitations on executions helped address some of the associated risks.However, since the independence of the test set is of utmost importance for a data challenge, we contend that these additional efforts are well-justified.

PHH3-assisted ground truth
The post-challenge evaluation on the alternative, PHH3-assisted ground truth yielded overall lower recall values for all approaches.We attribute this to the inclusion of multiple MFs having inconclusive morphological features in the H&E image, which could be identified with higher confidence in the IHC due to immunopositivity against PHH3-antibodies.Equivocal or inconclusive morphologies include the MF being out of focus due to the factual threedimensionality of the sample as well as general difficulty in clearly differentiating some MF morphologies (particularly prometaphase MF) from imposters.In the PHH3 stain, however, these structures are clearly distinguishable due to immunoreactivity, which provides an unaltered high contrast, contributing to the overall higher number of MFs.Similar to the expert annotators of the H&E-approach, algorithms were trained (based on the ground truth used) to exclude these morphologically inconclusive structures, which explains the lower recall values of all approaches.The good agreement of the challenge ground truth (three-expert majority vote) compared to the alternative and IHC-assisted ground truth highlights the benefits of multiple blinded expert ensembles for H&E-based MF annotations.The in-depth evaluation of mitotic figures that were only identifiable using the IHC stain as secondary source of information (Table 7), however, also reveals the limitations of purely H&E-based ground truth definitions, as occurring borderline morphological patterns were found to represent the majority of IHC-positive MFs that were not found in the annotations based solely on the H&E stain.Nonetheless, it is worth pointing out that the PHH3-assisted ground truth should not be simply perceived as an improved version of the H&E-based ground truth on H&E-stained images, even if it is a more accurate description of the biological truth.The majority of cells that were additionally attributed to be MFs after consulting the PHH3 stain were not sufficiently discriminable using the H&E stain alone since the necessary information was likely just not contained in the images.Training on such annotations could hence exhibit a higher degree of label noise (if only the information available from the H&E image is considered as input to the network).
Our analysis indicates that by using the PHH3assisted identification of MF objects, we can expect a considerable increase of the MC, which is in line with findings by other works on PHH3 alone [van Steenhoven et al., 2020, Dessauvagie et al., 2015].Currently, grading schemes predominantly rely on H&Ebased counts alone and changing the methodology could invalidate the respective cutoff values.Hence, the use of PHH3-assisted labels for training MF detectors could additionally also lead to an overall increase of the MC, which would be conflicting with current grading schemes.

Limitations of the AP metric
One insight from our challenge is the limitations of the AP metric, which averages the precision at defined recall values, as a ranking metric.Besides a high number of hyperparameters (such as the maximum number of detections, the interpolation method, and grid), the AP metric is used according to multiple different definitions [Hirling et al., 2023].Moreover, as can be seen in Fig. 7, none of the algorithms reached the zero value for precision, which penalized the approaches in the AP metric.We hypothesize that this is a result of all approaches using a detection threshold before the non-maximum suppression; a common procedure to reduce computational overhead for the matching of ground truth and candidates, which is an operation in O(n 2 ).If no value can be meaningfully interpolated for high recall values (e.g., for the MIDOG 2021 baseline approach in Fig. 7 above a recall value of 0.6), the precision value is commonly extrapolated to 0, which penalizes the approach un-justly.Similarly, should the averaging be confined to the maximum achieved recall value, methods employing a high detection threshold would gain an unfair advantage.In particular, this is demonstrated when comparing the winning approach of Jahanifar et al. [2022] and the runner-up of [Kotte et al., 2023].While the precision-recall curve in Fig. 7 clearly indicates the superiority of the winning approach, the AP metric (see Fig. 6b) benefits from the lower detection threshold of the approach by Kotte et al. [2023], giving a false impression that the latter approach has a higher decision-threshold independent performance.This provides additional evidence for the utility of the F 1 score as the primary challenge metric.

Performance comparison and outlook
We found that the top algorithmic solutions of this challenge detected MFs at a level similar to that of the 2021 MIDOG challenge (top F 1 value of 0.748 in 2021 [Aubreville et al., 2023a] and 0.764 in 2022).Additionally, comparing these performances to published F 1 values for human experts (0.563 for human breast cancer [Aubreville et al., 2023a], 0.79 on canine cutaneous mast cell tumor [Bertram et al., 2021]) indicates that the automatic approaches are in the range of human experts.Nevertheless, it is worth pointing out that human experts typically perform this task not only on ROIs but on the entire slide, which was not the task of this challenge.We hence encourage the creation of further datasets and challenges incorporating annotations on the entire WSIs and thus also providing labels for a much more diverse set of tissue characteristics.pants of the challenge.All participants contributed to the overview of the submitted methods section.

Figure 1 :
Figure 1: Random selection of crops of size 128 × 128 px, centered around annotated MFs from the six domains of the training set.Caption indicates the originating lab (UMCU = UMC Utrecht, VMU = University of Veterinary Medicine Vienna, FUB = FU of Berlin) and the scanners (S360 = Hamamatsu S360, XR = Hamamatsu XR, CS2 = Aperio ScanScope CS2, 3DH = 3DHIstech Pannoramic Scan II).Domain F was not labeled, hence the crops were selected at random.

Figure 2 :
Figure 2: Overview of the domains of the test set.Random cropouts sized 256 × 256 px from four randomly selected images of each domain are shown.Caption indicates origin of tissue (UMCU = UMC Utrecht, UKER = University Hospital Erlangen, UKER NP = Institute of Neuropathology at University Hospital Erlangen, FUB = FU Berlin, VMU = University of Veterinary Medicine Vienna) and scanner (S360 = Hamamatsu S360, S60 = Hamamatsu S60, 3DH = 3DHistech Pannoramic Scan II).The tumor types are categorized by the tissue morphology into aggregated cell patterns, round cell morphology and spindle cell morphology.

Figure 3 :
Figure 3: Correspondence between hematoxylin and eosin (H&E)-stained tissue (top) and immunohistochemistry stain against phospho-histone H3 (PHH3, bottom).The left panel shows two tumor cells (green circles) with clear immunopositivity against PHH3 conclusive for MFs, supporting H&E morphology.The right panel shows a mitotic figure in telophase where the PHH3-stain is less conclusive, but the morphology in the H&E is characteristic.

Figure 4 :
Figure 4: Histogram of MFs and NMFs in the training set of MIDOG 2022.

Figure 5 :
Figure 5: Box-whisker plot of the distribution of MC across the domains of the preliminary test set and the final challenge test set.Boxes indicate lower and upper quartile values, colored lines indicate median values.

Figure 6 :
Figure 6: Distribution of the F 1 score, precision, recall, and AP metric as a result of bootstrapping.Only submissions that provided meaningful scores per detection are shown in the AP metric diagram.

Figure 7 :
Figure 7: Precision-recall values and curves (for all participants where the model score per MFs was provided and consistent).The marker indicates operating point calculated by the thresholded detections of the participants.Minor mismatches may be explained by post-processing after thresholding.

Figure 11 :
Figure 11: 80% confidence regions for precision and recall for each team and tumor type, plotted for each individual team.Confidence intervals were established using a Gaussian kernel density estimator applied to empirically bootstrapped datasets.

Figure 12 :
Figure 12: Results of the Tukey HSD test, assessing results of all approaches for statistical significance.Table shows p values.

Table 1 :
Overview of the submitted methods by all participating teams.TTA indicates test-time augmentation.

Table 2 :
F 1 values across all tumor domains for all participants.Values in brackets indicate 95% confidence interval as a result of bootstrapping.The top group are the baselines, the middle group are the submissions in track 1 and the bottom group are the submissions in track 2 of the challenge.

Table 3 :
Results from the linear mixed effects model assessing the influence of tumor domain on F 1 -Score with 'team' as a random effect for the intercept and canine hemangiosarcoma acting as baseline category.
Table shows p values.