MarrowQuant 2.0: A Digital Pathology Work ﬂ ow Assisting Bone Marrow Evaluation in Experimental and Clinical Hematology

Bone marrow (BM) cellularity assessment is a crucial step in the evaluation of BM trephine biopsies for hematologic and nonhematologic disorders. Clinical assessment is based on a semiquantitative visual estimation of the hematopoietic and adipocytic components by hematopathologists, which does not provide quantitative information on other stromal compartments. In this study, we developed and validated MarrowQuant 2.0, an ef ﬁ cient, user-friendly digital hematopathology work ﬂ ow integrated within QuPath software, which serves as BM quanti ﬁ er for 5 mutually exclusive compartments (bone, hematopoietic, adipocytic, and interstitial/microvasculature areas and other) and derives the cellularity of human BM trephine biopsies. Instance segmentation of individual adipocytes is realized through the adaptation of the machine-learning-based algorithm StarDist. We calculated BM compartments and adipocyte size distributions of hematoxylin and eosin images obtained from 250 bone specimens, from control subjects and patients with acute myeloid leukemia or myelodysplastic syndrome, at diagnosis and follow-up, and measured the agreement of cellularity estimates by MarrowQuant 2.0 against visual scores from 4 hematopathologists. The algorithm was capable of robust BM compartment segmentation with an average mask accuracy of 86%, maximal for bone (99%), hematopoietic (92%), and adipocyte


Introduction
The use of quantitative digital pathology is growing rapidly because it enables an objective, robust, and automatic assessment of stained slides to inform pathologists on diagnostic and prognostic parameters. [1][2][3][4][5][6] However, the implementation of digital pathology comprises many challenges and gaps for translation toward a clinical context, including difficulty integrating clinical workflows within a user-friendly environment. Such integration requires a joint effort from different stakeholders to assimilate the current landscape of multiple separate applications into an integrated image analysis platform and unified workflow compatible with the diagnostic environment. 1,[3][4][5][6][7] In hematopathology, recent advances in digital and quantitative pathology have renewed interest in developing faster and more quantitative assessments of bone marrow (BM) trephine biopsies, which are a key component in the diagnosis and followup of hematologic and nonhematologic disorders. [8][9][10][11] BM biopsy assessment informs on tissue architecture, such as cellularity, necrosis, inflammation, cell lineages, metastatic spread, and stromal modifications, 12 with cellularity being a key factor reflecting hematopoietic function. For instance, low cellularity highlights a central defect in blood production (toxic, constitutional, or idiopathic), whereas high cellularity can suggest a neoplastic transformation. 8,9,[13][14][15][16][17][18] BM cellularity can be estimated quantitatively (histomorphometry: point-counting) or semiquantitative (visual estimation). 8,10 It is defined as the relative percentage of the area of the hematopoietic tissue within a BM biopsy specimen and assessed on hematoxylin and eosin (H&E)stained slides. 10,13,[19][20][21] It can be either expressed as a percentage of the total marrow space delimited by the endosteal surface 22,23 or as a percentage of the area occupied by the sum of the hematopoietic and adipocytic compartments, 8,10,19,24,25 which assumes reciprocity of the hematopoietic and adipocytic compartments within the BM space. Hence, when nonadipocytic elements of the stroma constitute a significant compartment within the BM, the 2 equations might conflict.
Indeed, blood vessels, nerve fibers, and nonadipocytic stromal elements are, in most instances, in minority. Not being the focus of the cellularity assessment, they tend to be neglected when assessing H&E stains. Because these stromal elements are affected in many disorders and different conditions (eg, osteosclerosis, expanded stromal scenarios: fibrosis, edema, or gelatinous transformation), 8,13,17,18 quantitative efforts that focus exclusively on the adipocytic component of the stroma may limit our understanding of the prognostic parameters associated with stromal remodeling in BM histopathology. For instance, although the assessment of fibrosis is highly standardized, 26 very few studies have correlated the architectural characteristics of the BM stroma with clinical outcomes in leukemia. [27][28][29][30] To enable a more thorough assessment of the BM stroma, it is important to remedy the lack of digital pathology tools that enable the simultaneous tracking of the hematopoietic and stromal BM compartments.
In particular, the historical definition of quantitative BM cellularity assessment comes from the point-counting method, which uses an eyepiece with a graticule. This method is labor intensive, requires large biopsies, and is incompatible with clinical routine. It was first adapted from the method of quantitative morphologic tissue analysis of Chalkley 31 by Hartsock et al. 24 Three different categories were identified when counting with this method: hematopoietic tissue, fat, and other structures. The % cellularity is calculated by dividing the total number of hits on the hematopoietic tissue by the sum of the counted hits on the fat and hematopoietic tissues, expressed in percentages. 24 The equivalent visual estimate of cellularity used in clinical routine assessment method is semiquantitative, 32 and for highly trained individuals, it correlates with the point-counting method while being faster and simpler. 19,24,25,33 However, it is still time consuming and semiquantitative and may underestimate cellularity. 13 Several quantitative digital pathology and deep learning methods have been recently used to quantify BM cellularity and agreed with the gold standard visual semiquantitative estimations. [34][35][36][37] Nevertheless, they were tested only on selected retrospective training and validation data sets focused on the hematopoietic compartment. To further assess the overall heterogeneity of BM compartments, we recently introduced the semiautomated MarrowQuant workflow and extensively validated it on mouse samples. 38 In this study, we present MarrowQuant 2.0dan adapted version of MarrowQuant for human BM assessment, implemented as a QuPath script, thus allowing for a user-friendly application. 39 We have tested MarrowQuant 2.0 on H&E-stained images from 250 pathologic and nonpathologic BM samples. Our workflow predicts and quantifies 4 major compartments: bone, hematopoietic cells, adipocytes, and the interstitial/microvasculature area (IMV), and groups unassigned pixels into a fifth "other" compartment that includes the expanded stromal compartment (eg, stromal edema and fibrosis). We validated BM cellularity scores in retrospective longitudinal BM trephine biopsies from patients with acute myeloid leukemia (AML) or high-risk myelodysplastic syndrome (MDS) and control orthopedic surgical bone specimens, as well as in prospectively collected samples from clinical routine diagnostics. In particular, we first validated the predictions of the workflow on a training data set, then tested its compatibility with clinical routine samples (test data set), and finally applied the workflow to the most extreme context of BM remodeling (experimental/validation data set), where we could test the assumption of reciprocity implicit in the definition of BM cellularity. Overall, we demonstrate that MarrowQuant 2.0 constitutes a robust, objective, and accurate assessment of BM tissue in a user-friendly platform, open source, and easy to access. Its use may contribute to stromal biomarker discovery, to open up the assessment of specific hematopathologic parameters in research laboratories (with limited access to expert clinicians) and to homogenize BM cellularity assessments in multisite clinical trials. Potential incorporation in digitalized clinical diagnostic pipelines will face the complex regulatory challenges of pathologist plus machine workstreams. 40

Clinical Samples
This study complied with the Declaration of Helsinki and the local ethical authorities (CER-VD). Three independent sample data sets (H&E-stained images from 250 BM specimens) were collected as detailed further, to form the training, test, and experimental sets from patients treated at the Lausanne University Hospital (CHUV). Figure 1 describes the study design and sample allocation.

Training Set
A retrospective set of 36 anonymous BM trephine biopsies was selected from the Institute of Pathology biobank at CHUV. These biopsies were collected for clinical purposes from patients undergoing treatment for AML or MDS at diagnosis and different times after induction chemotherapy and selected to reflect a range of cellularity or BM remodeling. The samples were processed for standard pathology diagnosis and analyzed blindly, as previously described. 38

Test Set
The test set consisted of 42 prospectively collected trephine biopsies digitalized from all hematopathologic cases received in 2021 over a period of 2 weeks at the Institute of Pathology at CHUV. Diagnostic category, sex, and age are tabulated in Supplementary Table S1.

Experimental/Validation Set
The validation set consisted of 2 cohorts of CHUV specimens: longitudinal follow-up of AML or high-risk MDS trephine biopsies collected for diagnostic purposes and BM specimens collected from the spongy bone in the femoral head or neck of age-matched patients who underwent elective surgical hip replacement. In total, H&E-stained slides from 125 BM specimens were digitalized from 28 patients with AML or high-risk MDS (mean age ± SD ¼ 55 ± 11 years at diagnosis, sex-balanced) at 4 different time points: diagnosis, the peak of aplasia (days 17-21 postinduction chemotherapy with 7 þ 3 cytrabine and ida/daunorubicine as in the HOVON/SAKK 132 control arm 41 , or C2 with FLAG ± Ida, or C2 by 5-azatidine for 2 frail patients), hematopoietic recovery after the first cycle of induction chemotherapy (RC1), and hematopoietic recovery after the second cycle of induction chemotherapy n = 250 BM H&E specimens (RC2) (Fig. 1). For age-matched control cases, H&E-stained BM samples were obtained from 15 patients who underwent an elective hip replacement surgery (n ¼ 15, mean age ± SD ¼ 57 ± 13 years, sex-balanced). All patients signed a specific consent for the reuse of biological samples and clinical data in the context of our study. Trephine biopsies were collected, routinely processed, and H&E-stained between 2017 and 2021 but scanned synchronously in 2020 and 2022, reflecting a gradient in H&E contrast associated with the sample age.

MarrowQuant 2.0 Workflow
The workflow is summarized in Figure 2 and Supplementary Figure S1 and detailed further. MarrowQuant 2.0 was implemented as a script for the freely available and open-source QuPath software. 39 The code is accessible on GitHub (see Code and Data Availability section).

Image Acquisition and Preprocessing
For the training, test, and experimental set specimens, the H&E-stained slides were scanned at the Institute of Pathology at CHUV using a NanoZoomer S60 40Â objective to generate .ndpi files, with the exception of the H&E-stained slides from the control specimens, which were scanned with an Olympus VS120 slide scanner using a UPLSAPO 20Â/0.75 objective to generate .vsi files. Both file formats were loaded into a project in QuPath0.3.2 using the BioFormats extension for annotation and MarrowQuant 2.0 for quantification. Once the images were inserted within a QuPath project, the user performed a white color balance selection and 3 annotations (Supplementary Video S1). In brief, the user first annotates the tissue boundaries, which correspond to the regions of interest (ROIs) to be quantified by MarrowQuant 2.0. Then, the user selects both a background reference (white space) and the artifact regions to be excluded from the quantification (ie, imprints of detached bone pieces, retraction artifacts, traces of large blood vessels, or highly hemorrhagic regions). This step took, on average, 2-3 minutes per trained user.
The annotations of the training and experimental sets were performed independently by one expert in hematopathology (L.D.L.) and by 2 individuals without any background in BM histopathology (R.S., S.B.). A first reminder was set to pop as a warning window "If highly heterogenous marrow, seek expert opinion to select ROI." The annotations of the test set were performed by a senior hemopathology resident (C.R.C.) who also participated in the cellularity assessment of the same samples during the diagnostic pipeline at the Institute of Pathology at CHUV.

MarrowQuant 2.0 for Human Tissue
The MarrowQuant workflow, implemented within QuPath software, was first developed on murine bone H&E-stained images and adapted in this study to human BM quantification. In brief, MarrowQuant 2.0 works by segmenting regions based on color and texture compared with those of background. 38 It predicts 5 mutually exclusive compartments as output areas for each H&E-stained image in the following order: bone (mint blue mask), hematopoietic cells (dark purple mask), interstitial and microvasculature (pink mask), the later detecting red blood cells and other small eosinophilic structures such as microvasculature, and adipocyte ghosts (yellow mask) ( Fig. 2 and Supplementary  Fig. S1). Marrow areas not recognized as any of the abovementioned compartments are categorized as the "other" mask. If the percentage of other compartment is >25%, an error message is displayed to alert the user to manually check the image or seek an expert opinion. This threshold was set based on extensive quantifications of cases with stromal expansion, a component classified as other (Supplementary Fig. S2E and Fig. 3G-H). To calculate the relative contribution of each of the marrow compartments, the "total marrow area (Ma.Ar)," which includes 4 marrow compartments and excludes the bone mask and artifacts, was used as the denominator (Ma.Ar ¼ hematopoietic area þ adipocytic area þ IMV area þ other area). For the percentage of bone compartment, the denominator is defined as the full tissue boundary selected by the user after artifact exclusion.

Bone Marrow Compartment Detection for Cellularity Calculation
For cellularity measurements, 2 separate calculations were performed and embedded within the output of MarrowQuant 2.0 to consider the 2 alternative denominators described in the literature: either the sum of the areas of the hematopoietic and adipocytic compartments used as denominator (Eq. 1: cellularity assessment score or simply cellularity) or the Ma.Ar used as the denominator (Eq. 2: hematopoietic area ratio or simply %hematopoietic area). Equation 1 reflects the clinical, semiquantitative estimation that constitutes the most prevalent working definition in hematopathology, 17,24,25,42 which assumes reciprocity of the hematopoietic and adipocytic compartments. Equation 2 reflects the bone morphometry definition of %hematopoietic marrow 43 integrating Ma.Ar as the spaces of the skeleton defined by endosteal surfaces. 22 MarrowQuant Validation: Pathologist Cellularity Assessment For the training and experimental sets, 4 independent hematopathologists (reviewers 1-4) performed a retrospective visual cellularity assessment based on the digitalized H&E-stained images as follows. Pathologists were blinded to the clinical data associated with the patient samples. Electronic booklets were generated for the pathologists to quickly score each BM specimen, which contained an overview of the whole biopsy (4Â) along with two 20Â images (0.22 mm/pixel) for each trephine biopsy (n ¼  Fig. 2). For intraobserver variability, the same booklet, but with the image order shuffled and relabeled, was sent to the same pathologists to be scored after a washout period of 8 weeks. 45 For the test set, the senior hematopathology resident (C.R.C.) first visually performed cellularity assessment in the digitalized image. Then, she annotated the image on QuPath and autonomously ran Step 1

H&E BM image
Step 2 Step 7: Output based on Marrow Area (Ma.Ar) Step 6: Bone Mask Substraction Step 8: Hematopoietic Mask Step 9: IMV Mask Step 10: Adipocytes Mask 100% * Optional run "StarDist on Adipocytes" of the example hematoxylin and eosin image shown. Color code: "bone" mask in mint blue, "hematopoietic cell" mask in dark purple, "IMV" mask in pink, "adipocytes" mask in yellow, and "other" mask (overlayed in green for illustration purposes). If desired, batch run "Stardist for adipocytes" to obtain the adipocyte size distribution and color by size segmentation. B.Ar, bone area; BM, bone marrow; BG, background; Hm.Ar, hematopoietic area; IMV, interstitial and microvasculature; Ma.Ar, marrow area; T.Ar, tissue area; TB, tissue boundary; Tt.Ad.Ar, total adipocyte area. MarrowQuant 2.0 for quantification. For both validation and test sets, the percent cellularity estimation was extracted from the pathology report (consensus of 2-3 pathologists) for comparison.

From Adipocyte Area Mask to Individual Adipocyte Detection: StarDist for Adipocytes
Segmentation of individual adipocyte ghosts with the Mar-rowQuant 2.0 workflow alone was suboptimal to derive a precise human adipocyte size distribution. For precise individual adipocyte detection within QuPath, we added an additional step to the workflow after training StarDist to recognize adipocytes. For this purpose, 12 images covering a wide range of cellularity (<5%-95%) were selected from the training and experimental sets. StarDist training was performed on the manually annotated RGB images after extraction from QuPath. Two additional images were used for validation. Default StarDist parameters 46,47 were used to create the model (32 rays, grid factor of 2 Â 2, learning rate ¼ 0.0005). Images were augmented by applying flips and rotations, adding Gaussian noise and independent random intensity changes to the R, G, and B channels. The training was performed for 400 epochs on 256-Â 256-pixel patches with a batch size of 4 (100 steps per epoch). The Jupyter notebook is available on GitHub (see Code and Data Availability section). After threshold optimization built in StarDist, the model was evaluated using the validation images. Then, StarDist for adipocytes was integrated within QuPath as described in Supplementary Fig. S5. An additional code was created to classify the size of each adipocyte and report size distribution for each image: very small (300-499 mm 2 ), small (500-899 mm 2 ), medium (900-1999 mm 2 ), large (2000-3499 mm 2 ), and very large (>3500 mm 2 ). The performance of StarDist for adipocytes was assessed at different levels during the training process ( Fig. 4G-H and Supplementary Fig. S4).

Statistical Analysis
All values quantified by MarrowQuant 2.0 are displayed as mean ± SD. GraphPad Prism (version 9.3.0; GraphPad La Jolla) was used for Student t test or 2-way ANOVA, linear regression (Pearson correlation), and multiple comparison tests were used for Figures  ). To build multiple ROC curves, we used both multiclass macroaveraged AUC function and the AUC metrics computed using the Hand-Till method 48 (Table 2 and Supplementary Fig. S6). The intraobserver variability was calculated with the ICC using an F test. In addition, the interobserver variability was calculated with the ICC using a 2-way random effect model with an absolute agreement. As recommended, ICC values <0.5 were categorized as poor, 0.5-0.75 as moderate, 0.75-0.90 as good, and >0.9 as excellent. 49 To assess the cellularity scoring of MarrowQuant 2.0, specificity, sensitivity, and AUC tests were performed compared with the clinical references, defined as the score extracted from the pathology report, where sensitivity measures the true positive rate and specificity measures the true negative rate. Cellularity scores were categorized into 3 ranges: low cellularity (scores <25%), as defined for severe aplastic anemia and other BM insufficiency syndromes 16,50,51 ; medium cellularity (25%-50%); and high cellularity (>50%), as applied in several studies for age-adjusted cellularity assessment or myeloid malignancy evaluation. 24,28,32 Our cellularity measurements (whether from the pathologist scoring or Marrow-Quant 2.0 output) are provided as absolute values with no age adjustment.

Adaptation and Validation of MarrowQuant 2.0 for Human Bone Marrow Specimens (Training Set)
Adapting our earlier work on MarrowQuant 38 for the analysis of human BM specimens (Fig. 2) required most preset parameters and classification thresholds to be modified and integrated within the code to consider the larger human hematopoietic and adipocyte cell size, larger average cell separation, and higher content of interspersed red blood cells. A training set of H&E-stained images spreading across the full range of BM cellularity was identified (n ¼ 36), and an error quantification analysis for BM compartment prediction was performed as detailed in the Error Quantification and Confusion Matrix Sections of the Supplementary Materials and Supplementary Figures S2 and S3. In brief, the minimum adipocyte size was validated through perilipin and CD34 immunohistochemistry in contiguous sections to differentiate small adipocyte ghosts from small vascular structures ( Supplementary  Fig. S2A-D). Then, thresholds were adapted to minimize the detection of false adipocytes, adipocyte nuclei, or megakaryocyte misclassification. This includes images from extreme BM remodeling, such that manual correction did not provide significant improvement compared with that of MarrowQuant 2.0 alone (Supplementary Fig. S3A-C).
For validation, cellularity assessment (Eq. 1) constitutes the only parameter with a clinical reference to which we can compare the performance of our algorithm. In clinical routine, cellularity is visually estimated in a semiquantitative fashion by an expert hemopathologist (reviewed in a previous study 8 ). Cellularity assessment by MarrowQuant 2.0 (Eq. 1) strongly correlated with the mean cellularity assessment of the 4 independent pathologists for the training set ( Supplementary Fig. S4A) (n ¼ 36, R 2 ¼ 0.94). In particular, MarrowQuant 2.0's cellularity values fell within the interobserver variability range, with maximal correlation at medium and high cellularity values. More specifically, the ICC among the 4 reviewer pathologists was 0.96 (95% CI ¼ 0.94-0.98) and remained unchanged at 0.96 (95% CI: 0.92-0.98) when Marrow-Quant 2.0's estimation was incorporated as a fifth reviewer. The mean difference between the scores from MarrowQuant 2.0 and the average of all reviewers was À0.55% ± 7.6%. Thus, Marrow-Quant 2.0's cellularity assessment performed equivalently to an additional pathologist, indicating noninferiority compared with the clinical reference for the training set.
To compare the overall classification of MarrowQuant 2.0 with that of manual mask classification as measured by the percent true positive pixel classification (visual estimation by the user), we generated a confusion matrix. The average true mask classification rate was 0.86 ( Supplementary Fig. S4B). The classification rate of the bone, hematopoietic, and adipocytic classification masks was excellent (0.99, 0.92, and 0.98, respectively). As for the IMV and "other" classification masks, the classification rates were 0.69 and 0.72, respectively. When these 2 classification masks were misclassified, they were confused for one another but not classified as either of the remaining 3 classification masks. The misclassification seemed to arise from expanded stromal or highly hemorrhagic regions within the BM and, thus, did not affect overall cellularity assessment (Eq. 1). In all, for the training set, Marrow-Quant 2.0's precision was validated through the classification rate of both average masks across a wide cellularity range and through cellularity assessment compared with that of the clinical reference.

Testing Utility for Routine Clinical Diagnostic Samples and Identifying the Main Source of Outliers (Test Set)
Next, we tested whether MarrowQuant 2.0 could be compatible with the heterogeneity of samples encountered in clinical routine and user-friendly enough to be used in this setting. Over the course of 2 weeks, all hematopathologic cases that were received for routine diagnostic evaluation at the University Hospital were collected and the diagnostic pathology reports generated, whereas all H&E BM-stained BM slides were scanned as a batch to generate digital images. Once digitalized, one of the participating pathologists blindly re-scored all images using the full digital image (pathologist digital score), annotated, and ran MarrowQuant 2.0 for each image.
Initially, we found a strong correlation between the cellularity assessment given by the pathologist on the digital image and MarrowQuant 2.0 (R 2 ¼ 0.89, n ¼ 42, mean difference þ0.85 ± 9.9) but then a correlation of only 0.7 (n ¼ 42, mean difference þ1.13 ± 6.6) for MarrowQuant 2.0 when compared with the cellularity assessment from the pathology report (Fig. 3A). To understand the discrepancy between the pathology report evaluation and the digital image evaluation when compared with the MarrowQuant 2.0 output (Fig. 3A), we searched for outliers. These belonged to 2 distinct categories: 3 heterogeneous marrow tissues (Fig. 3B-E) and 1 case of myelofibrosis ( Fig. 3F-G). The discrepancy in the heterogeneous marrow subgroup was due to the pathologists' selection of specific ROIs during diagnosis (Fig. 3E) as opposed to the full digital image annotation on MarrowQuant 2.0 scoring (Fig. 3B-D). In fact, during diagnosis, expert hematopathologists excluded purely adipocytic BM areas corresponding to subcortical marrow spaces, which are frequently hypocellular and may be excluded in the assessment of cellularity as per International Council for Standardization in Haematology guidelines 52 (Fig. 3E). The remaining outlier was a case of myelofibrosis, where Mar-rowQuant 2.0 overestimated the cellularity because of stromal expansion (MarrowQuant 2.0 score ¼ 60.9%, pathology report ¼ 25%, and pathologist digital image score ¼ 40%) (Fig. 3F-G). Indeed, pathologists confirmed that the cellularity score for this outlier was estimated considering the fibrotic, expanded stromal compartment as part of the denominator, instead of computing exclusively the sum of the adipocytic and hematopoietic areas as denominator. This adapted score corresponds to the alternative cellularity definition (hematopoietic area ratio or %hematopoietic area; Eq. 2), also accepted in hematopathology, 8,10,19,24,25 which uses the Ma.Ar as a denominator. MarrowQuant 2.0 quantification of this myelofibrosis case using %hematopoietic area (Eq. 2) was 30.1%, falling within the range of intervariability between the pathology report (25%) and the pathologists' estimation of the rescored digital image (40%) (Fig. 3G), instead of the 60.9% Mar-rowQuant 2.0 score predicted when using Eq. 1. Thus, we went back to the outlier cases and re-annotated the heterogeneous BM digital images by excluding empty marrow spaces and found coherent scores and a higher correlation of MarrowQuant 2.0 with the pathologist's cellularity assessment (R 2 ¼ 0.86 for the pathology report and R 2 ¼ 0.92 for the re-scored digital image, n ¼ 42) (Fig. 3H).
To diversify our cohort and verify the validity of the hematopoietic mask detection workflow in the context of lymphoid aggregates, we selected a set of H&E-stained images from anonymous patients with lymphoid disorders which associated to with increased cellularity. We then tested the performance of Marrow-Quant 2.0 on such cases, and found it to correctly assign lymphoid and lymphoblastoïd cells to the hematopoietic mask ( Supplementary Fig. S6A-I).
Overall, we could validate the good to excellent performance of MarrowQuant 2.0 compared with the clinical reference, within the user-friendly interface of QuPath, to assess BM cellularity in routine diagnostic samples regardless of the underlying condition assessed. The variety of cases emphasized the importance of expert ROI selection and the challenges associated with the choice of the denominator for BM cellularity assessment in cases of expanded stroma (Eq. 1 vs Eq. 2). MarrowQuant 2.0 systematic reporting of both equations is compatible with future prospective analysis to determine the predictive value of either equation in diagnostic hematopathology.

MarrowQuant 2.0 in an Extreme Bone Marrow Remodeling Context (Experimental Set)
The denominator-dependent discrepancy identified for the cellularity assessment of samples with stromal expansion prompted us to experimentally test the assumption of reciprocity for the hematopoietic and adipocytic marrow compartments, which is implied within the working definition of cellularity (Eq. 1). To investigate the performance of MarrowQuant 2.0 in a full range of cellularities and test the assumption of reciprocity, we focused on patients receiving myeloablative chemotherapy, one of the most extreme cases of BM remodeling and recovery.
Next, we measured the agreement of MarrowQuant 2.0 with the clinical reference of visual cellularity assessment. We first assessed the intraobserver variability among the 4 independent reviewers after scoring the same set of training and validation images at t ¼ 0 and t ¼ 8 weeks (washout period), and we observed an excellent agreement among the 4 independent pathologists, with the ICC ranging from 0.95 to 0.99 (Table 1). Then, we tested the agreement between MarrowQuant 2.0 cellularity assessment and all individual reviewer estimations and found an excellent agreement (ICC ¼ 0.96; 95% CI: 0.93-0.97) (Table 1). Furthermore, compared with all reviewers' average score, the ICC was 0.978 (95% CI: 0.955-0.989). Thus, we conclude that Mar-rowQuant 2.0 used by an expert performed at least equivalently in cellularity measurements as the visual estimation by a pathologist and that variability between individual pathologist reviewers did not significantly affect our evaluation.
Furthermore, we tested whether MarrowQuant 2.0 performed equally across different cellularity ranges when annotations were performed by an expert (supervised). We performed specificity and sensitivity analyses using the cellularity assessment reported in the pathology report as the ground truth (Table 2). Specificity (true negative rate) ranged between 0.89 and 1.0 and was highest for the 2 extremes: low cellularity and high cellularity. Sensitivity (true positive rate) ranged between 0.81 and 0.99. The lowest sensitivity scores were for low cellularity specimens, which included the cases with stromal expansion ( Supplementary  Fig. S7). Overall, we could thus validate MarrowQuant 2.0's cellularity assessment as good to excellent when compared with the clinical reference in a scenario of extreme BM remodeling (experimental set) and identified low cellularity cases with stromal expansion as the most problematic for high sensitivity.

Quantification of Adipocyte Size Distribution During Bone Marrow Remodeling
In addition to quantifying the area of different BM compartments using MarrowQuant 2.0, we were interested in quantifying BM remodeling by tracking adipocyte size distribution. One limitation of MarrowQuant is that adipocyte segmentation is based in the Fiji/ImageJ analyze particle function, 54,55 which calculates the total adipocyte mask but cannot accurately detect individual adipocyte size. To accurately segment individual adipocytes, we developed an adaptation of the StarDist extension of QuPath (https://github.com/qupath/qupath-extension-stardist), which segments and measures nuclei size as oval objects fitted within polygons. We applied deep learning to train the model to recognize BM adipocytes instead of nuclei. Our measurements of accuracy and precision validated the use of StarDist for adipocyte size detection ( Fig. 4G and Supplementary Fig. S5C, D). Significant differences were detected for the very small (300-499 mm 2 ) and medium (900-1999 mm 2 ) adipocyte categories, but not for the small (500-899 mm 2 ), large (2000-3499 mm 2 ), or very large (>3500 mm 2 ) categories (Fig. 4H). In particular, we observed a significant increase in the very small adipocytes at diagnosis (48.5% ± 22.2%). At RC2, the adipocyte size distribution became comparable with that of the control group (RC2: 22.6% ± 9.9%; control: 19.5% ± 4.1%). For the medium adipocytes, a significant decrease was seen at diagnosis (15.2% ± 10.8%); then, the percentage was restored at RC2 when compared with the control group (RC2: 30.3% ± 6.6%; control: 31.8% ± 3%). In conclusion, we validated the image, a visual cellularity score of 30% was given for the full digital image and a 70% cellularity score was reported in the consensus pathology report. the use of StarDist for the measurement and allocation of BM adipocyte size distribution. We found both the total area and certain size ranges of BM adipocytes to be significantly and longitudinally remodeled in patients with AML/high-risk MDS after intensive chemotherapy.

Reciprocity: Assumption in the Working Definition of Cellularity
The lower sensitivity of MarrowQuant 2.0 for low cellularity samples prompted us to evaluate the limits of reciprocity between the hematopoietic and adipocytic BM compartments. We plotted the adiposity (Tt.Ad.Ar %) versus hematopoietic area (Hm.Ar%) for all images within the experimental set ( Fig. 5A and Supplementary Fig. S8A). The hematopoietic and adipocytic areas presented a negative correlation for the 2 data sets (R 2 ¼ 0.74, n ¼ 79 for experimental set), thus validating a general trend for reciprocity. However, cases with stromal expansion, which could be identified by MarrowQuant 2.0 through a threshold of "other" compartment higher than 25% of the marrow area, clustered outside of the regression line (red dots) and corresponded univocally to the discordant outliers on the %cellularity assessment by the pathologist versus MarrowQuant 2.0 scoring. This is illustrated in Figure 5B-E by the fact that a case of stromal edema postintensive chemotherapy was scored by the expert pathologists as 5% cellularity (blue column), versus 34% cellularity by Marrow-Quant 2.0 (light purple column) when using Eq. 1. The %hematopoietic area (Eq. 2) calculated by MarrowQuant 2.0 was 10% (dark purple column), which was closer to the pathologist's estimate. The overall concordance between %cellularity estimated by the pathologist (blue column) and %hematopoietic area estimated by MarrowQuant 2.0 (Eq. 2, dark purple column) is illustrated in Fig. 5D for the full experimental data set. All stromal expansion cases are outliers (red dots). The discordance between %cellularity assessed by the pathologist versus by MarrowQuant 2.0 (Eq. 1) was only statistically significant for the stromal expansion subset (Fig. 5E). In conclusion, we found %hematopoietic area (Eq. 2) to reflect more reliably the %cellularity estimated by the pathologist clinical reference assessment in low cellularity BM and stromal expansion cases, as opposed to the working definition (Eq. 1). For high cellularity, Eq. 1 best aligned with the pathologist's assessment of %cellularity because the maximum values differ for Ma.Ar (Eq. 2 denominator) and Eq. 1. Indeed, %hematopoietic area as measured by Eq. 2 was capped on average at~80% because the IMV and other compartments occupied, on average, a minimum of 20% of the BM space. Therefore, we tested the use of Eq. 2 for low cellularity samples and Eq. 1 for medium and high cellularity samples in the experimental and test data sets (Fig. 5F-G). Once all cases identified by MarrowQuant 2.0 as <25% cellularity (Eq. 1) were corrected to use Eq. 2 for cellularity assessment, all stromal expansion outlier cases (red dots) correlated with the clinical reference for both experimental (Fig. 5G) and test sets (Fig. 5G). In conclusion, and for best congruency with the clinical reference, we found best results using Eq. 1 for cellularity assessment of medium to high cellularity samples and Eq. 2 for the assessment of low cellularity samples (Fig. 5H). Thus, the results of Eq. 1 and Eq. 2 are reported in the MarrowQuant 2.0 output, together with a "recommended cellularity value," which computes Eq. 2 for both low cellularity (<25% hematopoietic area) and expanded stromal cases (> 25% other mask), and Eq. 1 for all other cases.

Robustness of MarrowQuant 2.0
Finally, following the validation of MarrowQuant 2.0 on 2 retrospective data sets and 1 prospectively collected data set, we tested its repeatability when used by experts (Fig. 5I) and its reproducibility when used by nonexperts (Supplementary Fig. S9). Four experts were asked to score the test data set on a fixed ROI predefined by an expert user, which comprised either the totality or a portion of the trephine biopsy judged as contributive for the assessment of BM cellularity. Repeatedly running the Marrow-Quant 2.0 workflow on this ROI generated the same cellularity score regardless of how many times we ran the script (ICC ¼ 1.00). However, some variability was detected when comparing the scoring across the 4 experts on the predefined ROI (ICC ¼ 0.90; 95% CI: 0.84<ICC<0.94) (Fig. 5I). MarrowQuant 2.0 cellularity scoring in the predefined ROI correlated with each of the 4 pathologists (R 2 ¼ 0.81-0.92) but correlated best with the cellularity estimation derived from the average of the 4 pathologists (R 2 ¼ 0.92; ICC ¼ 0.90; 95% CI 0.85<ICC<0.94) (insert in Fig. 5I). Overall, the repeatability of MarrowQuant 2.0 in a predefined ROI was superior to that of the individual pathologists.
To further assess the robustness of the tool, we tested the reproducibility of the results when 2 nonexpert users annotated (ROI selection) and performed MarrowQuant 2.0 analysis for the experimental set when compared with the outcome on annotation by a hematopathology expert. Given the challenges identified for heterogeneous marrow scoring (Fig. 3), nonexpert users were prompted to exclude samples with heterogeneous marrow and seek expert opinion. We found a strong correlation between MarrowQuant 2.0's cellularity estimation by the 2 nonexpert users and the clinical reference, both when compared with the mean of the 4 independent reviewers and when compared with the score extracted from the pathology report ( Supplementary Fig. S9A) (R 2 ¼ 0.95). We compared the 2 methods of annotation (expert vs naïve user) and observed an excellent agreement (ICC ¼ 0.98; 95% CI: 0.971-0.98) (Supplementary Table S2). However, we observed a higher correlation with both the pathology report and the 4 expert reviewers when MarrowQuant 2.0 was annotated by an expert compared with nonexpert users (R 2 ¼ 0.96 for pathology report score, R 2 ¼ 0.98 for average scoring of 4 reviewers) (Supplementary Fig. S9B and Supplementary Table S2). Correlation with the clinical reference was inferior when MarrowQuant 2.0 was annotated by nonexperts and when including the heterogenous marrow samples (R 2 ¼ 0.96) (Supplementary Fig. S9C). Overall, these results suggest that MarrowQuant 2.0 performs very well when annotated by nonexpert users, but it performs second induction cycle [RC2]). (B-F) MarrowQuant2.0 output (C) %cellularity and (D) %hematopoietic area are highest at diagnosis and lowest at the peak of aplasia. A recovery is observed after the first and second cycles of induction (each patient is assigned 1 color for tracking purposes) (n ¼ 28 patients with AML patients; n ¼ 15 control patients; *P < .05, **P < .01, and ****P < .0001 by multiple comparison test). (E) The %adiposity is lowest at diagnosis, reflecting the infiltration by leukemic cells. The highest adiposity is reached at the peak of aplasia (n ¼ 28 patients with AML; n ¼ 15 control patients; **P < .01, by multiple comparison test). (F) The %IMV increases significantly at the peak of aplasia (n ¼ 28 patients with AML; n ¼ 15 control patients; *P < .05, **P < .01, and ****P < .0001 by multiple comparison test). (G) Adipocyte size distribution detection with an AI-based model trained on adipocytes, called StarDist for adipocytes, on H&E BM image after the recovery after the first cycle of induction of chemotherapy in a age þ gender. (H) %Adipocyte (count) distribution by size category at 5 different time points as quantified by StarDist: very small: 300-499 mm 2 ; small: 500-899 mm 2 ; medium: 900-1999 mm 2 ; large: 2000-3499 mm 2 ; very large: above 3500 mm 2 (n ¼ 28 patients with AML/high-risk MDS; n ¼ 15 control patients; *P < .05 and **P < .01, by multiple comparison test).  Table S3).

Discussion
In this study, we present MarrowQuant 2.0, a new version of the MarrowQuant workflow 38 adapted and validated for BM compartment segmentation of H&E-stained human BM specimens. MarrowQuant 2.0 is implemented within QuPath 0.3.2, and together with a specifically trained StarDist model, it segments the BM images into 5 compartments (bone, hematopoietic, IMV, adipocytic, and other). Together, they quantify both the areas of each BM compartment and the individual BM adipocyte count and size distribution. This semiautomated workflow relies on the user's input to define the ROIs and artifacts. Automatic selection of ROI is not technically limiting for future versions of the workflow. However, it would require consensus in the hematopathology field for computer-logic compatible BM ROI selection/exclusion criteria in defined hematologic pathologies. We have seen that this is one of the most important sources of variability in Mar-rowQuant 2.0 outputs and have, thus, preferred to give the user control on ROI selection.
We counted on the %cellularity scoring to validate our tool because cellularity assessment is the only BM parameter measured by MarrowQuant that has an established clinical reference. 10,19,32,[56][57][58] The assessment of cellularity in the diagnostic pathology report or the average cellularity assessment of 4 pathologists (interobserver agreement: ICC ¼ 0.96) was used as the ground truth. We found satisfactory performance when Marrow-Quant 2.0 was used by a trained pathologist and once cases of stromal expansion had been excluded (R 2 ¼ 0.98; ICC ¼ 0.96-0.98; specificity, 0.89-1.0; sensitivity, 0.81-0.93; AUC ¼ 0.98 for the experimental set, n ¼ 172 BM cases). Cases of stromal expansion could be identified when setting an other mask threshold >25%. Cellularity assessment in these cases was satisfactory (fell within the interobserver variability) when the alternative %hematopoietic area (Eq. 2) was used to determine cellularity, which uses Ma.Ar as the denominator instead of the sum of the adipocytic and hematopoietic compartments. Overall, MarrowQuant 2.0 outliers are not due to compartment misclassification but to the expansion of the stromal compartment, which is not considered in Eq. 1. Limitations in color contrast inherent to H&E stains prevented us from defining a separate compartment for the expanded stroma, which was thus classified as "other." Nevertheless, MarrowQuant 2.0 displayed a satisfactory average mask classification rate of 0.86 for all BM compartments in the confusion matrix for the training set, with an excellent accuracy for the assignation of bone (99%), hematopoietic (92%), and adipocyte (98%) areas.
Our goal for MarrowQuant 2.0 was to make it accessible for research applications and to test its compatibility with a clinical setting. For the research scale, MarrowQuant 2.0 was able to compartmentalize the BM space and provide cellularity measurements independently of a pathologist or expert hematopathology supervision, (unsupervised/nonexpert annotation performancedR 2 ¼ 0.95; ICC ¼ 0.93-0.95; specificity, 0.87-0.99; sensitivity, 0.77-0.93; AUC 0.97 for the experimental set, n ¼ 172 BM cases). However, the performance was even higher when used by an expert pathologist. In a clinical diagnostic setting, the MarrowQuant 2.0 workflow was compatible with integration as an additional quantitative assessment (augmented pathologist) (Fig. 6), thus abiding by the strategy proposed by the Swiss Digital Pathology Consortium. 59 Overall concordance in the clinical test set was satisfactory for the use of Eq. 1 when excluding cases with stromal expansion (R 2 ¼ 0.86-0.92). Based on our data and to best approximate the visual cellularity assessment that constitutes current clinical routine, we recommend the use of Eq. 2 for low cellularity cases (<25%) and for cases of expanded stroma identified by an "other" area superior to 25%. Eq. 1 can be applied for all other cases. The outputs of both Eq. 1 and Eq. 2 are systematically reported by MarrowQuant 2.0., and our recommendation for the choice of equation as discussed above is indicated in the "recommended cellularity value" output column. Future prospective studies in diagnostic hematopathology should more broadly compare this strategy for the choice of equation, and consider the appropriate validation of MarrowQuant 2.0 in a clinical setting abiding by the challenges associated with the implementation of digital pathology tools, both from the regulatory perspective 40 despecially in an opensource environment 60 d and from the perspective of reproducibility on future software updates. 61 The diagnostic or prognostic value of BM cellularity, adiposity, and adipocyte size quantification at the histopathology level has been confirmed in recent studies for MDS, acute lymphoblastic leukemia, AML, and obesity. 15,28,29,62,63 In the case of extreme BM remodeling upon intensive chemotherapy for patients with AML/ high-risk MDS, MarrowQuant 2.0 found a strong reciprocity (negative correlation) between the hematopoietic and adipocytic compartments (R 2 ¼ 0.68, n ¼ 42 in test set; R 2 ¼ 0.74, n ¼ 79 in experimental/validation set), matching previous quantitative validations in murine BM. 38,64 In addition, we adapted the deep learning based StarDist extension as an additional workflow within QuPath0.3.2, which can be run in parallel to MarrowQuant 2.0 to provide individual adipocyte segmentation and thus adipocyte size distribution measurements and size-based color coding. This analysis constitutes, to our knowledge, the first cellularity generated by MarrowQuant 2.0 (Eq. 1) with the clinical reference %cellularity score estimation in the test data set after correction of outliers (red dots) associated with myelofibrosis using %hematopoietic area (Eq. 2) (H&E biopsies n ¼ 42 [test set], R 2 [pathology report] ¼ 0.90, R 2 [digital image] ¼ 0.93). (H) Cellularity measurement as suggested for the digital pathology analysis, classified per cellularity category (low cellularity x < 25%, medium cellularity 25 x < 50%, and aged-corrected high cellularity x ! 50%). Average patient age ¼ 57 ± 9 years. Hm.Ar, hematopoietic area; Ma.Area, marrow area; Ad.Ar, adipocytic area. (I) Correlation of %cellularity generated by MarrowQuant 2.0 with the scoring performed on a preselected ROI by 4 expert reviewers on the validation data set (n ¼ 125, R 2 ¼ 0.92 [average pathologists] with ICC ¼ 0.90). dedicated method to automatically segment adipocyte size distribution at diagnosis and follow-up of patients with AML/highrisk MDS when compared with those of age-matched control patients. We found very smallesized and medium-sized adipocytes to longitudinally remodel after myeloablative chemotherapy, whereas the proportion of other BM adipocyte subsets (large and very large) remains unchanged. Our observations highlight the plasticity of the BM and the specificity of distinct BM adipocyte subsets to undergo size remodelation. This observation fits with previous work by Lu et al, 63 who underlined that very small adipocytes at AML diagnosis correlate with poor prognosis, and with the current understanding that at least 2 populations of BM adipocytes coexist (regulated or labile and constitutive or stable subsets) with differential capacity for remodeling. 65 We conclude from this analysis that the incorporation of the quantitative stromal remodeling parameters provided by MarrowQuant 2.0 and StarDist in experimental hematopathology could be useful to investigate novel diagnostic or prognostic markers in clinical scenarios of intense BM remodeling, including myeloid malignancies, aplastic anemia, BM insufficiency syndromes, and hematopoietic progenitor transplantation or other advanced cellular therapies.
Contrasting the performance of MarrowQuant 2.0 to the results of previous studies and existing algorithms or tools, we found that our results are either comparable or provide an improvement to previous work (reviewed in Supplementary Table S4). Direct comparison was not possible as these algorithms, unlike Mar-rowQuant 2.0, are not open source. The closest approach was recently proposed by van Eekelen et al, 34 who developed a deep learning approach to automatically detect compartments within the BM compared with the visual estimation of 2 hematopathologists (ICC ¼ 0.78, n ¼ 109). The advantage of our study is that it integrates MarrowQuant 2.0 within the user-friendly platform of QuPath and that we validated our approach in the extreme cases of low and high cellularity (range from below 5%-100% cellularity in our study compared with 20% to 80% cellularity range in the study by van Eekelen et al 34 ). Similarly, Hagiya et al 37 used the HALO imaging algorithm to compare automatic cellularity measurements with visual estimation by 3 pathologists (ICC ¼ 0.81, n ¼ 165), and Kim et al 36 used nuclear counts to assess the cellularity against visual scoring (R 2 ¼ 0.816, n ¼ 325). Our study used not only the visual estimation of independent reviewers for validation but also the scoring extracted from the diagnostic pathology report. When compared with these studies, we observed a

Report to Pathologist
Export Results higher agreement between the pathology report cellularity scores and MarrowQuant 2.0's cellularity assessment in both nonexpert and expert annotations, suggesting that our approach may provide a more accurate evaluation for the diagnostic pipeline. Moreover, our algorithm performed similarly to the results reported for the fully automated algorithm developed by Nielsen et al 35 for cellularity assessment, which agreed well with hematopathologists (ICC ¼ 0.80, n ¼ 8). However, their training set was based on 8 BM H&E-stained slides only to fully segment the BM tissue into either red (hematopoietic) or yellow (adipocytic) marrow to generate a heatmap with gradients of cellularity scorings. Compared with our much larger data sets and greater diversity of classification categories, MarrowQuant 2.0 outperforms the previously cited tools in simultaneous segmentation of the full BM space into 5 mutually exclusive BM compartments and overall accuracy of cellularity assessment. Moreover, Brück et al 30 used convolutional neural networks, QuPath, and a multiregression model to extract and quantify morphologic features from BM biopsies of patients with MDS and MDS/myeloproliferative neoplasms. Their main focus was the correlation with clinical data to build predictive models. Their work relied on the pixel classification power of QuPath to extract the hematopoietic compartment and adipocytes, with the bone plus stroma as a joint entity. Unlike Brück et al, 30 the approach used by MarrowQuant 2.0 segments bone and stroma as 2 separate compartments and identifies a separate IMV compartment, thus allowing for comparison of the 2 competing definitions of BM cellularity.
Our study faces some limitations. First, it is a semiautomatic workflow: the user must annotate the ROIs, artifacts, and background. This adds 2-3 minutes of hands-on annotation per image, in addition to the 1-2 min needed to have the full H&E-stained image quantified by MarrowQuant 2.0. However, we strategically kept the semiautomated feature because it gives the user control over the ROI selection, especially in cases where the BM tissue is heterogeneous or significantly modified by stromal expansion. The highly user-friendly and interactive interface offered by QuPath allows for rapid custom annotation of tissues and the exclusion of artifacts that should not be counted in the quantification. Second, it is important to note that we used a 25% low cellularity cutoff for sensitivity and specificity analyses. This value is routinely used as the diagnostic cutoff in the context of BM insufficiency and severe aplastic anemia. 15,16,50,51 However, others have used a 30%-40% in the context of myeloid malignancies. 34,43 Finally, MarrowQuant 2.0 classification, based on color and texture thresholding, has the disadvantage of misclassifying parts of the megakaryocyte cytoplasm as IMV or bone. Our training and experimental sets did not contain samples with a high number of megakaryocytes, leading to a nonsignificant misclassification ( Supplementary Fig. S3). The algorithm should be used with caution in clinical scenarios where megakaryocyte hyperplasia is expected, including myeloproliferative syndromes. Future versions of the algorithm, which will incorporate deep learning and machine learning, 34,66 are under development to overcome this problem.
In conclusion, MarrowQuant 2.0 is, to our knowledge, the first robust workflow to simultaneously segment the full BM space in human H&E-stained images into the hematopoietic compartment and 4 stromal compartments (bone, IMV, adipocytic, and other areas). When coupled with StarDist on adipocytes, it automatically provides an adipocyte size classification and distribution for the whole BM biopsy in the user-friendly environment of QuPath. Other applications could be focused on the quantification of bone to total marrow area in the context of osteopenia and osteosclerosis associated to hematologic disorders. We expect our approach may be useful both to support quantitative BM cellularity and stromal compartment assessment in biomarker discovery for the diagnostic setting and to either provide a novel tool for research laboratories with limited access to expert pathologists or to homogenize BM cellularity assessments in clinical research. Potential integration in future digital clinical diagnosis pipelines should be determined by larger prospective studies and by the challenging regulatory landscape of digital histopathology in open-source digital environments. and CRSII5_186271. F.S. was funded by Swiss National Science Foundation (SNF MD/PhD) grant 183986.

Declaration of Competing Interest
The authors state that there are no conflicts of interest to disclose.

Ethics Approval and Consent to Participate
The studies involving human participants were reviewed and approved by Comission cantonale d' etique de la recherche sur l'être humain du Canton de Vaud (CER-VD). All patients in the experimental data set signed a specific written informed consent form for participation. Images for the training and test sets were anonymous. A specific written informed consent was not required for this tool development study in accordance with the national legislation and the institutional requirements.