Metabolic tumour volume in Hodgkin lymphoma—A comparison between manual and AI‐based analysis

To compare total metabolic tumour volume (tMTV), calculated using two artificial intelligence (AI)‐based tools, with manual segmentation by specialists as the reference.


| INTRODUCTION
Total metabolic tumour volume (tMTV) is associated with progression-free and sometimes with overall survival in Hodgkin lymphoma (HL) patients staged with [18F] fluorodeoxyglucose (FDG) positron emission tomography/computed tomography (PET/ CT) (Barrington & Meignan, 2019).The future hope is to use such quantitative predictors as a clinical tool for precision medicine in HL patients (Al-Ibraheem et al., 2023).However, software that is integrated into current workstations are often semi-automatic using thresholding and based on absolute or relative standardized uptake values (SUVs).These tools have been shown to significantly under-or overestimate visible tumours, limiting the application of tMTV measurements for clinical practice and clinical trials (Barrington et al., 2021).Recently, artificial intelligence (AI)based tools have been developed (Jiang et al., 2022;Kuker et al., 2022;Sibille et al., 2020) and tested (Weisman et al., 2020a,b).Weisman et al. (2020a,b) compared 11 different quantification methods, some based on thresholding and others on deep learning models, and they conclude that multiple methods, including a three-dimensional (3D) convolutional neural networks (CNNs), clustering and an iterative threshold method, achieved both good lesion-level segmentation and patient-level quantification performance in a population of 90 lymphoma patients.The authors recommend these methods over thresholding methods such as 40% and 50% SUVmax, which were consistently found to be significantly outside the limits defined by interphysician agreement (Weisman et al., 2020a,b).
The aim of this study was to compare tMTV in HL patients undergoing staging FDG-PET/CT calculated using the two AI-based methods PET assisted reporting system (PARS) (Siemens Medical Solutions Inc.) (Sibille et al., 2020) and RECOMIA (recomia.org)using manual measurements by specialists as the reference.

| Patients
All 49 patients who had undergone staging by [18F]FDG PET/CT between 2017 and 2018 at Sahlgrenska University Hospital, with biopsy-proven HL were retrospectively included.The patients were newly diagnosed and untreated.One patient was excluded due to an error in recording the uptake time.The final group consisted of 48 patients with a median age of 35 years (range: 7-75) and 46% of the patients were female, the same group as used in a prior publication (Sadik et al., 2021).The study was approved by the Ethics Committee at Gothenburg University, and the need for written informed consent was waived (#2019-01274).
We certify that the study was performed in accordance with the ethical standards laid down in the 1964 Declaration of Helsinki and its later amendments.

| Image acquisitions
[18F]FDG PET/CT scans were obtained using an integrated PET/CT system (Siemens Biograph 64 Truepoint).The adult patients were injected with 4 MBq/kg [18F]FDG (maximum 400 MBq) and fasted for at least 6 h before injection of FDG.The injected radioactivity for children was according to the EANM Dosage Card (Version 5.7.2016).The standard uptake time was 60 min.Images were acquired with 3 min per bed position from the base of the skull to the mid-thigh.PET images were reconstructed with a slice thickness of 5 mm and slice spacing of 3 mm with an iterative OSEM 3D algorithm (four iterations and eight subsets) and a matrix size of 168 × 168.CTbased attenuation and scatter corrections were applied.A low-dose CT scan (64-slice helical, 120 kV, 30 mAs, 512 × 512 matrix) was obtained covering the same field of view as the PET scan.The CT was reconstructed using a filtered back projection algorithm with a slice thickness and spacing matching those of the PET scan (Sadik et al., 2019(Sadik et al., , 2021)).

| Image interpretation
The manual segmentations used to calculate the reference tMTV was performed by a team of eight nuclear medicine specialists (S.F. B., E. T., B. S., A. L. N., A. L. J., J. L. L., J. L. U., R. K.) from eight different hospitals.All with more than 5 years of experience in reading PET/CT studies.They segmented FDG uptake in tumour sites for tMTV calculations based on the following recommendations (Barrington & Meignan, 2019): 1. viable areas in lymph nodes with increased FDG uptake, 2. focal uptake in the spleen, irrespective of splenic size, 3. focal uptake in the bone marrow or other extra-nodal sites and 4. diffuse increased uptake in the spleen, in the absence of reactive changes in bone marrow, greater than the liver uptake (spleen/ liver ratio >1.5 and bone marrow/liver ratio <1.0) The cloud-based RECOMIA software (recomia.org)was used, and every case was presented with CT, PET, fused [18F]FDG PET/CT and maximum intensity projection images (Trägårdh et al., 2020).The interpreter was able to shift between sagittal, coronal and transverse planes.The PET images could be displayed in different colour scales with the images scaled to an upper SUV threshold of five, and the latter could also be changed according to the preference of the reader with respect to SUV threshold and colour scale.The CT images could be viewed using standard settings, for example, soft tissue, lung, bone.Before starting, each specialist received an instruction document explaining the purpose of the study and two help videos showing how to perform the analysis.
The eight specialists analyzed 12 cases each.The cases were randomly distributed to them.Each case was analyzed by two different specialists.

| Organ-CNN-RECOMIA
The CNN uses U-Net 3D architecture (Çiçek et al., 2016), with the training procedure described in (Trägårdh et al., 2020) to train a network to classify each pixel in the image into the classes; background, liver, spleen, bones and bone marrow.

Focal spleen, liver and bone uptake
The method described in (Sadik et al., 2021) was extended to cover focal spleen and liver uptake.The SUV threshold (THR) was defined as: where SUV mean is the average SUV for either organ in the segmentation mask automatically generated by the organ-CNN, and SD is the standard deviation.Pixels with SUV above this threshold were marked as tumour.Bone tumour was segmented as described in (Sadik et al., 2021).

Diffuse spleen uptake
The whole spleen was marked as tumour if both conditions from the fourth point under the heading "Image interpretation" are fulfilled: 1. Median SUV for spleen/median SUV for liver >1.5.

| Lymphoma-CNN-RECOMIA
It is difficult to directly segment lymph nodes in the CT images, hence an approach similar to what was used to segment tumour in spleen, liver and bone is not practical for lymph node tumours.Instead, we modified the method described in (Borrelli et al., 2021) and trained a CNN to directly segment lymph node tumours.The CNN uses U-Net 3D architecture (Çiçek et al., 2016), with two 25% dropout layers.
The network has two channels with SoftMax activation; one for background and one for tumour.The network has three separate inputs, the CT image, the SUV image and an organ mask constructed using the output from the organ-CNN (Figure 1).for five epochs the learning rate was halved, and if it did not improve for 10 epochs training was stopped.The model was trained for a maximum of 50 epochs.After this initial step was done the loss was calculated for every pixel of every image.Using the loss, the sampling was updated such that 20% of the samples now were selected proportional to their loss, 40% from lymph node tumour and 40% from the background.The effect of this rebalancing was that pixels with high loss were sampled more often.This procedure was repeated five times.
All connected components with tumour lesion uptake less than 1.0 were removed.As a final postprocessing step, all uptake whose closest local maxima in the SUV image were outside of the tumour uptake mask were removed.
The automated lesion segmentations by both AI-based tools were used without any manual modifications to calculate tMTV.
Once the patient examinations were loaded, the tMTV analyses were performed in seconds by RECOMIA and PARS.
In 22 of the 48 patients, a manual tMTV value was closer to the RECOMIA tMTV value than to the other manual tMTV value.In 11 of the remaining 26 patients the difference between the RECOMIA tMTV and a manual tMTV was smaller than the median difference between the two manual tMTV values (26 cm 3 ).The corresponding numbers for PARS were 18 and 10 patients, respectively.
The patient with the largest difference (−792 cm 3 ) between RECOMIA tMTV (348 cm 3 ) and the two manual tMTV values (1062 and 1217 cm 3 ) is shown in Figure 3.The main difference between the manual and AI-based segmentations is the decision whether to include or exclude the spleen in the tMTV.The AI-based tool followed the predefined rule (see Section 2) not to segment the entire spleen when the bone marrow/liver ratio is <1.0.The entire spleen was segmented in the two manual segmentations (Figure 3b,c) and the contribution of the spleen (712 and 713 cm 3 ) to the total tMTV explains the difference between RECOMIA tMTV (Figure 3d) and one of the manual tMTV measurements (Figure 3b).One of the manual segmentations (Figure 3b) also included the bone marrow in many vertebrae, explaining the difference of 156 cm 3 between the two manual tMTV values (Figure 3b,c).The PARS tMTV (Figure 3e), on the other hand, also included the heart, the spleen and some vertebrae in the tMTV value (purple).
Figure 4 shows the patient with the second largest difference between RECOMIA calculated tMTV (276 cm 3 ) and the two manual tMTV values (691 and 711 cm 3 ).The incorrect segmentation by RECOMIA was most likely due to the rare appearance of a large The patient with the third largest difference between RECOMIA tMTV (1143 cm 3 ) and the two manual tMTV values (1434 and 1518 cm 3 ) is shown in Figure 5.There are no obvious mistakes by RECOMIA, but a tendency to segment all lesions as slightly smaller.
The PARS tMTV was only (501 cm 3 ) in this patient, classifying the conglomerate below the diaphragm as physiological uptake while the heart was included in tMTV.

| DISCUSSION
We have demonstrated that the AI-based tool RECOMIA in 33 of the 48 patients in our study calculated tMTV accurately without any manual adaptation that was within the range of inter-reader tMTV or close to the manual tMTV (difference < 26 cm 3 ) based on segmentation by nuclear medicine specialists.The corresponding results for PARS were 28 of the 48 patients.The results presented show how these AI-based tools perform without human intervention, but however these tools are not meant to be used unsupervised.
F I G U R E 2 Bland-Altman plot showing the differences between the two manual tMTV values (•), PARS (+), RECOMIA (X) and the mean of the two manual tMTV values.One outlier for PARS is not shown (diff 18 096 cm 3 ).tMTV, total metabolic tumour volume.Furthermore, it is required in clinical practice that image analysis by AI tools can be used with no or minimal need for manual adjustments.
The results of this study indicate that the analysis of RECOMIA and PARS could be used without any major manual adjustment in 69% (33/48) and 58% (28/48) of the patients, respectively.
An explanation why the RECOMIA tMTV, in 10% of the cases either was within the range of inter-reader variability or closer to the manual tMVT, than PARS, could be that RECOMIA was trained solely on staging HL lesions, while the PARS developers trained their system with lesions from lung cancer and lymphoma patients examined before, during and after treatment (Sibille et al., 2020).
Our results show a trend that the RECOMIA tool perform better than PARS in HL patients.
The patient with largest differences between AI-based (Figure 3d) and manual tMTV values indicated that AI tools could support physicians to follow recommendations regarding which FDG uptake to include in tMTV calculations (Figure 3).A large difference between AI-based and manual tMTV values is not necessarily a After a case is loaded in an AI tool the system can quantify tMTV in seconds.However, one should keep in mind that AI tools only perform well when trained with many examples of a specific finding.
A limitation in the current RECOMIA tool in the organ-and lymphoma-CNN-based approaches is their restriction in specific organs (bones, spleen, liver, lymph nodes), not taking into account other sites of rare extranodal involvement (e.g., skin manifestations).
While the strength of a human expert will still be in rare cases or rare findings.This is how experts and the technique can complement each other.
Other limitations to our study include that the specialists, when measuring tMTV, might choose to use software with semi-automatic growing algorithms that copy from one slice to another, or automatic contouring based on an SUV threshold rather than performing purely manual segmentations as used in this study, which might overestimate the advantages of the AI-based methods.
In general, the automated tools uses organ segmentation extracted from the CT-images before classifying increased FDG uptake as pathological or physiological.In case the anatomical segmentation fails, physiological uptake will be classified as false positive and, therefore, included in the tMTV quantifications.This might explain the false positive quantifications made by PARS in Figures 3e, 5e and 6e.
AI-based tools have been developed for diffuse large B-cell lymphoma with good performance (Pomykala et al., 2023).Here, we present a novel method (RECOMIA) solely trained on HL lesions and tested on the same type of disease.
No gold standard is available to validate specialists segmentations because the ground truth is unknown, hence a strength with our study is that we compared both AI tools with manual segmentations from eight specialists, working at eight different hospitals, which reflect the generalizability of the tools.The idea with such future AI tools is to facilitate equal healthcare no matter where the diagnostics are performed.

| CONCLUSIONS
The results of this study indicate that the analysis of the AI tools could be used without any major manual adjustments in 69% (33/ 48) and 58% (28/48) of the HL patients for RECOMIA and PARS, respectively.This shows the feasibility of using AI tools to provide clinicians with timely and accurate measurement of tMTV in clinical practice.The tools are not meant to be used unsupervised.

ACKNOWLEDGEMENTS
| AI-based toolsThe PARS research prototype (version 3.0; Siemens Medical Solutions Inc.) was used to automatically analyse all the PET/CT studies(Sibille et al., 2020).The tool was trained (n = 380) and validated (n = 126) on lung cancer and lymphoma patients examined before, during and after treatment undergoing routine whole-body PET/CT at the University Hospital of Münster, Germany, from August 2011 to August 2013.Two nuclear medicine experts performed a manual delineation of all foci with increased 18F-FDG uptake by using a volume-of-interest tool.The analysis included segmentation of the liver as reference region, segmentation of PET foci using a thresholding algorithm and classification of anatomical location and characterization for all detected PET foci as likely to represent tumour using a CNN.The volumes of all PET foci classified as suspicious (physiological uptake = FALSE) by PARS were used to calculate PARS-tMTV.The RECOMIA tool consists of two CNNs; 1. one using only the CT image as input (organ-CNN), (used to segment tumour in spleen, liver and bone) and 2. one using CT, PET and an auxiliary mask constructed from the CT image as input aimed to directly segment lymph node tumours (lymphoma-CNN).The RECOMIA tool was trained to detect and segment nodal and extra-nodal focal lesions based on an independent training data set of 101 retrospectively selected lymphoma patients, median age 43 years (range: 14-85 years), 42% were women.Sixty-seven patients (66%) were untreated biopsy-verified HL patients undergoing staging [18F]FDG PET/CT at Sahlgrenska University Hospital between 2011 and 2016, while the rest were lymphoma patients scanned after treatment between 2008 and 2010.The latter group was used to train the AI tool to recognize cases without any lesions.Two nuclear medicine specialists performed the segmentations in the training set.
The organ mask helps with rough anatomical localization.As preprocessing the CT images were restricted to [−800, 800] HU and the SUV images to [0, 25].Both images were then resampled to a resolution of 1.36 × 1.36 × 3 mm, then normalized to [−1, 1].The input patches were augmented using rotations (−0.15 to 0.15) radians, scaling (−10% to 10%) and intensity shifts of (−100 to +100 HU) for the CT images and (−0.5 to +0.5) for the SUV image.The categorical crossentropy was used as loss function and optimized using the ADAM (Kingma & Ba, 2014) method with Nesterov momentum and an initial learning of 0.0001.The model was trained using randomly selected patches, 50% from background and 50% from pixels marked as lymph node tumour.Each epoch contained 20 000 training and 4000 validation samples.If the loss for the validation set did not improve

F
I G U R E 1 Flowchart for lymphoma CNN in RECOMIA.CNN takes three inputs: CT image, PET image and an automatically generated organ mask.The CNN produces pixel-wise segmentation.Grey arrow indicating lymphoma-related enlarged lymph node in the left axilla.CNN, convolutional neural network; CT, computed tomography; PET, positron emission tomography.tumour in the right side of the mediastinum, most likely not encountered in the training set used for the CNN.The PARS tMTV was in agreement with the manual tMTV values (711 cm 3 ).

F
I G U R E 4 (a) The original image.(b) The patient with the second largest difference between RECOMIA tMTV and the two manual tMTV values.Upper left image = CT, upper right = fused PET and CT, lower left = PET and lower right = maximum intensity projection.CT, computed tomography; PET, positron emission tomography; tMTV, total metabolic tumour volume.F I G U R E 5 (a) The original image.The patient with the third largest difference between the (b, c) two manual tMTV and (d) RECOMIA tMTV values.(e) PARS tMTV in purple, green areas classified as physiological uptake.(a) Original image, (b) (tMTV 1652 cm 3 ), (c) (tMTV 1020 cm 3 ), (d) (tMTV 1201 cm 3 ), (e) (tMTV 19 432 cm 3 ).tMTV, total metabolic tumour volume.clinical problem if all values are very high (Figure 5d).Further development of AI tools is, however, needed before a widespread adoption in clinical routine as indicated in Figures 4 and 6e, ensuring that adequate numbers of cases are used for training or refining the network.
This study was supported by the Swedish State under the agreement between the Swedish Government and the Country Councils; the ALF agreement (70380).The funders had no specific role in the conceptualization, design, data collection, analysis, decision to publish, or preparation of the manuscript.Sally F. Barrington acknowledges support from the National Institute for Health and Care Research (NIHR) (Grant No. RP-2016-07-001).This work was also supported by core funding from the Wellcome/EPSRC Centre for Medical Engineering at King's College London (Grant No. WT203148/Z/16/Z).The views expressed are those of the author (s) and not necessarily those of the NHS, the NIHR or the Department of Health and Social Care.

F
I G U R E 6 (a) The original image.The patient with the largest difference between the (b, c) two manual tMTV and (e) PARS tMTV values in purple, green areas classified as physiological uptake.(d) RECOMIA tMTV.tMTV, total metabolic tumour volume.