18F-FDG PET/CT Radiomic Analysis with Machine Learning for Identifying Bone Marrow Involvement in the Patients with Suspected Relapsed Acute Leukemia

18F-FDG PET / CT is used clinically for the detection of extramedullary lesions in patients with relapsed acute leukemia (AL). However, the visual analysis of 18F-FDG diffuse bone marrow uptake in detecting bone marrow involvement (BMI) in routine clinical practice is still challenging. This study aims to improve the diagnostic performance of 18F-FDG PET/CT in detecting BMI for patients with suspected relapsed AL. Methods: Forty-one patients (35 in training group and 6 in independent validation group) with suspected relapsed AL were retrospectively included in this study. All patients underwent both bone marrow biopsy (BMB) and 18F-FDG PET/CT within one week. The BMB results were used as the gold standard or real “truth” for BMI. The bone marrow 18F-FDG uptake was visually diagnosed as positive and negative by three nuclear medicine physicians. The skeletal volumes of interest were manually drawn on PET/CT images. A total of 781 PET and 1045 CT radiomic features were automatically extracted to provide a more comprehensive understanding of the embedded pattern. To select the most important and predictive features, an unsupervised consensus clustering method was first performed to analyze the feature correlations and then used to guide a random forest supervised machine learning model for feature importance analysis. Cross-validation and independent validation were conducted to justify the performance of our model. Results: The training group involved 16 BMB positive and 19 BMB negative patients. Based on the visual analysis of 18F-FDG PET, 3 patients had focal uptake, 8 patients had normal uptake, and 24 patients had diffuse uptake. The sensitivity, specificity, and accuracy of visual analysis for BMI diagnosis were 62.5%, 73.7%, and 68.6%, respectively. With the cross-validation on the training group, the machine learning model correctly predicted 31 patients in BMI. The sensitivity, specificity, and accuracy of the machine learning model in BMI detection were 87.5%, 89.5%, and 88.6%, respectively, significantly higher than the ones in visual analysis (P < 0.05). The evaluation on the independent validation group showed that the machine learning model could achieve 83.3% accuracy. Conclusions: 18F-FDG PET/CT radiomic analysis with machine learning model provided a quantitative, objective and efficient mechanism for identifying BMI in the patients with suspected relapsed AL. It is suggested in particular for the diagnosis of BMI in the patients with 18F-FDG diffuse uptake patterns.


Introduction
Acute leukemia (AL) is a hematological malignancy characterized by a rapid increase in the number of immature blood cells. Despite the high rates of initial complete remission, relapse remains a Ivyspring International Publisher formidable clinical challenge and has become a major cause of failure in treatment [1]. Leukemia relapse can occur intramedullary or extramedullary, or both. Patients typically undergo multiple bone marrow biopsy (BMB) in the follow-up to monitor the intramedullary relapse [2]. However, BMB is an invasive test and only evaluates a small proportion of the entire bone marrow. 18 F-fluorodesoxyglucose positron emission tomography/computed tomography ( 18 F-FDG PET/CT) has been proven to detect more extramedullary lesions missed by routine examinations [3][4][5][6][7][8].
The diagnosis of 18 F-FDG PET/CT-based leukemic bone marrow involvement (BMI) has not been fully evaluated due to the lack of systematic and large-scale studies. From the available leukemic bone marrow studies, mostly are case reports, we could speculate that diffuse uptake is the major pattern [9][10][11][12], and its incidence is much higher than that in the lymphoma studies [13]. It is quite difficult to determine whether diffuse uptake is BMI in visual assessment, because the judgment depends on the physician's experience, and both malignant and benign causes may have similar appearance [11,14,15]. In some lymphomatous bone marrow studies, diffuse uptake was considered to be BMI negative [16,17], while in other studies it was considered as BMI positive [18,19]. Because of the relatively high incidence of diffuse uptake in leukemia patients, it is not appropriate to take diffuse uptake as positive or negative for BMI in patients with suspected relapsed AL. In summary, the clinical 18 F-FDG PET/CT-based diagnosis of BMI in relapsed AL is still challenging.
Radiomics extracted and mined a large number of medical imaging features to quantify tumor phenotypic characteristics and could reveal features of the disease that are incomprehensible to the naked eye. It has been used in many solid tumors [20][21][22], while rarely used in bone marrow assessment. A recently published study indicated that 18 F-FDG PET-based radiomic analysis was helpful in identifying BMI [23]. We hypothesize that high-dimensional, high-throughput radiomic features from both PET and CT images would provide a thorough strategy for extracting the pattern of BMI, and thereby would be helpful in improving the diagnostic power of 18 F-FDG PET/CT in patients with suspected relapsed AL.

Patients
The study has been approved by the institution review board, and the need for written informed consent was waived. This study retrospectively analyzed images of AL patients who underwent 18 F-FDG PET/CT at Peking University People's Hospital between January 2012 and February 2019. The inclusion criteria were as follows: 1) acute myeloid leukemia or acute lymphoblastic leukemia patients who achieved complete remission after induction chemotherapy, 2) Age ≥ 16, 3) clinically suspected recurrence, but not yet started treatment, 4) no chemotherapy or granulocyte stimulation-factor within 1 month, 5) BMB has been completed within 1 week. The simple statistics of selected patients are summarized in Table 1. The patients were divided into two groups, i.e. 35 patients from January 2012 to February 2018 as training group and 6 patients from March 2018 to February 2019 as independent validation group.

PET/CT acquisition and reconstruction parameters
All patients fasted at least 6 h before scan, and the blood glucose level were controlled below 8.3 mM (range 4.7~8.0 mM). 18 F-FDG (provided by Atom high-tech Co., Ltd., Beijing, China) was injected intravenously with a weight-base dose of 5.55 MBq/kg (0.15 mCi/kg). After 60 minutes (60 ± 5 min, range 54 ~63 min) 18 F-FDG injection, the PET scan between the base of skull and the middle of the thigh was performed on a Discovery VCT (GE Healthcare, Milwaukee, Wisconsin, USA) with a 64-slice spiral CT. CT scan was firstly performed with a tube voltage of 140 Kev and a tube current of 80 mAs. The matrix size of CT was 512 × 512 with the voxel size 1.0 × 1.0 × 3.3 mm 3 . The PET data were collected in 3D mode for 2.5 min/bed and were corrected for attenuation with a CT-based attenuation correction method. The PET images were reconstructed using an iterative algorithm (ordered-subset expectation maximization with 2 iterations, 28 subsets) and 6-mm full width at half maximum (FWHM) of Gaussian filter. The matrix size of PET was 128 × 128 with the voxel size 5.5 × 5.5 × 3.3 mm 3 .

Clinical PET/CT review
Three nuclear medicine physicians with 15, 10, and 10 years of PET/CT reading experiences visually assessed bone marrow 18 F-FDG uptake in each patient. They were allowed to refer the corresponding clinical data except for the BMB results. Focal uptake, the presence of 18 F-FDG-avid foci, which could not be explained by benign findings on underlying CT or clinical history, was considered as positive for BMI. Normal uptake, the uptake of bone marrow equal to or lower than the liver, was considered as negative for BMI. For the diffuse uptake, the uptake of bone marrow higher than liver, the physicians made their diagnosis based on their visual assessment in the 18 F-FDG bone marrow uptake distribution, intensity and apparent heterogeneity. In case of discrepancy, the examination was conjointly reviewed to reach a consensus. The BMB results were used as the gold standard or real "truth" for BMI diagnosis in the study. All the true positives (TP) and true negatives (TN) were recorded as successful diagnosis, whereas all the false positives (FP) and false negatives (FN) cases were recorded as failed diagnosis.

PET/CT radiomic analysis with machine learning
As illustrated in Figure 1, the radiomic analysis composed of three major stages. Firstly, based on the manual delineation of the volumes of interest (VOIs) from CT and then ascertained on PET, our model automatically extracted high-dimensional imaging features from both PET and CT VOIs; then important and discriminative features for pattern extraction were selected using harnessed correlation analysis and machine learning models; and finally, a machine learning based prediction model was validated for the classification of BMB cases.
The first stage was radiomic feature extraction. A semi-automatic procedure for axial skeleton VOI definition is described in a previous study which shows high reproducibility [23]. A software XD3 (Mirada Medical) was used for PET-CT image display and processing. The VOI including the spine and the pelvis was firstly determined by CT densities of Hounsfield units >130, and then all irrelevant bone areas were manually excluded. The final CT VOIs were then displayed on fused PET images to check if there were possible regions of increased 18 F-FDG uptake near the skeleton, including extramedullary lesions and bladder. Areas of contiguous bone involvement and bone hyperplasia and sclerosis were also manually excluded.
From PET/CT VOIs, in total 1826 quantitative features including 781 features from PET and 1045 from CT were extracted. We extracted the radiomics features with the PyRadiomics package [24] (https:// github.com/Radiomics/pyradiomics) which is compliant with the Imaging Biomarker Standardization Initiative [25]. From this package, we extracted the radiomics features from the original PET and CT images, filtered images with coiflet wavelet and Laplacian of Gaussian (LoG) respectively. The images were discretized with a fixed bin size of 25 HU, which was quite commonly used in radiomics literature [26][27][28]. The extracted features reflected the disease characteristics including intensity distribution, texture pattern, morphological information, and spatial locations, as well as wavelet features [24]. The detailed list of extracted features was provided in the Supplementary Materials (I. Experimental settings of radiomic features). Conventional PET metrics were also considered with equivalent features included in the features list. Specifically, the maximum and mean of the standard uptake value (SUV) were represented by the "Intensity Histogram" features "Maximum" and "Mean" from the original PET image, and the metabolic tumor volume (MTV) could be represented by "Morphology" feature "Volume". Texture patterns were represented statistically by some common matrix, such as gray level co-occurrence matrix (GLCM), gray level size zone matrix (GLSZM), and gray level run length matrix (GLRLM). In addition, features from LoG and wavelet images were able to depict subtle texture features at different coarseness levels and frequency domains.
The second stage was important feature selection with model construction. To reduce the high dimensionality of features, our selection strategy incorporated both intrinsic and statistical feature relationship as well as an outcome-driven machine learning model. To ensure that the feature-set was accurately clustered, we first repeated consensus cluster sampling for n=50 times to achieve the most stable groups. And then, to select the most important features, our selection process included: 1) from each cluster, the most representative features were selected based on random forest [29] tree importance (importance ≥ 0.01), 2) key features were selected from the representative features by univariate random forest using the area under the curve (AUC ≥ 0.7), 3) to further eliminate the remaining redundant features, we then utilized the pairwise Pearson correlation matrix, 4) recursive feature elimination [30] was adopted to select the most important features to form radiomic pattern. Thereby, the machine learning prediction model could be constructed only with the selected important features using Random Forest algorithm. The detailed settings of Random Forest are provided in the Supplementary Materials (II. Parameters setting of the Random Forest prediction model).
The last stage was model validation. The machine learning model was trained by a Stratified ten-fold cross-validation on the training dataset, and the proportion of the positive-negative sample ratio in training and testing sets were approximately the same as in the original data set. To validate the robustness and stability of the machine learning model, we utilized both cross-validations and independent validations to assess the performance of the model. Ten-fold cross-validations were performed within the training group. As to the independent validations, the model was trained with the entire training group and then evaluated on the independent validation group. Feature importance ranking were adopted in the random forest model to evaluate the representative value of selected features. The feature-set was continuously and randomly permuted and scored, and the importance scores of the variable were obtained.
The performance of the pattern in this model was evaluated using receiver operating characteristic (ROC) curve. Wilcoxon test was utilized for feature P values (P ≤ 0.05) selection for both continuous and classification variables. The sensitivity, specificity, accuracy, positive predictive value (PPV) and negative predictive value (NPV) were also computed by Confusion matrix-derived metrics. Statistical analyses were performed "scikit-learn", "scipy", "math" packages in Python programming language.

Clinical visual analysis
The visual analysis was performed on the patients of training group with 16 BMB positive and 19 BMB negative patients. According to the visual analysis, 3 patients were classified as focal uptake, 8 as normal uptake and 24 patients were classified as diffuse uptake. Visual analysis correctly diagnosed all focal uptake patients and 7 out of 8 normal uptake patients. However, as to the diffuse uptake cases, visual analysis correctly diagnosed 14 cases, with 7 TP and 7 TN, failed in 10 cases with 5 FP and 5 FN. In summary, visual analysis achieved a successful diagnosis in 68.6% (24/35) of patients. The AUC of the visual analysis was 0.681 (95% confidence interval was 0.502-0.828). Its sensitivity, specificity, accuracy, PPV and NPV was 62.5%, 73.7%, 68.6%, 66.7% and 70.0%, respectively.

Feature selection and machine learning model
Feature selection procedure and results are illustrated as Figure 2. It could be observed that although Morphology features were extracted from images, these features were eliminated due to their statistical insignificance by statistical analysis. The texture features from original CT image were all eliminated due to their less importance determined by the Random Forest algorithm. The following feature univariate random forest selection showed that the features from original PET and CT images were less predictive in comparison with the features from LoG filtered and Wavelet decomposed images. Finally, after recursive feature elimination process, the machine learning model consisted of two PET and one CT features ( Table 2). It could be observed that the three selected features were all from the wavelet decomposed images capturing the textural information with low pass filters applied to the first two dimensions and high pass filter applied to the last dimension. The feature values extracted from the experimental dataset are normalized and summarized in Table 3. These values were assigned different weights when performing the model prediction.
The model was evaluated with both crossvalidation and independent validation. In the crossvalidation, the model correctly predicted 31 patients with 14 TP and 17 TN, incorrectly predicted 4 (2 FP and 2 FN) patients of 18 F-FDG diffuse uptake. The machine learning model achieved a successful diagnosis in 88.6% (31/35) of patients, which was significantly higher than that of visual analysis by using Pearson Chi-square test (P=0.041). The AUC of the model was 0.885 (95% confidence interval was 0.732-0.968), which was significantly higher than that of visual analysis (P=0.046). Its sensitivity, specificity, accuracy, PPV and NPV was 87.5%, 89.5%, 88.6%, 87.5% and 89.5%, respectively. As to the independent validations, the prediction model could achieve 83.3% (5/6) accuracy on the independent validation dataset. Among the six patients, one (out of two) focal uptake patient was incorrectly predicted as FN, while all the diffuse uptake and normal uptake patients were correctly predicted.

Results analysis and interpretation
Results from the study show that the differences between the two methods mainly existed in the diagnosis of the patients with diffuse uptake. The machine learning model achieved 83.3% (20/24) prediction accuracy, in comparison with 58.3% (14/24) accuracy from visual analysis. Among the 10 visually failed diffuse uptake cases, the machine learning model correctly predicted 9 of them. Visual analysis correctly diagnosed the other three cases in which the machine learning model failed.
Two representative cases from visual analysis and machine learning model are illustrated by Figure  3. As shown in Figure 4 for the distribution histogram of the three normalized features among all experimental data, there existed BMB positive and BMB negative patients sharing same feature value ranges. Therefore, BMB positive and negative patients could not be discriminated from an individual feature (with mean accuracy of 70.8%, 72.7% and 76.7% respectively for Kurtosis, RunEntropy and SRHGLE features). As to the case 3A, according to the first and third features, since there were more BMB negative patients than positive ones exhibiting the same feature value, these two features would suggest that the patient was more probably to be BMI negative. However, the distribution of the second feature was against this negative suggestion. As to the case 3B, although all three features were suggesting that the patient was more likely to be negative, the possibility of a positive case could not be eliminated, given that a few positive cases were exhibiting the same feature values.
Where is the number of discretized grey level intensity in the mask of VOI, is the maximal possible run length in the mage. is normalized the run length matrix. is a discretized grey level and is occurrences of runs with length in matrix. Measuring the distribution of gray levels randomness from an image filter by a mid-frequency wavelet. The higher the value, the stronger the heterogeneity in the texture patterns. Wavelet-LLH_firstorder _kurtosis _PET Formula: is the intensities set included in the ROI intensity mask denoted as { ,1 , ,2 , … , , }. is average gray level intensity within the VOI, Measuring the peak of image VOI pixel value distribution in a mid-decomposition domain by using wavelet filter. The lower the value implies the mass of distribution concentrated towards a peak close to the mean value, vice versa. Wavelet-LLH_GLRLM_S RHGLE _CT Formula: Where is the number of discretized grey level intensity in the mask of VOI, is the maximal possible run length in the mage. be the run length matrix for an arbitrary direction .
be the number of runs in the image along angle . is a discretized grey level and is occurrences of runs with length in matrix. Measuring the distribution of homogeneity by measuring the short run length distribution of higher gray values after mid-pass wavelet filter.
GLRLM: gray level run length matrix, LLH: low, low, and high frequency, SRHGLE: short run high gray level emphasis   The machine learning model quantitatively combined these features for the final prediction while considering their diverse contributions. The contributions of features could be explained by the weighting coefficients derived from Local Interpretable Modelagnostic Explanations (LIME) model which is a local linear approximation of the trained prediction model [31]. The LIME model perturbed the feature values and observed the resulted changes in prediction. The features, which the prediction was more sensitive to, would be assigned higher weight values. Positive weights indicated that the increase in the corresponding features would be more supporting a positive prediction, while negative weights would indicate the changes supporting a negative prediction. The right column of Figure 3 shows the features weights employed in the prediction of the two representative cases, and the predictions were derived from the linear combinations of the features weights and features values.

Discussion
To tackle the well-recognized difficulties of visual analysis of BMI, we developed a 18 F-FDG PET/CT radiomic analysis in the patients with suspected relapsed AL. To the best of our knowledge, there have been no previous studies using radiomic features with machine learning methods to assess leukemic bone marrow uptake, and it is a relatively large-scale bone marrow 18 F-FDG PET/CT study.
Considering the sample size, we employed the Random Forest prediction model in our study. As evaluated by Gunduz et al [32], the random forest model substantially outperformed other techniques on both real life and simulated data regarding the task of robust classification in the high dimension low sample size context. Floares et al [33] further justified that the Random Forest method would derive accurate and robust model from omics data of small sample size. Such characteristic made random forest model more suitable to our study where radiomic pattern would be derived from high dimensional data (a total of 1826 features for each patient) of limited number of sample studies. Additionally, according to the theory of Chalkidou et al [34], 10 to 15 patients are minimally required to test one radiomic feature, our model reduced the number of features to 3 features and would be valid to minimize false detection rates regarding the sample size in our study. The intra-and inter-observer variabilities and their influence on the performance of our prediction model was also evaluated in the study in Supplementary Materials (III: Influence of intra-and inter-observer variability on prediction).
The first finding of this study is that the machine learning model achieved a high accuracy for detecting the BMI, outperforming that of visual analysis, and was particularly excellent in analyzing diffuse uptake patterns. The diagnostic value of machine learning model statistically outperformed visual analysis in terms of AUC (0.885 vs. 0.681, P=0.046), and the successful diagnosis rate of machine learning model was significantly higher than that of visual analysis (88.6% vs. 68.6%, P=0.041). For the diffuse uptake patients, the machine learning model achieved 83.3% (20/24) prediction accuracy, in comparison with 58.3% (14/24) accuracy from visual analysis. The independent validation further justified the excellence of the machine learning model for diffuse uptake pattern. To the best of our knowledge, this is the first study to apply artificial intelligence technology to improve the 18 F-FDG PET/CT-based clinical diagnosis of BMI in the patients with suspected relapsed AL. A comparable radiomic analysis result was reported in the patients with diffuse large B cell lymphoma, where the AUC of a first-order Skewness feature in detecting BMI was 0.821, and its sensitivity and specificity was 81.8% and 81.7%, respectively [23]. The Skewness feature and its variants were also extracted in our experiments, and their performances (mean accuracy of 52%, range 34.7%~67.2%) were all lower than that of the individual three features we selected, and thereby also lower than the performance of our radiomic pattern (Supplementary Materials IV. Comparison of Skewness features with selected features).
Another finding is that this study provided an interpretable insight into the output of BMI from the machine learning model. Due to the complexity and opacity of algorithms, machine learning methods are often criticized as black boxes. We attempted to interpret the results of model predictions based on the LIME model. LIME approximated the machine learning model as a local linear model which is a linear combination of the feature values and the corresponding relative weighting coefficients. With the derived weights of features, the driving factors of the machine learning model prediction could be extracted. A more detailed explanation is in the results section.
Interestingly, a CT feature became an integral part of the model in the present study. Although the value of features extracted from unenhanced lowdose CT has been demonstrated in the studies of non-small cell lung cancer [35], lymphoma [36] and esophageal cancer [37], there are no such published studies on bone marrow. Based on the experience of visual analysis, CT is suitable to visualize cortical and trabecular bone, while not a routine method for bone marrow assessment [38,39]. In the present study, the CT feature contributed with a relatively high weight in some patients. However, the value of CT features on BMI requires a larger number of research samples for further confirmation.
In addition, in comparison to the PET conventional metrics (SUVmax, SUVmean, MTV and TLG), our selected radiomics features possessed much stronger correlations with BMB. The equivalent features to the three conventional metrics, i.e. SUVmax, SUVmean and MTV, were initially included in the extracted radiomics set. However, these three equivalent features were excluded automatically by our feature selection procedure on the basis of their discriminative contributions. We calculated another conventional metric, TLG=MTV*SUVmean [40]. The prediction accuracy for these four individual conventional metrics were 53.9%, 44.2%, 50.5% and 51.5% respectively. Further comparison analysis on the correlations with BMB was performed between PET conventional metrics and our three selected radiomics features (Table 4). The comparison showed that the BMB correlation values of our selected radiomics features were 0.42, -0.41 and -0.38 while the correlation values of the four PET conventional metrics were -2.33E-01, 0.19, 0.22 and 0.29.  The last finding is that our automated radiomic analysis method could serve as a non-invasive test option complementing the visual analysis for the diagnosis of suspected relapsed AL. For the 11 failed cases in visual analysis, our machine learning model correctly predicted 10 of them by analyzing the radiomic features purely based on the PET/CT scans. And that would suggest our model being an eligible non-invasive test option complementing the visual analysis for a more comprehensive and accurate diagnosis.
For the next stage, we will be performing translational research by 1) harnessing automated bone segmentation software with machine learning based prediction model for automated processing and analysis platform, and 2) installing the software platform in our collaborative hospitals for multicenter study for standardization of the imaging biomarkers for BMB.

Conclusion
18 F-FDG PET/CT radiomic analysis with machine learning model provided an objective and efficient mechanism for identifying the BMI in suspected relapsed AL, and could serve as a non-invasive test option complementing the visual analysis to derive a more comprehensive, confident and accurate diagnosis. It is suggested in particular for the diagnosis of BMI in the patients with diffuse uptake.