Enhancing the Clinical Utility of Radiomics: Addressing the Challenges of Repeatability and Reproducibility in CT and MRI

Radiomics, which integrates the comprehensive characterization of imaging phenotypes with machine learning algorithms, is increasingly recognized for its potential in the diagnosis and prognosis of oncological conditions. However, the repeatability and reproducibility of radiomic features are critical challenges that hinder their widespread clinical adoption. This review aims to address the paucity of discussion regarding the factors that influence the reproducibility and repeatability of radiomic features and their subsequent impact on the application of radiomic models. We provide a synthesis of the literature on the repeatability and reproducibility of CT/MR-based radiomic features, examining sources of variation, the number of reproducible features, and the availability of individual feature repeatability indices. We differentiate sources of variation into random effects, which are challenging to control but can be quantified through simulation methods such as perturbation, and biases, which arise from scanner variability and inter-reader differences and can significantly affect the generalizability of radiomic model performance in diverse settings. Four suggestions for repeatability and reproducibility studies are suggested: (1) detailed reporting of variation sources, (2) transparent disclosure of calculation parameters, (3) careful selection of suitable reliability indices, and (4) comprehensive reporting of reliability metrics. This review underscores the importance of random effects in feature selection and harmonizing biases between development and clinical application settings to facilitate the successful translation of radiomic models from research to clinical practice.


Introduction
In the era of precision medicine, translational research that aims to solve specific clinical questions with technical developments has gained increasing popularity due to the availability of structured medical data and rapid advancements in data mining techniques.Imaging is one of the most frequently analyzed data modalities due to the wide availability of imaging data and their rich anatomical, textural, and functional information.Imagederived biomarkers have been used in routine clinical practice, such as the TNM stage determined from multiple imaging modalities and the bone scan index calculated from SPECT [1].Meanwhile, new imaging biomarkers have been actively investigated to fully explore the potential of imaging data in personalized clinical decision making.One of the most popular quantitative imaging biomarker development techniques is radiomics, in which a comprehensive set of features are extracted from medical images and are correlated with the underlying pathophysiology [2].The general workflow of radiomics has been shown in Figure 1.It has been recognized as a potentially effective tool for diagnosis, Diagnostics 2024, 14, 1835 2 of 16 survival prognosis, and toxicity prediction by combining selected features into single predictive signatures [3][4][5][6][7].
cussions on the radiomic workflow, most reviews conducted a radiomic workflow-based evaluation of sources of variations, and there was a lack of discussion on the nature of the sources of variations.Furthermore, most of the reviews summarized the sources of variations, and there was a lack of discussion of the results.Additionally, focusing on individual feature repeatability would be more practical for developing reliable radiomic signatures.For example, the assessment of radiomic feature repeatability via test-retest and perturbation methods has been proven to be valuable in safeguarding the reliability of radiomic signatures by improving their internal and external generalizability when developed using only repeatable radiomic features.Despite the large number of radiomic signatures proposed in previous studies, very few have been externally validated in a prospective setting [8], posing a great challenge to reliable clinical applications.A repeatable and reproducible radiomic signature requires complete transparency in signature composition and consistent measurements in the same or similar conditions [9].The latter, consistent measurement, relies heavily on the repeatability and reproducibility of individual radiomic features, which can be affected by inconsistencies during all the steps of radiomic feature acquisition, including image acquisition, structure delineation, image preprocessing, and feature extraction.The Imaging Biomarker Standardization Initiative (IBSI) attempted to achieve consensus on the settings and procedures in image preprocessing and feature extraction through international collaboration and provided guidelines on quality assurance and feature reporting [10].On the other hand, absolute agreements in image acquisition and structure delineation are infeasible due to heterogeneous machines and imaging acquisition protocols, randomness in machine status and patient setups, as well as bias and error in manual structure delineations.Most reviews have approached this issue by summarizing and analyzing the reliability of radiomic features with respect to each aspect and have provided suggestions to mitigate the repeatability issues in each case.More clinical radiomic studies have assessed radiomic feature repeatability and reproducibility under these types of uncertainties using experimental techniques, including test-retest imaging and multiple delineations [11][12][13][14], and perturbation methods [15][16][17].
It has been recognized that each step in the radiomic workflow impacts on radiomic feature reliability, and several review articles have comprehensively investigated the source of variations affecting radiomic feature repeatability in radiomic workflows.Zhao [18] discussed the sources of variations in the radiomic workflow, categorizing them into controllable and uncontrollable factors to provide a deeper perspective on the reliability of radiomics and also provide potential solutions for each step.Yet, despite plenty of discussions on the radiomic workflow, most reviews conducted a radiomic workflow-based evaluation of sources of variations, and there was a lack of discussion on the nature of the sources of variations.Furthermore, most of the reviews summarized the sources of variations, and there was a lack of discussion of the results.Additionally, focusing on individual feature repeatability would be more practical for developing reliable radiomic signatures.For example, the assessment of radiomic feature repeatability via test-retest and perturbation methods has been proven to be valuable in safeguarding the reliability of radiomic signatures by improving their internal and external generalizability when developed using only repeatable radiomic features.
Therefore, this review aimed to (1) summarize the sources of variation in the CT/MRbased radiomic feature repeatability literature in terms of random effects and bias and discuss their implications for different applications and (2) summarize the number/proportion of highly repeatable and reproducible radiomic features under randomness and bias in radiomic workflows reported in previous studies.Specifically for the second arm, we focused on the comparison of highly repeatable and reproducible features under different sources of variations, including scanners, image acquisition protocols, test-retest imaging, intra-observer contouring, and inter-observer contouring, categorized by different imaging modalities.This review could provide a more holistic picture of radiomic feature susceptibility to different bias and random factors during applications in real clinical scenarios and facilitate the construction of reliable radiomic signatures.

Eligibility Criteria
Peer-reviewed full-text articles written in the English language and published between 1 May 2017 and 1 December 2023 were eligible for inclusion in this review.Three electronic databases (PubMed, EMbase, and Web of Science) were used to search for records.Publications included in our review met all of the following inclusion criteria: (1) peer-reviewed English full-text reports; (2) included radiomic features extracted from CT images or MR images; (3) indicated compliance with IBSI during feature extraction; and (4) focused on the repeatability and reproducibility of radiomic features resulting from variations during image acquisition and segmentation.

Research Strategy
To search for articles, we used the following search string: ("Radiomics" OR "Texture") AND ("Reproducibility" OR "Repeatability" OR "Robustness" OR "Stability") AND ("CT" OR "CT Scan" OR "Computed Tomography" OR "MRI" OR "Magnetic resonant image").We also screened the Cochrane Database of Systematic Reviews for any previous reviews addressing the robustness of CT-based radiomic features.For all the articles obtained where we used the full text for data extraction, we screened the bibliographic references within them for additional potentially eligible studies.The researchers downloaded these electronic full-text articles using university library subscriptions.Two experienced researchers independently reviewed the eligibility of the studies based on the previously mentioned eligibility criteria.

Data Extraction
Data were extracted to a spreadsheet with a drop-down list for each item, defined by the first author, imaging modalities, sources of variations, criteria for highly repeatable/reproducible features, imaging sites, number/portion of highly repeatable/reproducible features, and availability of repeatability/reproducibility metric values for individual features.

Overall Results
Overall, 38 publications, including 24 publications on CT scans and 16 publications on MR scans (2 publications involved both CT and MR scans), were included in the analysis.The inclusion flowchart was shown in Figure 2.
The proportion of features found to be highly repeatable against random effects ranged from 15.1% to 93.1% across the literature investigating feature repeatability on CT scans and from 0.50% to 91.6% across the literature investigating feature repeatability on MR scans.The proportion of features that were highly reproducible against bias ranged from 0% to 100% across the literature on CT scans and from 2.5% to 96.7% across that on MR scans.Furthermore, a clear trend was observed, namely, that more features are susceptible to inter-scanner/observer variability than intra-scanner/observer variability.
Lastly, 26 out of 38 included publications had repeatability/reproducibility indices available for individual radiomic features.

Random Effects Affecting CT-Based Radiomic Features
Ten publications were included in the summary of the role of random effects on the repeatability of CT-based radiomic features, as shown in Table 1.

Bias Affecting CT-Based Radiomic Features
Twenty-one publications were included in the summary of the effect of bias on the repeatability of CT-based radiomic features, as shown in Table 2.

Random Effects Affecting MR-Based Radiomic Features
Eight publications were included in the summary of the role of random effects in the repeatability of MRI-based radiomic features, as shown in Table 3.

Bias Affecting MR-Based Radiomic Features
Thirteen publications were included in the summary of investigations on MRI-based radiomic feature repeatability affected by random effects, as shown in Table 4.

Significance of Repeatability and Reproducibility in Radiomic Studies
Radiomics has emerged as a pivotal technique to augment the value of medical imaging through high-throughput characterization of medical images.The explicit mathematical definitions of each radiomic feature enhance the interpretability of radiomic signatures, offering a more transparent alternative to the "black box" deep learning models.The key advantage of radiomics over deep learning-based methods is the standardization of the image preprocessing and feature definition standardization due to the effort by Zwanenburg et al.An increasing number of publications have explored the diagnostic and prognostic value of radiomic approaches for various diseases in recent years.However, despite the surge in publications, concerns about reproducibility and repeatability have been prevalent since the inception of the field.It is believed that the usage of highly repeatable and reproducible radiomic features should be the first and foremost criterion to safeguard downstream model reliability [16].Previous evidence has also suggested the positive impact of repeatable radiomic features in improving both the internal and external generalizability of radiomic models [6,17,52].Guidelines such as the EvaluAtion of Radiomics research (CLEAR) [53], the Radiomics Quality Score (RQS) [54], and the Joint EANM/SNMMI guideline on radiomics [55] have underscored the importance of evaluating the reproducibility and repeatability of radiomic features and models.Understanding the source of radiomic feature variability is fundamental to harnessing the potential of radiomics in precision medicine by ensuring its reliability and determining its scope of application.

Randomness: A Fundamental Source of Variation in Radiomic Studies
Randomness, inherent and unpredictable variability which cannot be controlled in image acquisition and segmentation, is a primary concern when addressing repeatability issues in radiomic studies.The influence of randomness on radiomic features has been the subject of extensive research, particularly through the use of repeated scans with identical scanners at brief intervals.This randomness can arise from factors such as patient positioning and scanner noise, which may induce fluctuations in image intensity and, as a result, affect the consistency of radiomic features.For instance, Muenzfeld et al. [29] investigated the repeatability of CT-based radiomic features by performing multiple scans on a medical phantom with the same scanner, applying a CCC threshold of 0.85 to define repeatability.Their study revealed that a mere 22% (19 out of 86) of the features from original images met this repeatability criterion.Similarly, other research has explored the effects of randomness on radiomic features, particularly with intra-observer segmentations, where a single observer is responsible for multiple delineations on the same subject.Here, the randomness is attributed to the variability in segmentation boundaries, which affects the region of interest for feature extraction and, consequently, the radiomic features themselves.Specifically, Duan et al. [31] examined the impact of intra-observer variability by setting a more permissive threshold for high repeatability, with an ICC greater than 0.75.Their results indicated that 78.5% (84 out of 107) of CECT-based radiomic features and 72.0%(77 out of 107) of CT-based radiomic features were repeatable.It is important to recognize that intra-observer variability in ROI delineation may differ depending on the imaging modality, the anatomical site, and the observer's experience.These studies highlight the vulnerability of radiomic features to randomness.The implications are clear: employing non-repeatable features to construct a radiomic signature can render the signature susceptible to randomness, potentially leading to a significant margin of error in its prognostic or diagnostic utility.Therefore, ensuring the repeatability of radiomic features is essential for the development of robust and reliable radiomic signatures.

Bias: Inter-Observer and Inter-Scanner Variations-A Significant Hurdle to Generalizable Radiomic Signatures
Variations in the measurement settings of radiomic features can significantly impact their consistency.These variations can stem from changes in acquisition protocols or cross scanners, or from segmentations being performed by different observers.It is crucial to distinguish variations that are not random but are instead attributable to the inherent biases associated with different scanners or observer practices.Inherent bias refers to systematic differences that are difficult to replicate during applications.For example, a radiomic signature developed using data from a specific scanner and radiologist is likely to underperform in a new institution with a different scanner and radiologist.
Radiomic signatures often demonstrate optimal validation performance within the clinical settings in which they were developed.However, this performance can deteriorate when they are applied in different scenarios, such as when using alternative scanner brands or imaging protocols, or when segmentation is conducted by different observers.The further the application deviates from the original clinical setting, the more pronounced the decline in generalizability becomes, as evidenced by diminished discrimination performance.
The impact of inter-observer and inter-scanner variability on the reproducibility of radiomic features has been the focus of several studies.It has been consistently observed that radiomic features are more vulnerable to variations introduced by different observers or scanners than to those arising from the same observer or scanner.For example, Fiset et al. [13] examined the repeatability and reproducibility of MR-based radiomic features in the context of inter-scanner and intra-scanner rescans.Their findings indicated that while 52.1% of radiomic features remained reproducible in the face of intra-scanner variability, a mere 14.1% maintained reproducibility when confronted with inter-scanner variability.A similar pattern emerged when comparing intra-observer and inter-observer variability.
Understanding the distinction between random effects and bias is imperative for the reporting of results.The metrics used to measure repeatability and reproducibility should differ.High repeatability against random effects is typically defined by the absolute agreement between repeated measures, whereas consistency measures are more appropriate for defining reproducibility in the presence of bias.Koo et al. [56] provided a practical guideline for applying the intra-class correlation coefficient (ICC) in assessing test-retest reliability, inter-rater reliability, and intra-rater reliability.This guideline facilitates the appropriate use of ICCs to account for both random effects and bias, thereby enhancing the reliability of radiomic feature measurements.

Efforts to Mitigate Randomness for Repeatable Radiomic Signatures
The primary goal in mitigating the effects of randomness on radiomic features is to develop a robust radiomic signature that can consistently deliver the same results, irrespective of random fluctuations.To achieve this, two key strategies should be employed.
First, the extraction process should be refined to maximize the number of reproducible features, which involves optimizing image preprocessing parameters such as interpolations, rounding intensities, and outlier filters.Second, the construction of a radiomic signature should be based on features that have demonstrated repeatability and resilience to random variations.
Dewi et al. [47] conducted a study to assess the repeatability of features under various preprocessing conditions on T2-weighted MR images.They pinpointed a specific set of preprocessing parameters, namely, a fixed bin count of 64, the absence of signal intensity normalization, and the exclusion of outliers, which resulted in the highest number of repeatable features.However, this raises a critical question: Is the pure quantity of repeatable features the most reliable indicator of optimal preprocessing settings, or should the sensitivity of radiomic features to preprocessing also be taken into account?
Moreover, the construction of radiomic signatures should prioritize the inclusion of repeatable features.Teng et al. [17] evaluated multiple radiomic signatures by systematically excluding features with low repeatability, applying ICC thresholds of 0, 0.5, 0.75, and 0.95.Their findings indicated that increasing the threshold for feature repeatability not only enhanced the repeatability of the radiomic signatures but also maintained their discriminative capability.Similarly, Zhang et al. [52] showed that the exclusion of features with low repeatability from the signature construction process improved the inter-institutional generalizability of the radiomic model.
However, assessing the repeatability of radiomic features in the face of randomness presents a significant challenge, as gold-standard test-retest scans are often not readily available.The repeatability determined from a limited set of test-retest scans may not be universally applicable to other datasets.To address this data scarcity, Zwanenburg et al. [15] introduced a simulation-based approach as an alternative to actual test-retest scans.This method generates pseudo-test-retest scans by applying transformations such as rotation, translation, and noise addition to the original images, along with contour randomizations at the edges of segmentations.The robust features identified through this simulation technique align with those found to be repeatable in actual test-retest scenarios.Building on this, Zhang et al. [57] further validated that simulation methods for identifying repeatable features can lead to the development of generalizable radiomic signatures comparable to those derived from test-retest scans.This suggested the potential of simulation-based methods as a viable solution for overcoming the limitations posed by the scarcity of test-retest data.

Efforts to Address Bias for Generalizable Radiomic Signatures
To ensure the generalizability of a radiomic model, it is critical that it retains its discriminative ability across diverse clinical settings.This entails maintaining performance despite potential variations, such as differences in scanner brands, raw data acquisition protocols, and the methodologies employed by radiologists or physicians in delineating regions of interest.While these factors are not inherently random, they are often difficult to control when applying radiomic signatures in practice.Thus, the approach to mitigating these issues parallels that of addressing randomness, namely, identifying and utilizing radiomic features that are robust to the variations likely to be encountered in real-world application scenarios.Hoebel et al. [58] reported that normalization and intensity quantization can affect the level of repeatability of radiomic features.Moradmand et al. [26] evaluated various combinations of preprocessing steps for multi-parametric MR images and found that a sequence of bias field correction followed by noise filtering produced the most reproducible radiomic features.
Three primary sources of variation must be considered: the scanner used, the image acquisition protocol, and inter-observer variability in contouring.Of these, inter-observer variability can be mitigated by involving multiple observers in the segmentation process and selecting features that consistently perform well despite differences in observer input.Variations arising from different scanners and acquisition protocols are more challenging to address and typically require repeated scans for thorough evaluation.Unfortunately, the limited availability of repeated scan data often restricts the ability to assess the reproducibility of radiomic features in a dataset-specific manner.
A review of the literature may provide insights into which features are reproducible.However, the transferability of feature repeatability across different studies is not always clear, and the absence of a standardized repeatability index for individual features can complicate the identification of robust features.Additionally, calibration of a radiomic model before its application in new clinical settings is strongly recommended to enhance its adaptability and performance.
In summary, the development of a generalizable radiomic model requires careful consideration of potential variations and the selection of features that are resistant to these changes.By incorporating robust preprocessing steps, involving multiple observers in segmentation, and calibrating the model for different settings, researchers can improve the reliability and applicability of radiomic signatures across various clinical environments.

Enhancing the Reporting of Repeatability and Reproducibility in Radiomic Feature Studies
While research into the repeatability and reproducibility of radiomic features has significantly enhanced our understanding of their sensitivity to random effects and biases, leveraging this knowledge to bolster the repeatability and reproducibility of radiomic signatures is paramount.In the developmental stages of radiomic signatures, the feasibility of conducting supplementary test-retest scans for dataset-specific assessments is often limited.Therefore, it becomes crucial for studies focusing on repeatability and reproducibility to explore whether their findings can aid other researchers in evaluating the repeatability and reproducibility of their own radiomic models, especially in scenarios where additional test-retest scans are not available.To support this endeavor, we propose the following specific recommendations: (1) Detailed Reporting of Variation Sources: Authors should meticulously document any sources of variation encountered across different measurement settings.These include, but are not limited to, changes in scanner types, imaging protocols, and segmentation processes.Such detailed reporting will provide valuable context for understanding the conditions under which the radiomic features were assessed.(4) Comprehensive Reporting of Reliability Metrics: The reliability metrics for individual features should be thoroughly reported.This comprehensive reporting will allow other researchers to discern which features are most stable and reliable across different datasets and conditions, thereby informing the selection of features for their own radiomic signatures.
By adhering to these recommendations, researchers can facilitate a more precise evaluation of the repeatability and reproducibility of radiomic signatures, even in scenarios where direct test-retest data are unavailable.This approach not only advances the field of radiomics by ensuring the development of more robust and reliable signatures, but also fosters a culture of transparency and reproducibility within the research community.
Positron emission tomography (PET) is another crucial imaging modality in radiology.Unlike CT and MRI, which provide anatomical images with clear tissue boundaries, PET is a functional imaging technique that relies on the type of radiopharmaceutical tracer used.PET is also significant in radiomic research [59].The concepts discussed in this review can be applied to identify highly repeatable PET features.

Conclusions
In conclusion, the exploration of repeatability and reproducibility in radiomic features has significantly deepened our understanding of their susceptibility to both random effects and systematic biases.This knowledge is indispensable for the advancement of radiomic research, particularly in the development of robust and reliable radiomic signatures that can withstand the variability inherent in clinical settings.However, the practical challenges of conducting additional test-retest scans for dataset-specific evaluations highlight the necessity of a standardized approach in reporting and assessing the repeatability and reproducibility of radiomic features.
To address these challenges, we have proposed a set of recommendations aimed at enhancing the transparency and reliability of radiomic studies.These include the detailed reporting of sources of variation, transparent disclosure of feature calculation parameters, careful selection of reliability indices, and comprehensive reporting of reliability metrics for individual features.Adherence to these guidelines will not only facilitate more accurate evaluation of radiomic signatures in the absence of extensive test-retest data but also contribute to the broader goal of achieving generalizable and clinically applicable radiomic models.

Figure 1 .
Figure 1.Steps of radiomic feature extraction that could affect radiomic feature values.The red color in ROI segmentation indicates the gross tumor volume as target in Radiotherapy.The color spectrum of Preprocessing indicates the high Hounsfield units (red) and low Hounsfield units

Figure 1 .
Figure 1.Steps of radiomic feature extraction that could affect radiomic feature values.The red color in ROI segmentation indicates the gross tumor volume as target in Radiotherapy.The color spectrum of Preprocessing indicates the high Hounsfield units (red) and low Hounsfield units (blue) within the gross tumor volume.The color spectrum of Feature extraction indicates the varied features value.

Figure 2 .
Figure 2. Flowchart of the selection of publications for the review.

( 2 )
Transparent Disclosure of Calculation Parameters: It is imperative to transparently disclose all parameters used in the calculation of radiomic features.This transparency ensures that other researchers can accurately replicate the feature extraction process, facilitating a more reliable comparison of results across different studies.(3) Careful Selection of a Suitable Reliability Index: Choosing an appropriate reliability index is critical for assessing the repeatability and reproducibility of radiomic features.Researchers should select indices that most accurately reflect the nature of the variations.

Table 1 .
Summary of the literature on random effects affecting CT-based radiomic features.

Table 2 .
Summary of the literature on bias affecting CT-based radiomic features.

Table 3 .
Summary of the literature investigating random effects affecting MR-based radiomic features.

Table 4 .
Summary of the literature investigating bias affecting MR-based radiomic features.