Development and Internal Validation of a Risk Prediction Model to Identify Myeloma Based on Routine Blood Tests: A Case-Control Study

Simple Summary Due to the rarity and non-specific symptoms of myeloma, a key challenge in diagnosing myeloma in primary care is the clinician considering myeloma and initiating appropriate investigations. We developed an algorithm to identify those at high-risk of having undiagnosed myeloma based on results from routine blood tests taken for other reasons. We demonstrated that it is possible to combine signals and abnormalities in several routine blood test parameters to identify individuals at high-risk of having undiagnosed myeloma with high predictive accuracy. Identified high-risk individuals could then undergo further automatic testing in the laboratory (a reflex test) to test specifically for myeloma. We estimate the number of additional tests needed to diagnose one case at different model thresholds. Further work is needed to explore the full potential of such a strategy. If successful, automatic reflex testing would enable large-scale, low-cost case-finding for myeloma with a significant impact on the diagnosis of myeloma. Abstract Myeloma is one of the hardest cancers to diagnose in primary care due to its rarity and non-specific symptoms. A rate-limiting step in diagnosing myeloma is the clinician considering myeloma and initiating appropriate investigations. We developed and internally validated a risk prediction model to identify those with a high risk of having undiagnosed myeloma based on results from routine blood tests taken for other reasons. A case-control study, based on 367 myeloma cases and 1488 age- and sex-matched controls, was used to develop a risk prediction model including results from 15 blood tests. The model had excellent discrimination (C-statistic 0.85 (95%CI 0.83, 0.89)) and good calibration (calibration slope 0.87 (95%CI 0.75, 0.90)). At a prevalence of 15 per 100,000 population and a probability threshold of 0.4, approximately 600 patients would need additional reflex testing to detect one case. We showed that it is possible to combine signals and abnormalities from several routine blood test parameters to identify individuals at high-risk of having undiagnosed myeloma who may benefit from additional reflex testing. Further work is needed to explore the full potential of such a strategy, including whether it is clinically useful and cost-effective and how to make it ethically acceptable.


Introduction
Multiple myeloma is a cancer of the plasma cells, with approximately 6000 new cases diagnosed annually in the UK [1]. Myeloma has been identified as one of the hardest cancers to diagnose in primary care [2]. It is a relatively rare cancer; on average, a GP of undetermined significance (MGUS). It is often diagnosed incidentally via routine blood tests. MGUS is common, with a prevalence of 3-5% and a risk of progression to myeloma of approximately 1% per year [17]. Once diagnosed, regular monitoring of blood tests and clinical symptoms is recommended, often guided by MGUS risk-stratification strategies. This has cost and workload implications for the NHS, as well as for the mental health of these patients, the majority of whom will never require treatment for MGUS.
Prior to developing and implementing a laboratory reflex test for myeloma, it is essential to investigate whether an appropriate algorithm based on results from routine blood tests can be developed with appropriate accuracy. In this study, we develop and internally validate a risk prediction model to identify those with a high risk of having undiagnosed myeloma based on results from routine blood tests taken for other reasons.

Study Population
A retrospective case-control study used de-identified data from Leeds Teaching Hospitals Trust (LTHT), including clinical, pathology, and demographic data. Cases were defined as patients diagnosed with multiple myeloma (ICD-10 C90) between 2012 and 2019 at LTHT.
Controls were selected from LTHT patients who had at least two FBCs within a twoyear time period from 2012-2019. Patients diagnosed with any other haematological cancer up to 2021 were excluded. Initially, five controls were selected for each case matched on age, gender, and year. The date of the last blood test was used as the index date. Further restrictions were applied to the control group to remove groups of patients who were critically ill and, therefore, may potentially have extreme outlying values for blood test results. This involved removal of controls where cancer was recorded as a comorbidity and those whose blood tests were all requested from intensive or palliative care.
All patients with an MGUS diagnosis recorded on their medical record (ICD-10, D47.2) between 2012 and 2019 were also identified for inclusion in a further sensitivity analysis.

Blood Tests
The blood tests included components of FBC and biochemistry blood tests. All tests carried out between 2011 and the diagnosis/index date were included, so cases and controls had blood tests at least one year prior to the index date. The blood tests included those requested from primary care and those conducted in hospital.
The total number and median number of blood tests for each individual were calculated, including all measurements up to five years prior to the diagnosis/index date. This analysis assessed how frequently these tests were conducted and to help select the final list of candidate predictors (Supplementary Table S1).
Although repeat measurements of blood tests over time were available, the prediction model was developed based on cross-sectional data, so we retained only one blood test result for each patient. For cases, the date the patient started to undergo tests as part of the diagnostic workup for their myeloma was identified and defined as the date of either a paraprotein, serum electrophoreses, or immunofixation test within the period 2 years prior to diagnosis. Only blood tests prior to this date were included, and the test result closest to this was retained. For controls, the test results closest to the index date were included.

Sample Size and Events per Parameter
The sample size was calculated based on the recommendations from Riley et al. to calculate the minimum number of events per parameter [18]. The prevalence of myeloma in this case-control study (after further exclusion of controls) was 27%, assuming an AUC of 0.80 (based on similar work by Koshiaris et al. [16]) and a shrinkage factor of 0.9. Based on these inputs, a minimum of 9.8 events per parameter was estimated. We included 375 cases and were able to include up to 38 parameter predictors. Therefore, we included age and sex and 15 blood tests to allow us to include non-linear terms for continuous variables.

Candidate Predictors
Candidate predictors initially considered were components of FBC and biochemistry blood tests (shown in Supplementary Table S1) that are commonly conducted in primary care and known to be abnormal at the point of a myeloma diagnosis. These were refined for inclusion in the modelling stage based on existing literature, clinical opinion, and data availability. The final list of variables to take forward to the modelling stage were selected via discussion with the clinical team, including a GP (RN) and haematologists (GC and JC). We only included routine blood tests well recorded within the data extract (recorded for at least 50% of cases and controls). The final fifteen blood tests included were: albumin, alkaline phosphatase (ALP), alanine transaminase (ALT), basophils, calcium, creatinine, Creactive protein (CRP), eosinophils, haemoglobin, lymphocytes, mean cell volume (MCV), monocytes, neutrophils, platelets, and white cell count (WCC) (Supplementary Table S2).

Model Development
A risk prediction model was developed using logistic regression. A full modelling approach was adopted where all predictors were included in the model regardless of their statistical significance. This method avoids issues with potential overfitting and predictor selection methods, which may be data-dependent. To account for potential nonlinearity, continuous predictors were modelled using restricted cubic splines with three knot points placed at the recommended locations based on percentiles (10th, 50th, and 90th percentiles) [19].

Missing Data
Multiple imputations using chained equations were used to account for missing data assuming the data were missing at random [20]. The imputation model included all predictors and outcomes using predictive mean matching for continuous variables. Model convergence was assessed by examining convergence plots and comparing the distributions of observed and imputed values. Twenty imputed datasets were generated, and analyses were conducted on each imputation set and pooled using Rubin's Rule [20].

Internal Validation
The model was internally validated to assess the model discrimination and calibration. Model discrimination, which measures the ability of the model to differentiate between cases and controls, was measured using the C-statistic. Model calibration, which measures the agreement between predicted and observed risks, was assessed using calibration plots and the calibration slope. The calibration plot was obtained by generating 10 groups based on the deciles of predicted probability, plotting the predicted risks against the observed risks, and fitting a weighted smoothing trend line (loess smoother) between model predictions and outcome. The calibration slope was calculated by fitting a logistic regression for the outcome with the model linear predictor as the only independent variable. The coefficient for the linear predictor is the calibration slope (the ideal calibration slope is 1). Internal validation using bootstrapping with multiple imputation was conducted using 1000 bootstrap samples drawn from each imputed data set and results pooled to obtain the optimism-adjusted estimates.

Diagnostic Accuracy Statistics
Model sensitivity and specificity were obtained for probability thresholds ranging from 0.1 to 0.5 (by intervals of 0.05). The prevalence of myeloma in a primary care population (where the predicted algorithm would be implemented) was estimated based on published studies and official national statistics [12,16,21]. We estimated the positive predictive values (PPV), negative predictive value (NPV), number of cancers detected and missed, the total number that would require additional reflex testing, and the number needed to reflex test to diagnose one case at each probability threshold.

Sensitivity Analysis and Additional Models
Several additional models based on further exclusion criteria were run as sensitivity analyses. The additional models are listed below.
Calcium excluded: The model was run again, excluding calcium as a predictor. Calcium is an important predictor of myeloma [11]; however, it was the blood test that was conducted least in both the case and control group and had the highest levels of missing data. In clinical practice, if a GP is requesting a calcium test, they may already be thinking of the possibility of cancer.
Over 60s only: The model was run again, restricting the age of cases and controls to include those over 60 years of age only. The median age at diagnosis for myeloma is 71 years, with over 80% of new cases diagnosed in those over 60 years of age [21]. In our study cohort, there were 66 cases and 348 controls aged <60 years who were excluded from this analysis.
MGUS as controls: The MGUS cohort (n = 392) were excluded from the main model development. Our main aim is identifying myeloma cases, and we wish to avoid inadvertently detecting a large number of MGUS cases. The MGUS cohort were included in the control group, and the model was developed and validated using this updated control group.
Further models based on combinations of the above criteria were also run.

1.
All ages, MGUS as controls, calcium excluded as a predictor.

3.
Over 60 s only, MGUS as controls, and calcium excluded as a predictor.
Each of these models was internally validated, and the model's discrimination and calibration were assessed by deriving the C-statistic and the C-slope. The diagnostic accuracy statistics were calculated for each model as described above.

Software
Analysis was carried out in R version 4.0.2 using the following packages: dplyr for data manipulation [22], Hmisc and rms to generate descriptive statistics and model fitting [19], mice and nanair to explore missing data patterns and perform multiple imputations [23,24], pROC for model discrimination and estimating diagnostic accuracy statistics [25], and psfmi for model validation with multiple imputations [26].
The study is reported in accordance with Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) guidance [27].

Results
There were 439 myeloma patients diagnosed between 2012 and 2019, and, after all exclusions, 367 cases and 1488 controls were included in the prediction modelling ( Figure 1). The baseline characteristics of the cohort are shown in Table 1. The median age of the cases was 71 years (IQR 63, 79), and 58% were male. Supplementary Figure S1 and Table S3 show the characteristics of the MGUS cohort. Supplementary Figures S2 and S3 show the trends and distributions of blood test results for cases and controls.
Approximately one-third of included blood tests were from primary care (Supplementary Table S4). The percentage of missing data for candidate predictors was greatest for calcium (35%) and CRP (28%), with a higher percentage of missing controls than cases. Data were missing for between 5-6% for albumin, ALP, and ALT, and for all other tests, the percentage with missing data was <1% (Supplementary Table S5).
S3 show the characteristics of the MGUS cohort. Supplementary Figures S2 and S3 show the trends and distributions of blood test results for cases and controls.    Abbreviations: IQR = Interquartile range, ALP = alkaline phosphatase, ALT = alanine transaminase, CRP = Creactive protein, MCV = mean cell volume, WCC = white cell count.

Predictor Variables
Associations between each predictor and the outcome from the final model are shown in Figure 2, and the model coefficients are in Supplementary Table S6.

Model Discrimination and Validation
The C-statistic directly calculated from the dataset was 0.87 (95%CI 0.84, 0.89). After internal validation via bootstrapping, the C-statistic was 0.85 (95%CI 0.83, 0.87). The calibration slope was 0.87 (95%CI 0.75, 0.90), suggesting some model overfitting. The calibration plot is shown in Figure 3, and there is some suggestion of underestimation of risk in the groups with the highest risk. Supplementary Figure S4 shows histograms of the predicted probabilities for cases and controls.

Model Discrimination and Validation
The C-statistic directly calculated from the dataset was 0.87 (95%CI 0.84, 0.89). After internal validation via bootstrapping, the C-statistic was 0.85 (95%CI 0.83, 0.87). The calibration slope was 0.87 (95%CI 0.75, 0.90), suggesting some model overfitting. The calibration plot is shown in Figure 3, and there is some suggestion of underestimation of risk in the groups with the highest risk. Supplementary Figure S4 shows histograms of the predicted probabilities for cases and controls.

Sensitivity Analysis
Model fit statistics for the series of sensitivity analysis models are shown in Table 2. The C-statistic ranged from 0.81 (95%CI 0.79, 0.84) for the model, including the MGUS cohort as controls and excluding calcium as a predictor, to 0.86 (95%CI 0.84, 0.89) for the model based on the over 60s and including all blood tests. The C-slope ranged from 0.86 to 0.88.

Diagnostic Accuracy
The diagnostic accuracy of all models at a range of threshold probabilities are shown in Table 3. At a probability threshold of 0.2, the sensitivity was 77.1 (95%CI 72. 8

Sensitivity Analysis
Model fit statistics for the series of sensitivity analysis models are shown in Table 2. The C-statistic ranged from 0.81 (95%CI 0.79, 0.84) for the model, including the MGUS cohort as controls and excluding calcium as a predictor, to 0.86 (95%CI 0.84, 0.89) for the model based on the over 60s and including all blood tests. The C-slope ranged from 0.86 to 0.88.

Diagnostic Accuracy
The diagnostic accuracy of all models at a range of threshold probabilities are shown in Table 3. At a probability threshold of 0.2, the sensitivity was 77.1 (95%CI 72. 8  Abbreviations: CI = confidence interval, PPV = positive predictive value, NPV = negative predictive value.
Assuming a prevalence of 15 cases per 100,000 population and a probability threshold of 0.2, the number of additional reflex tests to be carried out ranges from 14,009 to 21,508 depending on the chosen model, with the corresponding number needing a reflex test to diagnose 1 case ranging from 1269 to 1853. Increasing the probability threshold to 0.4 decreases the total number of additional reflex tests to between 4006 and 5807, with the number needed to reflex test to diagnose 1 case ranging from 573 to 704. Supplementary Tables S7 and S8 and Figure S5 show these metrics for different prevalence estimates.

Key Findings
We have developed a risk prediction model to identify individuals at high-risk of having undiagnosed myeloma based on the results of blood tests taken for other reasons. The model shows promising performance after internal validation and sensitivity analysis with good discrimination and calibration. Individual symptoms and blood test abnormalities have low predictive value. However, in this proof-of-concept study, we showed it is possible to combine signals and abnormalities in several routine blood test parameters to identify individuals at high-risk of having undiagnosed myeloma who may benefit from additional reflex testing specifically for myeloma.
The model is based on blood tests commonly conducted in primary care. Based on the model parameters, the number of patients identified for additional reflex testing varies considerably. At a probability threshold of 0.1, approximately 40,000 individuals would be identified to undergo further testing; this reduces considerably to around 5000 individuals at a probability threshold of 0.4, with the corresponding number needed to diagnose one case reducing from around 3000 to 600. However, at higher probability thresholds, the number of cancers missed increases.

Comparison with Other Studies
Previous prediction models have focussed on including blood test results alongside presenting symptoms to identify high-risk patients. Our study was not reliant upon patients reporting specific symptoms, so it is not completely comparable. However, the previously developed models included several blood test abnormalities, including anaemia, leucopoenia, hypercalcemia, and raised inflammatory markers [3,16].
Raised inflammatory markers are associated with increased cancer risk [28]. Specifically for myeloma, previous research has shown that PV and ESR are more useful for diagnosis than CRP [15]. In our model, we included CRP rather than PV or ESR, as this was the most commonly used inflammatory marker in our cohort. PV was missing for 31% of cases and 73% of controls, and ESR was missing for 76% of cases and 87% of controls. These tests are also more likely to be performed in patients where cancer is suspected, and therefore, their predictive value may appear higher. While methods to deal with missing data could have been used to impute missing values and allow these markers to be included in the model development, we felt that since these tests were only conducted in a small proportion of the study sample when the model algorithm is implemented clinically, this would lead to many exclusions and therefore decided not to include these in the model development. This is also supported by data from CRPD which shows that in the prediction model developed by Koshiaris et al. ESR was missing for 74%, CRP was missing for 81%, and PV was missing for 96% of the study population [16], and Watson et al. showed that PV was the least commonly used inflammatory marker [28].
Hypercalcaemia plays an important role in the diagnosis of myeloma and is one of the CRAB features essential for the diagnosis of myeloma [29]. Out of all blood tests included in our model, this is the one conducted the least (84% in cases and 60% in controls). Patients with hypercalcaemia have an increased risk of cancer [30], and it may be that when this test is requested, the clinician is already thinking about the possibility of a cancer diagnosis. For these reasons, we re-ran our model, excluding calcium as one of the predictors, and we still obtained similar model discrimination and calibration. Again, this is important for implementing the developed model so that patients are not excluded if they do not have the included blood tests.

Clinical Implications
A major challenge in diagnosing myeloma in primary care is the clinician suspecting myeloma and appropriately testing for it. Our proposed approach based on reflex testing removes this step by identifying high-risk individuals on blood tests already being conducted with minimal impact on NHS resources. However, there are challenges and several issues to consider with this strategy. Reflex and reflective testing are common in UK laboratories, but there is variation between laboratories and different approaches used in decisions to contact the requestor before adding a particular test [31]. Whether this test would be considered automatically, based on our developed algorithms, or whether it is dependent upon the laboratory specialist adding the additional tests and the mechanisms for this need to be defined. Setting appropriate thresholds for the test is challenging, along with risk communication. There are serious ethical issues around the potential diagnosis of a disease that was not originally considered when the blood test was requested. Further work is needed to explore all these issues with a range of stakeholders, including patients and the wider population, to determine whether this strategy is clinically useful and cost-effective and how to make it ethically acceptable.
The number of individuals to undergo additional testing varies considerably depending on the probability threshold chosen. Further economic evaluation work is needed to explore whether this type of automatic triage testing pathway can be cost-effective and establish the necessary appropriate thresholds that incorporate the costs of the additional tests. This study provides the evidence base to conduct this health economic analysis which we are undertaking as the next stage of this project. One advantage of this method is that the additional test used in the diagnostic workup for myeloma, serum protein electrophoresis, immunofixation, and serum-free light chain assays are sensitive and specific initial tests for the majority of patients and relatively inexpensive compared to other diagnostic methods.
While trying to improve the early diagnosis of myeloma, it is important to avoid inadvertently diagnosing a large number of MGUS cases which would have a significant impact on patients' mental health as the majority of patients will never require treatment for MGUS. It also has significant cost and workload consequences for the NHS. We included the group of patients with an MGUS diagnosis recorded in sensitivity analysis combined with the control group and found that the model had similar predictive performance to the model, excluding this group. It should also be acknowledged that the control group may include patients with undiagnosed MGUS. A population-based screening trial for MGUS is ongoing in Iceland, where participants with MGUS are randomised to different follow-up strategies, providing vital evidence on the cost-effectiveness of screening and monitoring approaches [32].

Strengths and Limitations
The key strengths of this study are the methods used in model development. All blood tests included were those commonly conducted in primary care and undertaken for reasons other than a cancer diagnosis; they were not reliant upon the patient reporting specific symptoms and the clinician considering myeloma. All test results were considered on a continuous scale, and we did not use reference ranges to define abnormal results, which may vary across laboratories. The dichotomisation or categorisation of data also results in a loss of statistical power [33]. Spline functions were used to model the relationship between predictors and the outcome rather than assuming a linear relationship [34], and multiple imputations were used to account for missing data [20].
The major limitation of our study was the selection of the control group. This was based on hospital patients who had at least two FBCs over two years. These patients may not be representative of the wider population that our developed model would be implemented into. The blood test data we had access to were not limited to those conducted in hospital only, and around one-third of those included were requested by GPs. We tried to minimise some of this bias further by excluding tests with extreme values indicative of critical illness. Furthermore, the planned reflex testing would only be applied to patients undergoing blood tests who may represent a sicker population, and there is evidence to suggest an increased cancer risk in patients who undergo testing even if the test results are within the normal range [28,[35][36][37]. We adopted a case-control approach to ensure a sufficient number of cases for robust model development rather than a cohort study design. Further validation of our developed model in a primary care population is needed by adopting a cohort approach.
Full data was not available for all tests included in the model. While we overcame this in the model development using methods to deal with missing data, there may be implications for the implementation. If patients have not undergone the required blood tests, then we may not be able to calculate their predicted risk. Due to sample size limitations, we could only use up to 15 blood tests. Other tests may have been important, including those used to determine myeloma prognosis, but these tests were less likely to be conducted prior to diagnosis, again limiting the clinical utility of the developed model.
The model was based on cross-sectional data using results from one time point for each individual. For myeloma patients, we based this on those prior to the initiation of the diagnostic work for myeloma. Changes in some blood tests can be seen to 2-3 years before diagnosis [14,15]. Further work is required to identify whether such longitudinal trajectories and changes can be incorporated into a risk prediction model, with a growing interest in methods incorporating repeated measures captured within electronic health records [38,39].

Conclusions
To improve the timeliness of myeloma diagnosis, it is essential to identify high-risk individuals and initiate appropriate investigations. In this proof-of-concept study, we showed it is possible to combine signals and abnormalities in several routine blood test parameters to identify individuals at high-risk of having undiagnosed myeloma who may benefit from additional reflex testing. Further work is needed to explore the full potential of such a strategy. If successful, automatic reflex testing would enable large-scale, low-cost case-finding for myeloma, with a significant impact on myeloma diagnosis.

Supplementary Materials:
The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/cancers15030975/s1, Table S1: Number of tests available up to 5-year prior to index date for myeloma patients, MGUS patients and controls. Table S2: Candidate predictors variables included in model development. Table S3: Baseline characteristics of MGUS cohort. Table  S4: Most frequent requesting departments for blood test results used in model development. Table  S5: Number and percentage with missing data for blood tests in cases and controls. Table S6: Model coefficients for final model based. Table S7: Estimated prevalence based on published studies and national cancer data. Table S8: Total number to additionally test at different prevalence estimates and probability thresholds for each model. Figure S1: Flow chart of MGUS cohort. Figure S2: Trend plots for blood test for cases and controls, mean values (and 95% CIs) within 90 day periods. Figure S3: Box plot of distribution of blood tests for cases and controls for test results included in the model. Figure  S4: Predicted probabilities cases and controls (over imputed models). Figure S5: Number needed to additionally test to diagnose 1 case at different prevalence estimates and probability thresholds for each model.
Author Contributions: L.S., J.C., G.C., B.S. and R.D.N. contributed to the conceptualisation of the project. L.S. and R.D.N. are joint Principal Investigators of the study. L.S. performed the data analysis, drafted the manuscript, and is the study guarantor. All authors provided critical comments, edited the manuscript, and approved its final version. All authors have read and agreed to the published version of the manuscript.
Funding: The study is funded by a CRUK Early Detection and Diagnosis Primer award (EDDPMA-May21\100024). This work arises from the CanTest Collaborative, funded by Cancer Research UK (C8640/A23385), of which RDN is an Associate Director and BS a Senior Faculty Member.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki and approved by the Yorkshire and the Humber-Sheffield Research Ethics Committee (20/YH/0304 on 10/11/2020).

Informed Consent Statement:
Not applicable (patient informed consent was not required as the dataset consists of de-identifiable patient information collected from LTHT with the freedom to opt out).

Data Availability Statement:
The data are not publically available due to restricted access. The data used for this research project are available to the research community via application to LTHT Research Data and Informatics team. The R code used in the analysis can be found at https://github. com/LFairleySmith/MyelomaPrediction (accessed on 25 January 2023).