A machine learning PROGRAM to identify COVID-19 and other diseases from hematology data

Aim: We propose a method for screening full blood count metadata for evidence of communicable and noncommunicable diseases using machine learning (ML). Materials & methods: High dimensional hematology metadata was extracted over an 11-month period from Sysmex hematology analyzers from 43,761 patients. Predictive models for age, sex and individuality were developed to demonstrate the personalized nature of hematology data. Both numeric and raw flow cytometry data were used for both supervised and unsupervised ML to predict the presence of pneumonia, urinary tract infection and COVID-19. Heart failure was used as an objective to prove method generalizability. Results: Chronological age was predicted by a deep neural network with R2: 0.59; mean absolute error: 12; sex with AUROC: 0.83, phi: 0.47; individuality with 99.7% accuracy, phi: 0.97; pneumonia with AUROC: 0.74, sensitivity 58%, specificity 79%, 95% CI: 0.73–0.75, p < 0.0001; urinary tract infection AUROC: 0.68, sensitivity 52%, specificity 79%, 95% CI: 0.67–0.68, p < 0.0001; COVID-19 AUROC: 0.8, sensitivity 82%, specificity 75%, 95% CI: 0.79–0.8, p = 0.0006; and heart failure area under the receiver operator curve (AUROC): 0.78, sensitivity 72%, specificity 72%, 95% CI: 0.77–0.78; p < 0.0001. Conclusion: ML applied to hematology data could predict communicable and noncommunicable diseases, both at local and global levels.

uses fluorescence flow cytometry, impedance, hydrodynamic focusing, cynaide-free sodium lauryl sulphate (SLS) for hemoglobin and is capable of processing up to 100 samples/h using 88 μl sample volume. Up to 38 clinical parameters and 50 research parameters, or derivatives thereof are produced with up to 23 scattergrams and 4 histograms. The instrument stores up to 100,000 records in a buffer. A glossary of hematology parameter acronyms and explanations are available in Supplementary Table 2. SARS-CoV-2 PCR testing Respiratory pathogen PCR testing was performed at the clinical laboratory at WDHB using a BDMax (Becton, Dickinson and Company, NJ, USA). Testing included screening for SARS-CoV-2, influenza A, B and respiratory syncytial virus. Positive PCR results were used to identify cases whereas negative PCR results were considered controls. Untested patients were not used in ML models for COVID-19 as the infection could not be fully excluded in all patients due to shifting diagnostic and testing criteria. Some patients underwent repeated COVID-19 testing for either confirmation or exclusion of infection. PCR results were linked to hematology data via linkage through a laboratory information system.

Prediction of age, sex & individuality
After excluding data with missing demographics ML models were developed for sex, age and individuality. The purpose of this was to demonstrate the power of hematology data to discriminate and predict objectives, using feature patterns that would otherwise have been indistinguishable to a human observer. These objectives are clearly definable but not outwardly clinically useful.

Prediction of infectious diseases & COVID-19
Randomly selected first presentation FBCs from the total dataset were selected as controls for training models for both pneumonia and urinary tract infection. Only one FBC from each unique patient was used to ensure models did not train on features of individuality. Models were trained, tested and then validated in an independent cohort. Due to the low number of COVID-19 PCR-positive cases, serial results for each positive case were used for descriptive statistics and in ML models, in the assumption that this would include the various stages of the disease and convalescence. Hematology data from patients who had undergone a respiratory PCR test at North Shore Hospital's XN-3000 was used for training with validation performed on data from Waitakere Hospital's XN-1000. All data were pooled for binary prediction of a positive SARS-CoV-2 PCR result and for an identification of a specific virus in PCR-positive cases. Models were then applied to data from 9 June 2020 to 24 August 2020 during New Zealand's second wave.
Prediction of heart failure Heart failure was chosen as an example of a common noncommunicable disease, as it was the original objective of this project in July 2019. Due to the higher number of patients only single FBC results were used for ML using an approximate number of matched randomly selected controls. Multiple models were generated using either just hematology data or hematology data combined with age, demographics and standard laboratory biochemistry data and compared in an independent validation, inclusive of the remaining total dataset.

Statistics & ML
Univariate analysis was performed using the student t-test for continuous parametric variables and receiveroperating characteristic curve analysis was used to assess performance of diagnostic biomarkers by c-statistic. All tests were two-tailed and p < 0.05 deemed statistically significant, except where Bonferroni correction for multiplicity was applied. Medcalc software version 16.8.4 was used to analyze the data. BigML https://bigml.co m/ was used for applying ML models, using decision trees, and ensembles, logistic regression and deep neural networks (DNN) with transparency (https://static.bigml.com/pdf /BigML Classif ication and Regression.pdf?ve r=c306567#page=250). Model development involved splitting data 80:20 into training and test sets, and in the context of pneumonia, urinary tract infection and heart failure included an independent validation set. OptiML, an automated BigML optimization process for model selection and parametrization was used to find the best supervised model for sex classification and predicting age using regression. OptiML uses Bayesian parameter optimization and Monte Carlo cross-validation (https://static.bigml.com/pdf /BigML OptiML.pdf?ver=c306567). Unsupervised ML included t-stochastic neighbor embedding (t-SNE) embedding projector https://projector.tensor flow.org/ applied to numeric CSV data and uniform manifold approximation and projection (UMAP) to visualize high dimensional flow cytometry FCS data.

Data availability
The materials, data, code and associated protocols are available to readers with application to the corresponding author and Waitematā privacy, security and governance group with a limited data sharing agreement. BigML models will be shared without limitations.  Table 1 and biochemistry in Table 2. Summary results are available in Table 3.  A total of 13 patients with more than 100 unique serial FBCs were selected from the complete dataset with an aim to predict individuality in FBC patterns. A DNN predicted an individual patient's identity with 99.7% accuracy, F-measure: 0.97, precision: 100%, recall: 94.7% and phi: 0.97. The features used for this prediction varied according to ML method but MicroR (%) consistently ranked as the highest feature. To visualize the personalized nature of FBC metadata a t-SNE was created using only ten FBC results from six patients ( Figure 2). UMAP visualizations were created to demonstrate the differences between COVID-19 and non-COVID-19 patients as well as trajectories in cellular response over time and outcome (Figures 4 & 5). UMAP parameters were generated using all four available parameters in the FCS files.

Results
UMAPs were overlaid with recognizable population labels (lymphocyte, monocyte, neutrophil, immature granulocyte, blast/aberrant lymphocyte and ghost/debris) defined by manual gating (as per Sysmex guidelines) to aid in interpretation (Figure 4, top). Cells from each patient's first (blue) and last (red) time point sampled were plotted together to determine shifts in cellular population characteristics over the course of disease (Figure 4, bottom). In all patients, there was a noticeable shift in the population characteristics between first and last time point. In most instances, these differences were within immune cell subsets (e.g., a change in neutrophil phenotype) rather than a change in the overall immune signature (e.g., a change in the ratio of neutrophils to monocytes). tSNE was used to visualize temporal changes and clustering of numeric hematology data, comparing survivors with nonsurvivors ( Figure 6).
Prediction of heart failure A total of 237 heart failure patients and 384 controls were used for training and testing models for heart failure and cardiomyopathy. Models including age, sex and ethnicity negligibly differed from models excluding these demographics, so, to aid reproducibility these data were excluded. A logistic regression model returned an AUROC: 0.91, phi: 0.72 in training and AUROC: 0.91, phi: 0.62 in the test set. In the validation set of 138 cases and 42,615 controls the AUROC was 0.78, sensitivity of 72%, specificity of 72%, 95% CI: 0.77-0.78; p < 0.0001 (Supplementary Figure 2). Highest ranked features were RDW-SD(fL), PCT/M, BA-D#(10∧9/L), LY-WY, LY-Z(ch).

Discussion
In this paper, we aimed to demonstrate the hidden power of hematology metadata derived from a standard FBC. We have shown the ability to predict age, sex, individuality with a high degree of accuracy. Similarly, imperceptible patterns within hematology data (Figure 7) allowed the prediction of infectious diseases such as COVID-19, pneumonia and urinary tract infection as well as noncommunicable diseases such as heart failure. Although   important, these predictions may not currently be sufficient to have clinical utility. However, with larger datasets spanning a wider breadth of pathophysiology, there may be an opportunity to improve upon these predictions. There are numerous examples where similar approaches, using standard laboratory data, have been used to predict the presence of COVID-19 with a relatively high degree of accuracy [12,[16][17][18][26][27][28][29][30][31][32][33]. We used here only hematology data, demonstrating individual variables such as HFLC, previously associated with COVID-19 [34,35]. ML applied to larger datasets of COVID-19 cases would likely provide the ability to prognosticate as well as diagnose. In a study by Soltan et al. ML applied to laboratory data and e-vitals, inclusive of temporal trajectories, were validated both retrospectively and prospectively with a reported accuracy of 92.5% [15]. Thanks to the digital infrastructure at Waitematā District Health Board e-vitals are captured digitally, giving us the capacity to include this data in future model iterations. Anecdotally, we have seen altered sleep/wake cycles, overnight hypoxia and

Figure 5. Discriminatory patterns between COVID-19 and non-COVID-19 patients and COVID-19 nonsurvivors.
Samples were processed in the OMIQ platform which allowed for metadata to be assigned to each sample. Samples were labeled by patient, time series point and whether the patient was deceased or not. Cells were initially manually gated based on the manufacturer's guidelines for WDF sample. A UMAP analysis was performed using all available signal parameters (side scatter, side fluorescence, forward scatter and forward scatter pulse width  abnormal heart rate variability in individual cases of COVID-19 ( Figure 8). Perturbations in these variables, known to be influenced by pathogens [36], is probably what has been identified in the DETECT study which used wearable devices to identify influenza-like illness [37]. This also appears to be translatable to COVID-19 [33,38]. Blood is a rich source of information, and this study demonstrates that rapidly accessible inexpensive data that is otherwise purged can be put to new uses. Although this method, applied as a screening tool to 10,000's of FBCs will generate false positives, it could be deployed with an automated alert or trigger downstream confirmatory lab tests, for example NT-proBNP for heart failure, metagenomic sequencing etc. Other emerging high throughput 'omic' technologies, such as metabolomics and ML, have been used to both predict the presence of COVID-19 with a high degree of accuracy and the [39] severity of pneumonia. This has only been achievable using the shared resource of UK Biobank [26]. Host metabolomic profiling could be achievable in New Zealand which would facilitate screening the population for targeted isolation or prioritized vaccination [40]. Sysmex hematology analyzers utilize flow cytometry to produce not only extracted numeric data, but also high dimensional data contained within FCS files similar to other 'omic technologies. Numerous flow cytometry studies have demonstrated the utility of this method in identifying host responses to COVID-19. There appear to be individual immunophenotypes which are likely to predict not only prognosis but also response to treatment [41]. Artificial intelligence has been used in numerous ways to combat COVID-19, including not only the use of blood-based laboratory data but also radiomics and remote monitoring devices [42]. The hidden potential of these tools should be explored in both other communicable and noncommunicable diseases as they are likely to provide unexpected long-term benefits for numerous other health conditions, for example, cardiacvascular disease and heart failure [43]. For instance, a study using ML applied to longitudinal FBC data showed high accuracy in predicting the presence of bowel cancer which has been translated into a clinical tool [44]. Integrating other sources of highly abundant laboratory data with hematology data have been used to predict age or 'biological age' [45] also known as phenotypic age, which itself is associated with poorer outcomes in COVID-19 [46]. Integrating other forms of non laboratory data are likely to improve upon these predictions and we have previously shown that integrating conventional ECG parameters with laboratory data predicts the presence of heart failure with a high degree of accuracy (AUC: 0.94, sensitivity: 74%, specificity: 94%, phi: 0.58) [47].
New Zealand's digital infrastructure has many advantages for the application of this technology. First, for the most part, it has embraced digital medicine and data are abundant, second, the population is ethnically diverse, third, ICD10 and outcome data are centralized and there is excellent longitudinal data capture. Testsafe electronic laboratory results, available for over a decade, has not only revolutionized access to blood results by both medical professionals and patients but it has been collecting and storing data for well over a decade. Moreover, Sysmex hematology analyzers have an 85% market share in New Zealand (Figure 9), meaning that by connecting systems together could increase the power of the network exponentially (Metcalfe's law). In such a national, or even international system the ML methods described here would not only benefit individual clinicians and hospitals but could also provide real time assessment of populations for public health planning during pandemics. Since flow cytometry involves the profiling of immune cells, networked hematology analyzer systems linked to viral genomic sequences could geographically map the host responses to SARS-CoV-2 viral clades as they emerge. Such a system could be used in geographic-based outbreak analytics visualizations or vaccine programs in near real-time. However, what cannot be overstated is appreciating the source of these data, which is the population of New Zealand. The data used in this project were obtained with ethical approval but without informed consent, due to the impracticality of obtaining it at scale. Secondary use of medical data carries significant issues around the maintenance of security and privacy of individuals. We have shown here that a simple FBC is highly personalized and acts as a fingerprint for each patient. Data sovereignty particularly for indigenous populations, in this context Māori, must be respected. Governance and kaitiakitanga is paramount. Establishing a social license for the secondary use of unconsented health data is also necessary [48]. Ideally data of this nature would be pooled and shared both locally and globally to deliver the scale needed for robust applications of artificial intelligence, however, there are governance, privacy and other issues to address first before this can be achieved. As hematology analyzers are ubiquitous across New Zealand, the accessibility of this technology, even in rural areas is high. As flow cytometry becomes portable it would be expected that these results will be translatable to handheld systems, further improving access to remote locations. Ultimately, what we have described here is just the beginning of a machine learning program, and there is significant work to be done to ensure the systems described are robust. To support this, government funders will need to build on the existing sparse infrastructure to support both personalized and genomic services in New Zealand, so that its benefits can be realized.

Limitations
The sample of patients with COVID-19 in this study was too small to generate robust ML models, therefore, all FBC results on individual patients were used. This would have led to overfitting in model training with the predictive model identifying the individual patient, perhaps more than the disease. As we used transparent, explainable artificial intelligence (AI) with BigML we were able to interrogate the features used in predictive models. The COVID-19 data used here would also have been skewed toward those who were the most unwell with the infection and had a large number of blood tests. Most COVID-19 patients were managed at home and would not have undergone an FBC, therefore, the results here, do not represent the entire spectrum of disease. Although ML models may have over fitted with reduced availability of data, the data used for pneumonia and heart failure models, generated on unique patient data, were robust. FCS data were not available on all patients as the time taken to download it from the IPU made it impractical to obtain. ICD10 diagnoses are reliant on accurate coding and not always reliable, missing data limited the number of patients used in statistical models. Modest predictions are not clinically useful in the context of a disease at low prevalence, with both low positive and negative predictive value. To evaluate the clinical utility of the method outlined in this project, ML predictions would need to show incremental benefit to standard of care with a history, physical examination and conventional investigations. RT-PCR was used as the gold standard, but false negatives may have occurred resulting in misclassification of patients.
Future perspective ML/artificial intelligence will have many applications in medicine. ML predictions made from laboratory data are an excellent example where existing, inexpensive data are converted to more valuable information. Understanding how this will impact and change clinical practice will be important in the future as the implications of implementing a machine learning tool at scale could be profound. These implications could be positive or negative depending on the downstream effects on resource utilization and clinical outcomes. Laboratory data including raw machine outputs are an untapped resource, rich in both snapshot and time series health information. Future research in this field will require pooling data to generate adequately sized datasets of rarer diseases, and careful clinical implementation studies to gauge the impact on physician behavior and patient outcomes.

Summary points
• We undertook a machine learning project, making use of existing hematology analyzer raw data, much of which is purged from healthcare systems. • These data were rich in information about health states and predicted biological age, gender, individuality and both communicable and noncommunicable diseases, such as COVID-19 and heart failure. • Further work will be required to evaluate the clinical impact of these machine learning tools.

Supplementary data
To view the supplementary data that accompany this paper please visit the journal website at: www.futurescience.com/doi/suppl/10.2144/fsoa-2020-0207

Ethical conduct of research
Ethics approval was obtained locally from the Research and Knowledge Centre in collaboration with the Institute for Innovation + Improvement (i3), and from the regional HDEC ethics committee (20/CEN/162). Informed consent was waived, as the research was observational and used secondary data.

Data availability
The materials, data, code and associated protocols are available to readers with application to the corresponding author and Waitematā Privacy, Security and Governance (PSGG) group with a limited data sharing agreement. BigML models will be shared without limitations.

Open access
This work is licensed under the Creative Commons Attribution 4.0 License. To view a copy of this license, visit http://creativecomm ons.org/licenses/by/4.0/