Cluster analysis and visualisation of electronic health records data to identify undiagnosed patients with rare genetic diseases

Moynihan, Daniel; Monaco, Sean; Ting, Teck Wah; Narasimhalu, Kaavya; Hsieh, Jenny; Kam, Sylvia; Lim, Jiin Ying; Lim, Weng Khong; Davila, Sonia; Bylstra, Yasmin; Balakrishnan, Iswaree Devi; Heng, Mark; Chia, Elian; Yeo, Khung Keong; Goh, Bee Keow; Gupta, Ritu; Tan, Tele; Baynam, Gareth; Jamuar, Saumya Shekhar

doi:10.1038/s41598-024-55424-8

Download PDF

Article
Open access
Published: 01 March 2024

Cluster analysis and visualisation of electronic health records data to identify undiagnosed patients with rare genetic diseases

Daniel Moynihan¹,
Sean Monaco²,
Teck Wah Ting^3,4,
Kaavya Narasimhalu^4,5,
Jenny Hsieh^4,6,
Sylvia Kam^3,4,
Jiin Ying Lim^3,4,
Weng Khong Lim^4,7,8,9,
Sonia Davila^4,7,
Yasmin Bylstra^4,7,
Iswaree Devi Balakrishnan^4,10,
Mark Heng¹¹,
Elian Chia¹¹,
Khung Keong Yeo¹⁰,
Bee Keow Goh¹²,
Ritu Gupta¹,
Tele Tan¹,
Gareth Baynam^13,14 &
…
Saumya Shekhar Jamuar^3,4,7

Scientific Reports volume 14, Article number: 5056 (2024) Cite this article

960 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Rare genetic diseases affect 5–8% of the population but are often undiagnosed or misdiagnosed. Electronic health records (EHR) contain large amounts of data, which provide opportunities for analysing and mining. Data mining, in the form of cluster analysis and visualisation, was performed on a database containing deidentified health records of 1.28 million patients across 3 major hospitals in Singapore, in a bid to improve the diagnostic process for patients who are living with an undiagnosed rare disease, specifically focusing on Fabry Disease and Familial Hypercholesterolaemia (FH). On a baseline of 4 patients, we identified 2 additional patients with potential diagnosis of Fabry disease, suggesting a potential 50% increase in diagnosis. Similarly, we identified > 12,000 individuals who fulfil the clinical and laboratory criteria for FH but had not been diagnosed previously. This proof-of-concept study showed that it is possible to perform mining on EHR data albeit with some challenges and limitations.

Identifying cross-disease components of genetic risk across hospital data in the UK Biobank

Article 23 December 2019

Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies

Article Open access 23 July 2021

Data-driven identification of ageing-related diseases from electronic health records

Article Open access 03 February 2021

Introduction

Rare genetic diseases affect 5–8% of the population and account for a disproportionately increased use of healthcare resources given the multisystemic manifestations^1,2. However, these rare genetic diseases often go undiagnosed, with patients undergoing a prolonged diagnostic odyssey³. One such example is Fabry disease, which is a rare multisystemic genetic disease with a reported annual incidence of 1 in 100,000, although this is believed to be an underestimation as many cases remain undiagnosed⁴. There is a substantial time gap between the age of first symptoms and diagnosis: 9 and 23 years old for males and 13 and 32 years old for females, respectively⁵. Another example is familial hypercholesterolaemia (FH), which is a genetic disease resulting in greatly increased levels of low-density lipoprotein cholesterol (LDL-C). Such elevations in LDL-C lead to the premature development of numerous problems, such as angina and myocardial infarction⁶. Diagnosis is often delayed with patients only diagnosed after presenting with a catastrophic event⁷.

Electronic health record (EHR) describes how comprehensive patient data can be gathered between institutions and over a long period of time. Such a longitudinal and comprehensive amalgamation of data can go beyond the initial requirement of addressing medical treatment and provide invaluable information about a patient’s overall and general health⁸. EHRs contain rich volumes of patient data including demographics, medications, laboratory test results, diagnosis codes, and procedures, and while their primary use is to facilitate patient care, the existence of such data in a digitised format presents a unique opportunity for analysis and novel clinical insights, including identifying patients with rare genetic diseases. Indeed, Phenotype Risk Scores (PheRS) is a method to detect genetic disease patterns using phenotypes from EHR and has been shown to be a scalable method to find patients with undiagnosed genetic disease^9,10.

While EHR data, in theory, can be considered to be standardised across different healthcare system, there is paucity of literature on such data mining approaches within the Asian healthcare system¹¹, especially for rare genetic disorders. This could be partly be attributed to lack of EHR systems in many of the Asian healthcare systems, relying primarily on traditional paper based health records, but also due to challenges associated with infrastructural and skilled manpower constraints associated with implementation and data mining of EHR¹². This study aims to be a preliminary step towards using EHR data to perform data mining to identify patients with an undiagnosed rare genetic disease, specifically Fabry disease and FH, as well as understand challenges in an Asian healthcare system context. This study is an early step towards increasing the quality of care for people living with rare disease by reducing their diagnostic odyssey so that more time and resources can be spent on managing the disease.

Methods

Data

This work was performed on a dataset consisting of deidentified structured medical records of approximately 1.28 million patients across three healthcare institutions under the Singapore Health Services (SingHealth) cluster in Singapore (National Heart Centre Singapore, KK Women’s and Children’s Hospital, and Singapore General Hospital) over a 3-year window (1 Jan 2018–1 Mar 2022). All methods and analysis were carried out in accordance with relevant guidelines and regulations and were approved by the SingHealth Data Governance committee. SingHealth Centralised Institutional Review Board has waived need for informed consent for this study.

Data extraction, deidentification and preparation

The initial dataset was collected from various sources, including laboratory, radiology, pathology, diagnoses, and detailed patient information within the SingHealth Database, all of which contained identifiable patient information. Only structured data was extracted for this initial pilot (Supplementary Table 1), and free-text fields were excluded to minimise risk of exposing sensitive information. As a critical step towards ensuring privacy and compliance with security protocols, using a trusted third party, data fields deemed as sensitive were identified based on “SingHealth Policy for Data Anonymisation” and were pseudonymised. For example, the NRIC of patients was changed from “SxxxxxxxZ” format to “PIDxxxxxx” format. Such sensitive fields were initially randomly sorted and then PID numbers were assigned. After pseudonymisation, the data were then transferred to the Office of Insights and Analytics (OIA) High-Performance Computer Lab, which is an air-gapped environment. Strict security guidelines were in place to ensure that only users directly involved in the study were allowed access to the pseudonymised data. Such users were unable to transfer data onto or from the Lab devices themselves and required the permission and assistance of other Lab staff. The record linkage key was also not transferred into the lab and not exposed to the clinicians involved in this study by any means, hence, ensuring that the privacy of the individuals was not compromised.

Once deidentified, the structured data was then normalized and standardized using a third-party platform, Population Builder (Health Catalyst, USA). Population Builder tool allows the user to select multiple parameters and apply them as filters on the patient population (Supplementary Table 2). Problem list filtering was done with Systematized Nomenclature of Medicine Clinical Terms (SNOMED) codes. In many cases, multiple SNOMED codes refer to the same disease/phenotype in question (for example, familial hypercholesterolemia and familial hypercholesterolaemia). This poses a challenge when using such terms for searching through the data, in that every time one wants to apply a filter, all spelling and naming variations must be included. This problem was solved with the use of value sets. Value sets are a feature of the Population Builder tool, allowing for codes to be grouped into one unified set so that the user may simply add the set in the filtering process rather than manually searching for all the codes. Creating value sets requires the work to be done only once, then the set is readily available for the user to add to patient filtering whenever they please.

Rare diseases

As a pilot, we selected two rare diseases for this project: Fabry Disease and Familial Hypercholesterolemia (FH). These diseases were selected as the diagnostic criteria for these two diseases are well defined and datapoints needed for the diagnostic criteria can be extracted or inferred from structured health records. The diagnostic criteria included:

(1)
Fabry disease

If patient is

Less than 50 years old

AND has symptoms from AT LEAST two (2) of the following systems (broken down further into specific diseases).

Kidney
- Chronic kidney disease
- Proteinuria
- Microalbuminuria
Cardiac
- Cardiomyopathy
- Valvular heart disease
- Arrhythmia
Neuro
- Stroke ischemic
- Transient ischemic attack
- Acroparaesthesia
Skin
- Angiokeratomas
- Impaired sweating/hypohidrosis
- Heat and cold tolerance
Eye
- Corneal whirling
- Cornea verticillate
- Corneal and lenticular opacities
- Vasculopathy (retina, conjunctiva)

(2)
Familial hypercholesterolemia

If a patient has premature atherosclerotic cardiovascular disease (ASCVD) OR has severely elevated low-density lipoprotein calculated (LDL-C) laboratory results. More specifically if a patient satisfies any of the following.

ASCVD, male, less than 55 years old
ASCVD, female, less than 60 years old
LDL-C greater than 2.6 mmol/L whilst adhering to high-intensity statins
LDL-C greater than 3.9 mmol/L and 18 years old or younger
LDL-C greater than 4.9 mmol/L and more than 18 years old

Patients who meet the criteria are flagged as a Fabry suspect or FH suspect, respectively^13,14. We also created value sets for Fabry disease and FH (Supplementary Fig. 1) so as to identify patients with known diagnosis of these diseases within our database.

Data wrangling

Specific metrics were examined based on each patient cohort. The data used for Fabry TPs and suspects included demographics (age, race, sex) and systems (which systems, from the diagnostic criteria, were affected for each patient) data. The data used for FH included demographics and LDL-C laboratory results. The data was retrieved from the database using Microsoft SQL Server Management Studio. Various SQL queries were written and executed, and the outputs were saved as CSV files to be loaded into RStudio. Once the CSV files were converted into R data frames, various manipulations were made in order to get the data into the correct shape and form for the various analyses that were performed (namely: visualisation, and statistical testing).

Data analysis

As mentioned above, data analysis came in two forms for this project, and they are described below.

Visualisation

The tidyverse and lubridate R packages were used heavily in the visualisation portion of the project. Demographic data for both diseases was used to generate pie charts to visualise the race and sex breakdown of TPs and suspects for each respective disease. Age data for both diseases was used to generate scatterplots and boxplots (depending on the number of observations) showing the age distribution of male and female patients of each respective disease. Fabry disease systems data was used to create bar graphs for first, second, and third-order interactions and a five-way Venn diagram showing the interactions.

Statistical testing

A two-sample t-test for a difference in means was conducted on the LDL-C data for FH TPs and suspects using the following hypotheses:

H₀:

The mean LDL-C levels are equal between TP and suspect cohorts

H_A:

The mean LDL-C levels are different between TP and suspect cohorts

Results

Fabry disease

Out of the 1.28 million patients, only 4 patients (true positives, TPs) were identified to have a confirmed diagnosis of Fabry disease (Supplementary Fig. 2), giving a prevalence of 1 in 320,000. Assuming a prevalence of 1 in 100,000, this suggests there were potentially undiagnosed patients in our database. All 4 TPs were Chinese, consisting of 3 males and 1 female. The female TP was the youngest.

We then applied our criteria for Fabry disease and identified 2213 Fabry suspects. The suspects’ demographics are slightly more varied, due to the greater number of observations. Among the suspects, the male to female ratio is about 60:40 split; and, while the majority are Chinese (61%), there are small portions of other races present (15.5% Malay, 11.7% Indian, 11.8% others), which is consistent with the ethnic distribution in Singapore. The median age for both male and female suspects are both barely older than 40 years old, with the median male age slightly greater than that of females. Unfortunately, the small size of known cases of Fabry precluded any further comparisons with the suspect cases.

Given that patients with Fabry disease present with multisystemic involvement, we then reviewed the data pertaining to system interactions. Due to the multi-systemic nature of the disease, a patient may be in contact with multiple specialists at any given point in time, receiving care for the specific organ system. Examining affected systems data may provide insights into significant comorbidities for Fabry. The five systems involved in descending order of frequency are renal (922), cardiac (872), neurological (440), ophthalmologic (21), and cutaneous (2) (Fig. 1A). Exploring two-way interactions, we identified 64 cardiac-renal, 27 cardiac-neuro, 23 renal-neuro, and 1 renal-ophthalmologic interactions (Fig. 1B). Expanding to three-way interactions, we identified 4 patients with interactions across cardiac-renal-neuro system (Fig. 1C). Interrogating these 4 patients with 3 system interaction identified 2 potential cases who deserve further work up to exclude Fabry disease, while the other two had underlying comorbidity that could explain their 3-system interaction (Supplementary Text).

Familial hypercholesterolemia

The TP cohort for FH is substantially larger than that of Fabry, with 161 confirmed diagnoses, giving us a prevalence of 1 in 8000. In contrast, based on a recent genomic epidemiology study of Singaporeans¹⁵, the population prevalence of FH in our local population is estimated to be 1 in 250, suggesting that we were severely underdiagnosing FH in our system. Using our screening criteria, we identified 12,328 FH suspects. There was a slight difference in the gender breakdown between TPs and suspects. TPs have about a 53:47 male to female ratio whereas suspects have a 42:58 ratio. On average, female TPs were older than male TPs, and the same is true for female and male suspects, with more variability in the TPs (greater interquartile range (IQR)).

Laboratory results provide good numeric data about patients. LDL-C is used in the diagnostic criteria; thus, these lab results were extracted for analysis. The TP and suspect cohorts were split into the main diagnostic criteria for graphing. The median LDL-C level for TPs satisfying the ‘early onset ASCVD’ red flag is slightly greater than that of the suspects (Fig. 2A). The same is true for patients satisfying the ‘adult with raised LDL-C’ red flag; the data for suspects in this subset is right-skewed, with most of the incidences being very close to the red flag level of 4.9 (Fig. 2B). We also observed some extreme values, with one patient’s LDL-C well over 20 mmol/L. Grouping all the data together into TP and suspect sets, it can be seen that the median LDL-C value for TPs is very close to that of the suspects (p-0.1022) (Fig. 2C).

Discussion

Our proof-of-concept study, spanning 1.28 million records, identified 2 potential patients with Fabry disease, leading to a potential 50% increase in the number of patients with Fabry disease within our healthcare system. In addition, we identified over 12,000 participants with suspected FH, which is closer to expected prevalence of FH in our population, suggesting that patients with rare genetic diseases such as Fabry and FH are often underdiagnosed. This presents as a missed opportunity as early identification and treatment is associated with better healthcare outcomes^16,17.

Within the current literature, there are multiple examples of data mining successfully applied to EHR data, in fields such as: “Understanding the Natural History of Disease”, “Cohort Identification”, “Risk Prediction/Biomarker Discovery”, “Quantifying the Effect of Intervention”, “Constructing Evidence-Based Guidelines”, and “Adverse Event Detection” ¹⁸. Denny described how EHR data can be used for genomic analysis and discussed various challenges accompanying this task, including missing data, incorrect data, and unstructured data¹⁹. Jensen et al. also described the issue of unstructured text data in EHRs, pointing out that improvements in text mining techniques were making these parts of the EHR more accessible for data mining²⁰. Kirk et al. implemented a text-mining approach to match patient records with two clinical vocabularies to perform data clustering on 14,017 patients²¹. In addition to clinical features, administrative factors such as length of stay in the hospital have been examined with the aid of EHR data as well²². More recently, Landi et al. demonstrated that the vast quantities of data within EHRs presented the opportunity for deep learning to be applied. Deep-learning is a method of machine learning, which uses neural networks to tease out information from data²³. Liang et al. identified the promise, albeit underdeveloped, of EHRs to assist with the identification of interactions between co-existing morbidities, such as severity of COVID-19 in patients with immunodeficiency²⁴. There are many steps involved in the process of rendering EHR data usable for mining. As such, there are many stages of the process which can be improved using technology. Both structured and unstructured (data in the form of free text) EHRs can be improved by a search engine to assist in finding patients; natural language processing can be applied to identify phenotypes; machine learning can be applied to classify patients²⁵.

One of the biggest issues with such approaches is the underlying dataset, which can vary across healthcare systems and present with their own unique set of challenges, and the algorithm needs to be modified accordingly to address those challenges. One of the purposes of this proof-of-concept study was to show what can be done on such a dataset from an Asian healthcare setting.

While the study shows the potential utility of flagging potential cases, it is not possible to confirm the diagnosis in those specific patients, because the data has been deidentified irreversibly. Indeed, this is a limitation to our study as we are unable to confirm or refute the diagnosis in patients that were flagged for further review. However, to add a layer of additional validation, we had manually reviewed the deidentified medical records of all the flagged patients and considered them as potential cases only if they fulfilled the clinical criteria and did not have a more plausible secondary cause for their multi-organ manifestation (supplementary text). For example, individual PIDxxx4 with chronic kidney disease, transient ischemic attacks and cardiomyopathy may have hypertensive nephropathy rather than Fabry disease, but this is outside the scope of this current study. However, our data suggests that such results can be used to guide decision-making and provide evidence for or against diagnostic criteria. It is also important to note that these results cannot be used in isolation to deliver definitive diagnosis and must be interpreted within clinical context and in partnership with domain-specific medical expertise.

Although substantial and useful in supporting proof-of-concept, another limitation of our study is that the data used for this project was limited to a 3-year window. This resulted in some patients presenting as TPs for a disease, but no other information about their state prior to/following their diagnosis was available for further context exploration. This is a problem because multiple data fields are required to implement any sort of statistical testing or machine learning. Such data censoring can be overcome by lengthening the window during which data is collected, but that is outside of the parameters of our current accessible data and can be assessed prospectively with this data set as it accumulates with time. In addition, the testing and fine-tuning of diagnostic criteria are critical. The data mining performed in this project was based on algorithms and diagnostic criteria supplied by subject matter experts. While such criteria are derived from relevant literature and have proven validity, care should be taken to ensure their statistical validity.

Phenotype risk scores (PheRS) is the sum of clinical features observed in an individual based on phecodes, a map of human phenotype ontology (HPO) terms mapped to billing codes within the EHR, weighted by the log inverse prevalence of the feature¹⁰. This approach has been shown to be a strong predictor of case status in cohorts of patients with rare diseases and assists with augmentation of rare variant interpretation and identification of rare disease patients with symptoms overlapping with common diseases¹⁰. However, our dataset was limited to structured data, which did not allow us to generate phecodes, and hence, we were unable to compare our strategy against PheRS in flagging potential cases within our dataset.

Lastly, while we did identify potential undiagnosed patients, this approach does not identify all such patients. Indeed, while we identified 2 potential patients with Fabry disease, bringing our total to 6, we would expect between 13 and 30 patients with Fabry disease in our dataset, suggesting that alternative complementary approaches will be required to comprehensively identify all the unidentified patients. However, while our single solution does not solve this complex problem, it offers a unique solution in addressing one element of this healthcare challenge.

This study was a pilot study, a proof of concept, with the goal to test whether EHR data contained enough/suitable data for data mining techniques to be applied. In the specific use case of rare diseases, not as many techniques as initially thought could be carried out. By nature, rare diseases will be very sparse in large patient datasets, especially when drilling down to one very specific disease, such as Fabry Disease. Given enough time or with a broader dataset, as more patients are diagnosed with rare diseases, the data could be built up and more algorithms could be applied. In addition, extracting unstructured data and extracting relevant HPO terms or phecodes could allow us to test the utility of other data mining approaches.

Potential expansion from such studies could include providing clinicians with pre-trained algorithms to automatically flag patients in their system that they may be at-risk of having a rare disease. Such patients can then be referred to genetic testing. If such a tool was effective at identifying potential rare disease cases, then the algorithm would be able to be trained on more and more data, both structured and unstructured, as time goes on, increasing its utility. An ideal end-product would be a computer application available to primary care practitioners during patient consultations, which have access to EHR data and can raise a warning when the application detects a rare disease ‘red-flag’ from either a pre-determined or dynamic data mining process.

Conclusion

This project has shown that conducting mining on EHR data is possible. Even in de-identified, sparse, censored data, many useful insights were found. Directions and guidance on what to look for are crucial in tasks like these; as such, continuous and clear communication with subject matter experts is paramount. A wider variety of data fields could be included in analyses to improve on the work outlined in this project. The attributes used in our project were selected based on time constraints and data availability. By showcasing the opportunities and limitations of such a dataset, this project helps to strategize future work in the area.

Data availability

The data used in this study are not publicly available due to privacy and legal restrictions but are available from the corresponding author on reasonable request.

References

The Lancet, N. Rare neurological diseases: A united approach is needed. Lancet Neurol. 10, 109. https://doi.org/10.1016/S1474-4422(11)70001-1 (2011).
Article Google Scholar
Ferreira, C. R. The burden of rare diseases. Am. J. Med. Genet. A 179, 885–892. https://doi.org/10.1002/ajmg.a.61124 (2019).
Article PubMed Google Scholar
Bauskis, A., Strange, C., Molster, C. & Fisher, C. The diagnostic odyssey: Insights from parents of children living with an undiagnosed condition. Orphanet. J. Rare Dis. 17, 233. https://doi.org/10.1186/s13023-022-02358-x (2022).
Article PubMed PubMed Central Google Scholar
Germain, D. P. Fabry disease. Orphanet. J. Rare Dis. 5, 30. https://doi.org/10.1186/1750-1172-5-30 (2010).
Article PubMed PubMed Central Google Scholar
Eng, C. M. et al. Fabry disease: Baseline medical characteristics of a cohort of 1765 males and females in the Fabry Registry. J. Inherit. Metab. Dis. 30, 184–192. https://doi.org/10.1007/s10545-007-0521-2 (2007).
Article CAS PubMed Google Scholar
Ison, H. E., Clarke, S. L. & Knowles, J. W. Familial Hypercholesterolemia. In GeneReviews® (eds Adam, M. P. et al.) (University of Washington, Seattle, Seattle (WA), 1993).
Google Scholar
Kramer, A. I. et al. Major adverse cardiovascular events in homozygous familial hypercholesterolaemia: A systematic review and meta-analysis. Eur. J. Prev. Cardiol. 29, 817–828. https://doi.org/10.1093/eurjpc/zwab224 (2022).
Article PubMed Google Scholar
Hoerbst, A. & Ammenwerth, E. Electronic health records. A systematic review on quality requirements. Methods Inf. Med. 49, 320–336. https://doi.org/10.3414/ME10-01-0038 (2010).
Article CAS PubMed Google Scholar
Morley, T. J. et al. Phenotypic signatures in clinical data enable systematic identification of patients for genetic testing. Nat. Med. 27, 1097–1104. https://doi.org/10.1038/s41591-021-01356-z (2021).
Article CAS PubMed PubMed Central Google Scholar
Bastarache, L. et al. Phenotype risk scores identify patients with unrecognized Mendelian disease patterns. Science 359, 1233–1239. https://doi.org/10.1126/science.aal4043 (2018).
Article CAS PubMed PubMed Central ADS Google Scholar
Wang, D. et al. Data mining: Traditional spring festival associated with hypercholesterolemia. BMC Cardiovasc. Disord. 21, 526. https://doi.org/10.1186/s12872-021-02328-4 (2021).
Article CAS PubMed PubMed Central Google Scholar
Dornan, L. et al. Utilisation of electronic health records for public health in asia: A review of success factors and potential challenges. Biomed. Res. Int. 2019, 7341841. https://doi.org/10.1155/2019/7341841 (2019).
Article PubMed PubMed Central Google Scholar
Silva, C. A. B., Andrade, L. G. M., Vaisbich, M. H. & Barreto, F. C. Brazilian consensus recommendations for the diagnosis, screening, and treatment of individuals with fabry disease: Committee for Rare Diseases—Brazilian Society of Nephrology/2021. J. Bras. Nefrol. 44, 249–267. https://doi.org/10.1590/2175-8239-JBN-2021-0208 (2022).
Article PubMed PubMed Central Google Scholar
Koh, N. et al. Asian pacific society of cardiology consensus recommendations on dyslipidaemia. Eur. Cardiol. 16, e54. https://doi.org/10.15420/ecr.2021.36 (2021).
Article PubMed PubMed Central Google Scholar
Chan, S. H. et al. Analysis of clinically relevant variants from ancestrally diverse Asian genomes. Nat. Commun. 13, 6694. https://doi.org/10.1038/s41467-022-34116-9 (2022).
Article CAS PubMed PubMed Central ADS Google Scholar
Hopkin, R. J. et al. The management and treatment of children with Fabry disease: A United States-based perspective. Mol. Genet. Metab. 117, 104–113. https://doi.org/10.1016/j.ymgme.2015.10.007 (2016).
Article CAS PubMed Google Scholar
Lee, W. J. et al. Familial hypercholesterolemia genetic variations and long-term cardiovascular outcomes in patients with hypercholesterolemia who underwent coronary angiography. Genes (Basel) https://doi.org/10.3390/genes12091413 (2021).
Article PubMed PubMed Central Google Scholar
Yadav, P., Steinbach, M., Kumar, V. & Simon, G. Mining electronic health records (EHRs): A survey. ACM Comput. Surv. 50, 85. https://doi.org/10.1145/3127881 (2018).
Article Google Scholar
Denny, J. C. Chapter 13: Mining electronic health records in the genomics era. PLoS Comput. Biol. 8, 1002823. https://doi.org/10.1371/journal.pcbi.1002823 (2012).
Article CAS ADS Google Scholar
Jensen, P. B., Jensen, L. J. & Brunak, S. Mining electronic health records: Towards better research applications and clinical care. Nat. Rev. Genet. 13, 395–405. https://doi.org/10.1038/nrg3208 (2012).
Article CAS PubMed Google Scholar
Kirk, I. K. et al. Linking glycemic dysregulation in diabetes to symptoms, comorbidities, and genetics through EHR data mining. Elife https://doi.org/10.7554/eLife.44941 (2019).
Article PubMed PubMed Central Google Scholar
Baek, H. et al. Analysis of length of hospital stay using electronic health records: A statistical and data mining approach. PLoS ONE 13, e0195901. https://doi.org/10.1371/journal.pone.0195901 (2018).
Article CAS PubMed PubMed Central Google Scholar
Landi, I. et al. The evolution of mining electronic health records in the era of deep learning. Deep Learn. Biol. Med. 55, 92. https://doi.org/10.1142/9781800610941_0003 (2022).
Article Google Scholar
Liang, C. et al. Curating a knowledge base for individuals with coinfection of HIV and SARS-CoV-2: A study protocol of EHR-based data mining and clinical implementation. BMJ Open 12, e067204. https://doi.org/10.1136/bmjopen-2022-067204 (2022).
Article PubMed Google Scholar
Garcelon, N., Burgun, A., Salomon, R. & Neuraz, A. Electronic health records for the diagnosis of rare diseases. Kidney Int. 97, 676–686. https://doi.org/10.1016/j.kint.2019.11.037 (2020).
Article PubMed Google Scholar

Download references

Acknowledgements

We would like to acknowledge the help provided by the SingHealth OIA team in data access and analysis.

Funding

Sanofi Aventis provided funding for the project. However, Sanofi Aventis did not have any role in data analysis or manuscript writing. SSJ is supported by NMRC Clinician Scientist Award (NMRC/CSAINV21jun-0003). DM is supported by New Colombo Plan, Department of Trade and Foreign Affairs, Australia. Additional funding support includes grants under National Research Foundation Singapore administered by the Singapore Ministry of Health’s National Medical Research Council to the following individuals: National Precision Medicine Programme (NPM) PHASE II FUNDING (MOH-000588) to WKL.

Author information

Authors and Affiliations

Curtin University, Perth, Australia
Daniel Moynihan, Ritu Gupta & Tele Tan
Health Catalyst, Utah, USA
Sean Monaco
Genetics Service, Department of Paediatrics, KK Women’s and Children’s Hospital, 100 Bukit Timah Road, Singapore, 229899, Singapore
Teck Wah Ting, Sylvia Kam, Jiin Ying Lim & Saumya Shekhar Jamuar
SingHealth Duke-NUS Genomic Medicine Centre, Singapore, Singapore
Teck Wah Ting, Kaavya Narasimhalu, Jenny Hsieh, Sylvia Kam, Jiin Ying Lim, Weng Khong Lim, Sonia Davila, Yasmin Bylstra, Iswaree Devi Balakrishnan & Saumya Shekhar Jamuar
Department of Neurology, National Neuroscience Institute (Singapore General Hospital), Singapore, Singapore
Kaavya Narasimhalu
Department of Internal Medicine, Singapore General Hospital, Singapore, Singapore
Jenny Hsieh
SingHealth Duke-NUS Institute of Precision Medicine, Singapore, Singapore
Weng Khong Lim, Sonia Davila, Yasmin Bylstra & Saumya Shekhar Jamuar
Cancer & Stem Cell Biology Program, Duke-NUS Medical School, Singapore, Singapore
Weng Khong Lim
Laboratory of Genome Variation Analytics, Genome Institute of Singapore, Singapore, Singapore
Weng Khong Lim
National Heart Centre Singapore, Singapore, Singapore
Iswaree Devi Balakrishnan & Khung Keong Yeo
SingHealth Office of Insights and Analytics, Singapore, Singapore
Mark Heng & Elian Chia
Data Analytics Office, KK Women’s and Children’s Hospital, Singapore, Singapore
Bee Keow Goh
Rare Care Centre, Perth Children’s Hospital, Perth, WA, Australia
Gareth Baynam
Western Australian Register of Developmental Anomalies, Perth, WA, Australia
Gareth Baynam

Authors

Daniel Moynihan
View author publications
You can also search for this author in PubMed Google Scholar
Sean Monaco
View author publications
You can also search for this author in PubMed Google Scholar
Teck Wah Ting
View author publications
You can also search for this author in PubMed Google Scholar
Kaavya Narasimhalu
View author publications
You can also search for this author in PubMed Google Scholar
Jenny Hsieh
View author publications
You can also search for this author in PubMed Google Scholar
Sylvia Kam
View author publications
You can also search for this author in PubMed Google Scholar
Jiin Ying Lim
View author publications
You can also search for this author in PubMed Google Scholar
Weng Khong Lim
View author publications
You can also search for this author in PubMed Google Scholar
Sonia Davila
View author publications
You can also search for this author in PubMed Google Scholar
Yasmin Bylstra
View author publications
You can also search for this author in PubMed Google Scholar
Iswaree Devi Balakrishnan
View author publications
You can also search for this author in PubMed Google Scholar
Mark Heng
View author publications
You can also search for this author in PubMed Google Scholar
Elian Chia
View author publications
You can also search for this author in PubMed Google Scholar
Khung Keong Yeo
View author publications
You can also search for this author in PubMed Google Scholar
Bee Keow Goh
View author publications
You can also search for this author in PubMed Google Scholar
Ritu Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Tele Tan
View author publications
You can also search for this author in PubMed Google Scholar
Gareth Baynam
View author publications
You can also search for this author in PubMed Google Scholar
Saumya Shekhar Jamuar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

D.M.: data analysis, conduct of literature review, writing—original draft, and writing—review and editing. S.M.: data analysis and writing- review and editing. T.T.W., K.N., J.H., S.K., J.Y.L., Y.B., I.D.B., W.K.L., S.D., K.K.Y.: conceptualisation of study, conduct of literature review, writing—review and editing. M.H., E.C., B.K.G.: data analysis, writing- review and editing. R.G., T.T., G.B., S.S.J.: conceptualisation and design of study, writing—review and editing, supervision, and mentorship.

Corresponding author

Correspondence to Saumya Shekhar Jamuar.

Ethics declarations

Competing interests

SM is an employee of Health Catalyst. DM worked as an intern at Health Catalyst. All other authors have no conflict of interest.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Figures.

Supplementary Table S1.

Supplementary Table S2.

Supplementary Information 4.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Moynihan, D., Monaco, S., Ting, T.W. et al. Cluster analysis and visualisation of electronic health records data to identify undiagnosed patients with rare genetic diseases. Sci Rep 14, 5056 (2024). https://doi.org/10.1038/s41598-024-55424-8

Download citation

Received: 01 November 2023
Accepted: 23 February 2024
Published: 01 March 2024
DOI: https://doi.org/10.1038/s41598-024-55424-8

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.