An interactive atlas of genomic, proteomic, and metabolomic biomarkers promotes the potential of proteins to predict complex diseases

Multiomics analyses have identified multiple potential biomarkers of the incidence and prevalence of complex diseases. However, it is not known which type of biomarker is optimal for clinical purposes. Here, we make a systematic comparison of 90 million genetic variants, 1453 proteins, and 325 metabolites from 500,000 individuals with complex diseases from the UK Biobank. A machine learning pipeline consisting of data cleaning, data imputation, feature selection, and model training using cross-validation and comparison of the results on holdout test sets showed that proteins were most predictive, followed by metabolites, and genetic variants. Only five proteins per disease resulted in median (min–max) areas under the receiver operating characteristic curves for incidence of 0.79 (0.65–0.86) and 0.84 (0.70–0.91) for prevalence. In summary, our work suggests the potential of predicting complex diseases based on a limited number of proteins. We provide an interactive atlas (macd.shinyapps.io/ShinyApp/) to find genomic, proteomic, or metabolomic biomarkers for different complex diseases.

omics layers, such as genomics or metabolomics perform compared to the proteomics-based prediction models?(2) How does the prediction focused on the incident cases relate to the prediction based on the prevalent cases?Are the same molecules good markers for earlier and later stages of the diseases?(3) Could a single molecule or a limited number of molecules suffice to predict complex diseases that, in contrast to monogenic diseases, are caused by multiple interacting molecules 6,12 ?This problem is illustrated by a recent study, which identified multiple disease-associated metabolites, each of which could vary greatly between both healthy and sick individuals 13 .
To systematically answer these questions, we performed a comparison of genomic, proteomic, and metabolomic data from the UKBB.We used machine learning to build predictive models for different combinations of genetic variants, proteins, and metabolites, and these models were utilized to search for potential biomarkers for the incidence and prevalence of nine complex diseases (Fig. 1).

Patient cohort
Our analyses were based on 92,916 patients with rheumatoid arthritis (RA), systemic lupus erythematous (SLE), ulcerative colitis (UC), Crohn's disease (CD), psoriasis (PSO), type 2 diabetes (T2D), obesity, atherosclerotic vascular disease (ASVD), and chronic obstructive pulmonary disease (COPD), as well as their age/sex matched controls.These diseases were selected because they had enough samples to perform statistically robust predictions.For each disease, we divided patients into those who were diagnosed after the assessment visit and those who were already diagnosed (incident and prevalent disease, respectively).The patient characteristics are presented in Supplementary Table 1.

Interactive web-based atlas enables the search for biomarkers
The UKBB data that we analysed consists of genotypes data of 90 million genetic variants, 1453 proteins, and 325 metabolite measurements.To find the optimal type and number of potential biomarkers for incidence and prevalence, we used a machine learning pipeline, which consisted of data cleaning, data imputation, feature selection, and model training, with comparisons of the results on holdout test sets.In all models, the data were divided into training and testing datasets.The classification model was trained using a tenfold cross validation.The results are presented in an interactive web-based atlas (macd.shinyapps.io/ShinyApp/),which generates receiver operating characteristic (ROC) curves for the incidence or prevalence of each disease based on user-selected numbers of either proteins or metabolites.In the case of genomics, the ROC curves are generated from published polygenic risk scores (PRS)s for the studied diseases, which were derived from the polygenic score catalog 14 .

Proteomics biomarkers outperforms biomarkers from other omics
To evaluate the predictive performance of the different molecular types for incidence and prevalence of each disease, we started by computing test ROC curves based on only five proteins or metabolites, while for genomics, we used the individual scaled PRSs.We found that there was a significant difference between incidence or prevalence and healthy controls for all diseases in proteins and metabolites, while the differences in PRSs were sometimes not significant (Fig. 2A).The boxplots for proteins and metabolites also indicate that healthy controls can be separated from incident and prevalent cases with a good precision.To further evaluate the models, we computed the area under the ROC curves (AUCs).We aimed for AUCs of 0.8 or greater, which may be of clinical significance 15 .Overall, proteins yielded the highest AUCs.The AUCs reached 0.8 or more for all diseases except CD and UC.The median (min-max) AUC for incidence was 0.79 (0.65-0.86), and for prevalence 0.84 (0.7-0.91).Metabolites yielded median (min-max) AUCs for incidence and prevalence of 0.70 (0.62-0.80) and 0.86 (0.65-0.90), respectively.T2D, obesity and ASVD had the highest AUCs.Genetic variants resulted in median AUCs for incidence and prevalence of 0.57 (0.53-0.67) and 0.6 (0.49-0.70), respectively.The most clinically significant AUCs were found in CD, PSO and T2D.These findings suggested that as few as five proteins may suffice for both predicting incident and diagnosing prevalent disease.However, the optimal number could be context dependent.For example, a serious disease may motivate a larger number of biomarkers than a less serious one.To address this question, we analysed the atlas to compute AUCs for different numbers of proteins and metabolites (Fig. 2B-D and Supplementary Table 2).For most of the diseases, five or fewer proteins sufficed to achieve AUCs of 0.8 or more.For example, in ASVD only three proteins resulted in an AUC of 0.88 for prevalence, namely, matrix metalloproteinase 12 (MMP12), TNF Receptor Superfamily Member 10b (TNFRSF10B), and Hepatitis A Virus Cellular Receptor 1 (HAVCR1), consistent with extant knowledge on the role of inflammation and matrix degradation in atherogenesis.However, for incidence, 18 proteins were needed to achieve an AUC of 0.8 (Fig. 2B).

Gene ontology analysis of potential protein biomarkers showed significant enrichment of a wide variety of pathways
To examine if the 30 most discriminating proteins for each disease were functionally connected, we performed a Gene Ontology Analysis (Fig. 3).Indeed, this showed significant enrichment of the proteins in a wide variety of pathways.In agreement with the general importance of the immune system 16 the term "inflammatory response" was enriched in all the diseases.However, the pathophysiological diversity of the diseases was also reflected by enrichment of pathways regulating highly diverse immunological, structural, proliferative, and metabolic functions.
Figure 1.Overview of the study.(A) Genomic, proteomic, and metabolomic data from patients with (B) nine incident or prevalent complex diseases and age/sex matched controls were (C1) analysed using cross-validation and holdout test datasets, and (C2) A polygenic risk score was computed for genomic data, while (D) feature selection was performed for proteomics and metabolomics.(E1) A machine learning model was trained and tested to (E2) construct an interactive atlas that can be found at macd.shinyapps.io/ShinyApp/.(F) In summary, we present a comparison of different omics layers, and an interactive atlas to derive context-dependent types and numbers of potential biomarkers for incidence and prevalence of the diseases.

Discussion
Our comparison of genomic, proteomic, and metabolomic data provides a systematic solution for the prioritisation of the type and number of potential biomarkers for the incidence and prevalence of nine common complex diseases.The clinical relevance lies in that prioritisation of biomarkers is complicated by each disease involving thousands of genes and gene products that can vary between the same patient before and after diagnosis, as well as between patients with the same diagnosis.Recent organome-, cellulome and genomewide studies show that the same complex diseases can involve variable cellular and molecular changes across multiple organs, and not only in the organ that shows symptoms or signs of disease 17,18 .Moreover, those changes can vary greatly between patients with the same diagnosis 19 .Thus, biomarker prioritisation based on literature, clinical experience, or omics data involves formidable challenges.The main finding of our study is that a limited number of proteins have potential for both prediction and diagnosis, representing substantial dimensionality reduction of the everexpanding pool of big data acquired from patients with these complex disorders.
From a clinical perspective, an advantage of proteins is that they can be measured with routine clinical methods.For any biomarker and disease, the optimal number of proteins is a trade-off between cost, sensitivity and specificity 15,20 .For example, the prediction or diagnosis of a serious disease may motivate a larger number of proteins than a less serious disease.We make all the molecule combinations and their AUCs available to facilitate systematic and context-dependent prioritisation of biomarkers for clinical studies.
It may be difficult to discern whether different omics layers have casual effect in relation to the disease mechanisms or rather reflect the consequences of those mechanisms.For example, we have previously found that different subtypes of diabetes have variable genetic and environmental associations 21 .This heterogeneity is particularly evident in type 2 diabetes, which has a strong genetic component.However, environmental factors, like diet also play a large role.In agreement with this, our analyses of type 2 diabetes, showed that metabolites and proteins had higher AUCs than genetic variants.While the genetic variants likely have causal roles, it is difficult to define if the metabolites or proteins change because of the environmental factors or secondary to intrinsic disease mechanisms.Further studies are warranted to offer a better understanding of the role of biomarkers in disease progression.Because of its longitudinal design, the UKBB provides a unique opportunity to identify incidence biomarkers that are potentially associated with early disease mechanisms.Proteins in blood may be particularly suitable because they reflect changes in tissues and mediate a wide range of disease-relevant functions, such as interactions between cells, immune responses, vascular functions, tissue remodelling 17,22 .
Indeed, our pathway analyses of the 30 most discriminating potential protein biomarkers showed a wide variety of pathways.Some of these were shared between diseases.For example, the general pathway term "inflammatory response" was shared between all the diseases.By contrast more specific immunological pathways were associated with immune-mediated inflammatory diseases (UC, RA, ASVD and SLE).Indeed, in agreement with previous studies more specific pathways, like type 2 interferons was only enriched in SLE 23 .Thus, the incidence biomarkers may help to discover, or prioritise among previously known, early disease mechanisms and thereby identify targets for preventive treatment.www.nature.com/scientificreports/ As one example, ASVD is an important cause of morbidity and mortality worldwide and is associated with myocardial infarction, stroke, vascular dementia, and peripheral arterial occlusive disease.Early prediction of subclinical disease and prevention or treatment are, therefore, key health care objectives with the potential to greatly reduce patient suffering 24 .The incidence biomarker proteins identified by our analysis are mechanistically rational (although they need not be a priori) and are responsible for inflammatory responses in the early plaque (CXCL17, PLAUR), stress responses after (hypoxic or inflammatory) injury (GDF15), innate immunity (WFDC2), and angiogenesis (PLAUR, WFDC2).Similarly, as discussed above, the prevalent biomarker proteins include proteins critical for matrix remodelling (MMP12, KLK4), cytokine-mediated inflammatory responses (TNFRSF10B), and Hif-1alpha-dependent angiogenesis (ADM).
Limitations of our study include the fact that proteins and metabolites in blood may not reflect diseaseassociated changes in tissues or may vary for reasons other than disease.Moreover, a limited number of proteins and metabolites were analysed, with technologies that could have method-dependent variations.Variable numbers of patients and controls were used for different omics layers, and thus, the statistical power of proteomics is lower than the other layers.However, to reduce the risk of spurious findings or overfitting the analyses, our analytical strategy was based on dividing each diagnostic group into a training and a test set.In support of the pathophysiological relevance of the analysed proteins, Gene Ontology Analyses showed that they were significantly enriched for a wide variety of pathways.This supported that they were functionally related rather than random findings.However, another limitation is that the UKBB mainly consists of European participants with restricted age ranges.Additionally, the UKBB participants have been shown to be potentially biased due to selective participation 25 .Given all these limitations, the prediction potential of each biomarker combination should be evaluated by independent studies.However, the focus of our study, comparing different omics layers, should be less effected by the abovementioned limitations.These limitations, however, point to how our atlas can be exploited to facilitate future studies of (1) more diverse populations, starting with targeted analyses of prioritised biomarkers rather than with more costly omics technologies; (2) how to combine and integrate prioritised biomarkers with routine laboratory measurements as well as clinical symptoms and signs.As mentioned above proteins may be particularly suitable to integrate in clinical laboratory analyses, because they can be measured with routine methods.From a clinical perspective, the relevance of integration of novel biomarkers with other types of routine variables is consistent with the clinical experience that physicians are trained to make diagnostic and therapeutic decisions based on combining biomarkers with routine clinical variables; (3) investigation of data integration methods to construct classifiers with the capacity to incorporate different omics layers; and (4) incidence biomarkers to predict and potentially prevent complex diseases.The importance lies in the fact that many complex diseases have vague or no symptoms at early stages but are easier to treat at that point rather than at later stages.A well-known example is how biomarkers for early diagnosis have greatly improved the treatment of RA.However, finding targets for early treatment is complicated by systematic studies of early mechanisms being difficult to perform in human subjects before diagnosis.Thus, transformative changes of clinical practice based on biomarkers will require consorted multi-disciplinary efforts across multiple diseases, including analysing larger numbers of different types of different molecules.
In summary, we found that a limited number of proteins in blood may suffice for the early prediction and diagnosis of complex diseases.We make those proteins, as well as metabolites and genetic variants associated with the PRS of the diseases, available in the form of an interactive atlas for future studies to evaluate their potential.

Data source and participants
Participants of this study were a part of the UKBB dataset, a large prospective cohort study consisting of more than 500,000 participants recruited in the United Kingdom 7 .Full details of the UKBB study can be found on the UKBB website (https:// bioba nk.ndph.ox.ac.uk/ showc ase/).UKBB received ethical approval from the National Information Governance Board for Health and Social Care and the National Health Service Northwest Multi-Center Research Ethics Committee 7 .All participants gave informed consent through electronic signatures before enrolment in the study.This research has been conducted under approved UKB Project ID 102162.The follow up of the individuals was until the 31st of October 2022.
The specific data fields used in this analysis were nuclear magnetic resonance (NMR) metabolomics, proteomics, imputed genomic data, date of recruitment, age, sex, date of diagnosis, and Diagnostic Codes-ICD10.
All methods were performed in accordance with the relevant guidelines and regulations or declaration of Helsinki.

Data processing
NMR spectroscopy measurements took place between June 2019 and April 2020 (Phase 1), April 2020 and June 2022 (Phase 2), using eight spectrometers at Nightingale Health based in Finland.The metabolic biomarkers are involved in multiple metabolic pathways, including lipoprotein lipids in 14 subclasses, fatty acids, and fatty acid composition, as well as various low-molecular-weight metabolites, such as amino acids, ketone bodies, and glycolysis metabolites quantified in molar concentration units.The dataset comprised 249 NMR metabolite measurements along with their associated quality control (QC) matrices.Of these measurements, 168 were absolute, and 81 were ratios.Data preprocessing, technical variation removal, and computation of an additional 76 biomarkers from the post-QC dataset were conducted using the ukbnmr package (version 2.0) 26 .Consequently, 325 metabolite measurements were utilised in subsequent analyses.For individuals with both repeat assessments (2012-2013) and baseline assessments (2006-2010), only the baseline data were retained.Repeated measures were based on the Eid and visit_index columns from the UKBB dataset.Technical variations were removed using the updated Algorithm version 2 in the ukbnmr V2 package, where well positions within each www.nature.com/scientificreports/batch were separately considered and adjusted.Further details on this approach can be found at the ukbnmr GitHub repository.Proteomic profiling of blood plasma samples was collected during participant visits between 2006 and 2010 (UKBB dataset field: 53) using the Olink Explore 1536 platform, measuring 1472 protein analytes and capturing 1463 unique proteins.The criteria for participant inclusion in the UK Biobank Pharma Proteomics Project (UKB-PPP) and the specifics of the proteomics assays and normalisation processes are detailed in an earlier study 8 .
Genetic data was downloaded from UKBB.The genotyping and imputation (and quality control) were performed by the UKBB 27 .Genome-wide data available from the UK Biobank v3 imputed data in BGEN v1.2 format.
The NMR measurement, proteomic profiling, and genomic data were processed using code 3, 143, and code 87, which enabled the decoding of the data.

Patient stratification
The 10th revision of International Classification of Diseases (ICD10) codes was used to assess the diagnosis of the patients.We identified patient groups based on ICD10 codes in the hospital inpatient data (UKBB datasets field: 41270), which is curated from UKBB as provided.In the analysis, we included Crohn's disease (CD) (K50), ulcerative colitis (UC) (K51), psoriasis (L40), systemic lupus erythematosus (SLE) (M32), chronic obstructive pulmonary disease (COPD) (J449), obesity (E66), type 2 diabetes (T2D) (E11), atherosclerotic vascular disease (ASVD) (I70), and rheumatoid arthritis (RA) (M05, M06).In case of genomics, we only selected patients with European origin.We distinguished incident and prevalent cases based on the earliest reported data across the respective date of first inpatient diagnosis (fields 41262 and 41280) columns from the UKBB.Thus, individuals receiving a diagnosis after the time of sampling were labelled incident cases, and patients diagnosed before or at the time of sampling were classified as prevalent cases.For each group of patients, we identified a group of healthy controls, which were defined as all patients without any disease code.We used the MatchIt package 28 to match the healthy control cases with the prevalent and incident cases based on age and sex.The match was performed by the nearest neighbour method using Euclidean distance as a measure of similarity.We used exact matching for sex.

Weighted PRS analyses
The known PRS from the polygenic score catalog for each disease was used to determine the genetic contribution to the probability of developing the disease 14 .The known PRS and genetic variants for each disease can be found in Supplementary Table 1.IMIDs were RA (PGS000194), SLE (PGS000328), UC (PGS001306), CD (PGS001331), and PSO (PGS002293).Chronic diseases included COPD (PGS001332), T2D (PGS000864), obesity (PGS000848), and ASVD (PGS000863).Imputed genetic data (930995623 genetic variants) from UKBB were used to calculate the corresponding PRSs (1595 genetic variants in PRSs) for everyone in each disease separately.Genetic variants for each PRS were pruned for linkage disequilibrium (r2 = 0.5, 250-kb window in PLINK).The weighted PRSs were calculated for all participants included in the analysis of each disease using the standard formula in PLINK v1.9 software 29 .All analyses of the PRSs included participants of European origin only and were adjusted for the first 3 genetic principal components supplied by the UKBB quality control files 21 .The principal components were generated by UKBB 27 .The list of SNPs for each disease is available in supplementary Table 3.

Modelling the probability of health status
To construct the classifiers, we created a pipeline consisting of data cleaning, data imputation, feature selection, training of the classifier and summary of results.The pipeline was created with the following key ideas that we wanted to achieve: (1) Interpretability-we aimed to construct a pipeline that would be understandable by interdisciplinary researchers as well as clinicians.(2) Scalability-we aimed to use methods that would be applicable on diseases with relatively low number of patients (such as ASVD) as well diseases with relatively many patients (such as T2D), (3) Robustness-we aimed to develop a pipeline that would be robust in terms of replicability of the results.In detail, we first removed all molecules with more than 10% NA values and divided the data into training (70%) and testing (30%) groups.We then trained a KNNImputer 30 method to impute NA values.The choice of the KNN method was based on the computational efficiency, simplicity and a recent study 31 , which suggested that KNN has a comparable performance to other more complex methods for the continuous data, in our case proteomics and metabolomics, with a low amount of missing values.Next, we applied the extremely randomised trees (ERT) 32 method for feature selection.We used 10,000 trees, and the main motivation for using ERT was the reduced impact of multicollinearity, which is handled by using multiple trees and utilising random splits.The feature selection ranked molecules from the most important to least important.For the downstream analysis, we chose only the N most important features, where N (1-30) was selected by the user.To train the prediction model, we used a logistic regression with a ridge (L2) penalty 33 , with the N most important proteins as features and binary disease status as the response variable.Even though it may slightly reduce the performance of the model, the use of logistic regression with the L2 penalty over more complex nonlinear methods is motivated by the interpretability of the method, which we consider essential in clinical research.Furthermore, logistic regression is a tool with which many clinicians are familiar and provides us with the possibility to assess whether the feature has a positive or negative association with the disease.The disease status was dependent on the user input.In the case of 'incident' , the aim of the model was to discriminate the incident cases from healthy controls matched by age and sex.Analogously, when setting 'prevalent' , the aim was to discriminate prevalent cases from their respective healthy controls.We used cross-validation as implemented in the LogisticRegressionCV 34 method from the sklearn package to optimise the L2 penalty factor.All the above was performed per training dataset.To measure the performance of the whole pipeline, we applied the trained imputation model, downsampled the feature set based on the trained ERT model and applied the trained logistic model to the test datasets and presented the receiving operating characteristic (ROC) curve together with its corresponding area under the

Figure 2 .
Figure 2. (A) Violin plots showing the probabilities of incident and prevalent diseases, as well as healthy controls, based on PRS scores and five molecules derived from proteomic and metabolomic data.The centre line corresponds to median; (B) line plots showing the relationship of the AUC of incident and prevalent diseases and the number of molecules used in the model; (C) top 10 proteins for the ASVD prevalence, biomarkers with positive coefficients may have disease-inducing roles, while negative coefficient indicate protective roles; (D) the ROC for ASVD prevalence.ASVD = Atherosclerotic Vascular Disease, CD = Crohn's Disease, COPD = Chronic Obstructive Pulmonary Disease, PSO = psoriasis, RA = Rheumatoid Arthritis, SLE = Systemic Lupus Erythematosus, T2D = Type 2 Diabetes, UC = Ulcerative Colitis.*P < 0.05, **P < 0.01, ***P < 0.001, ****P < 0.0001.T test, adjusted P values.