Blood protein assessment of leading incident diseases and mortality in the UK Biobank

The circulating proteome offers insights into the biological pathways that underlie disease. Here, we test relationships between 1,468 Olink protein levels and the incidence of 23 age-related diseases and mortality in the UK Biobank (n = 47,600). We report 3,209 associations between 963 protein levels and 21 incident outcomes. Next, protein-based scores (ProteinScores) are developed using penalized Cox regression. When applied to test sets, six ProteinScores improve the area under the curve estimates for the 10-year onset of incident outcomes beyond age, sex and a comprehensive set of 24 lifestyle factors, clinically relevant biomarkers and physical measures. Furthermore, the ProteinScore for type 2 diabetes outperforms a polygenic risk score and HbA1c—a clinical marker used to monitor and diagnose type 2 diabetes. The performance of scores using metabolomic and proteomic features is also compared. These data characterize early proteomic contributions to major age-related diseases, demonstrating the value of the plasma proteome for risk stratification.


Sample selection
A complete summary of the sample selection, processing and quality control details for the UK Biobank PPP proteomics samples is available in Sun et al, 2022 1 .Consortium members chose samples that were enriched for specific diseases of interest.The remainder of the population was randomly sampled through stratified selection against age, sex and recruitment centre.Day of the week of collection, deprivation index and participant ethnicity were confirmed as representative of the wider UK Biobank cohort.Inclusion  Panels contained dilution blocks to account for the range of proteins present.Samples were serially diluted to 1:10, 1:100 and 1:1000 and transferred to the 384-well plates, which had four blocks for each set of 96 samples.Matched antibodies are labelled with complementary oligonucleotides that bind to the target protein in the sample.Hybridization of the probes can thus be recorded through DNA amplification using polymerase chain reaction (PCR 1) to create amplicons for protein assays.Amplicons were combined across each of the four abundance groups, resulting in one well of amplicons per sample.This signal is quantified using next generation sequencing and validated using methods that have been previously reported 2,3 .

Assessment of technical and genetic effects
To assess the potential impact of protein processing batch (0-7), study centre (1-22) and 20 genetic principal components, protein levels were regressed onto these variables and residuals were correlated with the original protein levels.Across the 1,468 proteins tested, the lowest Pearson correlation was 0.94, indicating that there was minimal influence of these factors on protein levels.Cox PH models therefore did not incorporate them as covariates.This supports the extensive characterisations from Sun et al 2023 previously 1 , which suggested that the proteomic data in the UK Biobank PPP sample does not have pronounced plate, batch or study site specific variability.

Summary of incident disease derivation
Cancer diagnoses were sourced from the cancer registry made available by the UK Biobank at (field ID 100092).First occurrence traits made available by the UK Biobank were used to define non-cancer disease diagnoses (field ID 1712).First occurrence traits integrate selfreport at baseline with electronic health linkage to ICD9, ICD10 and GP read2/3 codes from healthcare providers across the United Kingdom to identify the earliest date of a given diagnosis for an individual.Self-report data was recorded at the baseline clinic visit through a touchscreen and was then confirmed via verbal interview with a nurse.Any diagnoses included as ICD codes on death registry information (field ID 100093) from the UK Biobank were also integrated.The UK Biobank provides dates of data availability for each data provider at: https://biobank.ndph.ox.ac.uk/ukb/exinfo.cgi?src=Data_providers_and_dates.
Censoring dates for cancer outcomes were set to 2016, which is the earliest date of complete data availability across all providers listed in this guidance.Censoring dates were set to October 2021 for non-cancer diseases and November 2021 for death, which were the dates of the data extractions used.

Medication use
Medication self-report at baseline was extracted using fieldID 20003 from the UK Biobank.

MethylPipeR R package information
MethylPipeR is an R package that facilitates systematic and reproducible development of

ProteinScore testing covariate preparation
Three sets of increasingly complex covariates were used to model the difference in AUC resulting from the addition of ProteinScores.In addition to age and sex (that were available for all individuals), an additional set of 24 covariates were considered.These included the six lifestyle covariates modelled in individual Cox PH analyses (BMI, smoking status, alcohol consumption, social deprivation, education status and physical activity).For the extended set, 18 clinically-relevant covariates were selected from the UK Biobank biomarker panel.These have previously been integrated in metabolomics prediction studies of incident disease in the UK Biobank 6 and represent a comprehensive set of measures that are theoretically possible to generate in clinical settings (although generation of all biomarkers is not typically done as part of clinical practice, as disease-specific biomarkers will often be tested in specific circumstances as isolated tests).When the 24 variables were considered across the 47,600 individuals, 4,163 individuals were identified that had >10% missingness and were excluded.
In the remaining 43,437 individuals, none of the covariates had >10% missingness and were therefore all retained.Age, sex, six lifestyle covariates and an extended set of 18 covariates were taken forward for ProteinScore testing.Missing covariate information for continuous traits was imputed through knn imputation and these variables were log transformed, whereas categorical and binary variables were imputed through median imputation.
This covers only a portion of the UKB sample (376,448 individuals), which when subset to the population of 47,600 with protein measures available in the present study results in 35,073 individuals.To model medication use, 124,198 medication name instances that were recorded in the 35,073 individuals were condensed into unified classes of action, using the anatomical therapeutic chemical (ATC) classification categories.This coding system was previously included in the GWAS of medication classes performed by Wu et al4 .The frequency of these medications grouped into 849 ATC classes in the population of 35,073 individuals is summarised in Supplementary Table 10.Blood-pressure lowering medication was defined using the following ATC codes: Antihypertensives (ATC code C02) = 803 individuals, Diuretics (ATC code C03) = 4227 individuals, Beta blockers (ATC code C07) = 3660 individuals, Calcium channel blockers (ATC code C08) = 3674 individuals, Reninangiotensin system actors (ATC code C09) = 7288 individuals, Statin use (ATC code C10AA) = 8351 individuals.Taken together, 14,074 individuals (of the 35,073) indicated they were taking one or more of the above blood-pressure lowering medications at baseline.This was treated as a binary variable and the comparison with/out adjustment for this variable was performed for ischaemic heart disease Cox PH associations in the subset of 35,073 individuals.Adjustments for age, sex and six lifestyle factors were included in both sets of analyses, with 2,456 cases, 27,468 controls.
have applied MethylPipeR to incident type 2 diabetes prediction considering DNA methylation sites as informative features5.However, MethylPipeR allows for Cox PH penalised regression models to be run with for any input features of interest.Input features are provided to the model in the training samplein this case, the measurements of 1,468 protein analytes available in the UK Biobank PPP consortium sampleand the features that are predictive of the outcome are selected and assigned weighting coefficients.These coefficients can then be applied to the test sample, to project in scores and assess performance.