Machine learning optimized polygenic scores for blood cell traits identify sex-specific trajectories and genetic correlations with disease

Summary Genetic association studies for blood cell traits, which are key indicators of health and immune function, have identified several hundred associations and defined a complex polygenic architecture. Polygenic scores (PGSs) for blood cell traits have potential clinical utility in disease risk prediction and prevention, but designing PGS remains challenging and the optimal methods are unclear. To address this, we evaluated the relative performance of 6 methods to develop PGS for 26 blood cell traits, including a standard method of pruning and thresholding (P + T) and 5 learning methods: LDpred2, elastic net (EN), Bayesian ridge (BR), multilayer perceptron (MLP) and convolutional neural network (CNN). We evaluated these optimized PGSs on blood cell trait data from UK Biobank and INTERVAL. We find that PGSs designed using common machine learning methods EN and BR show improved prediction of blood cell traits and consistently outperform other methods. Our analyses suggest EN/BR as the top choices for PGS construction, showing improved performance for 25 blood cell traits in the external validation, with correlations with the directly measured traits increasing by 10%–23%. Ten PGSs showed significant statistical interaction with sex, and sex-specific PGS stratification showed that all of them had substantial variation in the trajectories of blood cell traits with age. Genetic correlations between the PGSs for blood cell traits and common human diseases identified well-known as well as new associations. We develop machine learning-optimized PGS for blood cell traits, demonstrate their relationships with sex, age, and disease, and make these publicly available as a resource.

In brief Xu et al. develop and validate polygenic scores (PGSs) for 26 blood cell traits using 6 PGS methods. PGSs developed using machine learning methods show improved polygenic prediction and allow for jointly modeling the effect of correlation, interaction, and low MAF variants. Blood cell trait PGSs were used to stratify the age-based trajectories of blood cell trait levels and showed genetic correlation with common diseases.

INTRODUCTION
Blood cells play essential roles in a variety of biological processes, such as oxygen transport, iron homeostasis, and pathogen clearance. [1][2][3] Abnormalities in blood cell traits, such as the number of cells, the proportions of different types, sizes, and morphology, and thus their likely functions, have been associated with a range of human diseases, such as reticulocyte indices with coronary heart disease 4 or eosinophil counts with asthma. 5 As such, blood cell counts and associated traits are also widely used in clinical practice, where they are among the most common clinical tests worldwide.
Blood cell traits are heritable, and their genetic architecture has been found to be polygenic. Analyses of the UK Biobank (UKB) 6,7 and INTERVAL 8 cohorts have suggested that between 18% and 30% of the variance in erythrocyte counts and morphology can be explained by hundreds of common autosomal variants. 4 It is expected, therefore, that levels of these traits can, to some extent, be predicted by genetic variation through the use of polygenic scores (PGSs) 9 .
PGSs for blood cell traits show the potential for utility in clinical practice. A recent study examining the effects of known pathogenic variants and blood cell trait PGSs on patients with rare blood disorders showed that a 1-standard deviation (SD) increase in PGS was comparable in risk to carrying a rare coding variant in heterozygosity. 10 These results indicate that PGSs for blood cell traits could play important roles in disease risk prediction and prevention, or help in better understanding disease etiology and identifying novel therapeutic targets. 11,12 A PGS is most commonly constructed as a weighted sum of genetic variants, typically single-nucleotide polymorphisms (SNPs), carried by an individual, in which the genetic variants are selected and their weights are set via the per-SNP univariate analysis in a genome-wide association study (GWAS). 9,13 Univariate analysis largely relies on hard cutoff thresholds to identify associated variants-for example, linkage disequilibrium (LD) pruning for selection of independent variants 14 and p value thresholding for selection of significant variants (the P + T method). However, standard methods such as P + T have limitations, including that they do not capture interactions between variants. Machine learning and deep learning methods may provide significantly improved polygenic scores for blood cell traits, as has been demonstrated in applications for celiac disease and type 1 diabetes, [15][16][17][18][19] thus facilitating analyses into the genetic architecture of blood cell traits and their relationships with sexand age-specific effects and the genetics of common diseases.
In this study, we evaluate 6 PGS methods to develop optimized PGS for 26 blood cell traits across 3 blood cell types-platelets, red blood cells, and white blood cells-using data from UK Biobank and INTERVAL (see Figure 1 for study workflow). The 6 PGS methods evaluated in this study include the pruning and thresholding (P + T) method and 5 learning methods: LDpred2, elastic net (EN), Bayesian ridge (BR), multilayer preceptron (MLP), and convolutional neural network (CNN). Our analysis finds that common machine learning methods EN and BR show improved polygenic prediction of blood cell traits and consistently outperform other methods. We assess the compositions of these blood cell trait PGSs and discover that the benefits of EN and BR are in jointly modeling the effect of correlation, interaction, and low minor allele frequency (MAF) variants. Our analyses suggest that the EN and BR methods are the top choices for PGS construction of blood cell traits when sufficient individual-level data are available. When there is no sufficient individual-level data available, LDpred2 is also a good option. We investigate the interactions of PGS with sex as well as stratification of measured blood cell traits across ages. Finally, we perform a genetic correlation scan of blood cell trait PGSs across diverse common diseases. We make the machine learning-optimized PGS models publicly available via the PGS Catalog 20 to facilitate genetic and clinical studies on blood cell traits and associated diseases.

Development of blood cell trait PGSs
Using the optimal variant set (i.e., conditional analysis variants, see STAR Methods) identified by the P + T method, we compared the performance of the 5 learning methods with that of P + T for constructing PGSs for 26 blood cell traits ( Figure 2). Four of the 5 methods, EN, BR, LDpred2, and MLP, consistently outperform the P + T method in terms of Pearson r for nearly every blood cell trait. Notably, the performance of EN and BR were nearly indistinguishable and were the most stable as well as the top-performing methods overall. Although LDpred2 outperformed other learning methods in the internal validation across the majority of the traits, its outperformance largely declined in the external validation with similar or slightly better performance for most traits and notable underperformance for a few traits when compared with EN and BR (e.g., basophil percentage of white cells [BASO%]). With any of these 4 learning methods, PGSs for 11 blood cell traits achieved a nearly R0.02 increase in Pearson r score in internal validation. The following 5 blood cell traits each achieved R0.02 improvement in both internal and external validation using EN or BR, in comparison with the P + T method ( (Figure 2). We found that the incorporation of nonlinear factors, as in MLP and CNN, did not improve genomic prediction of blood cell traits, compared with linear models. For nearly half of the blood cell traits we studied, the CNN resulted in PGS with approximately the same or lower Pearson r as the P + T approach.
Comparing estimated SNP effect sizes between univariate analysis and machine learning BR and EN outperformed P + T due largely to differences in variant effect size estimation; thus, we compared the variant effect sizes estimated by univariate analysis (used in the P + T method) and the EN/BR methods ( Figure S1). We found that almost no effect sizes were set to zero by BR and EN, and effect sizes of most variants using BR or EN are the same or similar to those from the univariate analysis in GWAS. This is consistent with a genetic model in which most common genetic variants are independently and additively contributing to each blood cell trait. In addition, we also found that both EN and BR tended to shrink the effects (sometimes greatly) of variants with low MAF compared with that estimated in univariate analysis; however, this did not necessarily contribute to substantially improved PGSs. For example, we observed the effects of numerous low-MAF variants for traits such as mean corpuscular volume (MCV) and mean corpuscular hemoglobin concentration (MCHC) were substantially shrunk by BR and EN; PGS construction of MCV achieved significant improvement ($0.03 increase in Pearson r score), while PGS for MCHC saw little improvement in internal validation ( Figure 2). In spite of that, the effect of shrinkage of low-MAF variants can result in better model generalization, which means they can offer more stable predictions when applied across datasets. This is likely due to the substantial noise in univariate variants for each trait included many correlated variants, with those with r 2 > 0.1. As expected, for variants in moderate to high LD, EN and BR tended to assign weights that were more different from univariate analysis than for variants not in LD, with some variant effects even changing direction ( Figure S1).
In addition, univariate analysis does not model SNP-SNP interaction effects on the trait; however, modeling interactions can improve PGS construction. 21 For all CA variants, we performed a SNP-SNP interaction analysis on each trait using a Bonfer-roni-adjusted threshold to determine significant interactions (STAR Methods). We found that significant SNP-SNP interactions tended to include SNPs that had different weights when comparing EN/BR to univariate analysis ( Figure S1)-for example, in MCV and mean corpuscular hemoglobin (MCH). The increased performance of EN and BR relative to univariate analysis appeared to be due to the weights assigned for the 3 groups of variants above (low MAF, moderate to high LD, or SNP-SNP interactions). Pearson r score performance of the P + T method for PGS construction of 26 blood cell traits are presented in testing on UKB or INTERVAL. Relative to the P + T method, performance of the 5 learning methods: EN, BR, LDpred2, MLP, and CNN, are presented for each blood cell trait in descending order, left to right, according to EN (largest Pearson r increases on left). Given a particular method, a trait and a cohort, the averaged r performance of the 5 trained models, corresponding to the 5 different training-testing data partitions, is shown. Detailed comparison between variant effect sizes estimated using EN/BR and P + T are presented in Figure S1

EN and LDPred2 improve blood cell trait PGSs on larger variant sets
The above analysis has shown that EN, BR, and LDpred2 are the most promising methods for PGS construction of blood cell traits. We further explored potential improvements by incorporating larger variant sets for LDpred2 and EN (because EN and BR performed nearly identically, we only present results for EN). Our results showed that EN further improved PGS for almost every trait with the expanded variant sets, when compared with EN using conditional analysis variants only (Figures 3 and S2). For example, when incorporating all variants genome-wide (i.e., no p value threshold), 25 of the 26 traits had a further improvement of at least 0.02 in terms of r score in the external validation, in which 6 traits achieved >0.05 additional improvement. When compared with the P + T method, EN achieved greater improvements for PGSs of most blood cell traits by using the largest expanded variant set. For example, 25 of the 26 traits had R10% improvement over their r scores achieved using P + T, in which 2 traits had >20% improvement (i.e., hematocrit [HCT; from 0.24 to 0.30] and WBC# [from 0.32 to 0.39]).
By applying different p value thresholds on all the LD-thinned variants, we also obtained smaller variant sets for each trait. Our results suggested that it is possible that by using a smaller variant set with p value thresholding, EN can achieve performance comparable to when using the largest expanded variant set. For example, the performance differences in terms of r score using EN are within 0.01 for all of the traits between the smaller variant set with p value threshold = 10 À4 and the largest variant set (INTERVAL external validation). Using a more lenient p value threshold (i.e., p value threshold = 10 À2 in this study) can result in overfitting problems. By incorporating the variant set with a p value threshold of 10 À2 , EN significantly outperformed that of using other variant sets in the internal validation with UKB, while it experienced a substantial performance decrease in the external validation with INTERVAL (Figures 3 and S2), with some models even underperforming the P + T method. However, the use of overly stringent thresholds (e.g., p value threshold = 10 À6 or lower) could limit the predictive power of EN.
LDpred2 also showed improved performance for PGSs of 24 traits when using the expanded variant sets with more stringent p value thresholds (i.e., p value thresholds = 10 À6 and 10 À4 ) as compared with using CA variants only. However, LDpred2 models showed overfitting on variant sets with lenient or no p value thresholding (i.e., p value thresholds = 10 À2 and 1.0). Nevertheless, EN consistently outperformed LDpred2 in the external validation with INTERVAL data on almost every expanded variant set for every trait (Figure 3). For example, EN outperformed LDpred2 by >0.02 in terms of r score for 9 traits on the top-performing variant set (p value threshold of 10 À4 ) of both methods. In addition, LDpred2 failed to construct a PGS for the trait MPV on 2 expanded variant sets, indicating that EN may be more robust when using large variant sets.

Sex-specific interactions and PGS-stratified trajectories
Maximizing the accuracy and performance of PGS for blood cell traits raises opportunities for insights into the underlying biology, which is potentially of relevance to disease risk. We next compared the extent to which EN-trained PGS would be used to stratify the levels of blood cell traits in men and women over the age ranges of individuals in INTERVAL (Figures 4 and  S3). There were a wide range of age-dependent dynamics in the levels of many blood cell traits in INTERVAL, with the ENtrained PGS (p value threshold = 1 in variant selection) offering stratification that was largely consistent with Pearson r of the trait (i.e., the larger the Pearson r PGS of the trait received, the better the PGS stratified the population). Blood cell traits exhibited well-known sex differences. 22 Interestingly, PGS for approximately half of blood cell traits resulted in different levels of stratification between men and women, with 10 blood cell traits passing the Bonferroni-adjusted significance threshold in PGS-sex interaction analyses (Table 1). For example, WBC indexes in women significantly decrease after menopause, while the level of these traits in men were relatively stable. 23 Importantly, in both men and women, the EN-trained PGS continued to stratify the trait levels even after the trait levels themselves changed. The average trait levels in the top versus the bottom PGS quintiles were substantially different. The top quintile of the PGS for WBC# had an additional $1.5 WBCs per nanoliter (nL) on average in INTERVAL compared to the bottom quintile (an increase of $25%); similarly, the difference between the top versus the bottom 1% PGSs for WBC# was $2.2 WBCs per nanoliter (a 40% increase). For MCV, individuals in the top PGS quintile had red blood cells with $5 femtoliters (fL) greater volume on average than those in the bottom PGS quintile, and these differences were maintained over all age ranges for both men and women.
Genetic correlations of blood cell traits and common diseases Finally, we examined the landscape of genetic correlations for the EN-trained PGS of blood cell traits and PGS of several common human diseases ( Figure 5). We found 67 genetic correlations passing Bonferroni adjusted significance (p < 10 À4 ), which are consistent with well-known associations between the blood cell traits themselves and the disease. For example, prior studies have demonstrated a strong association of asthma with eosinophil indices, 4 consistent with our analyses, which show that PGSs for eosinophil counts (EO#) and eosinophil percentages (EO%) were correlated with the asthma PGS. The strongest genetic correlation was between schizophrenia and WBC#, consistent with previous studies of the trait and schizophrenia risk. 24 Our analyses also uncovered the genetic correlations for previous trait-level observations for EO# and allergic disease 25 as well as WBC# and Crohn's disease. 26 In addition to the wellknown associations between blood cell traits and common diseases, the genetic association scan also identified new associations. For example, PGS of the immature fraction of reticulocytes (IRF) was significantly associated with the coronary artery disease (CAD) PGS, which is related to a recent finding that reticulocyte levels have an ambivalent association with hypertension and atherosclerosis 27 ; the PGS of MONO# was significantly associated with the schizophrenia PGS, which can be supported by the inflammation hypothesis in the pathogenesis of schizophrenia. 28  Using conditional analysis variants as a base set, we added in the selected variant sets with LD thinning and p value thresholding to form different sizes of expanded variant sets for each trait. We used the CA variant set as the starting point and then observed the performance of P + T, EN, and LDpred2 on these expanded variant sets. Note that in this figure, P + T refers to the method that directly applies the weighted sum on a given variant set with effect sizes from GWAS. See Figure S2 for similar performance comparison in UKB.
6 Cell Genomics 2, 100086, January 12, 2022 Article ll OPEN ACCESS and Crohn's disease, and suggest that via shared genetics, blood cell traits may be either indicators or mediators.

DISCUSSION
Improved polygenic models of blood cells traits aid our understanding of myriad biological processes and diseases. This study demonstrated that common machine learning methods such as EN and BR show improved polygenic prediction of blood cell traits and consistently outperform other methods, likely due to their implicit modeling of SNP-SNP correlations, interactions, and controlling the effect of underrepresented low-MAF variants. We showed that blood cell trait PGSs are able to stratify age-dependent trait levels in both men and women in a population-based setting, and that many blood cell trait PGSs have sexspecific interactions. The landscape of genetic correlations between blood cell traits and common diseases identified wellknown trait-level associations, such as eosinophils and asthma, and intriguing associations such as IRF and CAD, MONO# and schizophrenia.
Our analysis indicated that EN and BR can jointly model the effect of correlation, interaction, and low MAF variants, and result in improved PGSs over P + T for blood cell traits. To overcome the scalability problem in PGS methods such as MLP and CNN and provide directly comparable results across these methods,

. Trait levels by quintiles of ENtrained trait PGS in men and women for traits MCV, WBC#, and neutrophil count (NEUT#) in INTERVAL
The y axis is the observed measurements adjusted only for technical artifacts and season for each blood cell trait. The generalized additive model (GAM) was used to fit the data across INTERVAL samples, and the shaded areas represent 95% confidence intervals. See Figure S3 for results of all other traits. the comparative analysis was conducted based on the optimal variant sets identified by the P + T method. Due to the use of hard cut-off thresholds in P + T, this unified way for variant selection may limit the potential of the regularization-based methods EN and BR, which are known for their strength in feature selection. Our follow-up analysis demonstrated including a larger set of genome-wide variants in EN further improved PGS for almost every blood cell trait. These results suggest the EN/BR method as the top choice for PGS development of these (or similar) traits when sufficient individual-level data are available. However, the increased number of input variants can cause increased requirements for the amount of training data, computational resources when using these methods, which could constrain their utilization in PGS development. To address these issues, we also demonstrate that tightening p value thresholds offers a way to not only reduce the size of input variants but also maintain its performance. However, we also find that overly stringent or lenient p value thresholds could cause either deteriorated performance or overfitting problems, so that these should be applied with caution. Our analyses suggest that selecting an appropriate p value threshold through an external validation step could be the key to the better application of these methods in PGS construction.
LDpred2 is specifically designed to consider LD correlation in polygenic prediction, but it still relies on cutoff thresholds to remove the impact of other factors such as low-MAF variants. 29 They can be potential causes for its underperformance or failures in PGS construction, which were demonstrated by the results of LDpred2 on the expanded variant sets. In spite of that, our results did suggest that LDpred2 is still a competitive option for PGS construction (compared with P + T) when appropriate pre-variant selection steps were taken (e.g., applying stringent low-MAF variant filtering and p value thresholding). It is particularly the case when there are only summary-level data available and/or there is a lack of sufficient individual-level data.
Deep learning models MLP and CNN are based on looser model assumptions than the other methods and are capable of modeling more complex relationships among data. The increased model Cell Genomics 2, 100086, January 12, 2022 7 Article ll OPEN ACCESS complexity of MLP and CNN did not result in improvements for PGS construction of blood cell traits, indicating that explicit incorporation of non-linearity factors in the two models does not offer an advantage in this setting. However, it is well known that the design of customized network structures plays a pivotal role in addressing a specific task in deep learning, which encourages us to further design and optimize networks beyond these standard structures for PGS construction. Meanwhile, it was noted that the scalability problem of existing deep learning frameworks in ultrahigh-dimensional genotype data will be a major challenge for the widespread application of neural network-based methods in PGS development. Thus, deep learning frameworks based on efficient genotype data format (e.g., bed format in Plink 30 ), may represent the future efforts of the area.
We demonstrated that population-based samples can be stratified by the PGSs of these blood cell traits, even for traits exhibiting substantial differences between ages and sexes. These observations may offer therapeutic insights. For example, it is known that some drugs, such as clozapine and dapsone, 31 have neutropenia side effects. The difference between top and bottom quintiles of the neutrophil count (NEUT#) PGS was $1,000 NEUTs per microliter; therefore, there may be clinical utility in a priori knowledge that an individual may have genetically lowered NEUT# so as to guide pharmacotherapy. It is also well known that blood cell traits are associated with the risk of some complex diseases (e.g., between eosinophil counts and asthma 5 ), which indicates that PGS may be useful in followup studies on disease risk prediction.

LIMITATIONS OF THE STUDY
MLP and CNN are two of the most common and fundamental deep learning models, and previous studies have demonstrated their potential in genetic prediction. 19,32,33 The present study is limited to two specific but common deep learning models; thus, we cannot make conclusive suggestions on the applications of the whole line of deep learning methods for PGS development of blood cell traits. Taking the optimized MLP/CNN structures further and learning from the characteristics of EN/BR methods may represent the future for designing customized deep learning methods in the field. Also, there are theoretically infinite possibilities for MLP or CNN structures, so we had to restrict the searching within a fixed set of configurations when identifying the optimal PGS model of blood cell traits. The selection for these configurations was based on recommendations in deep learning studies 34 as well as previous findings on the application of neural networks in genetic prediction. 19,32 While the adopted configurations have wide coverage of common MLP/CNN structures, it is still possible that there are other MLP/CNN structures that can construct better PGSs but are not included in the study.
In addition, while the present study focused on PGS development of blood cell traits, the method comparison analysis provides a potentially useful reference for PGS development of other cellular and molecular traits. However, these methods may perform differently for other phenotypes, such as complex diseases with very different genetic architectures. In this case, dedicated studies will be needed to identify an appropriate method for PGS construction of the phenotype.
The extensive sharing of the polygenic basis for blood cell traits and several common diseases was consistent with known trait-level associations and raised potentially fruitful avenues for future translational research. For example, both EO# and NEUT# are important risk factors for rheumatoid arthritis (RA), and their respective PGSs reflected these associations. Knowledge of their shared genetics and corresponding PGSs may enable early stratification of individuals at increased risk Interactions between PGS and sex were tested for all of the traits on the INTERVAL cohort by using the multivariate linear regression: Overall, this study evaluated a variety of learning methods to construct PGSs for blood cell traits using individual-level genotype data. We demonstrate how the learning methods outperform univariate analysis-based methods, including by adjusting the effects of correlation, interaction, and low-MAF variants. This work highlights the importance of moving beyond standard summary statistics-based methods for PGS, particularly as the biobank-scale cohorts are becoming more common. We have made these PGSs available to the community, demonstrated that they can stratify sex-and age-dependent trajectories, and identify their shared polygenic basis with various common diseases. Future studies leveraging the totality of genetic variation (e.g., the full allelic spectrum and difficulty to genotype/ sequence loci) for blood cell traits, as identified in recent studies 10 may provide further improvements in the PGSs of these traits and may facilitate further studies evaluating their clinical validity.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

Study cohorts UK Biobank
The UK Biobank is a cohort including 500,000 individuals living in the UK who were recruited between 2006 and 2010, aged between 40 and 69 years at recruitment. The participants with the measurements of the 26 blood cell trait and who were identified as European ancestry based on their genetic component analysis were included in our study. The detailed simple sizes used for training and internal validation of PGS of each blood cell trait after quality control were given in Table S2.

INTERVAL Study
INTERVAL is a randomized trial of 50,000 healthy blood donors, aged 18 years or older at recruitment. The participants with measurements of the 26 considered blood cell trait were included in our study. The detailed simple sizes used for external validation of PGS of each blood cell trait after quality control were given in Table S2.

Data quality control
This study analyzed 26 different traits across three blood cell types: platelets, red blood cells, and white blood cells (Tables S1 and S2) that were measured in UK Biobank 6,7 and INTERVAL 8 cohorts. As construction and evaluation of PGS are highly dependent on the quality of both phenotype and genotype data used, we adopted the established protocols described in the previous work, 10 adjusting measured values for blood cell trait values to help account for a variety of environmental and technical factors, as well as the first 10 genetic principal components. Technical variables include the time between venepuncture and full blood cell analysis, seasonal effects, center of sample collection, time dependent drift of equipment, systematic differences in equipment; environmental variables include sex, age, and lifestyle factors, including diet, smoking and alcohol consumption. Approaches to quality control and imputation of the genotype data of UK Biobank have been described previously, 7 which filtered the samples to the European-ancestry only; similarly, the quality control and imputation of the genotype data of INTERVAL has been described in the previous work. 4 For algorithmic purposes, any remaining missing genotypes were mean imputed.

Variant selection
To construct PGS for blood cell traits, a key step is to select genetic variants (e.g., SNPs), that are not only significantly associated with the trait but also independently contribute to the trait. Our previous work 10 investigated a range of different variant selection criteria and validated their performance with the P+T method. It was discovered that the conditionally independent variants yield the best predictive power across all the blood cell traits. Thus, this study first adopted the same variant selection strategy for each blood cell trait and used the conditional analysis (CA) variants as inputs to compare the performance of six PGS methods. Below, we describe the steps of the conditional analysis in brief. A GWAS was first performed for each trait on the UKB cohort to select variants significantly associated with the trait, in which a MAF threshold of 0.005% was applied and an genotype imputation INFO threshold of 0.4. For each variant tested, a genomewide significance threshold of p = 8.31 3 10 À9 was applied as it is widely utilized for common, low frequency and rare variants. 43,44 Details of GWAS for these blood cell traits on UKB have been previously published. 10 Based on these significantly associated variants of each trait, a conditional analysis with a r 2 threshold of 0.9 was further performed to identify the variants that are independently associated with a trait and can best represent the underlying genetic signals of that trait.
The conditional analysis was performed using a stepwise multiple linear regression approach. 4,45 For each blood cell trait, the set of genome wide significant variants was first partitioned into the largest number of blocks such that no pair of blocks are separated by fewer than 5Mb, and no block contains more than 2,500 variants. For each block, variants within the block are tested separately using the multiple-stepwise regression algorithm and independently associated variants are put forward into a larger chromosome wide pool on which a second multiple-stepwise regression algorithm is executed. The multiple-stepwise regression algorithm starts by adding in variants that pass the genome-wide significance threshold (p = 8.31 3 10 À9 ) and have a LD r 2 score lower than 0.9. Then, it fits a multivariate linear regression to remove variants that have a p value larger than the genome-wide significance threshold, which step is iterated until no more variants can be removed from the model. Note that we only keep those CA variants whose genotype data are available on both UKB and INTERVAL studies for the convenience of external tests in this study.
There are PGS methods that are scalable to much higher dimensional genetic data, such as EN and LDpred2, so we further investigated their performance on larger sets of genome-wide variants that were selected with more lenient thresholds. In details, we first selected all the biallelic variants that are shared between UKB and INTERVAL, on which filters, MAF > 0.01, INFO score > 0.4 and variant missing rate < 0.1, were applied to control the quality and total number of selected variants using UKB samples. Then, a LD thinning step was performed on UKB to remove SNP-SNP LD correlations using the indep-pairwise method implemented in plink version 2.00 30 at the threshold of r 2 = 0.5, which resulted in a total of 1,090,437 variants. Finally, several levels of p value thresholding (i.e., 10 À6 , 10 À4 , 10 À2 , and 1) were applied on these selected variants to form different variant sets for each trait, each of which was incorporated to the CA variant set of the trait for its PGS construction.

SNP-SNP correlation and interaction detection
To investigate capability of learning methods in modeling correlations and interactions, we focused on the CA variant sets and performed correlation and interaction tests between any pairs of CA variants for each trait on the UKB cohort. The coefficient of determination r 2 was used to evaluate the correlation between two variants. The multivariate linear regression: y = b 0 + b 1 SNP 1 + b 2 SNP 2 + b 3 SNP 1 SNP 2 was employed for interaction tests with the interaction terms passing the threshold of p = 2.3 3 10 À7 (Bonferroni adjusted for all tested SNP pairs across the 26 traits) were deemed significant. The term ''interaction'' used here refers to statistical interactions and does not imply (biological) epistasis.

Polygenic scoring methods
We constructed PGS for 26 blood cell traits using a conventional P+T method, summary statistics based learning method LDpred as well as a variety of widely used machine learning and deep learning methods. This subsection describes fundamental aspects of the P+T method, elastic net (EN), Bayesian ridge (BR), LDpred method, multilayer precepton (MLP) and convolutional neural network (CNN) methods. Pruning and thresholding (P + T) P+T method assumes that the genetic variants have linear additive effects on PGS of the trait and constructs polygenic scores of a blood cell trait using the weighted sum of genotypes of the selected variants for that trait: 46 where S is the set of SNPs that are identified in the variants selection step; b j is the effect size of the SNP j that is obtained through the univariate statistical association tests in the GWAS using the UKB cohort; x ij is the genotype dosage of SNP j of the individual i. As discussed previously, P+T relies on LD pruning and p value thresholding, or maybe other thresholding strategies, for variants selection. The best set of thresholding parameters is usually identified by comparing their performance on a validation set. Elastic net (EN) EN also assumes that the variants have linear additive effects on the PGS of a trait, i.e., Equation 1, but the effect sizes of variants are obtained using a different way. These effect sizes are estimated by minimizing the penalized squared loss function: in which, N is the set of training samples for a given trait and y i is the trait level of the training sample i; the second term is L1 norm and the third term is L2 norm; a and l are coefficients used to control the contribution of L1 and L2 norms in the model, which are usually set via cross-validation. In EN, effect sizes of the variants selected for a trait are jointly estimated which provides an implicit way to model the correlations among these variants, and the use of L1 and L2 norms helps to control model complexity to address the overfitting problem in which L1 controls the sparsity of the model and L2 controls the contribution of each variable. It has been shown that the application of these regularized multivariate models offers an effective way to improve PGS construction in practice. 17,18 Bayesian ridge (BR) Similarly, BR also has a linear assumption for the effects of the variants, i.e., Equation 1. Different from EN, BR assumes that PGS of a trait follow a Gaussian distribution, and the prior for effect sizes of variants is also given by a spherical Gaussian: ) where a and l are coefficients of the model and subject to two Gamma distribution: Gamma(a 1 , a 2 ) and Gamma(l 1 , l 2 ). These two prior Gamma distributions can be set via a validation step. The b, a, l are then estimated by maximizing the log of the corresponding posterior distribution with respect to b by combining Equations 3 and 4 on the training data. 47 LDpred The LDpred method (both version 1 and version 2) also has a fundamental linear assumption as EN and BR. Differently, it considers the prior for effect sizes of variants by a Gaussian mixture model:  Figure S4 shows an example of a three-layer MLP in which the first layer is known as input layer consisting of the input features, i.e., SNPs in the context of this study; the last layer outputs the final result of the model and the layer(s) in between are called hidden layer(s). A function node in hidden and output layers typically transforms the inputs from the previous layer with a weighted linear sum followed by an activation function. 19 For example, f 1 in Figure S4 can be represented as: where w 11 , w 12 and w 13 , are weights of the three inputs of function f 1 and b 10 is the intercept (or bias); SNP 1 , SNP 2 and SNP 3 are the genotype dosages of three SNPs in our context; f act is an activation function which typically plays the role of introducing non-linearity into the model. Thus, the network architecture and its components of an MLP, e.g., activation function, determine a linear/non-linear mapping space, from which a model, i.e., all the weights across the given network that can best represent the data, is supposed to be learned. Details on the selection of network architectures for this study are given in the next subsection. This learning process is typically implemented by minimizing the difference, i.e., cost function, between the training data and the model distribution, through a back-propagation algorithm. 34 Convolutional neural networks (CNNs) CNNs are a specialized neural network for processing data that have a grid-like topology, 34 e.g., time-series data, image data, genome sequence data. 52 As regularized versions of MLPs, CNNs construct its hidden layers using convolutional and pooling operations which are usually followed by fully connected layers and the output layer. The convolution operation limits the number of input units for an output unit by using kernels, and leads to a sparse connectivity of the network, which allows us to store fewer parameters and largely improve statistical efficiency. A typical convolutional layer in CNNs performs multiple convolutions in parallel which lead to multiple representations of the input units. To help generalize these representations and reduce the chance of overfitting, a pooling layer is usually followed to replace each representation at a certain location with a summary statistic of the nearby output units. 34 There are different pooling operations that can be applied based on different application context, e.g., max pooling and average pooling. Figure S5 shows an example of a simple one-dimensional CNN with illustrations on convolution and pooling operations.

Measurement and hyperparameter tuning
We used Pearson r to measure the performance of various polygenic scoring methods. For each trait and each learning method, we randomly and equally partitioned the UKB samples into 5 portions, from which any 4 portions (80% of the samples) were used as training data to learn a model, and test the respective model's performance on the remaining 20% of UKB samples, as well as an external validation using the whole INTERVAL cohort. For each learning method and each trait, we obtained 5 different models, each with a performance measurement for both the internal UKB test and the external INTERVAL test. By doing so, the training and internal testing covered the whole UKB cohort, affording an effective way to avoid evaluation bias. The P+T method was also tested on the five different UKB testing sets, and the whole INTERVAL cohort. Hyperparameter turning is a crucial step for machine learning and deep learning methods as the choice of hyperparameters can greatly influence the model performance. In this study, we employed SNPNET 42 to implement EN method, in which a was set to 0.5 and 10% of the training samples were used as a validation set to tune l for each trait and each variant set. To identify two appropriate gamma distributions in BR i.e., the selection of a 1 , a 2 , l 1 and l 2 , a grid search across the set [-10 10 , À10 5 ,-10, 0, 10, 10 5 , 10 10 ] was conducted on the training set in which 10% of the samples were used as a validation set. BR was implemented using the scikit-learn package. 41 This study applied the commonly used and top performing grid search option in LDpred2 to learn PGS of these blood cell traits. Summary statistics from GWAS in the variants selection step, and the 102 default options of hyper-parameters for h 2 , p, and sparsity were applied when running LDpred2. Randomly selected 10,000 samples from training data were used to obtain the variantsvariants correlation matrix and all the training samples for each trait were used to validate the performance of different hyperparameter combinations for the optimal model selection.
As this work is, to our knowledge, the first attempt to employ MLPs and CNNs for genomic prediction of blood cell traits, there was no prior information that could be used for the design of network architecture for this task. Therefore, similar to the previous work, 19 we used a genetic algorithm to search for the optimal MLP and CNN architectures as well as other hyperparameters, e.g., the number of layers, the number of neurons at each layer, activation functions, optimizers, dropouts, etc., on the training set, in which 10% of the samples were used as a validation set. MLPs and CNNs were implemented using Keras (keras.io).