Development and validation of polygenic risk scores for prediction of breast cancer and breast cancer subtypes in Chinese women

Studies investigating breast cancer polygenic risk score (PRS) in Chinese women are scarce. The objectives of this study were to develop and validate PRSs that could be used to stratify risk for overall and subtype-specific breast cancer in Chinese women, and to evaluate the performance of a newly proposed Artificial Neural Network (ANN) based approach for PRS construction. The PRSs were constructed using the dataset from a genome-wide association study (GWAS) and validated in an independent case-control study. Three approaches, including repeated logistic regression (RLR), logistic ridge regression (LRR) and ANN based approach, were used to build the PRSs for overall and subtype-specific breast cancer based on 24 selected single nucleotide polymorphisms (SNPs). Predictive performance and calibration of the PRSs were evaluated unadjusted and adjusted for Gail-2 model 5-year risk or classical breast cancer risk factors. The primary PRSANN and PRSLRR both showed modest predictive ability for overall breast cancer (odds ratio per interquartile range increase of the PRS in controls [IQ-OR] 1.76 vs 1.58; area under the receiver operator characteristic curve [AUC] 0.601 vs 0.598) and remained to be predictive after adjustment. Although estrogen receptor negative (ER−) breast cancer was poorly predicted by the primary PRSs, the ER− PRSs trained solely on ER− breast cancer cases saw a substantial improvement in predictions of ER− breast cancer. The 24 SNPs based PRSs can provide additional risk information to help breast cancer risk stratification in the general population of China. The newly proposed ANN approach for PRS construction has potential to replace the traditional approaches, but more studies are needed to validate and investigate its performance.

population in the world over the past few decades has made breast cancer a major public health issue that seriously endangers the health of women in China [3].
The etiology of breast cancer is multifactorial, with both non-genetic risk factors (including reproductive factors, exogenous hormonal medication, and lifestyle factors) and inherited genetic risk factors playing important roles [4][5][6][7][8]. Multiple pathogenic variants of the BRCA1 and BRCA2 genes that confer high relative risks of breast cancer have been identified [9]. However, these variants are too rare in the general population to explain more than a small proportion of breast cancer cases [10,11], especially among Chinese women where the prevalence of BRCA1 and BRCA2 mutations is lower than that in women of European ancestry [12]. In addition to these highly penetrant rare variants, more than 180 common single nucleotide polymorphisms (SNPs) that are associated with breast cancer risk have been identified in genome-wide association studies (GWASs) [13]. Each of these SNPs confers only a small risk of developing breast cancer, but when summarized in the form of a polygenic risk score (PRS), their combined effect can be substantial [14].
Breast cancer PRSs have been shown to have sufficient predictive power to aid risk stratification, and some have already been implemented in clinical practice [15,16]. However, there is a lack of studies examining PRSs in Chinese women, since the majority of GWASs and other studies of breast cancer PRSs conducted to date were conducted among women of European ancestry [13]. Among the limited studies investigating breast cancer PRSs in Chinese women [17][18][19][20][21], the biggest limitation is the lack of validation using independent datasets. These studies used the same datasets to estimate the PRS weighting parameters and to evaluate the PRSs, which limited the value of the results as a true reflection of the performance of the PRSs. Furthermore, as highlighted by some recent studies, more efforts are needed to optimize PRSs for the prediction of estrogen receptor (ER) negative (ER − ) breast cancer [22,23], which is more aggressive and less common than estrogen receptor positive (ER + ) breast cancer. Better prediction of ER-specific breast cancer could enable selection of high-risk women who might benefit from prevention with endocrine therapies.
The primary aim of this study was to develop and validate PRSs for use in stratification of the risk of breast cancer and subtype-specific breast cancer in Chinese women. To that end, we used a GWAS dataset to develop PRSs and validated them in an independent test set from a case-control study. We also aimed to compare different approaches for calculating PRSs, including a newly proposed artificial neural network (ANN)-based approach.

Study design and participants
The dataset used for PRS development was obtained from the Shanghai Breast Cancer Genetics Study (SBCGS) [24]. The SBCGS was conducted in 5152 participants (2867 case participants and 2285 control participants) from the following four population-based studies conducted among Chinese women in urban Shanghai between 1996 and 2005: the Shanghai Breast Cancer Study [25], the Shanghai Breast Cancer Survival Study [26], the Shanghai Endometrial Cancer Study (contributing controls only) [27] and the Shanghai Women's Health Study [28]. The samples from the SBCGS were genotyped using Affymetrix Genome-Wide Human SNP Array 6.0. The raw individual-level genotype dataset was provided by the Database of Genotypes and Phenotypes (dbGaP) project phs000799.v1.p1 (https:// www. ncbi. nlm. nih. gov/ gap). The quality control (QC) procedures applied to the SBCGS dataset are described in Fig. 1. Briefly, we excluded SNPs and samples with a call rate < 99%. We also excluded SNPs with a minor allele frequency < 1%, SNPs with Hardy-Weinberg equilibrium (HWE) test P < 10 − 6 and P < 10 − 10 for controls and cases, respectively, and samples with KING-robust kinship coefficients > 0.0884 (second-degree relations, first-degree relations and duplicate samples) [29]. QC and imputation were performed using PLINK 1.9 and IMPUTE2 software [30,31]. After QC procedures, the final dataset consisted of 4861 participants (2722 case participants and 2139 control participants) and 569,677 SNPs.
The independent test set used for PRS validation was obtained from the Sichuan Breast Cancer Case-Control Study (SBCCS) conducted in Chengdu, Sichuan Province. The study design has been described in detail elsewhere [6]. In brief, the SBCCS was conducted in 794 case participants and 805 control participants between 2014 and 2015. Case participants were recruited from primary breast cancer patients diagnosed in three government-owned hospitals, whereas control participants were recruited from healthy women undergoing annual physical examination in two physical examination centers. A standardized questionnaire was used to collect demographic and breast cancer risk factor information from participants. Clinical characteristics of case participants were directly exported from hospitals' information systems. Blood samples were collected from all participants on the day of the questionnaire survey and stored at − 80 °C prior to DNA extraction. DNA was extracted from blood samples using whole blood genomic DNA extraction kits (Tiangen Biotech Company, Beijing, China) and stored at − 80 °C. In the current study, we included 826 DNA samples from 376 control participants and 431 case participants that were available in 2019.

SNP selection and genotyping
We generated two sets of SNPs as potential candidates for genotyping in the SBCCS. The first set of SNPs was selected by reviewing association studies or meta-analyses. Due to budget limitations and the"diminishing returns" effect [13], we focused on susceptible SNPs that were identified in previous smaller studies and selected 28 SNPs that had been widely found to be associated with breast cancer risk in the Chinese population (Table  S1). Thirteen SNPs were not represented in the SBCGS dataset, among which five SNPs (rs1801133, rs4973768, rs854560, rs1695 and rs9282861) were excluded because their eligible proxy SNPs, defined as linkage disequilibrium (LD) measure R 2 > 0.9 determined using the LDLink tool [32], were also not represented in the SBCGS dataset. The remaining eight SNPs were replaced by corresponding proxies (rs1137101 replaced by rs10789190; rs10941679 replaced by rs4479849; rs662 replaced by rs2057681; rs2234767 replaced by rs7097467; rs2981578 replaced by rs10736303; rs2420946 replaced by rs2162540; rs730154 replaced by rs8031463; rs11655505 replaced by rs9646413). We further excluded rs1219648 because it was in tight LD (R 2 > 0.8) with both rs2162540 and rs2981575 ( Supplementary Fig. S1). Twelve SNPs that achieved genome-wide significance (P < 5 × 10 − 8 ) for overall breast cancer in the SBCGS dataset formed the second set of SNPs (Table S2). As shown in Supplementary Fig. S2, pairwise LD analysis revealed that no pruning was needed in the second set of SNPs (R 2 < 0.8). Therefore, a total of 34 SNPs were selected and genotyped  Table S1 and Supplementary Table S2).
Before genotyping, QC of DNA samples was performed and 19 samples that failed the DNA QC were excluded, resulting in a total of 807 samples (376 control participants and 431 case participants) plus 30 blind duplicate samples sent for genotyping. Genotyping of the 34 SNPs was carried out blindly by Bio Miao Biological Company Limited. Time-of-flight mass spectrometry was used for genotyping in strict accordance with a standard protocol.
QC of the SBCCS genotyping was carried out by excluding SNPs with call rate < 98%, concordance rate in duplicate samples < 99%, HWE test P < 0.05 (rs6730484), and SNPs that were monomorphic (Supplementary Table  S1 and Supplementary Table S2). Samples were excluded if ≥3 SNPs failed the QC (6 samples were excluded). The remaining sporadic missing genotypes were imputed using population mean values.

PRS development
The 22 SNPs in the first set of SNPs were all included from the PRS development. Of the remaining 11 SNPs in the second set, we included only two SNPs that exhibited the same effects on breast cancer in the SBCGS and SBCCS regardless of P-values (Supplementary Table  S2). Therefore, a total of 24 SNPs were included for PRS development (Supplementary Table S3).
In the current study, we used three different approaches to calculate PRSs. The first two approaches were based on the same formula: PRS = n k=1 β k x k , where n is the total number of SNPs, x k is the number of effect allele (minor allele) for the kth SNP, and β k is the corresponding effect size, calculated as per-allele log OR for breast cancer associated with the kth SNP. The first approach is known as the repeated logistic regression (RLR) approach. In this approach, β k was estimated in the SBCGS dataset using univariate logistic regression for each SNP individually. The RLR approach is the typical method used to calculate PRSs, since β k estimated from RLR is a summary statistic and can be easily obtained without access to individuallevel genotype data. In the second approach, β k was estimated in the SBCGS dataset using multivariate logistic ridge regression, where all 24 SNPs were included in the model simultaneously. The model was also adjusted for age and population structure (first two principal components). The second approach is known as the logistic ridge regression (LRR) approach. The optimal penalty parameter lambda in the ridge regression model was chosen by conducting 10-fold cross-validation on the SBCGS dataset (results shown in Supplementary Fig. S3). The third approach was a newly proposed ANN-based approach. In this approach, the ANN can be considered as a perceptron, that was used to extract a vector of length 6 from the original 24 SNPs, and the final PRS was calculated based on the extracted vector while adjusting for age and population structure. The optimal hyperparameters for the ANN-based model were chosen by conducting 10-fold cross-validation on the SBCGS dataset (Fig. 2). The structure of the final ANN-based model used in the study is shown in Supplementary Fig. S4.
The primary PRSs for overall breast cancer were constructed using all breast cancer cases in the SBCGS dataset. We also constructed the PRSs for subtype-specific breast cancer (ER + and ER − ) using corresponding subtype-specific breast cancer cases in the SBCGS dataset.
Hyperparameters tuning was conducted by applying 10-fold cross validation to the SBCGS dataset and using average log-loss as the main outcome. The optimal number of iterations, hidden layers and dropout rate were 60, 3 and 0.4 respectively. Other hyperparameters that were not tuned include: number of hidden neurons in each hidden layer (square root of number of input neurons plus two); learning rate (0.01), activation function of the hidden layers (Leaky ReLU); activation function of the output layer (sigmoid); loss function (sigmoid cross entropy) and optimizer (Adam optimizer). SBCGS: Shanghai Breast Cancer Genetics Study.

Statistical analyses
The performance of the PRSs was assessed from two perspectives: predictive ability and calibration. For predictive ability, we used the odds ratio (OR) per interquartile range (IQR) increase (IQ-OR) in the PRSs in the controls as the primary outcome. Discrimination was also used as a metric for the evaluation of predictive ability. Discrimination was assessed by the area under the receiver operator characteristic curve (AUC) with confidence intervals estimated using the Hanley and McNeil's method [33]. To indirectly compare the predictive ability of our PRSs with previous PRSs, we also assessed the odds of breast cancer in the fourth quartile (Q 4th ) of the PRSs in controls with those in the first quartile (Q 1st ). Calibration was assessed by inspecting the observed OR to the expected OR in each PRS decile and were further estimated using coefficients from log scale linear regression as described by Brentnall et al. [23]. In addition to evaluating the crude performance of the PRSs, we also evaluated their performance after adjusting for non-genetic risk factors or absolute risks predicted by the Gail-2 model, to investigate the ability of our PRSs to provide additional risk information for Chinese women. To this end, we regressed the PRSs (as the dependent variable) against non-genetic risk factors and used the remainder of the PRSs to calculate the evaluation metrics described above. The non-genetic risk factors used for adjustment included age, age of menarche, number of live births, family history of breast cancer, body mass index (BMI), and menopausal status. Sensitivity analyses were conducted as follows: 1) by excluding samples with sporadic missing genotypes in the SBCCS dataset, and 2) by incorporating a more rigorous pruning in the SNP selection process (R 2 < 0.3).

Results
The age and ER status profile of the participants in SBCGS are shown in Supplementary Table S4. ER status information was available for only 1495 case participants (54.9%), among which 985 cases were ER + breast cancer patients and 510 cases were ER − breast cancer patients.
Basic characteristics of the included 427 case and 374 control participants in the SBCCS are shown in Table 1. Due to a relatively small sample size, case and control participants were comparable in terms of several breast cancer risk factors, including BMI, age at menarche and family history of breast cancer (P > 0.05). Furthermore, there were no significant differences in 5-year absolute risks of breast cancer predicted by the Gail-2 model between case and control participants (P = 0.07). Comparison of the basic characteristics of the participants in the SBCCS who were included in the current study with those of the participants not included due to unavailability of DNA samples showed that there were no significant differences between these two groups of participants (Supplementary Table S5). As revealed by Fig. 3, the three primary PRSs for overall breast cancer (PRS RLR , PRS LRR , PRS ANN ) had very weak correlation with other breast cancer risk factors. Associations between the PRSs and Gail-2 model 5-year risk were also very weak (Spearman's ρ = − 0.01, − 0.03, and − 0.01 for PRS RLR , PRS LRR and PRS ANN , respectively), suggesting that the PRSs were independent of absolute risk predicted by the Gail-2 model.
For overall breast cancer, the primary PRSs constructed using the ANN-based approach achieved higher IQ-OR (1.76, 95% CI 1.39-2.24) than the primary PRSs constructed using RLR (IQ-OR 1.49, 95% CI 1.23-1.81) and LRR (IQ-OR 1.58, 95% CI 1.29-1.92, Table 2). In terms of discrimination (Fig. 4), PRS LRR and PRS ANN were comparable (AUC 0.598, 95% CI 0.559-0.637 vs. AUC 0.601, 95% CI 0.562-0.640) and superior to PRS RLR (AUC 0.582, 95% CI 0.543-0.621). As shown in Fig. 4 Table S7). The ANN-based and LRR approaches can compensate for the issue of collinearity; therefore, we incorporated a loose R 2 threshold of 0.8 when selecting the SNPs in order to include more informative SNPs. However, this threshold may have influenced the performance of the PRS RLR . A sensitivity analysis was conducted by incorporating a more rigorous R 2 threshold, which led to the removal of seven additional SNPs (rs2981582, rs3803662, rs9646413, rs2162540, rs10736303, rs4479849, and rs10789190). The performance of the PRS RLR constructed using SNP-17 was slightly improved but did not exceed the performance of the primary PRS LRR and PRS ANN (Supplementary Table  S8).

Discussion
In the current study, the PRSs for the prediction of overall breast cancer and subtype-specific breast cancer in Chinese women were developed using a GWAS dataset and validated in an external case-control dataset. The best PRSs (PRS ANN   These results indicated that our PRSs can provide additional risk information and are therefore suitable for use in conjunction with breast cancer risk prediction models based on non-genetic risk factors to stratify women into different risk groups. Since the Gail-2 model is the only publicly available model that can be implemented in our dataset, we also investigated the combination of the PRS ANN and PRS LRR with the Gail-2 model. Although there was a substantial increase in AUC when the PRSs were combined with the Gail-2 model (increased from 0.531 to approximately 0.58), the combined models had lower predictive ability than that using the PRSs alone. This was largely due to the poor performance of the Gail-2 model in the SBCCS dataset, which was consistent with a recent meta-analysis reporting a pooled AUC of 0.55 (95% CI 0.52-0.58) for the Gail-2 model in Asian females [34]. Therefore, although our PRSs showed great potential to contribute additional risk information and increase predictive ability when combined with classical breast cancer risk factors, further studies are still needed to investigate their performance when combined with a more accurate non-genetic risk prediction model for Chinese women (e.g., Han Chinese Breast Cancer Prediction model [35]). Another important application of the PRS is to identify women at high risk of breast cancer who could benefit from more frequent breast cancer screening or preventive therapy. Therefore, it is also important to assess the ability of the PRSs in predicting risk in the tails of the distribution. In the current study, the adjusted Q 4th vs. Q 1st ORs for PRS ANN and PRS LRR were 2.51 and 2.47, respectively, meaning women in the fourth quartiles of the PRSs had an approximately 2.5-fold greater risk of having breast cancer than those in the lowest quartiles. This represents a substantial improvement in predictive ability compared with previous PRSs developed for Chinese women (Supplementary Table S9). This improvement can be attributed to the use of individual-level genotype data and a more sophisticated approach for PRS construction. Nevertheless, our best PRSs were still less predictive compared with some recent PRSs developed for women of European ancestry [22,23,36,37], perhaps reflecting the gap between the number of SNPs included. Therefore, the performance of these PRSs can still be improved by including more SNPs associated with breast cancer risk in the Chinese population.
Previous studies conducted in women of European ancestry showed that breast cancer PRSs were generally less predictive of ER − breast cancer than ER + breast cancer [22]. We confirmed this result in our dataset in the Chinese population. The primary PRSs were significantly less predictive and poorly calibrated for ER − breast cancer, with adjusted IQ-OR and AUCs ranging from 1.24 to 1.30 and 0.548 to 0.555, respectively. To improve the prediction of ER − breast cancer, we developed ER − PRSs in the current study. The results indicated that when training PRSs solely on ER − breast cancer cases yielded a substantial gain in predictive ability for ER − breast cancer. As a more aggressive breast cancer subtype, patients with ER − breast cancer had significantly worse prognosis compared with patients with ER + breast cancer. Identifying women at high risk of ER − breast cancer regardless of their overall breast cancer risk is therefore of great value in clinical practice and breast cancer screening. Our results highlighted the requirement for optimization of future PRS for ER − breast cancer by incorporating more ER − cases in the training dataset and perhaps, including more SNPs associated with ER − breast cancer.
We compared three different approaches to PRS construction, consisting of the traditional RLR approach using summary statistics, as well as LRR approach and the newly proposed ANN-based approach using individual-level genotype data. Compared with the traditional summary statistics-based RLR approach, the LRR and ANN approaches can be used to address the issues of overfitting, collinearity and confounding by using individual-level genotype data, thus providing a more accurate estimate of the weighting parameters. Therefore, it is expected that the primary PRSs constructed using the ANN and LRR approaches both achieved better predictive performance than PRS RLR (including SNP-17 based PRS RLR in the sensitivity analysis). Through the use of the non-linear activation function and multiple hidden layers, the ANN model is able to fit high-order interactions between variables [38]. Therefore, in theory, the ANN approach captures the interactions between breast cancer SNPs [39][40][41], and thereby achieves better predictive performance than the linear LRR approach. Our research confirms this speculation. The primary PRS ANN showed better predictive ability than the primary PRS LRR in predicting overall and ER + breast cancer, which suggests the existence of interactions between the included SNPs. To explore possible SNP-SNP interactions, we conducted logistic regression analyses to identify pairwise interactions in the SBCGS dataset. A total of 13 pairs of SNPs with possible SNP-SNP interactions (P < 0.05) were identified, but none of them reached a Bonferroni corrected level of statistical significance (P < 1.8 × 10 − 4 ). Further post-hoc analysis revealed that the interaction between rs10789190 and rs7799039 was statistically significant in both datasets (P < 0.05). The SNPs rs10789190 and rs7799039 are located in the leptin (LEP) and leptin receptor (LEPR) genes, respectively, making their interaction biologically plausible. Adding this interaction term to the PRS LRR slightly improved its predictive ability (PRS LRR with interaction term: IQ-OR 1.62; AUC 0.602), indicating the differences between the primary PRS LRR and PRS LRR can be partially attributed to this interaction term. In other words, the ANN approach automatically captures the potential interactions between SNPs, which are likely to be omitted in the traditional approaches. Nevertheless, the ANN approach is more sophisticated and less flexible than the LRR and RLR approaches. Whether ANN can be considered the optimal approach to PRS construction remains to be investigated. The current study has several strengths. First, the PRSs were validated in an external dataset, and thus avoided the concern of overfitting. Nevertheless, further validation with an expanded sample, preferably from multiple locations in China, is still needed. Second, we examined the performance of the PRSs by ER status and further optimized the PRSs for ER − breast cancer prediction, which has not been previously conducted in Chinese women. Future studies may consider both ER and human epidermal growth factor receptor 2 (HER2) status when optimizing the PRS for prediction of subtype breast cancer. Third, all the SNPs in the SBCGS and SBCCS were genotyped directly. Imputation was conducted only for sporadic missing genotypes. However, the study also has some limitations. First, the overall performance of the PRSs is not ideal, especially compared with the performance of PRS in women of European ancestry. Due to budget constraints, the search for candidate SNPs was limited to those that are well-validated in Chinese population, hence some newly identified SNPs and SNPs that remain to be validated in Chinese population were omitted. Therefore, the results of our study should be interpreted with caution. Future studies should include more SNPs associated with breast cancer susceptibility, especially those identified in recent GWASs. High-quality genetic studies are also needed to identify and validate more breast cancer susceptibility SNPs in the Chinese population. Second, assessment of the performance of the PRSs in combination with classical breast cancer risk factors was not sufficient, since there is no suitable breast cancer risk prediction model for Chinese women. Further studies are warranted to investigate the performance of the PRSs when incorporated into more accurate risk prediction models for Chinese women. Third, BRCA status information is unavailable for either SBCGS or SBCCS, we are therefore not able to conduct further stratified analyses or make comparisons. Besides, since the SBCGS spanned a long period of time (i.e., 1996 to 2015), we cannot rule out the possibility that changes in recommendations for determining ER status may have influenced the results of our study. Finally, our PRSs and study results are limited to Han Chinese women and may not be generalizable to Chinese women in other ethnic groups, although they only account for around 9% of the total population.
In summary, the SNP-24-based breast cancer PRSs showed significantly better predictive ability than previous PRSs developed for Chinese women. Our SNP-24-based PRSs were largely independent of classical breast cancer risk factors and thus have great potential to improve clinical practice and future risk-based breast cancer screening programs by providing additional risk information for the general population. Nevertheless, the predictive performance of the current PRSs is not ideal and can be improved by incorporating more SNPs that are associated with breast cancer risk in Chinese women. Although the subtype-specific PRSs showed substantial improvement for ER − breast cancer prediction, the overall performance is still poor and improvements are still needed before it can be implemented to identify women at high risk of ER − breast cancer. Our newly proposed ANN-based PRS construction