Genome-wide association study of serum tumor markers in Southern Chinese Han population

Serum indicators AFP, CA50, CA125, CA153, CA19-9, CEA, f-PSA, SCC-Ag have been confirmed as tumor markers (TMs). We conducted a genome-wide association study on 8 tumor markers of our 427 Han population in southern China, in order to identify genetic loci that are significantly associated with the level of 8 tumor markers. We use Gene Titan multi-channel instrument and Axiom Analysis Suite 6.0 software for genotyping. We used IMPUTE2 software for imputation, and 1000 Genomes Project (Phase 3) was used as haplotype reference. After necessary quality control and statistical analysis, genetic loci genome-wide associated with TMs (p < 5E-8) will be identified. Finally, we selected Top SNPs (p < 5E-7) from the GWAS results for replication test. We used SPSS software to draw the distribution box plots of serum TMs under different genotypes of significant loci. The results showed that there were only MUC1 (mucin 1)-rs4072037 significantly genome-wide associated with CA153 (p = 1.28E-18). However, we found that a total of 30 genetic loci have a suggestively significant genome-wide association with the level of 8 serum tumor markers (p < 5E-6). Then 3 Top SNPs (p < 5E-7) were selected for replication verification. The results showed that MUC1-rs4072037 was still significantly associated with CA153 in another population (p = 3.73E-08). Comparing with the TT genotype of rs4072037, the CA153 level was higher under CC or CT genotype of rs4072037. MUC1-rs4072037 is significantly genome-wide associated with CA153 level. There are 30 genetic loci suggestively genome-wide associated with level of tumor markers among the Han population from Southern China.


Background
Tumor marker (TM) is a series of substances that reflect the existence or development of tumors. It not only plays a role in the disease detection and target treatment of tumor patients, but also has important practical value in the anti-cancer screening of healthy people. Numerous studies have confirmed that the levels of certain related indicators in serum are closely associated with the occurrence and development of one or more tumors, and can be regarded as tumor markers. Such as alpha fetoprotein (AFP): liver cancer [1,2]; CA19-9: pancreatic cancer; gastric cancer; colorectal cancer, etc. [3][4][5][6]; CA125: ovarian cancer [7]; CA153: breast cancer [8]; CA50: pancreatic cancer [9]; carcino embryonic antigen (CEA): colorectal cancer [10]; free Prostate Specific Antigen (f-PSA): prostate cancer [11]; squamous cell carcinoma antigen (SCC-Ag): squamous cell carcinomas [12], etc. In addition, previous studies suggested that the level of tumor markers such as CA19-9, CEA and AFP may differ between healthy individuals and cancer patients due to individual genetic differences and environmental factors [13][14][15][16][17]. Therefore, the detection of these tumor marker levels is helpful to monitor the occurrence and development of specific tumors in specific clinical populations.
Genome-wide association study (GWAS) has become the main common method for studying complex diseases and their susceptibility genes because of its ability to cover single nucleotide polymorphisms in the whole genome [18]. It can more effectively find genetic loci associated with the occurrence and development of diseases. GWAS can perform population-level statistical analysis of genotype and phenotype, so as to helping us identify genetic loci related to phenotypic changes. Up to now, numerous SNPs associated with levels of tumor makers among different populations have been reported [13,17,[19][20][21]. SNPs significantly associated with levels of tumor markers in specific populations can be identified through GWAS. However, there are few studies few studies on tumor markers GWAS. He, M. et al. identified several loci associated with CA19-9, CEA and AFP levels through GWAS studies among 3451 healthy participants [13]. The above reminds us that it is of great significance to identify genetic loci associated with tumor marker levels through GWAS, which provides new ideas and data supplement for further understanding the genetic mechanism of serum TMs in specific populations.
Therefore, we conducted a genome-wide association study on 8 serum tumor markers (AFP, CA125, CA19-9, CA153, CA50, CEA, f-PSA, SCC-Ag) in the Han population from southern China. With a view to discovering genetic loci that are significantly associated with serum tumor markers in the Han population from southern China. Our study will provide valuable reference for clinical monitoring and diagnosis of the occurrence and development of tumors among Han population from southern China.

Study subjects and DNA extraction
The subjects consisted of 427 healthy individuals (221 males and 206 females) from health examination center. The inclusion criteria for all participants included: age mainly 40-50 years old, no family history of tumor, no disease (at least within 2 weeks), and no medication (at least within 2 weeks). Then the whole bloods of the participants were collected, and we extracted the whole genome DNA. The specific operation procedure was carried out according to the kit instructions (GoldMag, Xi'an). Subsequently, a GWAS study was conducted on 8 serum tumor markers in the participating population, which are AFP, CA50, CA125, CA153, CA19-9, CEA, f-PSA, SCC-Ag.

Genotyping and quality control
In our study, Thermo Fisher Genotyping Chip was used (Applied Biosystems ™ Axiom ™ Precision Medicine Diversity Array, PMDA). We use Gene Titan multi-channel instrument and Axiom Analysis Suite 6.0 software for genotyping. We performed a full gene scan through Axiom, and the results showed that within our sample range (427 participants), there were a total of 874,190 loci. After excluding Indel, copy number variation, sex chromosomes and duplicate sites, we then perform the necessary quality control on the remaining sites (sample call rate> 0.95, maker call rate> 0.90, HWE > 5 × 10 -6 ). In the end, there are 796,288 loci left before we perform the imputation.

Imputation and quality control
We used IMPUTE2 software and used the haplotype of the 1000 Genomes Project (Phase 3) as a reference for imputation. After imputation, there are many poor quality loci, which need to be filtered out. Loci need to be filtered out include loci with MAF=0, loci with MAF<0.01 and info>0.3, and loci with excessive deletion rates (98% or more). Finally, the loci that meet the following conditions will be retained. These conditions include: sample call rate > 95%, marker call rate > 90%, HWE > 5 × 10 -6 , Allele = 2. In the end, a total of 6423,076 SNPs were used for subsequent analysis.

Statistical analysis
We use Gold Helix SNP & Variation Suite 8.7 version for association analysis. We mainly use the Mixed Linear Model-additive genetic model, and add the IBD matrix to the model to detect the SNPs associated with each indicator. All data in our study were adjusted by age and gender. At the same time, we also constructed Manhattan plots and Quantile-quantile plots related to serum TMs. The p-value is less than 5E-8, which means the genetic polymorphism is genome-wide significantly associated with the levels of tumor markers. Genetic polymorphism with p-value less than 5E-6 suggests that it may have a suggestively significant genome-wide association with tumor markers level [22].

Repeat detection
SNPs with p < 5E-7 were selected as Top SNPs from all SNPs identified by GWAS, and then they were repeatedly verified among 398 participants from the health examination center of Hainan General Hospital (different from the participants in the GWAS analysis). We removed the extreme measurement values (mean > ± 3 sd) of each serum TMs, and then normalized them with rankbaseINTs. Genotyping was performed by Agena MassARRAY. Then, the association analysis between Top SNPs and serum TMs was analyzed by Gold Helix SNP & Variation Suite 8.7. Through the mixed linear modeladditive genetic model, while adding the IBD matrix to the model, the correlation between each tumor marker and Top SNPs will be analyzed. The final results were all adjusted by age and gender. And we also use False Discovery Rate analysis to perform correction on positive results. Finally, SPSS software was used to draw the distribution box plots of serum TMs under different genotypes of significant loci. In this part, p < 0.05 indicates that the candidate SNPs are significant associated with the 8 serum TMs.

GWAS identified 31 single nucleotide polymorphisms
A total of 427 Southern Chinese Han people participated in the GWAS assessment of this study. The number of participants, and 'mean ± standard' of each serum TMs were summarized in Table 1. There were differences in sample sizes for each indicator, which was due to data missing during sample collection.

Repeat verification
We selected TOP SNPs (p < 5E-7) from the 31 SNPs identified by GWAS to narrow the scope of verification. Finally, 3 SNPs were selected for the verification test. The results showed ( Table 3) that MUC1-rs4072037 was still significantly associated with CA153 in different participants (p = 3.73E-08, FDR = 2.31E-06). The average level of CA153 under different genotypes of rs4072037 is summarized in Table 4. Compared with the wild genotype TT of MUC1-rs4072037, the level of CA153 under the CC/CT genotype was significantly increased (p < 0.001). Fig. 2 showed a box plot of the CA153 level changes under different genotypes of rs4072037.

Discussion
GWAS results of our study showed that only MUC1-rs4072037 significantly genome-wide associated with CA153 (p = 1.28E-18). However, a total of 30 genetic loci were identified in the Han population from Southern China, which were suggestively genome-wide associated with level of TMs. In addition, the results of repeated verification showed that only MUC1 rs4072037 still had a significant association with CA153 level. The level of CA15-3 is increased in the serum of patients with malignant breast cancer and ovarian cancer [23,24]. In this study, we found that the association between rs4072037 (MUC1) and CA15-3 levels reached the whole genome level. In addition, there were seven genetic loci that have a suggestive genome-wide  significant association with CA153 level. Among the genetic loci identified in this study, rs4072037 (MUC1) [25,26] and rs1053878 (ABO) [27] have been reported to be associated with the occurrence and development of cancers. However, their specific mechanism in cancer remains unclear. Gu, X. et al. found the important evidence that MUC1 rs4072037 may be used as a tumor marker in a large sample of 'case Vs. control' study [28].
Combining the results of this study, we speculate that MUC1-rs4072037 and ABO-rs1053878 may affect the expression level of CA153, thereby affecting the occurrence and development of cancer. We only conducted GWAS of CA153 among 165 participants due to data missing in sample collection. Nevertheless, MUC1 rs4072037 was still detected to be significantly associated with CA153 levels during the repeated verification, which was passed FDR correction. It shows that this study result is worth believing. And compared with the TT genotype, the CA153 level under the CC or CT genotype was significantly   [29]. Combined with the results of our study, the appearance of the minor allele 'C' will significantly increase the level of CA153. Based on the above, MUC1 rs4072037 can be used as a new genetic signal for tumor prevention in southern Chinese Han population. To our knowledge, our study is the first to report a significant association between MUC1 rs4072037 and CA153 level. Our study will provide a valuable reference for tumor prevention.
The median survival rate of hepatocellular carcinoma patients with high serum AFP level is low [30]. A Japanese GWAS study found that the AT/TT genotype of rs17047200 on the TLL1 intron was associated with the e level of AFP [31]. Perhaps due to the different study methods, sample size or the genetic backgrounds of the participating populations, our results are different from the previous studies. In our study, the GWAS results showed that the genetic loci that have genomewide significance with the AFP level in the Han population of southern China are rs1524622 (LOC100287704/ LOC100287834), rs143687479 (SGCZ), rs10425983 (ZNF444/GALP). As far as we know, these three SNPs were discovered for the first time that may be associated with changes in AFP level.
High CEA level are associated with tumorigenesis and are often used to assist in the diagnosis of cancers caused by the gastrointestinal tract [32]. Carcinoembryonic antigen (CEA) is also highly expressed in malignant breast cancer [23]. Up to now, there were studies have identified 5 genetic variants of the ABO gene associated with CEA level through GWAS [13]. However, the association between the two SNPs (TMEM132D-rs11060471; FARP1-rs17568726) identified by GWAS and CEA level changes in our study was reported for the first time. They may be new genetic signals to monitor changes of CEA level in cancer patients, but this requires further experiments to verify.
Serum f-PSA is an important indicator for early detection of prostate cancer. Jeannette et al. found that KLK3 SNPs may be associated with f-PSA level in the GWAS assessment conducted among African Americans and Europeans [11]. Our results are different from those of previous studies. We found that the SNPs of PDE4B, PAM, PCAT5/ANKRD30A, PABPN1L/CBFA2T3 were significantly associated with changes in f-PSA level among the Chinese Southern Han population. We speculate that the reason for this difference may be influenced by different genetic background or environment. Nevertheless, our results still provide data supplements for GWAS studies on f-PSA in different populations.
Detection of serum SCC-Ag level can be used in the early diagnosis of cervical cancer recurrence [33]. Up to now, there is no study on the SCC-Ag GWAS has been reported. We conducted the SCC-Ag GWAS study for the first time in the Han population of southern China. And three genetic loci (rs111241781, rs117089993, rs72773580) have been identified that may be associated with the SCC-Ag level. They may be used as new genetic signals for early diagnosis of cervical cancer recurrence.
Carbohydrate antigen (CA) plays a key role in tumor progression, CA15-3, CA125, and CA19-9 are widely used in various cancer screening [34]. Studies have reported that CA19-9 is the most sensitive tumor marker for pancreatic cancer [3]. And the expression of CA19-9 is related to the recurrence [35]. CA125 is a reliable marker for clinical diagnosis of ovarian cancer [7], and its increasing level is a signal of recurrence of female germline tumors [36]. The expression of CA50 in the serum of pancreatic cancer patients was significantly higher than that of healthy controls [9]. Although some SNPs that are significantly associated to the level of CA19-9/ CA125 have been reported in previous GWAS studies [13,37], the genetic loci identified in our study have never been reported in Southern Chinese Han population. Our results provide reliable evidence that these SNPs can be used as new genetic signals for one or some tumor markers.
The characteristics and changes of tumor markers has important guiding significance for health assessment management, cancer prevention and tumor detection. Our study provides important evidence for in-depth understanding of the mechanism of serum tumor markers. The small sample size of this study is a limitation that cannot be ignored. In subsequent studies, we will expand the sample size or replace populations with different genetic backgrounds as study subjects to conduct validation tests. In any case, our study have provided a valuable reference for clinical monitoring and diagnosis of the occurrence and development of tumors among Han population from southern China.

Conclusion
In summary, MUC1-rs4072037 is significantly genomewide associated with CA153 level (p = 1.28E-18) and 30 genetic loci were suggestively genome-wide associated with level of tumor markers (AFP, CA50, CA125, CA153, CA19-9, CEA, f-PSA, SCC-Ag) among the Han population from Southern China. Our study provides a valuable reference for clinical tumor prevention, and also provides a new theoretical basis for the phenotype research of tumor markers in healthy populations and individual health management.