Introduction

Angiogenesis is an essential event in tumor growth, progression, and metastasis, and is regulated by the VEGF pathway. Several angiogenic factors have been identified, among which VEGF appears to be the key regulator in neovascularization and enhanced vascular permeability1,2. The VEGF family consists of VEGF-A, -B, -C, -D, -E, and the placental growth factor (PlGF). The VEGF receptors comprise three tyrosine kinases: VEGFR-1 (FLT1), VEGFR-2 (KDR), and VEGFR-3 (FLT).

VEGF-A and VEGFR-2 are considered the most critical of the endothelial cell ligands and receptors, respectively, in solid tumor angiogenesis3,4. Anti-angiogenic therapies are established treatment modalities for many types of cancer5,6. Despite the survival advantage experienced by patients, this treatment strategy still faces many barriers. Among them, the paucity of clinically useful biomarkers that can consistently predict clinical efficacy for this class of agents is a significant impediment.

For certain cancers, the degree of angiogenesis may reflect the malignant potential of individual tumors7,8. The degree of angiogenesis can be quantified using various approaches, such as assessing tumor specimens for microvessel density or measuring tumor levels of angiogenic growth factors and cytokines. Approaches using patient blood are also available, including measurement of circulating angiogenic proteins and genetic analysis of germline DNA.

Circulatory proteins involved with angiogenesis and inflammation are measured to gain insight into the pathophysiology of different conditions. The intrinsic value of this approach is that it provides a safe and efficient way to identify circulatory proteins that reflect the biology of the tumor and its microenvironment. In the setting of oncogenesis, due to overproduction of VEGF-A by tumor cells relative to normal cells, circulating VEGF-A levels in cancer patients might reflect VEGF-A release by both normal and cancer cells. VEGF-A levels, hence, can inform about the biology of spreading tumors. For example, circulating VEGF-A levels decrease after primary tumor resection in colorectal and gastric cancers9,10, and there is a positive relationship between the concentrations of circulating VEGF-A and the extension of epithelial carcinomas11. Circulating proteins could provide information regarding the angiogenic potential of a tumor, or act as tools to evaluate residual disease after surgery, or early indications of response to anti-angiogenic therapy12,13.

The heritability of circulating angiogenesis proteins has primarily been evaluated for VEGF-A in family studies of cardiovascular disease and is high14,15,16. Genetic markers that predict the circulating levels of angiogenesis proteins in cancer patients have not been well-studied. To perform such study is of importance, as the level of heritability in cancer patients might be lower (or even very low) compared to non-cancer individuals. In fact, cancer patients tend to have higher VEGF-A levels as compared to non-cancer individuals, due to the influence of tumor-related VEGF-A production from multiple sources17,18.

Hence, we employed a genome-wide approach to discover genetic variants associated with levels of circulating angiogenesis proteins in cancer patients prior to receipt of treatment. As replication of these findings was crucial, we performed two studies (one discovery and one validation) from two clinical trials of cancer patients with two different diseases. An intergenic variant flanking the VEGFA gene was found to predict plasma levels of VEGF-A in both studies. Because the mechanistic basis of genetic associations needs to be established, we have also discovered that the the NF-AT1 transcription factor binds where the intergenic variant is located and its binding is affected by the variant.

Results

Associations between gene variants and plasma protein levels

The Consolidated Standards of Reporting Trials (CONSORT)19 charts presented in Figs 1 and 2 provide detailed information about the number of patients, proteins, and SNPs measured in the discovery (CALGB 80303) and validation (CALGB 80203) cohorts, respectively. CALGB is now a part of the Alliance for Clinical Trials in Oncology. For each of the 31 proteins in the discovery cohort, the top 50 SNPs (ranked according to P-value) are listed in Supplementary File 1.

Figure 1
figure 1

CONSORT chart for CALGB 80303. Abbreviations: MAF, minor allele frequency; HWE, Hardy-Weinberg Equilibrium.

Figure 2
figure 2

CONSORT chart for CALGB 80203. Abbreviations: MAF, minor allele frequency; HWE, Hardy-Weinberg Equilibrium.

Three SNP-protein pairs (rs7767396, VEGF-A), (rs2284284, MCP1) and (rs7504372, VEGF-C) with a P-value less than 1e–07 were selected for validation in CALGB 80203. This unadjusted P-value cut-off was used only for the purpose of selecting variants for validation. The results for these three pairs are summarized in Table 1. The quantile-quantile (QQ) and Manhattan plots, across all SNPs for each of these three proteins, are provided in Supplementary Figs 1 and 2 respectively.

Table 1 SNPs associated with plasma levels of MCP1, VEGF-C, and VEGF-A in CALGB 80303.

The genotype effect of rs7767396 (A > G) on plasma levels of VEGF-A was validated in the CALGB 80203 cohort (P-value = 5.18e–05) based on our validation criterion. In both the discovery and validation cohorts, the G allele of rs7767396 reduces the plasma levels of the VEGF-A protein, and an apparent gene-dosage effect is observed, i.e., in CALGB 80303, the AA genotype exhibited 2.04 and 3.4-fold higher VEGF-A levels than those with AG and GG genotypes, respectively. In CALGB 80203, the corresponding fold increases were 2.3 and 2.7, respectively. The associations of rs2284284 with MCP1 (P-value = 0.314) and rs7504372 with VEGF-C (P-value = 0.481) were not statistically significant in the validation cohort. The distributions of the levels of the three plasma proteins, conditional on the genotypes, are shown in the boxplots in Fig. 3. Within the framework of a multivariable regression model, with age and gender as baseline covariates, the association between rs7767396 and VEGF-A level remained strong for both CALGB 80303 (P-value = 1.3e–08) and CALGB 80203 (P-value = 4.5e–04). The percentage of variance in VEGF-A levels explained by rs7767396 was 14.5% for CALGB 80303 and 13.9% for CALGB 80203. Compared to cancer patients with the AA genotype, patients with the G allele (AG and GG genotypes) of rs7767396 have 56% and 57% reductions in median VEGF-A levels, in CALGB 80303 and CALGB 80203, respectively.

Figure 3
figure 3

Associations between rs2284284 and MCP1 levels, rs7504372 and VEGF-C levels, and rs7767396 and VEFG-A levels in both CALGB 80303 and 80203. Boxes represent 25th (Q1) and 75th (Q3) percentiles; Horizontal lines indicate the medians; Upper whiskers indicate min(max(x), Q3 + 1.5 * IQR); Lower whiskers indicate max(min(x), Q1–1.5 * IQR). Points indicate observations.

Associations between either rs7767396 or plasma VEGF-A levels and VEGFA mRNA levels in the tumor

The data from the validation study (CALGB 80203) show no evidence to support the hypothesis that the variability in the VEGFA mRNA levels in the primary tumors, as quantified by any of the four isoforms (121, 145, 165, and 189) individually or the total VEGFA mRNA levels, is associated with either the circulating VEGF-A levels in plasma or the rs7767396 germline variant (Supplementary Fig. 3).

Transcription factor binding and relationship with rs7767396

A bioinformatics analysis was conducted to explore potential mechanisms for the observed association between rs7767396 and VEGF-A plasma levels. HaploReg indicates that the binding of NF-AT1 and ZBRK1 transcription factors may be altered by the presence of the rs7767396 variant, based on the DNA binding motifs of the two factors. For the NF-AT1 transcription factor, DNA binding specificity data is also available for all possible 8-bp sequences from the cis-BP database15. The corresponding protein binding microarray (PBM) data to confirm ZBRK1 is not available. Analysis of rs7767396 using these data (Fig. 4) shows that the reference allele is bound with high affinity by NF-AT1, while the alternative allele is not bound specifically by the protein.

Figure 4
figure 4

rs7767396 disrupts NF-AT1-DNA binding in vitro. Plot shows the NF-AT1 8-mer PBM data for a 21-bp genomic region centered at rs7767396. The reference allele is bound specifically by the protein, as indicated by the fact that it overlaps two 8-mers with a binding score above 0.37 (dotted line), which corresponds to a false discovery rate (FDR) of 0.001 for calling transcription factor binding sites46. The alternate allele is not bound specifically by NF-AT1 in vitro, as indicated by the fact that all 8-mers overlapping the allele have low binding scores.

Discussion

The present study has assessed, for the first time, the genomic basis of differences in levels of circulatory proteins among cancer patients. Using a discovery/validation approach, a common, heritable variant was associated with plasma levels of VEGF-A in two different clinical trials of patients with two types of cancers.

This variant, rs7767396, is located 173 Kbs downstream from VEGFA and could be regarded as a cis protein quantitative trait locus (pQTL) for circulating levels of VEGF-A. Its relative minor allelic frequencies in the CALGB 80303 and 80203 studies are 0.47 and 0.52, respectively, compared to 0.49 reported for the 1000 Genomes Project 1 CEU population20. In both CALGB 80303 and 80203, the association between this variant and circulating VEGF-A levels is strong, with median differences in circulating VEGF-A levels of 2–3 folds across genotypes.

Debette et al. have evaluated the genetic basis of heritability in VEGF-A levels in non-cancer individuals21. For their validation cohort, they report four common variants (rs6921438, rs4416670, rs6993770, and rs10738760) that explain up to 48% of the heritability of serum VEGF-A levels. Although rs7767396 is not among those four variants, they report the results for the top hit of the present study, rs7767396, from their discovery set. In Debette et al., rs7767396 was associated with VEGF-A levels with an estimated effect size of beta = −0.71 and a P-value < 1.71e–48221. The estimated effects in our study are concordant, namely that copies of the G allele are associated with lower circulating VEGF-A levels. It is of interest to note that while only rs10738760 and a proxy for rs4416670, rs3734693, were genotyped in CALGB 80303, there was no evidence for linkage disequilibrium between rs7767396 and rs10738760 (r2 = 0 in CALGB 80303 and 1000 G CEU), or rs7767396 and rs3734693 (r2 = 0 in CALGB 80303 and r2 = 0.01 in 1000 G CEU). However, rs6921438 is in high LD with rs7767396 in 1000 G CEU (r2 = 0.93), and it’s likely that rs7767396 is the causal variant of that haplotype, due to its mechanism elucidated by our results.

Choi et al. report a genome-wide association study (GWAS) meta-analysis consisting of six discovery cohorts to identify SNPs associated with circulating VEGF-A levels22. They identify a panel of 10 SNPs, including three SNPs from the Debette et al. panel, which they report as accounting for 52% of the variability in circulating VEGF-A. For the top hit of the present study, they report a P-value of <4.85e–128422. The reported effect size in each of their cohorts is positive with respect to the A allele, i.e. the A allele is associated with an increase in VEGF-A levels. Again, this is concordant with the direction of the effect size of rs7767396 in our study. Sun et al.23 report results from a pQTL GWAS using blood protein analytes in current and former smokers with or without chronic obstructive pulmonary disease (COPD). They report rs7767396 as the top variant for VEGFA (P-value = 5e–26). They do not report the direction of the effect.

For the first time, we report that rs7767396 predicts levels of circulating VEGF-A also in patients with cancer, in addition to its prediction in patients without cancer, as shown by the studies reported above. This novel finding is of relevance, as cancer patients tend to have higher VEGF-A levels as compared to non-cancer patients, probably due to the influence of somatic and tumor-specific factors9,17,18,24,25. This study cannot establish a biological link between the plasma levels of VEGF-A and the expression of the gene in tumor cells. In fact, rs7767396 in VEGF-A does not predict for mRNA levels of VEGFA in the tumor. This lack of association does not contradict the pQTL observation. It suggests that other factors, very likely of somatic nature and/or less heritable, control the variability of mRNA levels of VEGFA in the tumor of colorectal cancer patients. In addition to the possibility that technical noise is introduced by fixation of tumor cells with formalin (which can affect mRNA stability and quality), another explanation could be that VEGFA mRNA levels in the tumor cells might not be reflective of the expression in the endothelial cells. Additional studies should shed light on the genetic and environmental basis of VEGFA expression in the tumor.

Similar to other genomic studies, the difficulty of interpreting phenotypic associations lies in the demonstration of the mechanistic basis of the findings. Previous studies of pQTLs for VEGF-A have not provided the mechanistic basis of the associations16,17,18 and performing functional experiments can identify new regulators of VEGFA expression. The NF-AT1 transcription factor was predicted by HaploReg to bind to the region where rs7767396 is located, and variation of rs7767396 alters the binding of the NF-AT1 motif. A complementary analysis using the PBM data from the cis-BP database has confirmed this prediction, providing strong evidence in support of the role of rs7767396 with respect to the NF-AT1 transcription factor. More importantly, the PBM data have demonstrated that the binding efficiency of NF-AT1 is dramatically reduced by the presence of the G allele of rs7767396 as compared to the A allele (Fig. 4). VEGF-A triggers activation of NF-AT1 in vascular endothelial cells26,27,28, and NF-AT1 inhibition has been shown to block expression of certain VEGF-A-induced genes29, although it is currently unknown whether NF-AT1 can act as a distal regulator of VEGF-A expression28,30,31. Our study results support the mechanism that, because of reduced transcriptional activation of VEGFA through reduced binding of NF-AT1, subjects with the G allele of rs7767396 have significantly reduced VEGF-A plasma levels in the circulation.

In our discovery cohort, the P-value for the SNP-protein pair (rs7767396-VEGF-A) (P-value = 5.8e–09) did not surpass the strict Bonferroni-corrected threshold of statistical significance (P-value < 3.3e–9). In our validation cohort, however, the corresponding P-value (5.2e–5) surpassed the equivalent threshold (P-value < 1.7e–2). The evidence in support for this pQTL is further bolstered by the strong associations reported in the Debette et al.21 and Sun et al.23 papers.

In addition to the results discussed above, we also provide genome-wide associations of SNPs regarding an additional 28 proteins (Supplementary File 1) to guide future research on these associations. Availability of results of SNP-protein pair associations will allow replication of hits with less statistical significance by other investigators who will test such associations prospectively.

In summary, we have identified a heritable common cis variant that regulates circulating VEGF-A levels in plasma of patients with advanced adenocarcinoma of the pancreas or colon. This variant has been reported to regulate circulating VEGF-A levels in a large cardiology cohort and a large COPD case-control cohort, but no data have been reported in a cancer population. This study is the first to identify a common variant near the VEGFA gene regulating the plasma levels of the protein in cancer patients. It is also the first study elucidating how this variant changes VEGF-A plasma levels, by altering the binding of the NF-AT1 transcription factor to a regulatory element about 170 Kb distant from VEGFA. Similar studies in tumor types outside of the gastrointestinal tract are warranted.

Methods

Clinical trials and patients

CALGB 80303 was the discovery cohort study. CALGB 80303 was a double-blind, placebo-controlled, randomized phase III study of bevacizumab in combination with gemcitabine in treatment-naïve advanced pancreatic adenocarcinoma patients. Patient eligibility, characteristics, stratifications, response evaluation, and treatments have been previously described32. The characteristics of the genetically estimated European patients for whom genotype data and plasma protein levels were available are described in Table 2.

Table 2 Baseline clinical characteristics of patients in CALGB 80303 and 80203.

CALGB 80203 was the validation cohort study. CALGB 80203 was a double-blind, randomized study in patients with previously untreated, advanced or metastatic colorectal cancer. Patients were treated with 5-fluorouracil/oxaliplatin or 5-fluorouracil/irinotecan with or without cetuximab. Additional design details of the CALGB 80203 study were previously described33,34. The characteristics of the self-reported white and non-Hispanic patients, for whom genotype data and plasma protein levels were available, are described in Table 2.

This research was approved by the Institutional Review Board of each participating institution. Each participant signed an IRB-approved, protocol-specific informed consent. All experiments were performed in accordance with relevant guidelines and regulations for the clinical, SNP and protein marker analysis; work was performed under the auspices of protocol number Pro00018430.

Genotyping

In CALGB 80303, germline genome-wide genotype data of 484,523 directly interrogated SNPs were collected from 294 genetically estimated European patients using the Illumina 550 K platform35. Quality control of the genotyping has been previously described35. Among these 294 patients, 216 had consented samples available for measurement of plasma proteins of angiogenesis (Fig. 1). The genotyping was conducted at the Center for Genomic Medicine at the RIKEN Institute.

In CALGB 80203, three SNPs (rs2284284, rs7504372, and rs7767396) were genotyped from germline DNA in 117 self-reported white, non-Hispanic patients (Fig. 2), using TaqMan SNP genotyping assays (Applied Biosystems, Foster City, CA). Sanger-based DNA sequencing (Mammalian Genotyping Core, University of North Carolina-Chapel Hill) was used to validate representative samples and determine thresholds for allelic discrimination. PCR primers used for amplification of genomic DNA prior to sequencing are listed in Supplementary Table 1.

Measurement of plasma proteins of angiogenesis

In CALGB 80303 and 80203, 31 proteins were measured (Table 3) using the SearchLight multiplex platform (Aushon BioSystems, Inc., Billerica, MA). The VEGF-A assay detects the VEGF-A165 isoform (the predominant isoform among the circulating ones) preferentially, but it is not considered isoform-specific36,37. Plasma samples were collected before treatment and stored at −80 °C until analysis. The frozen samples were thawed on ice, centrifuged at 20,000 × g for 5 min to remove any precipitate, and appropriately diluted before placement onto multiplex plates. Plasma samples had no more than two freeze-thaw cycles and all assays were performed in duplicate. The number of patients for whom proteins were measured is described in Table 2. Detailed information about the distributions of these protein markers and their relationship to clinical outcomes in CALGB 8030336 and 8020337 have been published. The analyses were carried out at the Phase I Biomarker Laboratory at Duke University Medical Center.

Table 3 Proteins measured in plasma of patients enrolled in CALGB 80303 and 80203.

Statistics

A two-stage approach was used to detect genetic associations with circulating protein levels. CALGB 80303 served as the discovery cohort, and selected genetic variants were then genotyped in CALGB 80203, which served as the validation cohort. In the discovery cohort, the genome-wide analyses used an additive genetic model, and only autosomal SNPs were evaluated.

The Jonckheere-Terpstra38,39 test was used to test the association between each SNP and protein level. The Jonckheere-Terpstra test was selected for its desirable qualities of being rank-based, making it robust to outliers, and distribution-free, meaning its validity would not depend on the normality and homogeneity of the variances. Since the test is powered for ordered alternatives, it is applicable to pQTL-type analyses (analogous to the Cochran-Armitage test for binary outcomes). The variance of the Jonckheere-Terpstra test was approximated using expression 6.19 in Hollander, et al.40. For each protein level, the distribution of the marginal P-values across all SNPs was assessed empirically using QQ and Manhattan plots. A robust linear regression rank-based approach41,42 was used to estimate the proportion of variance of the phenotype explained by SNPs. For this analysis, the protein level was log base 2 transformed. This regression approach was also used to investigate the relationship between protein level and genotype, accounting for sex and age (log10 transformed) as baseline covariates, in both CALGB 80303 and 80203. The concordance between plasma protein levels and mRNA levels in tumors was assessed using Kendall’s test of concordance43.

All statistical analyses were conducted using two-sided alternative hypotheses and were restricted to genetic Europeans for CALGB 80303 or self-reported white, non-Hispanic patients for CALGB 80203. For validation of SNP-protein pairs, we specified that the direction of effect with respect to the minor frequency allele must be the same in both studies and used a marginal two-sided P-value cutoff of 0.05/k, where k denotes the number of pairs chosen for validation based on the results from the discovery cohort. Additional statistical details are reported in the Supplementary Information section.

Protein-DNA binding microarray data

After using HaploReg v4.144 to identify putative transcription factors binding in the region of rs7767396, we tested the specificity of binding used PBM data from the cis-BP database45. PBM experiments provide, in addition to DNA motif models, quantitative measurements of in vitro protein-DNA binding specificity for all possible 8-bp DNA sequences. Such measurements can be downloaded from cis-BP in the form of 8-mer binding enrichment scores, which vary from −0.5 to 0.5, with values above 0.37 corresponding to 8-mers specifically bound by the protein (false discovery rate, FDR, 0.001)46.

Quantification of mRNA VEGF-A isoforms in primary tissues from CALGB 80203 patients

These methods are reported in the Supplementary Information section.