Strong cis-acting expression quantitative trait loci for the genes encoding SNHG5 and PEX6

Supplemental Digital Content is available in the text


Introduction
As of 2003, more than 170 million base pairs had been completely sequenced for human chromosome 6. [1] Genome-wide association studies (GWAS) have unraveled a large number of singlenucleotide variants associated with human traits. According to the GWAS Catalog, 32,768 genetic associations were reported across hundreds of diseases and quantitative traits as of October 2016 (P < 10 À5 ; https://www.ebi.ac.uk/gwas). The GWAS signals were more frequently found in chromosome 6 than in any other chromosomes, even in chromosome 1, which is 1.5 times longer than chromosome 6. The chromosome 6 has medical significance, with about 120 causal genes reported for complex diseases, such as cancers, heart disease, diabetes, Alzheimer disease, rheumatoid arthritis, and multiple sclerosis. [2][3][4] In particular, major histocompatibility complex and related regions including the human leukocyte antigen (HLA) genes in chromosome 6 are critically related to infection, immunity, and inflammation. [2] Many nucleotide variants turned out to be important for susceptibility to complex diseases from GWAS. One example is rs9378815, an intergenic single-nucleotide polymorphism (SNP) on chromosome 6, which has been identified as a rheumatoid arthritis risk locus from a trans-ethnic GWAS meta-analysis using >100,000 subjects (P = 1.7 Â 10 À10 ). [5] Another example is rs2046210, an SNP upstream of the gene encoding estrogen receptor alpha on chromosome 6, which was associated with susceptibility to breast cancer in a study involving over 31,000 subjects (eg, P = 1.5 Â 10 À30 in Chinese women). [6] Nevertheless, their underlying mechanisms are unknown. Identifying nucleotide variants that influence RNA transcript expression levels is critical to understand underlying mechanisms of GWAS signals. The objective of the current study was to identify expression quantitative trait loci (eQTLs) for genes located in chromosome 6, which might have abundant potential regulatory sequences.

Methods
The study used RNA expression data for lymphoblastoid cells of 376 unrelated individuals from European populations of CEPH (CEU), Finns (FIN), British (GBR), and Toscani (TSI) to identify regulatory variants for expression of the 573 genes in chromosome 6, which were obtained from the study by Lappalainen et al. [7] The genes included protein-coding genes and long intergenic noncoding RNA genes. The expression level was calculated as the sum of reads per kilobase of transcript per million mapped reads values of all the transcripts of each gene for each individual. A read was counted in an exon if either start or end coordinate of the read was in the exon. To avoid overestimating gene expressions, the count was divided by the number of overlapping exons among split reads. Their corresponding genotypes were obtained from the 1000 Genomes project (phase I; http://phase1browser.1000genomes.org/) that originally conducted whole genome sequencing with an average depth of 5Â, and whole exome sequencing with an average depth of 80Â using 1092 individuals from 14 populations of Europeans, East Asians, Saharan Africans, and Americans. A total of 404,240 SNPs on chromosome 6 among 5,941,815 genome-wide SNPs were included in the current analysis after a filtration process (excluding SNPs with minor allele frequency [MAF] <0.05). Three individuals with missing rate >0.05 of the genotypes were excluded from the analysis. Ethical approval was not necessary because we dealt with publically available data.
Genetic associations of SNPs with RNA expression were conducted to determine the significance of regression coefficient for minor allele effect. We employed the following mixed model incorporating a genomic relationship matrix (GRM) that reflected polygenic covariance among individuals to avoid population stratification [8] : where y is the vector of gene expression levels, m is the overall mean, 1 is the vector of 1 s, b is the fixed effect for minor allele of the SNP to be tested for association, and x is the vector of the numbers of minor alleles of the SNP, for example, 0(0, 0) for homozygote of the major allele, 1 (1, 0) for heterozygote, and 2 (1, 1)for homozygote of the minor allele in additive (dominant, recessive) model. Vector g is the vector of random polygenic effects with g ∼ Nð0; As 2 g Þ, where A is GRM estimated using SNPs, and s 2 g is the polygenic variance component. Elements of A consist of the following pair-wise genomic relationship coefficients estimated using genotypes of SNPs in linkage equilibrium (r 2 > 0.8): where a jk is the genomic relationship coefficient between j th and k th individuals, N S is the number of SNPs, x ij (x ik ) is the number (0, 1, or 2) of the minor allele at the i th SNP for the j th (k th ) individual, p i is the frequency of the minor allele at the i th SNP. e is the vector of random residuals with e ∼ Nð0; Is 2 e Þ, where s 2 e is the residual variance component and I is the identity matrix. All the association analyses were conducted using Genome-wide Complex Trait Analysis v1.24 freeware. [9] Multiple testing was applied employing Bonferroni correction. A total of 231,629,520 association tests for 573 genes and 404,240 SNP markers resulted in the significance threshold value of 2.16 Â 10 À10 . The association tests were analyzed using PLINK ver.1.04 (Free Software Foundation, Inc. Boston, MA).
Linkage disequilibrium (LD) blocks were constructed based on the algorithm of Gabriel et al [10] (Haploview version 4.2). The algorithm used a 95% confidence interval (CI) of pair-wise D' estimate between SNPs with MAF >0.05.
Promoters/enhancers including the significant SNPs were examined using genome-wide chromatin interaction analysis with paired-end-tag sequencing (ChIA-PET) data, which resulted from the study by Li et al. [11] The paired-end-tag (PET) sequences were mapped with the human reference genome of hg19. Highconfidence intrachromosomal PET clusters with P FDR <0.05were included with genomic span of 8 Kb to 1 Mb and PET count >3 for each PET cluster from the saturated libraries. Interaction anchors flanking transcription start site ±5 Kb were defined as promoters.
Enrichment analysis was conducted to analyze associations of significant SNPs with diseases and biological processes using genomic regions enrichment of annotations tool. [12] Diseases and biological processes based on the Gene Ontology (http://www. geneontology.ogr/) were used as input annotation terms. Each gene involved in input annotation terms was assigned to a regulatory domain that extends up to 1 Mb in both directions. Significance of enrichment was determined by a hypergeometric test over genes. The significance threshold was adjusted by false discovery rate (FDR) to control false-positives.
Other enrichment analyses were also performed to determine enrichment of cis versus trans regulation, promoter/enhancer versus nonpromoter/nonenhancer regions, and downstream versus upstream regions. Their significances were all determined with P < 0.01 by the Fisher exact test.

Results
The eQTL analysis revealed that 24,447 SNPs were associated with expression of the 88 genes located on chromosome 6 (P < 2.16 Â 10 À10 ). Most were identified using the additive model (24,213 SNPs). They were enriched (P FDR < 0.05, where P FDR is P value adjusted by FDR) with genes involved in a variety of diseases (Supplementary Table 1, http://links.lww.com/MD/ B483) and biological processes (Supplementary Table 2, http:// links.lww.com/MD/B483).The SNPs were enriched (P < 0.01) in promoter and enhancer regions resulting from a ChIA-PET study. [11] There were 18,872 cis-eQTLs within 1 Mb from target genes and 5575trans-eQTLs >1 Mb away from target genes, and they were largely cis-regulatory eQTLs (77% = 18,872/24,447) by the enrichment analysis (P < 0.01). The cis-eQTLs appeared considerably clustered with 68% (12,886/18,872) located within 100 Kb. Also, 4960 SNPs were associated with expression of multiple genes, which resulted in 18,378 unique SNPs. In particular, 36 SNPs were observed with extremely low P value (P < 10 À100 ; Table 1). All 36 SNPs influenced RNA expression of the genes encoding small nucleolar RNA host gene 5 (SNHG5) and peroxisomal biogenesis factor 6 (PEX6). The SNPs were all located in or near the corresponding gene. On chromosome 6, the SNHG5 gene is located from 85,676,990 to 85,678,736 bp (GRCh38/hg38), and the PEX6 gene is located from 42,964,335 to 42,979,150 bp (GRCh38/hg38). Considerable numbers of SNPs flanking the genes were associated with their expression (P < 2.16 Â 10 À10 ; Fig. 1). Most of the strongly significant SNPs with P < 10 À100 were found downstream of the SNHG5 gene and within the PEX6 gene. The SNPs were observed in strong linkage (r 2 > 0.8), forming LD blocks (Fig. 2).

Discussion
The current study identified 24,447 SNPs associated with RNA expression of genes located in chromosome 6, including cis-eQTLs (77%) that may affect transcription initiation complex on the same physical chromosome. The cis-eQTLs tended to be considerably clustered with 68% located within 100 Kb from their corresponding target gene, which concurred with the previous study where 80% of cis-eQTLs was identified within 100 Kb using 9 types of tissues. [13] The current study showed the pleiotropy that 4960 SNPs were associated with expression of multiple genes. This implied direct In particular, 36 SNPs showed strong associations with P < 10 À100 , and they were all associated with expression of the gene encoding SNHG5 or PEX6. All the SNPs were located flanking the corresponding gene within 200 Kb. An enrichment analysis showed that the SNPs strongly associated with expression of the SNHG5 were enriched downstream of the gene (P < 0.01). However, SNPs with P < 2.16 Â 10 À10 were enriched upstream of the gene (P < 0.01).The difference in the degree of significance implied that downstream and upstream signals may act as regulatory sequences with different importance. The association signal of rs6922 (P = 1.26 Â 10 À54 ) was located far from the target gene (SNHG5; 181,402 bp apart), and thus it was suspected as an enhancer variant. This speculation led us to compare it with the sequences resulting from a ChIA-PET study. [11] As a result, we found that 1 anchor of an interacting PET cluster included rs6922,and the other anchor was the promoter of the SNHG5gene. The current studyalso showed that nucleotide substitution from G to T at rs6922 decreased the expression of SNHG5 (beta = À53.7). The SNPs with P < 2.16 Â 10 À10 were enriched in the promoter and enhancer regions of the SNHG5 gene (P < 0.01; 7 SNPs in the promoter and 23 SNPs in the enhancer regions).
Nucleotide substitution of such variants might influence the expression of SNHG5 gene and is candidate genetic factor for susceptibility to cancer. This is supported by previous studies in which a variety of cancers were induced by low expression of snoRNAU50, a product generated from introns 4 and 5 of the SNHG5 gene. Expression of U50 is down-regulated in prostate Table 1 Genetic associations of nucleotide variants with mRNA expression of genes located in chromosome 6 * . Only SNPs strongly associated with gene expression are presented (P < 10 À100 ). † Its expression was associated with SNP. ‡ Major allele/minor allele.
x Minor allele frequency. jj P value for testing Hardy-Weinberg equilibrium. All the SNPs with P < 10 À100 were in Hardy-Weinberg equilibrium (P HWE > 0.05). ¶ Regression coefficient estimated using the additive model. Positive estimate means that minor allele increased expression of the gene, and negative estimate means that minor allele decreased expression. # Enhancer/promoter. The SNPs in second column are within enhancer or promoter of gene in first column. and breast cancer samples. [14,15] These studies also showed that the expression of U50 can inhibit colony formation of prostate and breast cancer cells. SNHG5 gene is located at the chromosomal translocation breakpoint involved in B-cell lymphoma (provided by RefSeq, July 2008). It was recently reported that the serum levels of SNHG5 were up-regulated in patients with malignant melanoma, and that SNHG5 may function in melanomagenesis or melanoma metastasis. [16] Further studies are needed to understand the underlying mechanisms of nucleotide substitutions on cancers and their subtypes.
The current study revealed many intragenic eQTLs for the PEX6 gene. The gene encodes a member of the ATPases associated with diverse cellular activities. In particular, the PEX6 plays a key role in biogenesis of peroxisome, which is a small eukaryotic organelle specialized to perform oxidative reactions. Its mutations cause peroxisome biogenesis disorders of complementation groups 4 and 6. [17] This suggests that the eQTLs detected from the current study might influence on susceptibility to the peroxisome biogenesis disorders by abnormal expression of PEX6.
Two of the eQTLs for PEX6 had been identified for association with prostate cancer from 2 previous studies which revealed rs10948059 and rs9462856 linked to it in European populations, respectively. [18,19] These findings were interpreted that the genetic associations with prostate cancer were produced by regulating the expression of the glycine N-methyltransferase (GNMT) gene that overlaps with the PEX6 gene. The authors suspected the GNMT gene because the rs10948059 and rs9462856 were their intronic variants. Since the nucleotide variants were revealed as a strong eQTL for PEX6 gene from the current study (P = 6.98 Â 10 À108 ), the variants associated with prostate cancer might be identified through expressional change in PEX6. This was supported by the results of the Human Protein Atlas in which PEX6 is down-regulated in prostate cancer cells (http://www. proteinatlas.org/ENSG00000124587-PEX6/cancer).
We found that the strong eQTLs (P < 10 À100 ) using additive model ("AA" vs "AB" vs "BB," where A is major allele and B is minor allele [20] ) were all significant using dominance ("AA + AB" vs "BB") and recessive ("AA" vs "AB + BB") models (P < 2.16 Â 10 À10 ). However, they were not as strong as observed using the additive model, and all showed P > 10 À100 . The best fit into the additive model implied that the expressional regulation of eQTLs was correlated with the number of minor alleles.
The current study also identified eQTLs with 10 À100 < P < 2.16 Â 10 À10 . In particular, some eQTLs for PEX6 (rs9471950, rs9471951, rs7754294, rs13203402, rs7740252, rs7758576, and rs7759112; 5.63 Â 10 À12 < P < 1.91 Â 10 À10 ) turned out to be enhancer variants of the serum response factor (SRF) gene, which resulted from chromatin interaction analysis with pairedend-tag using RNA polymerase II. [11] The regulatory variants could be plausible because SRF is a well-known transcript factor for the PEX6 gene.  In this study, we employed a mixed model incorporating pair-wise polygenic relationship among individuals to control population stratification, which may reduce false-positive eQTLs. [8] The analysis revealed strong cis-acting eQTLs for SNHG5 and PEX6, and offered insights about their cellular mechanism of transcriptome variation. However, this study was limited to gene expression in lymphoblastoid cells. Other strong eQTLs would be identified with different cell types. In addition, this study had another limitation in understanding specific functions of eQTLs in gene expression (eg, splicing regulation and transcript stability).
In conclusion, the current study revealed novel eQTLs for SNHG5 and PEX6 genes in chromosome 6.Nucleotide substitutions of the eQTLs might be candidate factors for a variety of cancers by regulating expression of the 2 genes. Further studies are needed to understand their underlying mechanisms.