Twelve New Genomic Loci Associated With Bone Mineral Density

Aiming to identify more genomic loci associated with bone mineral density (BMD), we conducted a joint association analysis of 2 genome-wide association study (GWAS) by the integrative association method multi-trait analysis of GWAS (MTAG). The first one is the single GWAS of estimated heel BMD (eBMD) in the UK biobank (UKB) cohort (N = 426,824), and the second one is the GWAS meta-analysis of total body BMD (TB-BMD) in 66,628 participants from 30 studies. Approximate conditional association analysis was performed in the identified novel loci to identify secondary association signal. Statistical fine-mapping was conducted to prioritize plausible credible risk variants (CRVs). Candidate genes were prioritized based on the analyses of cis- expression quantitative trait locus (cis-eQTL) and cis-protein QTL (cis-pQTL) information as well as the functional category of the SNP. By integrating the information carried in over 490,000 participants, this largest joint analysis of BMD GWAS identified 12 novel genomic loci at the genome-wide significance level (GWS, p = 5.0 × 10−8), nine of which were for eBMD and four were for TB-BMD, explaining an additional 0.11 and 0.23% heritability for the two traits, respectively. These loci include 1p33 (lead SNP rs10493130, peBMD = 3.19 × 10−8), 5q13.2 (rs4703589, peBMD = 4.78 × 10−8), 5q31.3 (rs9324887, pTB−BMD = 1.36 × 10−9), 6p21.32 (rs6905837, peBMD = 3.32 × 10−8), 6q14.1 (rs10806234, peBMD = 2.63 × 10−8), 7q21.11 (rs10806234, pTB−BMD = 3.37 × 10−8), 8q24.12 (rs11995866, peBMD = 6.72 × 10−9), 12p13.31 (rs1639122, peBMD = 4.43 × 10−8), 12p12.1 (rs58489179, peBMD = 4.74 × 10−8), 12q24.23 (rs75499226, peBMD = 1.44 × 10−8), 19q13.31 (rs7255083, pTB−BMD = 2.18 × 10−8) and 22q11.23 (rs13056137, pTB−BMD = 2.54 × 10−8). All lead SNPs in these 12 loci are nominally significant in both original studies as well as consistent in effect direction between them, providing solid evidence of replication. Approximate conditional analysis identified one secondary signal in 5q13.2 (rs11738874, pconditional = 5.06 × 10−9). Statistical fine-mapping analysis prioritized 269 CRVs. A total of 65 candidate genes were prioritized, including those with known biological function to bone development (such as FGF1, COL11A2 and DEPTOR). Our findings provide novel insights into a better understanding of the genetic mechanism underlying bone development as well as candidate genes for future functional investigation.


INTRODUCTION
Osteoporosis is a common aging-related disease characterized by low bone mass and micro-architectural deterioration of bone tissue with a consequent increase in bone fragility and susceptibility to fracture (1). These later complications are associated with significant individual morbidity and related healthcare costs. Fifteen per cent of white people over 50 years old suffer osteoporotic fracture in their remaining lifetime, and the projected costs expended on this disease will exceed $25 billion in the United States alone by year 2025 (2). Therefore, a better understanding of the mechanisms underlying osteoporosis may help to develop medications for osteoporosis prevention and treatment.
In the present study, aiming to identify more genomic loci that are responsible for BMD variation, we conduct a joint analysis of 2 GWAS analyses. The first one is the largest single GWAS study of estimated heel BMD (eBMD) in 426,824 UK biobank (UKB) participants (17), and the second one is the GWAS meta-analysis of total body BMD (TB-BMD) in 66,628 participants (15), which is so far the second largest study for BMD. This analysis integrates the information from an expanded size of over 490,000 participants, and therefore has the potential maximal statistical power of gene mapping to date. We further prioritize plausible functional variants and candidate genes for future experimental validation.

MATERIALS AND METHODS
We performed a joint analysis of summary statistics from two large-scale BMD GWAS studies. No new ethnic approval was required.

Study Samples
Two studies were incorporated into this joint association analysis. The first one is the single GWAS of eBMD in the UKB cohort (N = 426,824). In brief, the UKB sample is a large prospective cohort study of ∼500,000 participants from across the United Kingdom, aged between 40 and 69 at recruitment.
Heel BMD was estimated based on quantitative ultrasound speed of sound (SOS) and broadband ultrasound attenuation (BUA). Genome-wide genotypes were available for all participants at 784,256 genotyped autosome markers, and were imputed into UK10K haplotype, 1000 Genomes project phase 3 and Haplotype Reference Consortium (HRC) reference panels by IMPUTE2 (18). After quality controls, 426,824 qualified participants were used in GWAS analysis (17).
The second one is the GWAS meta-analysis of TB-BMD in 66,628 participants from 30 studies (15). In brief, TB-BMD was measured by dual energy X-ray absorptiometry (DXA). All participants had genome-wide genotype data and were imputed into the 1,000 Genomes project phase 1 or the combined 1,000 Genomes project phase 3 and the UK10K reference panel. Almost all popular imputation methods were used across studies (Supplementary Table 1). In their analyses, association was performed in each individual study, and the summary statistics across the 30 studies were meta-analyzed by a fixed-effects model.
A total of 1,553 participants from the UKB cohort were included in the TB-BMD study, accounting for 2.3 and 0.4% of TB-BMD and eBMD sample sizes, respectively. Association summary statistics for both studies were publicly available at the genetic factors for osteoporosis consortium (GEFOS) website (http://www.gefos.org), and were downloaded for analysis.

SNP Inclusion Criteria
The SNP inclusion criteria were same as described previously (19). In brief, in each study, non-SNP variants and ambiguous SNPs (i.e., multiple SNPs corresponding to one single identifier) were excluded. In addition, SNPs presenting significant metaanalysis genetic heterogeneity (I 2 > 50% or Q p-value < 0.1) in the TB-BMD study were excluded. After quality control (QC), 8,680,009 SNPs are available in both studies. After removing SNPs that are not concordant between the two studies (i.e., A/G vs. T/G polymorphisms), a total of 7,369,691 SNPs in common to both studies were used in subsequent analyses.

Joint Association Analysis
The recently developed multi-trait analysis of GWAS (MTAG) method was used for joint association analysis, which accounts for sample overlap and trait heterogeneity between studies (20). In brief, MTAG estimates per SNP effect size for each trait by incorporating information contained in other correlated traits, and therefore has potential to improve statistical power of association test. MTAG takes summary statistics from multiple studies as input. The output of MTAG contains re-estimated effect size and p-value for each SNP in each trait.
The linkage disequilibrium score regression (LDSC) method was applied to the MTAG results to estimate the amount of genomic inflation due to confounding factors such as population stratification and cryptic relatedness (21). LDSC takes GWAS summary statistics as input and partitions overall inflated association statistic into one part attributable to polygenic architecture and another part due to population stratification and cryptic relatedness. Reference LD scores for the European population were downloaded from the software website (https://github.com/bulik/ldsc). The relative contribution of confounding factors was measured by attenuation ratio (AR), which is defined as (intercept-1)/(mean chi 2 −1), where intercept and mean chi 2 are estimates of confounding and the overall association inflation, respectively (21).
Genome-wide significance (GWS) level was set to be 5.0 × 10 −8 . An independent locus was defined as one 1-MB region, which consists of two 500 kb regions from the lead SNP to both directions. A novel locus was declared if it was neither reported in previous GWAS studies nor significant at the GWS level in eBMD or TB-BMD study.
Individual variant effect size was estimated with the formula 2f (1-f )β 2 , where f is allele frequency and β is regression coefficient, which was estimated by MTAG.

Approximate Conditional Analysis
To identify additional signals in regions of association, approximate conditional association analysis was performed in each region using the genome-wide complex trait analysis (GCTA) tool (22). A reference sample of 100,000 unrelated participants from the UKB cohort was generated for estimating LD pattern. Specifically, a total of 369,968 unrelated participants were inferred with kinship-based inference for GWAS (KING) software (23), from whom the 100,000 participants of the reference sample were randomly drawn.
A recursive conditional association analysis was performed. In each iteration, an approximate conditional analysis conditioning on the current list of lead variants was performed. A secondary significant variant was defined at the conditional GWS level (conditional p < 5 × 10 −8 ). The variant with the smallest pvalue among such identified ones was added into the list of lead variants. Iterations of the conditional analysis were run until no significant signal can be identified.

Statistical Fine-Mapping of Credible Risk Variants
Statistical fine-mapping of credible risk variants (CRVs) was performed with FINEMAP (24). FINEMAP uses GWAS summary statistics and applies a shotgun stochastic search algorithm to efficiently exploring a set of most important causal configurations in the associated region. It relies on a matched reference panel for LD estimation. Again, the above reference panel of 100,000 unrelated UKB participants was used for LD estimation with LDstore (25). Software parameters were set by default. The outputs of FINEMAP include the posterior probability of each SNP being causal. For each locus, we sorted the posterior probabilities in an descending order, and constructed the set of CRVs by including those SNPs whose posterior probabilities were within one order of magnitude of the largest posterior probability.

Candidate Gene Prioritization
We prioritized candidate genes in the identified novel loci by annotating the CRVs for their cis-expression quantitative trait locus (cis-eQTL) and cis-protein QTL (cis-pQTL) effects, and for their distances to genes.
Cis-eQTL effect was assessed from two datasets. The first one is the 44 tissues from the GTEx project (v6) (26). Pre-compiled cis-eQTL results were downloaded from the GTEx web portal (www.gtexportal.org/). The distance between the SNP and the transcription starting site (TSS) of the target gene was assumed to be <500 kb. Assuming a maximal number of 5,000 independent variants over such a 1-MB region, significant cis-eQTL was declared at p < 1.0 × 10 −5 (0.05/5,000). The second one is the lymphoblastoid cell lines of 373 European individuals from the 1,000 genomes project (27). Pre-compiled cis-eQTL results were downloaded from the study website (https://www.ebi.ac.uk/ Tools/geuvadis-das/). Significant cis-eQTL was declared under the same criteria.
Cis-pQTL information was accessed from Sun et al. (28). In that largest proteome study to date, the authors measured plasma protein levels of 3,301 healthy individuals using the SOMAscan platform (SomaLogic, Inc., Boulder, Colorado, USA) comprising 4,034 distinct aptamers (SOMAmers) covering 3,623 proteins. GWAS summary statistics for 3,284 proteins were downloaded from the study's website. Cis-pQTL was searched within 500 kb distance from a target gene. Analogously to the cis-eQTL analysis, significant cis-pQTL was declared at p < 1.0 × 10 −5 .
SNPs were annotated by variant effect predictor (VEP) for their functional category (29). SNP-gene distance was annotated by prioritization of candidate causal genes at molecular QTLs (ProGeM) software (30). Genes nearest, second nearest and third nearest to each lead SNP were listed.

Main Association Results
There are 13,753,401 and 18,259,434 SNPs in the eBMD and TB-BMD studies. After QC, 7,369,691 SNPs present in both studies were qualified for analysis. Genetic correlation coefficient was reported to be 0.57 (19), implying shared BMD heritability between the two studies. In the MTAG analysis of the eBMD study, the intercept and mean chi-square are 1.09 and 3.30, respectively, suggesting that 96.1% of the inflation in the mean chi-squared statistic is from polygenic architecture rather than from population stratification. Similarly, the intercept and mean chi-square in the MTAG analysis of the TB-BMD study are 0.78 and 1.84, respectively, suggesting that most of the inflation is from polygenic architecture.
The original UKB eBMD study summary statistics contain 76,400 SNPs significant at the GWS level, encompassing 659 distinct loci. In the MTAG analysis, all the 659 lead SNPs are consistent in effect direction with original ones. Among them, 556 (84.4%) remain significant at the GWS level. Though the other 103 lead SNPs become non-significant at the GWS level, they are all nominally significant (p < 0.05) with p-value ranging from 5.02 × 10 −8 to 8.88 × 10 −5 . Of the 556 GWS significant  SNPs, p-value at 193 (34.7%) SNPs gets smaller while those at 363 SNPs gets higher. The original GEFOS TB-BMD study summary statistics contain 3,842 SNPs significant at the GWS level, encompassing 68 distinct loci. In the MTAG analysis, all the 68 lead SNPs are consistent in effect direction with original ones. Among them, 59 (86.8%) remain significant at the GWS level, while the other nine become non-significant at the GWS level. Only one SNP rs1037011 becomes non-significant even at the loosely nominal significance level (p = 0.07). Interestingly, this SNP is GWS significant in both original studies (p eBMD = 3.40 × 10 −9 , p TB−BMD = 1.54 × 10 −12 ), but the effect allele T has opposite direction (β eBMD = 0.01, β TB−BMD = −0.04). The possible reason for the opposite directions could be strand alignment error or true opposite genetic effects, pending further investigation. Of the 59 GWS significant SNPs, p-value at up to 53 SNPs gets smaller while that at the remaining 6 SNPs gets higher.
To search for additional loci, we evaluated MTAG results of all SNPs excluding those within 500 kb of a locus that was either GWS significant in either original study or was reported previously. This identified 225 SNPs in the eBMD study and 15 SNPs in the TB-BMD study. To increase the confidence of association results, only SNPs of p-value < 0.05 in both original studies were kept, resulting in 79 SNPs in the eBMD study and again 15 SNPs in the TB-BMD, encompassing nine and four distinct loci, respectively. Only one locus overlapped between the two studies, resulting in a total number of 12 distinct novel loci. All the 12 lead SNPs are consistent in effect direction in both original studies. Manhattan plot is displayed in Figure 1 and main results are listed in Table 1. Regional plots at all the novel loci are displayed in Supplementary Figure 1. The 12 lead SNPs collectively explain 0.11 and 0.23% phenotypic variance for eBMD and TB-BMD, respectively.
In Supplementary Table 2, we listed genomic distance and LD value (r 2 ) between the lead SNP at each of the identified novel loci and the nearest GWS SNP at the nearest known locus. The results show that all pairs of SNPs are in complete linkage equilibrium, confirming the novelty of the identified loci.
No secondary associations were found in the TB-BMD analysis.

Credible Risk Variants
With FINEMAP, we performed a statistical fine-mapping analysis. In the eBMD analysis, 214 CRVs were prioritized at the 12 novel loci, with an average of 18 variants per locus. The locus with most number of variants is 12q24.23, in which up to 38 variants were prioritized. There is one locus at 19q13.31 with only one causal variant being prioritized, which is the lead  Table 3).
In the TB-BMD analysis, 64 variants are prioritized at 12 loci, with an average of 5 variants per locus. The locus with most number of variants is again 12q24.23, in which 7 variants are prioritized. No locus is found to have only one causal variant (Supplementary Table 3).
The two sets of CRVs derived from both analyses were merged into one single set of 269 CRVs, where 9 CRVs overlap between the two analyses.

Newly Identified Loci/Genes
We next prioritized candidate genes based on the annotations of cis-eQTL and cis-pQTL effects and functional categories in the above set of CRVs.
In the 44 tissues of the GTEx (v6) dataset, 116 CRVs from 10 regions are cis-associated with the expression of nearby genes at a variety of tissues. The two regions with no evidence of cis-eQTL effect are 6q14.1 and 7q21.11 (Supplementary Table 4).
In the lymphoblastoid cell lines, 12 variants from four regions are associated with nearby gene expressions, where seven variants overlapped with the GTEx variants. There are three associated genes that are not found in the list of GTEx prioritized genes (Supplementary Table 4).
In the pQTL dataset, multiple SNPs at 12p13.31 are cisassociated with TAPBPL (TAP Binding Protein Like) protein level. The most significant association is observed at rs1639122 (p = 3.31 × 10 −43 ), which is a mis-sense mutation in a nearby gene CHD4 (Chromodomain Helicase DNA Binding Protein 4). Another SNP rs11609726 at this region is also associated with one additional protein C1RL (Complement C1r Subcomponent Like, p = 4.36 × 10 −8 ) (Supplementary Table 5).
Combining the various annotations, a total of 65 candidate genes were prioritized at the 12 novel loci ( Table 2).

DISCUSSION
By conducting a joint association study of 2 large-scale GWAS analyses, the present study integrated the information contained  in over 490,000 participants, making it the largest BMD GWAS analysis to date. As a result, 12 new genomic loci were identified, demonstrating the enhanced statistical power. Because the two integrated traits are not same, naive metaanalysis is not applicable. Instead, we used the recently developed method MTAG for integration analysis. MTAG is applicable to analyzing genetically correlated traits. It uses two sources of information to integrate association signals. The first one is that the true effect of target SNP is correlated across traits, and the second one is that the estimation error of the SNP' effects is correlated across traits (20). When two GWAS summary statistics datasets have overlapped individuals, for example, shared control individuals, MTAG is indeed capable of handling this relatedness. It accomplishes this by applying the bivariate LD score regression to estimating the correlation in GWAS estimation error due to sample overlap. In our analysis, 1,533 participants overlapped between the eBMD and the TB-BMD studies, accounting for 0.36% of the total UKB participants. This tiny portion of overlap is not expected to have a major effect on the results, let alone that MTAG analysis took the overlap into account.
We were able to identify only one bone-related dataset, which is the lymphoblastoid cell lines. Only 12 CRVs from this dataset were found to exert cis-eQTL activity. The use of the 44 GTEx tissues was not motivated by their specific relationship to BMD, but was by a maximal coverage of available cis-eQTL SNPs. Because while some cis-eQTL activities are tissue-specific, some others are common across tissues. For example, Of the 12 cis-eQTL variants identified from the lymphoblastoid cell lines, seven overlapped with the GTEx cis-eQTL variants, implying that a considerable portion of cis-eQTL sites were common across tissues.
The prioritized candidate genes include those being linked to bone biology in previous literature. At 5q31.3, for example, the lead SNP rs9324887 is located in the intron of FGF1 (Fibroblast Growth Factor-1) gene and is associated with its expression. FGF1 is a member of the FGF signaling pathway that participates in the regulation of bone development (31). Local and systemic FGF1 increases new bone formation and bone density. It also appears to restore bone microarchitecture and prevent bone loss associated with estrogenwithdrawal (32). At 6p21.32, COL11A2 (Collagen Type XI Alpha 2 Chain) is one of the prioritized genes by cis-eQTL analysis. It is a member of the collagen family of extracellular proteins. It is a critical positive factor in the regulation of extra cellular matrix (ECM), which mineralizes to bone (33,34). At 8q24.12, two genes DSCC1 (DNA Replication And Sister Chromatid Cohesion 1) and DEPTOR (DEP Domain Containing MTOR Interacting Protein) were prioritized. It was only recently that DEPTOR was found to play a novel function in osteogenic differentiation by inhibiting MEG3-mediated activation of BMP4 signaling, suggesting its involvement in osteoporosis (35).
In addition to those established candidate genes to bone biology, we also prioritized multiple candidate genes with no known function to bone metabolism. Among them, SEMA3D (Semaphorin 3D) at 7q21.11 belongs to a member of the semaphorin III family of secreted signaling proteins that are involved in axon guidance during neuronal development (36). Despite the lack of evidence for its involvement in bone development, one of its paralogs, SEMA3A (Semaphorin 3A), is found to regulate bone remodeling indirectly by modulating sensory nerve development instead of by acting on osteoblasts (37). In addition, SEMA3A and SEMA3E (Semaphorin 3E), are also reported to be associated with hypogonadotropic hypogonadism (38,39). Hypogonadotropic hypogonadism is known to regulate bone density (40). These lines of evidence imply SEMA3D may have a regulatory role on bone development.
In conclusion, by conducting a joint analysis of two largescale genome-wide association study and meta-analysis, we have identified 12 novel loci associated with BMD. Our findings provide candidate genes for future functional investigations and for a better understanding of the genetic mechanism underlying bone development.

AUTHOR CONTRIBUTIONS
LZ was in charge of conceptualization and design. LL, MZ, and Z-GX contributed to data collection, formal analysis, and manuscript drafting. JL and H-PP contributed to the literature review and manuscript editing. Y-FP, H-PS, and LZ are responsible for supervision and funding. All authors approved the final version to be published.

ACKNOWLEDGMENTS
We are grateful to the GEFOS consortium and the eBMD study for releasing the large-scale GWAS summary statistics that were used in this study. LZ and Y-FP are partially supported by the national natural science foundation of China (31571291, 31771417 and 31501026) and a project of the priority academic program development (PAPD) of Jiangsu higher education institutions. The numerical calculations in this paper have been done on the supercomputing system of the National Supercomputing Center in Changsha.