Genome Wide Association Study of Age at Menarche in the Japanese Population

Age at menarche (AAM) is a complex trait involving both genetic and environmental factors. To identify the genetic factors associated with AAM, we conducted a large-scale meta-analysis of genome-wide association studies using more than 15,000 Japanese female samples. Here, we identified an association between SNP (single nucleotide polymorphism) rs364663 at the LIN28B locus and AAM, with a P-value of 5.49×10−7 and an effect size of 0.089 (year). We also evaluated 33 SNPs that were previously reported to be associated with AAM in women of European ancestry. Among them, two SNPs rs4452860 and rs7028916 in TMEM38B indicated significant association with AAM in the same directions as reported in previous studies (P = 0.0013 with an effect size of 0.051) even after Bonferroni correction for the 33 SNPs. In addition, six loci in or near CCDC85A, LOC100421670, CA10, ZNF483, ARNTL, and RXRG exhibited suggestive association with AAM (P<0.05). Our findings elucidated the impact of genetic variations on AAM in the Japanese population.


Introduction
Age at menarche (AAM), the onset of the first menstrual period in girls, is considered as a landmark of female pubertal development. Menarche generally occurs after a series of complex neuroendocrine events leading to full activation of the hypothalamic-pituitary-gonadal axis [1]. Menarche is associated with physical, emotional, and social development [2]. In addition, AAM was shown to be associated with the risk of various diseases. Early AAM is reported to be one of the significant risk factors for depression [3], eating disorders [4], obesity [5], diabetes [6], breast cancer [7], and coronary heart disease [8]. On the other hand, late AAM has been associated with osteoporosis [9] and taller adult stature [10]. Therefore, the identification of loci contributing to variation in AAM could lead to a better understanding of a wide range of phenotypes.
AAM is known to be a complex trait determined by an array of genetic and environmental variables [11][12][13]. Twin and family studies suggest a significant genetic contribution to AAM with a heritability of more than 50% [12,14]. Several genetic variations within candidate genes such as the estrogen receptor genes (ESR1 and ESR2) [15,16], CYP19A1 [17], and the SHBG gene [18] were shown to be associated with AAM. To date, a number of genomewide linkage analyses [14,19,20] and genome-wide association studies (GWAS) [21][22][23][24][25] for genes underlying variation in AAM have been performed. In 2009, the association of genetic variations in LIN28B with AAM was identified by four independent groups. Currently, more than 30 loci have been shown to be significantly associated with AAM. However, most of these studies were conducted using women of European ancestry. Here we performed a large scale meta-analysis of GWAS using more than 15,000 Japanese female samples.

Results
A total of 15,495 Japanese female subjects from four GWAS using different SNP genotyping systems were enrolled in this analysis. Characteristics of samples and genotyping methods are summarized in Table 1. All the subjects were of Japanese origin and obtained from the Biobank Japan Project [26]. Samples consist of patients that were classified into 33 disease groups. The average and S.D. of AAM in each disease cohort is shown in Table S1. Some diseases such as breast cancer and osteoporosis are likely to be associated with early or late AAM, as reported previously [7,9]. We also found that AAM was negatively associated with birth year (p,0.0001). Thus we used disease status and birth year as covariates in this study. Genotyping was performed with over 500,000 SNP markers using Illumina HumanHap 550 Genotyping BeadChip, Illumina610-Quad Genotyping BeadChip, or Illumina Omni Express (Illumina, CA, USA). We applied stringent quality control criteria as mentioned in the methods section. We also conducted principal component analysis [27] to evaluate potential population stratification. To extend the genomic coverage and conduct meta-analysis, we subsequently performed a whole-genome imputation of the SNPs, using HapMap Phase II genotype data [28]. After the imputation, we performed SNP quality control (minor allele frequency$0.01 and an imputation score (Rsq value by MACH software [29])$0.7) and found that more than two million autosomal SNPs satisfied these criteria.
The associations of these imputed SNPs with AAM were evaluated using a linear regression model [30] and meta-analysis. Quantile-Quantile plots of P-values indicated the Inflation factors to be as low as 1.039 (Figure 1), suggesting no substantial population stratification existed in our study population. Although we could not identify significant association that satisfied the genome-wide significant threshold (P,561028), SNP rs364663 which is located within intron 2 of the LIN28B gene at 6q21 indicated the strongest association with a P-value of 5.49610 27 (Table 2, Figure 2). SNP rs364663 exhibited the association with AAM in all four cohorts without significant heterogeneity in both effect sizes and directions (P-value for heterogeneity = 0.29; Table 3), suggesting that the observed associations at LIN28B are not the result of false-positives due to study-specific bias. Regional p-value plots indicated that all of the AMM-associated SNPs were clustered around the LIN28B locus ( Figure 2). Previously reported SNPs rs314263 and rs7759938 near the LIN28B locus [21,24] also associated with AAM (P = 1.0361026 and 1.1261026, respectively) in the same direction. In addition to the 6q21 locus, 17 SNPs in seven genomic regions exhibited suggestive association with a p-value of,161025 ( Table 2). To further investigate the physiological role of these loci, the associations of these variations with body mass index (BMI) and height were evaluated using previously published results in 26,620 Japanese subjects [31]. SNP rs9404590 near the LIN28B locus associated with height with p-value of 0.0003 (Table S2), but we did not find significant association (P,0.01) between these loci and BMI (Table S2).
Since AAM is associated with various disease risks, we conducted separate analyses for each disease and then performed meta-analysis for the top 42 loci. As a result, some SNPs showed slightly stronger association, but none cleared the genome wide significant threshold (Table S3). Similar to the current study, our group had previously conducted QTL analyses using disease status as a covariate and successfully identified many QTL loci [31][32][33][34][35]. Therefore, different background due to disease status was unlikely to significantly affect the result of association analysis.
Additionally, we examined the loci already known to show significant association with AAM in women of European ancestry [22][23][24][25]36]. We selected 37 SNPs for this candidate analysis and successfully obtained the genotyping results of 33 SNPs (Table 4), and the risk allele was consistent with previous reports for 31 SNPs. In addition, eight SNPs in or near RXRG, CCDC85A, LOC100421670, TMEM38B, ZNF483, ARNTL, and CA10 indicated possible associations with AAM (P,0.05). Among them, rs4452860 and rs7028916 at the TMEM38B locus exhibited significant association even after Bonferroni's correction (P,0.0015 = 0.05/33). Taken together, these variations as well as LIN28B are likely to be common AAM loci, although their effect sizes are different between women of European ancestry and those of Japanese.

Discussion
AAM is a complex trait that is influenced by both genetic and environmental factors. In recent years, genome wide association analyses have become a standard method to identify genetic factors related with various diseases and phenotypes. In 2002 our group performed the first GWAS for myocardial infarction and successfully identified LTA as a disease susceptibility gene. Using this method, we have identified a number of loci associated with various phenotypes and common diseases [37][38][39][40][41][42].   Through the analysis of more than 15,000 Japanese female subjects from the Biobank Japan Project, we here report the association of LIN28B with AAM. In our analysis, SNP rs364663 within intron 2 of the LIN28B gene exhibited the strongest association. Previously reported SNPs rs314263, rs7759938, rs314280, and rs314276 near or in the LIN28B gene also showed equivalent association with p-values of 1.0321.5761026. Although the directions of associations were consistent between women of European ancestry and those of Japanese, the effect sizes among Japanese (0.085-0.087) were less than those among European ancestry (0.09-0.14). Thus LIN28B is a common AAM loci, but its effects are relatively small among the Japanese population.
The lin-28 gene was originally identified through C. elegans mutants showing abnormality in developmental timing [43]. Deleterious mutations in lin-28 produce an abnormal rapid tempo of development through larval stages to adult cuticle formation [43]. Two mammalian homologs, Lin28a and Lin28b, possess similar biochemical activities [44,45]. Transgenic mice expressing Lin28a exhibited increased body size, crown-rump length and delayed onset of puberty due to increased glucose metabolism and insulin sensitivity [46]. In humans, both LIN28B and LIN28A control preprocessing of the let7 microRNA family [45]. Lin28B is highly expressed in the majority of human hepatocellular carcinomas and embryonic stem cells [47], and regulate cell pluripotency [48] as well as cancer growth [47]. However the role of genetic variations in the LIN28B locus should be investigated in the future study.
Among the seven suggestive loci indentified in our GWAS, ZFHX4 and NKX2.1 loci were previously shown to be associated with height in the Caucasian population [49,50]. These loci also exhibited association with height among the Japanese population (p = 0.053 and 0.023, respectively). The NKX2.1 gene encodes a thyroid-specific transcription factor that was shown to be associated with hypothyroidism [51]. Since delayed pubertal development is a common manifestation of hypothyroidism, variation at NKX2.1 locus might affect AAM through the regulation of serum thyroid hormone level. The ZFHX4 gene encodes a homeodomain-zinc finger protein. ZFHX4 mRNA is highly expressed in brain, liver, and muscle, however the molecular mechanism whereby variations within this genetic locus affect AAM is not clear. In addition, KIRREL3 locus was shown to be associated with breast size in women of European ancestry [52]. The KIRREL3 gene encode a member of the nephrin-like protein family. KIRREL3 was highly expressed in brain and kidney, and shown to be associated with mental retardation autosomal dominant type 4 [53] and neurocognitive delay associated with Jacobsen Syndrome [54]. Taken together, these three loci seem to be common development traits in women of both European ancestry and Japanese.
In our candidate analysis, the association of SNPs rs4452860 and rs7028916 on 9q31.2 with AAM was also validated. This locus was repeatedly shown to be associated with AAM [25,55]as well as serum ALT (alanine transaminase) level [56]. SNPs rs4452860 and rs7028916 were located more than 300 kb away from the TMEM38B gene. In mice, TMEM38B is strongly expressed in brain, and TMEM38B deficient mice is neonatal lethal [57]. In humans, a homozygous mutation of TMEM38B was associated with autosomal recessive osteosgenesis imperfecta [58]. These findings also suggest an important role of TMEM38B in developmental processes.
In this study, we observed early AAM among breast cancer patients compared with overall samples (12.99 vs 13.44, Table  S1). On the other hand, patients with osteoporosis exhibited late AAM (14.33 y.o.). These findings are concordant with previous reports. Since we do not have age-matched controls, the association of these diseases and AAM should be examined using more appropriate sample sets.
Although more than 15,000 samples were employed in this study, no SNP cleared the genome wide significance threshold (p,5610 28 ). This could be explained by some limitations in our study. Firstly, the information about AAM was self-reported and might not be very accurate. Although this distinct event is often well recalled many years later [59], possible error due to ageassociated memory impairment would increase the risk of false negative association. Secondly, we did not have a replication sample set, and the total sample size may be inadequate to detect variations with modest effects. At present, we unfortunately do not have sufficient samples for replication, since more than 2,900 additional samples are necessary to obtain a genome wide significant association for the top SNP rs364663. Previous studies comprising up to 17,510 women detected only one or two genome-wide significant signals [22][23][24][25]. However, 30 significant loci were identified by using 87,802 subjects in the screening stage [21]. Although no SNPs reached the genome wide significance in our study, analyzing an increased number of subjects in a relatively younger generation would improve statistical power as well as data accuracy, and subsequently enable us to identify other genetic factors with modest effects.

Samples
The subjects enrolled in the GWAS meta-analysis for age at menarche (AAM) (n = 15,495) consisted of female patients that were classified into 33 disease groups; cancer (lung, esophagus, stomach, colon, liver, cholangiocarcinoma, pancreas, breast, uterine cervix, uterine body, ovary, hematopoietic organ), diabetes, myocardial infarction, brain infarction, arteriosclerosis, arrhythmia, drug eruption, liver cirrhosis, amyotrophic lateral sclerosis, osteoporosis, fibroid, and drug response, rheumatoid arthritis. chronic hepatitis B, pulmonary tuberculosis, keloid, heat cramp, brain aneurysm, chronic obstractive lung disease, glaucoma, and endometriosis (Tables 1 and 2). All subjects were collected under the support of the BioBank Japan Project [26], in which the individuals with any one of the 47 common diseases were enrolled between 2003 and 2008. Subjects under 17 years of age were not included in this study. Various life style information such as AAM was obtained through a face-to-face interview by trained medical coordinators using questionnaire sheets at each hospital. All participants provided written informed consent as approved by the ethical committees of each institute. For participants between the age of 17 and 20 years old, we obtained written informed consent from both the participant and her parents. AAM was collected by self-report on the questionnaire. The subjects with AAM between the ages of 10 and 17 were enrolled for the study. This project was approved by the ethical committees of the BioBank Japan Project [26] and the University of Tokyo.

Genotyping and quality control
In the four GWAS enrolled in the meta-analysis of AAM, all samples were genotyped at more than 500,000 loci using one of the following platforms: Illumina HumanHap550v3 Genotyping BeadChip, Illumina HumanHap610-Quad Genotyping Bead-Chip, or Illumina Omni express Genotyping BeadChip ( Table 1). Then we applied the following quality control for each GWAS separately: exclusion criteria; subjects with call rates,0.98, SNPs with call rates,0.99 or with ambiguous clustering of the intensity plots, or non-autosomal SNPs. We then excluded subjects whose ancestries were estimated to be distinct from East-Asian populations using principle component analysis performed by EIGENSTRAT version 2.0 [27]. Subsequently, the SNPs with MAF,0.01 or P-value of the Hardy-Weinberg equilibrium test,1.0610 27 were excluded.
After the quality control criteria mentioned above were applied, genotype imputation was performed by MACH 1.0.16 [29] using the genotype data of Phase II HapMap JPT and CHB individuals (release 24) [28] as references, in a two-step procedure as described elsewhere [60]. In the first step of the imputation, recombination and error rate maps were estimated using 500 subjects randomly selected from the GWAS data. In the second step, imputation of the genotypes of all subjects was performed using the rate maps estimated in the first step. Quality control filters of MAF$0.01 and Rsq values$0.7 were applied for the imputed SNPs.

Statistical analysis
Associations of the SNPs with AAM were assessed by linear regression assuming the additive effects of the allele dosages on AAM, using mach2qtl software [29]. Years of birth and affection statuses of the diseases were used as covariates. Meta-analysis of all four GWAS was performed using an inverse-variance method from the summary statistics of beta and standard error (SE), using the Java source code implemented by the authors [61]. Genomic control correction was applied for each GWAS separately, and applied again for the results of GWAS meta-analysis. We set the Pvalues of 5.0610 28 and 0.0015 ( = 0.05/33, Bonferroni's correction based on the numbers of the evaluated loci) as significance thresholds in GWAS and candidate gene analysis, respectively. Heterogeneity of the effect sizes among the studies was evaluated using Cochran's Q statistics.

Body mass index and Height QTL analysis
The associations of 42 SNPs with BMI and height were evaluated using previously published results, in which a total of 26,620 subjects with 32 diseases from Biobank Japan were enrolled [31]. Of the 26,620 subjects, 12,350 were also included in this AAM study. Genotyping was performed using the Illumina HumanHap610-Quad Genotyping BeadChip. Genotype imputation was performed using MACH 1.0. BMI was calculated based on self-reported body weight and height data. A rank-based inverse-normal transformation was applied to the BMI values of the subjects. Associations of the SNPs with transformed values of BMI were assessed by linear regression assuming additive effects of allele dosages (bound between 0.0 and 2.0) using mach2qtl software using gender, age, smoking history, the affection statuses of the diseases and the demographic classifications of the medical institutes in Japan where the subjects were enrolled were used as covariates.

Supporting Information
Table S1 Age at menarche in each disease cohort. (DOCX)