RAvariome: a genetic risk variants database for rheumatoid arthritis based on assessment of reproducibility between or within human populations

Rheumatoid arthritis (RA) is a common autoimmune inflammatory disease of the joints and is caused by both genetic and environmental factors. In the past six years, genome-wide association studies (GWASs) have identified many risk variants associated with RA. However, not all associations reported from GWASs are reproduced when tested in follow-up studies. To establish a reliable set of RA risk variants, we systematically classified common variants identified in GWASs by the degree of reproducibility among independent studies. We collected comprehensive genetic associations from 90 papers of GWASs and meta-analysis. The genetic variants were assessed according to the statistical significance and reproducibility between or within nine geographical populations. As a result, 82 and 19 single nucleotide polymorphisms (SNPs) were confirmed as intra- and inter-population-reproduced variants, respectively. Interestingly, majority of the intra-population-reproduced variants from European and East Asian populations were not common in two populations, but their nearby genes appeared to be the components of common pathways. Furthermore, a tool to predict the individual’s genetic risk of RA was developed to facilitate personalized medicine and preventive health care. For further clinical researches, the list of reliable genetic variants of RA and the genetic risk prediction tool are provided by open access database RAvariome. Database URL: http://hinv.jp/hinv/rav/


Introduction
Rheumatoid arthritis (RA; MIM180300) is a common autoimmune disease characterized by the chronic inflammation of the bones and joints. Several epidemiological studies reported that RA prevalence varies among different populations (1). In North America and Northern Europe, the estimated prevalence of RA is 0.5-1.1%, but in Southern Europe, a lower prevalence of 0.3-0.7% has been reported. In East Asia, RA prevalence in the urban areas of Japan and Taiwan is 1.04% and 0.93%, respectively, but that in mainland China ranges from 0.2% to 0.37% (2,3). Twin studies on RA have led to a heritability estimate for RA of 65% in the Finnish study and 53% in the UK study, and genetic factors account for an estimated 60% of the disease risk (4).
As the first human genetic variation database for RA, RAvariome aims to provide a reliable set of RA risk variants that was systematically assessed according to its reproducibility. Candidate gene association studies and GWASs are known to be vulnerable to a range of errors and biases, especially those arising from differences in the experimental and study designs (10). These problems are reflected in equivocal or inconsistent results and may lead researchers to design inappropriate follow-up studies or medical applications. Accordingly, we have collected association studies comprehensively, classified the studies by ethnicity of subjects, re-evaluated the associations by unified significance level and assessed the associations by population-based reproducibility. All data are publicly available in a regularly updated web database, RAvariome. Additionally, an online tool for predicting the genetic risk of RA for an individual was developed to support further analysis for preventive intervention of genetic RA risk carrier.

Collection and Extraction of Data
We collected 153 English-language literatures from the NHGRI catalog and HuGE Navigator and the automatic paper recommendation system PubMedScan (http:// medals.jp/pubmedscan/) ( Figure 1). Then, manual screening of their abstracts excluded articles about other autoimmune diseases, pharmacogenomics and gene environment studies. After filtering, 90 literatures about GWASs, fine mapping studies and meta-analyses were kept for further reading. Not only statistically significant but also statistically not significant association results, ethnicity of subjects, the country where subjects were recruited, the total number of cases and controls, the analysis platform and the study design were extracted manually from full text, tables or supplementary data. The association results of human genetic variants included SNPs, HLA alleles, copy number variations and variable number of tandem repeats markers. Finally, 7730 association results were stored in a database.
The results of statistical tests in original literatures were re-evaluated according to the following criteria. We determined two kinds of significance levels for GWAS that assayed genome-wide SNPs and for follow-up study, such as meta-analysis, that assayed only few SNPs of interest. For a GWAS, if the corresponding P-value was <5.0 Â 10 À8 , the result of the statistical analysis was judged as significant evidence of strong association, and P-value between 1.0 Â 10 À5 and 5.0 Â 10 -8 was judged as significant evidence of moderate association. Associations with P > 1.0 Â 10 À5 were judged as not significant. For follow-up study, associations were judged as 'strong' for P < 0.01, 'moderate' for P-value between 0.01 and 0.05 and 'not significant' for P > 0.05. Exceptionally, a meta-analysis of GWAS result and combined analysis of GWAS and replication results were judged by genome-wide significance threshold. According to our significance level, 5970 associations were classified as statistically not significant, and 1760 associations were classified as either strong or moderate associations.

Confirmation of Reproducibility of RA Risk Variants
To confirm reproducibility of genetic associations based on a geographical information, all associations including nonsignificant results were grouped by the following nine populations; European (including European American, European Australian, European New Zealander, West European, North European, South European and East European), East Asian (including South-East Asian), West Asian, South Asian, South American, Central American, North African, South African and African American. Because many follow-up studies of previous GWAS researches were conducted, the representative association for the variant in each geographical population was selected from the result of a study of the largest number of case subjects. Out of 6740 representative associations, 1296 were statistically significant and remained for further reproducibility assessment. Notably, 34 variants reached significant level in small-scale studies, but were not confirmed by larger studies.
The genetic associations were classified into two classes based on reproducibility between or within population. To assess reproducibility of an association between different populations, the variants that was reproduced by independent representative associations were identified. A genetic variant that showed opposite direction of association between different populations was excluded. Accordingly, 40 representative associations of 19 variants located in 12 loci were confirmed as 'inter-population reproduced' based on independent studies of different populations such as African American, Central American, East Asian, European, South American, South Asian and West Asian (Table 1).
To assess reproducibility of an association within the particular population, we identified the statistically significant representative association that was reproduced by an  Confirmed RA risk loci that were not mentioned in RA reviews (10,(24)(25)(26)(27). Related genes, refSNP ID and allele, association P-value, study type to set significance level, OR, sample sizes of case/control, geographical population of samples and references are shown. Genotyping coverage indicated whether the study assayed genome-wide (GW) SNPs or selected SNPs.

Prediction of Genetic RA Risk
To facilitate personalized medicine and preventive health care, a tool to calculate genetic risk of RA for an individual was developed based on confirmed RA risk variants (intraand inter-population-reproduced variants). To avoid overestimation, if confirmed RA risk variants are closely located to each other, i.e. in the same linkage disequilibrium block, only the genetic variant with a smaller P-value was used as a risk marker of that locus.
The genetic risk score (GRS) and the relative genetic risk (RGR) of RA for an individual were calculated by the combination of RA risk markers. As described in previous studies, an unweighted GRS, simply counting the number of risk alleles carried by an individual, was not applicable to RA because HLA-DRB1 alleles had substantially higher odds ratio (OR) than alleles at non-HLA loci (28)(29)(30). Accordingly, we used a weighted GRS that increases additively by log-OR of the risk allele (31). GRS was calculated as follows: where n is the number of markers available and X i represents the copy number of risk allele at a marker i. If an individual carries 1 risk allele at ith marker, X i is 1, and if an individual carries 2 risk alleles, then X i is 2.
To calculate RGR of the average population for each marker, an individual genotype-specific risk s i is estimated as: where r i is a risk factor assumed to equal the allelic OR of ith marker. RGR of individual marker is estimated as: Here p i is the risk allele frequency of ith marker in the control group. Thus, the overall RGR of an individual estimated by multi-markers is calculated as:

RAvariome Web Interface and Usage
To provide our results and a genetic risk prediction tool for researchers and clinicians, a web database RAvariome was developed. By comparing number of literatures and association data with other existing literature-based GWAS databases, RAvariome provides the most comprehensive dataset of associations of RA (Table 2). RAvariome consists of four sections: overview of RA, the list of confirmed RA variants, the collection of reported RA associations and genetic risk prediction tool. In the overview section, the heritability, environmental factors and protective factors of RA are described.
The page entitled 'Confirmed RA variants' provides users a list of genetic variants with the reproducibility class ( Figure 2A). The list provides detailed study information of the representative association of the specific variant and the particular geographical population. In this section, users can search the confirmed RA risk variants by the reproducibility class, the geographical population of the subjects, a gene name or refSNP ID.
The page entitled 'Reported RA associations' leads users to a list of all statistically significant and statistically not significant associations described in literatures. Users can search association data by study design, geographical population of the subjects, nationality of the subjects,    number of subjects and sub-phenotype of cases such as cases with 100% seropositive for rheumatoid factor and/ or anti-CCP. The page entitled 'Genetic risk predictor' provides a tool for predicting the genetic risk of RA calculated by the ethnicity and genotypes of an individual ( Figure 2B). By choosing a population from either European or East Asian, the confirmed RA risk markers and their common three genotypes are displayed in the list. Users can select one genotype for each risk marker, and selected genotypes will be used for estimation of RGR and GRS. RGR is the index that assumed the risk of average population as baseline. GRS is the index that assumed non-risk allele carrier as baseline and used in basic researches. For example, if genotypes AG for rs3761847, AA for rs2476601 and GA for rs6920220 are selected at the European population, a predicted lifetime risk will be 2.6%, 2.6-fold higher than the prevalence of RA in general European population (1%).

Discussion
RAvariome was developed to provide a list of RA-associated genetic variants with the degree of reproducibility, based on comprehensive re-evaluated results of the genetic association studies. By comparing our results with reviews (10,(24)(25)(26)(27) of RA, we found that two loci from inter-population-reproduced variants were not reported in reviews and have been validated in recent meta-analyses (Table  1). Thus, we conclude that our database provides the newest and most reliable genetic risk variants based on the comparative verification.
As the intra-population-reproduced variants were observed only in European and East Asian populations, we used gene set enrichment analysis (32) to discover common functional mechanisms underlying different gene sets confirmed specifically in European and East Asian populations. From inter-and intra-reproduced-variants, 10 genes were confirmed in both European and East Asian populations, although 39 and 30 genes were confirmed only in the European and East Asian populations, respectively. Interestingly, the European and East Asian unique gene sets had the following gene sets in common: genes involved in immune system, gene upregulated by CD40 signaling in Ramos cells, genes upregulated in TK6, WTK1 and NH32 cell lines in response to ionizing radiation, genes modulated in HeLa cells by TNF via NFKB pathway, genes whose promoters are bounded by FOXP3 based on a chromatin immunoprecipitation (ChIP)-chip analysis, genes involved in cytokine signaling in immune system and genes downregulated in freshly isolated CD31À versus the CD31+ counterparts (Table 3). This result suggested that even if the same genes were not reproduced between European and East Asian populations, several pathways were common in both populations.
The purpose of open access database RAvariome is to be a standard resource not only for RA researchers but also for RA clinicians and the general public. RAvariome is therefore designed as simple as possible to get confirmed RA genetic risk variants for each geographical population. With ongoing progress in sequencing technology, the number of genetic studies of RA will continue to grow. RAvariome will be periodically updated in concert with progress in RA genetic research and will incorporate a new genetic risk prediction method (33). An open access resource would be valuable to raise the precision of the clinical genetic tests and to develop effective prevention programs of RA based on genetic and population differences among individuals.