Keywords
Bioinformatics, Evolutionary/Comparative Genetics, Genomics
This article is included in the Preclinical Reproducibility and Robustness gateway.
Bioinformatics, Evolutionary/Comparative Genetics, Genomics
Quartz Bio and the Stochastic Information Processing group are involved in the PRECISESADS project (http://www.precisesads.eu/), which aims at reclassifying Systemic Autoimmune Diseases (SADs), a group of chronic inflammatory conditions characterized by the presence of unspecific autoantibodies in the serum and resulting in serious clinical consequences, based on genetic and molecular biomarkers rather than clinical criteria.
In order to use genetic similarities to deliver personalized treatments to patients affected by SADs as well as other diseases, it is important to first understand the genetic structures in healthy populations.
In 2008, Li et al.1 showed that although specific world regions have different genetic origins, all revealed population structures in principal component analyses (PCAs). Similar population structures were also observed in studies using other genome-wide variations datasets2,3.
Li et al. applied PCAs on subsets of individuals from two geographic regions, Europe and the Middle East & North Africa, and displayed the results on the two first principal components in their article as Figures 2A and B, respectively, (with the latter labeled only Middle East).
In an attempt to replicate these two figures, we performed quality control, minor allele frequency filtering, tag SNP selection4, and PCAs on both regional subsets of the SNP microarray data. The PCAs were then displayed on the first two principal components.
The replicated figures were found to match closely to the original figures, and therefore confirmed a successful replication.
The dataset consisted of two files: a zip file including the genotype data of 660,918 SNPs from 1,043 individuals with the annotations of the SNPs, and a text file composed of the annotations of 953 individuals (see Data and software availability).
The annotations of individuals were used to create two subsets of the data. The first contained 157 individuals from Europe and the second contained 163 individuals from the Middle East & North Africa.
For each geographic region subset of the data, we verified that no individuals had missing value rates above 3% and excluded SNPs with missing value rates above 1%. An additive genetic model was then used to encode each A/B SNP (A/A = 0, A/B = 1, B/B = 2), which converts categorical SNP values to numerics by assuming that the effect of the A/B heterozygote and B/B homozygote are proportional to the number of B alleles. SNPs with minor allele frequency below 5% were excluded to remove rare variants, which are more prone to genotyping errors. In addition, in order to decrease the required computation time and memory usage, redundant SNPs were removed by applying TagSNP4 (r2 > 0.8, window of 500,000 base pairs). The missing values were imputed by random sampling of each SNP. Then each SNP was centered and scaled to unit variance. All steps were performed using the SNPClust R package v1.0.02.
For the Europe subset, a total of 375,164 SNPs from 157 individuals were selected for analysis. This defines our Europe analysis set.
For the Middle East & North Africa subset, a total of 412,979 SNPs from 163 samples were selected for analysis. This defines our Middle East & North Africa analysis set.
For comparison, the supporting online material of Li et al. reported that individuals with missing value rates above 2.5% and SNPs with missing value rates above 5% were excluded. Table S1 of Li et al. reports that 156 individuals from Europe and 160 from the Middle East & North Africa were used and the supporting online material reports that 642,690 SNPs were used.
PCAs were applied on the two analysis sets and displayed using the SNPClust R package v1.0.02. Principal component analysis (PCA) is a dimensionality reduction method, which projects SNPs by linear combination to maximize the variance on successive axes, i.e. principal components, while constraining the axes to be orthogonal.
The supporting online material of Li et al. reports that they first computed the Identity-by-State (IBS) matrix among the 938 individuals by using PLINK (version not provided)5 and then performed PCAs on the IBS matrix for each region separately. In this study, PCAs were applied on the analysis sets and not on IBS matrices.
The PCA of the Europe analysis set was displayed on the two first principal components (Figure 1). Individuals were grouped by population and the replicated figure matched closely with Li et al.'s Figure 2A.
The explained variance was almost identical, as the replication stated 2.1% in PC1 and 1.6% in PC2, while Li et al.'s Figure 2A stated 2.4% and 1.6%, respectively.
The PCA of the Middle East & North Africa analysis set was displayed on the two first principal components (Figure 2). Individuals were grouped by populations and the replicated figure matched closely with Li et al.'s Figure 2B.
Two differences from Li et al.'s analysis were noted, first the Bedouin and Druze populations exhibited a larger spread on PC1 in the original figure. Second, one Bedouin individual was located with Mozabite individuals, which did not appear in Li et al.'s Figure 2B.
The explained variance was slightly smaller, as the replication stated 3.1% in PC1 and 2.2% in PC2, while Li et al.'s Figure 2B stated 5.0% and 2.6%, respectively.
The replicated figures matched closely to the original figures, although two differences appeared when examining the Middle East & North Africa subset: the smaller spread of two populations and the presence of an outlier.
Therefore, the main results were replicated and can be independently reproduced by using publicly available data, source code, and computing environment.
We successfully confirmed that although the two geographic regions studied had different genetic origins, both exhibited population structures in PCAs.
Understanding the genetic structure of healthy populations will enable us to use genetic similarities to deliver personalized treatments to patients affected by SADs. Using this replication, the PRECISESADS project will be able to compare clusters of patients affected by SADs to clusters of healthy individuals, independently from their ancestry-driven genetic structure2.
As stated in Li et al.1, the data sets are freely available online. Although the links that were provided are now outdated, the two data files are available from HGDP-CEPH: http://www.hagsc.org/hgdp/files.html (download link: http://www.hagsc.org/hgdp/data/hgdp.zip and http://www.cephb.fr/en/hgdp_panel.php#serie2; ftp link: ftp://ftp.cephb.fr/hgdp_v3/hgdp-ceph-unrelated.out).
The PCAs were computed and displayed using the previously published R package SNPClust v1.0.02.
Computing environment in a Docker container is available from: https://hub.docker.com/r/thomaschln/reproducible-hgdp
Source code required to generate this article and the definition of the corresponding computing environment, in which all required software are installed: https://github.com/ThomasChln/reproducible-hgdp
Archived source code as at time of publication: doi, 10.5281/zenodo.3451376
License: GNU General Public License version 3.0
The data were previously published1 and approved by ethics committees. No samples were used and records were de-identified.
Conceptualization: JW SV; Formal analysis: TC; Funding acquisition: JW; Investigation: JW ADC; Methodology: TC JW; Project administration: JW; Software: TC; Supervision: JW SV; Validation: TC JW ADC; Visualization: TC; Writing - original draft: TC; Writing - review & editing: JW ADC SV.
Thomas Charlon, Alessandro Di Cara, and Jérôme Wojcik are employees of Quartz Bio S.A., Switzerland. The authors declare no competing interests related to this commercial affiliation. This does not alter the authors’ adherence to F1000Research policies on sharing data and materials.
Quartz Bio S.A. provided support in the form of salaries for Thomas Charlon, Alessandro Di Cara, and Jérôme Wojcik, but did not have any additional role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript. This work has received support from the EU/EFPIA/ Innovative Medicines Initiative Joint Undertaking PRECISESADS (grant no. 115565).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the work clearly and accurately presented and does it cite the current literature?
Partly
Is the study design appropriate and is the work technically sound?
Partly
Are sufficient details of methods and analysis provided to allow replication by others?
Yes
If applicable, is the statistical analysis and its interpretation appropriate?
Yes
Are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions drawn adequately supported by the results?
Yes
References
1. Jakobsson M, Scholz SW, Scheet P, Gibbs JR, et al.: Genotype, haplotype and copy-number variation in worldwide human populations.Nature. 2008; 451 (7181): 998-1003 PubMed Abstract | Publisher Full TextCompeting Interests: No competing interests were disclosed.
Reviewer Expertise: Population genetics, biostatistics, bioinformatics
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 15 Mar 17 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)