ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Replication of the principal component analyses of the human genome diversity panel

[version 1; peer review: 1 approved, 1 approved with reservations]
PUBLISHED 15 Mar 2017
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Preclinical Reproducibility and Robustness gateway.

Abstract

Background. In 2008, several principal component analyses (PCAs) applied on 660,918 single-nucleotide polymorphisms (SNPs) from 938 individuals from 51 worldwide populations of the Human Genome Diversity Panel were published by Li et al. PCAs were applied on subsets of individuals sharing a common geographic origin and showed that in several geographic regions, genome-wide variations of SNPs grouped individuals by populations in the two first principal components. In this study, we replicated the PCAs applied on two geographic subsets, first on individuals from Europe and second on individuals from the Middle East & North Africa. Methods. Quality control, feature selection, and PCA were applied on each geographic subset. The results were displayed on the two first principal components and compared to the original figures. Results. The replicated figures were found to match closely to the original figures. Conclusions. Therefore, the main results were replicated and can be independently reproduced by using publicly available data, source code, and computing environment.

Keywords

Bioinformatics, Evolutionary/Comparative Genetics, Genomics

Introduction

Quartz Bio and the Stochastic Information Processing group are involved in the PRECISESADS project (http://www.precisesads.eu/), which aims at reclassifying Systemic Autoimmune Diseases (SADs), a group of chronic inflammatory conditions characterized by the presence of unspecific autoantibodies in the serum and resulting in serious clinical consequences, based on genetic and molecular biomarkers rather than clinical criteria.

In order to use genetic similarities to deliver personalized treatments to patients affected by SADs as well as other diseases, it is important to first understand the genetic structures in healthy populations.

In 2008, Li et al.1 showed that although specific world regions have different genetic origins, all revealed population structures in principal component analyses (PCAs). Similar population structures were also observed in studies using other genome-wide variations datasets2,3.

Li et al. applied PCAs on subsets of individuals from two geographic regions, Europe and the Middle East & North Africa, and displayed the results on the two first principal components in their article as Figures 2A and B, respectively, (with the latter labeled only Middle East).

In an attempt to replicate these two figures, we performed quality control, minor allele frequency filtering, tag SNP selection4, and PCAs on both regional subsets of the SNP microarray data. The PCAs were then displayed on the first two principal components.

The replicated figures were found to match closely to the original figures, and therefore confirmed a successful replication.

Methods

Genotype data

The dataset consisted of two files: a zip file including the genotype data of 660,918 SNPs from 1,043 individuals with the annotations of the SNPs, and a text file composed of the annotations of 953 individuals (see Data and software availability).

The annotations of individuals were used to create two subsets of the data. The first contained 157 individuals from Europe and the second contained 163 individuals from the Middle East & North Africa.

Analysis sets

For each geographic region subset of the data, we verified that no individuals had missing value rates above 3% and excluded SNPs with missing value rates above 1%. An additive genetic model was then used to encode each A/B SNP (A/A = 0, A/B = 1, B/B = 2), which converts categorical SNP values to numerics by assuming that the effect of the A/B heterozygote and B/B homozygote are proportional to the number of B alleles. SNPs with minor allele frequency below 5% were excluded to remove rare variants, which are more prone to genotyping errors. In addition, in order to decrease the required computation time and memory usage, redundant SNPs were removed by applying TagSNP4 (r2 > 0.8, window of 500,000 base pairs). The missing values were imputed by random sampling of each SNP. Then each SNP was centered and scaled to unit variance. All steps were performed using the SNPClust R package v1.0.02.

For the Europe subset, a total of 375,164 SNPs from 157 individuals were selected for analysis. This defines our Europe analysis set.

For the Middle East & North Africa subset, a total of 412,979 SNPs from 163 samples were selected for analysis. This defines our Middle East & North Africa analysis set.

For comparison, the supporting online material of Li et al. reported that individuals with missing value rates above 2.5% and SNPs with missing value rates above 5% were excluded. Table S1 of Li et al. reports that 156 individuals from Europe and 160 from the Middle East & North Africa were used and the supporting online material reports that 642,690 SNPs were used.

Principal component analyses

PCAs were applied on the two analysis sets and displayed using the SNPClust R package v1.0.02. Principal component analysis (PCA) is a dimensionality reduction method, which projects SNPs by linear combination to maximize the variance on successive axes, i.e. principal components, while constraining the axes to be orthogonal.

The supporting online material of Li et al. reports that they first computed the Identity-by-State (IBS) matrix among the 938 individuals by using PLINK (version not provided)5 and then performed PCAs on the IBS matrix for each region separately. In this study, PCAs were applied on the analysis sets and not on IBS matrices.

Results

PCA of the Europe analysis set

The PCA of the Europe analysis set was displayed on the two first principal components (Figure 1). Individuals were grouped by population and the replicated figure matched closely with Li et al.'s Figure 2A.

0119756e-bb05-4f1a-bb0b-e502f32c4849_figure1.gif

Figure 1. Two first principal components of the Europe analysis set.

Visualization of the principal component analysis on 375,164 SNPs from 157 individuals from Europe. Individuals from North and South were differentiated in the first principal component and located in the lower and upper sides, respectively. Individuals from East and West were differentiated in the second and located in the right and left sides, respectively.

The explained variance was almost identical, as the replication stated 2.1% in PC1 and 1.6% in PC2, while Li et al.'s Figure 2A stated 2.4% and 1.6%, respectively.

0119756e-bb05-4f1a-bb0b-e502f32c4849_figure2.gif

Figure 2. Two first principal components of the Middle East & North Africa analysis set.

Visualization of the principal component analysis on 412,979 SNPs from 163 individuals from the Middle East & North Africa. Individuals from East and West were differentiated in the first principal component and located in the right and left sides, respectively. Individuals from North and South were differentiated in the second and located in the lower and upper sides, respectively.

PCA of the Middle East & North Africa analysis set

The PCA of the Middle East & North Africa analysis set was displayed on the two first principal components (Figure 2). Individuals were grouped by populations and the replicated figure matched closely with Li et al.'s Figure 2B.

Two differences from Li et al.'s analysis were noted, first the Bedouin and Druze populations exhibited a larger spread on PC1 in the original figure. Second, one Bedouin individual was located with Mozabite individuals, which did not appear in Li et al.'s Figure 2B.

The explained variance was slightly smaller, as the replication stated 3.1% in PC1 and 2.2% in PC2, while Li et al.'s Figure 2B stated 5.0% and 2.6%, respectively.

Discussion

The replicated figures matched closely to the original figures, although two differences appeared when examining the Middle East & North Africa subset: the smaller spread of two populations and the presence of an outlier.

Therefore, the main results were replicated and can be independently reproduced by using publicly available data, source code, and computing environment.

We successfully confirmed that although the two geographic regions studied had different genetic origins, both exhibited population structures in PCAs.

Understanding the genetic structure of healthy populations will enable us to use genetic similarities to deliver personalized treatments to patients affected by SADs. Using this replication, the PRECISESADS project will be able to compare clusters of patients affected by SADs to clusters of healthy individuals, independently from their ancestry-driven genetic structure2.

Data and software availability

As stated in Li et al.1, the data sets are freely available online. Although the links that were provided are now outdated, the two data files are available from HGDP-CEPH: http://www.hagsc.org/hgdp/files.html (download link: http://www.hagsc.org/hgdp/data/hgdp.zip and http://www.cephb.fr/en/hgdp_panel.php#serie2; ftp link: ftp://ftp.cephb.fr/hgdp_v3/hgdp-ceph-unrelated.out).

The PCAs were computed and displayed using the previously published R package SNPClust v1.0.02.

Computing environment in a Docker container is available from: https://hub.docker.com/r/thomaschln/reproducible-hgdp

Source code required to generate this article and the definition of the corresponding computing environment, in which all required software are installed: https://github.com/ThomasChln/reproducible-hgdp

Archived source code as at time of publication: doi, 10.5281/zenodo.3451376

License: GNU General Public License version 3.0

Ethical statement

The data were previously published1 and approved by ethics committees. No samples were used and records were de-identified.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 15 Mar 2017
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Charlon T, Di Cara A, Voloshynovskiy S and Wojcik J. Replication of the principal component analyses of the human genome diversity panel [version 1; peer review: 1 approved, 1 approved with reservations] F1000Research 2017, 6:278 (https://doi.org/10.12688/f1000research.11055.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 15 Mar 2017
Views
31
Cite
Reviewer Report 18 Apr 2017
Michael G. B. Blum, TIMC-IMAG laboratory (Techniques for biomedical engineering and complexity management – informatics, mathematics and applications – Grenoble), Grenoble Alpes University, Grenoble, France 
Approved with Reservations
VIEWS 31
The authors replicate the ascertainment of worldwide population structure obtained by Li et al. (2008). They perform PCA to capture population structure. The PC axes closely match the ones obtained by Li et al.
 
However, the authors ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Blum MGB. Reviewer Report For: Replication of the principal component analyses of the human genome diversity panel [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2017, 6:278 (https://doi.org/10.5256/f1000research.11923.r21151)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
33
Cite
Reviewer Report 28 Mar 2017
Zoltán Kutalik, Department of Computational Biology, University of Lausanne, Lausanne, Switzerland 
Approved
VIEWS 33
This manuscript reports on the re-running of two PCA analyses presented in an earlier publication Li et al 2008). The authors confirm the PCA results presented in the original paper and point out two minor differences.
 
The ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Kutalik Z. Reviewer Report For: Replication of the principal component analyses of the human genome diversity panel [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2017, 6:278 (https://doi.org/10.5256/f1000research.11923.r21333)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 15 Mar 2017
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.