ABO allele-level frequency estimation based on population-scale genotyping by next generation sequencing

Background The characterization of the ABO blood group status is vital for blood transfusion and solid organ transplantation. Several methods for the molecular characterization of the ABO gene, which encodes the alleles that give rise to the different ABO blood groups, have been described. However, the application of those methods has so far been restricted to selected samples and not been applied to population-scale analysis. Results We describe a cost-effective method for high-throughput genotyping of the ABO system by next generation sequencing. Sample specific barcodes and sequencing adaptors are introduced during PCR, rendering the products suitable for direct sequencing on Illumina MiSeq or HiSeq instruments. Complete sequence coverage of exons 6 and 7 enables molecular discrimination of the ABO subgroups and many alleles. The workflow was applied to ABO genotype more than a million samples. We report the allele group frequencies calculated on a subset of more than 110,000 sampled individuals of German origin. Further we discuss the potential of the workflow for high resolution genotyping taking the observed allele group frequencies into account. Finally, sequence analysis revealed 287 distinct so far not described alleles of which the most abundant one was identified in 174 samples. Conclusions The described workflow delivers high resolution ABO genotyping at low cost enabling population-scale molecular ABO characterization. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-2687-1) contains supplementary material, which is available to authorized users.


Background
ABO is the clinically most relevant blood group system in transfusion and transplantation medicine [1]. Using classical serological methods, donor/recipient pairs are routinely classified phenotypically into four major blood groups (A, B, O, and AB). Additional phenotypes with weak expression patterns are recognized and have been adopted for ABO subgroup classification [2].
To supplement serological typing, several medium-to high-throughput molecular typing methods have been developed for the glycosyltransferase encoding ABO gene on human chromosome 9. These methods rely on a broad range of techniques such as restriction fragment length polymorphism (RFLP), sequence-specific primer (SSP) PCR, single-strand conformation polymorphism (SSCP) analysis, or DNA microarray hybridization [3][4][5][6]. Most molecular typing methods exclusively target exons 6 and 7, which code for the catalytic domain and comprise the majority of the coding sequence, and focus on single-nucleotide polymorphisms within these two exons.
While these methods generally suffice for clinical applications [7], they do not easily scale to the requirements of routine upfront ABO genotyping of large cohorts of blood donors [4]. Moreover, the currently most commonly used molecular typing methods are restricted to detecting the specific set of alleles included in the assay. Novel alleles are unlikely to be detected. To date, 367 ABO alleles are reported in the Blood Group Antigen Gene Mutation Database (BGMUT, http://www.ncbi.nlm.nih.gov/projects/gv/ rbc). 95 alleles have been added to BGMUT since the database was last described in an academic publication four years ago [8]. As this trend is not likely to slow down in the short term, DNA sequencing of the entire exonic sequences should be employed for more accurate genotyping results. Several studies have discussed the application of next generation sequencing (NGS) for ABO blood group genotyping or demonstrated its potential [9][10][11][12]. However, so far it has not been applied at population-scale to determine ABO allele frequencies.
Our lab has been applying NGS for high-throughput HLA genotyping based on direct sequencing of amplicons on the Illumina MiSeq and HiSeq platforms since beginning of 2013 [13]. We are providing this service to stem cell donor centers, mainly DKMS German Bone Marrow Donor Center and affiliated centers, to characterize newly registered potential stem cell donors. More than 2.5 million volunteers have been typed using our NGS approach since 2013. The implementation of the NGS-based workflow for HLA genotyping has reduced costs by more than 50 % as compared to Sanger-based genotyping. Furthermore, this type of workflow allows adding additional genes of interest to the donor genotyping profile at a minimal surcharge. Since donor-patient matching of the ABO status simplifies transplantation-related patient treatment procedures and may even improve outcome [14], we chose to extend the existing NGS-based HLA genotyping workflow to additionally provide ABO genotyping. Currently, the major bone marrow donor registries accept ABO data only at the blood group resolution level (A, B, AB, O). Therefore, we designed the assay to provide ABO blood group resolution at minimal costs. This resulted in the selection of exons 6 and 7 for sequencing, which, based on the currently known alleles, enable unambiguous determination of the ABO blood group status. Meanwhile we have typed 1.69 million samples using this approach.
Here, we describe the workflow and analyze the level of resolution beyond the blood group level that may be obtained by an exon 6 and 7 restricted approach: 99.9 % of the samples can be resolved at the ABO allele or allele group level. We analyze a subset of 113,367 samples of German descent to report ABO genotype frequencies at the resolution level of allele groups. Furthermore, our workflow readily identifies so far unknown alleles and we describe 287 unique novel ABO variants in exon 6 and 7.

Workflow
We describe a high-throughput workflow for ABO genotyping based on direct PCR amplicon sequencing on Illumina MiSeq or HiSeq instruments as described for HLA typing in Lange et al. [13]. The main advantages of this workflow are simplicity and cost effectiveness. Hundreds of samples may be pooled immediately after the PCR reaction as the samples are tagged during PCR with a molecular barcode that is read during sequencing. This approach significantly reduces costs and hands-on time for all post-PCR processing steps. In addition, since adapters are incorporated during the PCR, only four straightforward steps are required to initiate sequencing: PCR cleanup, quantification, denaturation and dilution.
Amplicon length is restricted to the combined forward and reverse sequencing read length, currently 600 (MiSeq) or 500 (HiSeq) bases. We designed two assays targeting exons 6 and 7 of the ABO gene ( Fig. 1, Additional file 1: Table S1). Since the length of exon 7 (686 to 692 bp) exceeds the amplicon size limit of our approach, ABO assay 1 includes two overlapping PCR reactions to cover exon 7 and one PCR reaction targeting exon 6 ( Fig. 1). This delivers high resolution but prevents multiplexing the PCR in one reaction. ABO assay 2 is set up as multiplex PCR reaction covering exon 6 and the central 507 bases of exon 7 (Fig. 1). The multiplexed ABO assay 2 was developed to simplify the automated hitpicking setup.
The workflow has been applied routinely as part of the stem cell donor typing program since April 2014. As of March 2016, a total of 1.69 million samples have been successfully ABO genotyped. ABO assay 1 PCR is performed alongside assays for HLA, Rh, CCR5Δ32 and KIR genotyping using 48.48 or 192.24 Fluidigm Access Array chips or 384-well PCR plates. In addition, ABO assay 2 is applied on 384-well PCR plates to confirm results that do not meet the internal quality criteria or failed in the first round.
In our setting sequencing yields on average more than 1000 reads per amplicon and sample. These reads are matched to the alleles as listed in the BGMUT database [8] by our NGS genotyping software neXtype [13]. In contrast to many other molecular approaches our sequencing assays are not limited to detect a subset of the more frequent alleles but may rather detect the full spectrum of frequent and rare alleles as well as so far undescribed variants. Our restricted focus on exons 6 and 7, however, limits the obtainable resolution to alleles that differ in those exons. Seven alleles lacking exon 6 and 7 sequence information, four alleles lacking the major part of exon 7 sequence information, 19 alleles lacking phenotype information, and one allele of uncertain phenotype were excluded (Additional file 1: Table S2) reducing the data basis to 108 A, 68 B and 73 O alleles.
In contrast to Sanger sequencing, NGS delivers phased information: Every read carries information about the phasing between two or more heterozygous positions. However, in a short amplicon sequencing workflow, phasing information between the individual amplicons is lacking. Often there is only one possibility for joining amplicons that results in valid sequence combinations of e.g. exon 6 and 7. Sometimes, however, both sequences of one exon can be joined with both sequences of the other exon. In such cases, the true sequences or alleles cannot be determined without additional information. For instance, analysis of all ABO alleles and their sequences showed that samples with the genotype ABO*O.02.01/ABO*B1.01.01 cannot be distinguished from the genotype ABO*O.02.14/ABO*Ax.02.01 due to the lack of phasing information between exons 6 and 7 (Fig. 2). Given the high abundance of the ABO*O.02.01 and ABO*B1.01.01 alleles and the low likelihood of a genotype combining the two rare alleles ABO*O.02.14 and ABO*Ax.02.01, we decided to disregard the possibility of an ABO*O.02.14/ABO*Ax.02.01 combination. Likewise the genotype A1.01.02/O.02.14 is neglected in favor of the more likely B1.01.01/O.02.02 allele combination. However, those rare alleles are unquestionably identified in all other genotypes.

Validation
We validated our workflow by genotyping 468 samples with known serological blood group status. For 15 samples (3.2 %) typing failed, either because read counts in one of the amplicons were too low (8 samples), or because the allele groups identified in the 3 amplicons did not intersect and could therefore not be resolved unambiguously (7 samples On the other hand, the AO genotype obtained by sequencing was confirmed by an SSP assay. These conflicting results may partly be explained by the identified alleles: ABO*Ax.13.01.1 (sample 324) is expected to give rise to a very weak A phenotype [1,15]. In accordance with that, the serum assay failed to detect anti-A2 antibodies at room temperature. The serological status for the other sample (226), however, did not show any abnormalities. The molecular genotype group includes several weak This demonstrates the limitations of an exon 6/7 restricted genotyping assay. Such a restricted assay can therefore currently not replace serological analysis. It does serve, though, as a cost effective extension that adds detailed allelic information. Conceptually, an addition of the missing exons (and introns) is straightforward. Within the scope of our approach, however, the ABO genotypes are supposed to serve as additional selection criteria within the search for HLA-matched unrelated hematopoietic stem cell donors. Furthermore, the ABO status provided by our screening approach has to be confirmed by standard methods during donor clearing before stem cell transplantation. Against this background, a deviation rate of 0.5 % from the serological status seems acceptable to us.

ABO allele frequencies
The primary purpose of this project was to supplement HLA genotyping with basic ABO blood group information at low cost. However, the data generated lends itself to an allele-level frequency analysis. We chose to analyze a subset of 113,367 samples from individuals of German descent processed from June 2014 to September 2014. Low DNA concentration was identified as main source of error when working with PCR volumes in the sub-microliter range [16]. Therefore, the data set was restricted to samples with a DNA concentration of higher than 20 ng/μl in order to minimize the risk for wrong assignments which could distort the frequency calculations particularly of low frequency alleles.
As discussed above, two conceptual limitations interfere with full allele-level resolution for every sample: limited sequence coverage and missing phasing information. Despite the limited sequence coverage most observed alleles can be resolved to the third field (e.g. A1.02.01). However, three A, one B and five O genotypes include alleles spanning several subtypes (e.g. A1 and Aw, O.02 and O.67) (see Table 1 for the allele groups with ambiguities observed in the data set and Additional file 1: Table S3 for an exhaustive list of allele groups with ambiguities). Based on the remarks on the abundance of alleles and allele groups in "The Blood Group Antigen FactsBook" [17] we assume that most of those alternative alleles occur at low frequencies in the German population. However, given the large sample size, some of those alternative alleles may be present in the data set. This may result in a slight overestimation of the more abundant subgroups (e.g. A1, A2, B1) and an underestimation of the less frequent subgroups (e.g. Ax, Aw, Bw).
Depending on the alleles present in a particular sample, the missing phase information may create an additional level of ambiguity for genotyping. However, when analyzing a large data set retrospectively, a Bayesian probabilistic framework allows to partially account for the uncertainty in the data. In short, the algorithm takes advantage of the fact that the frequencies can be determined for many samples without phasing problem. Based on the frequencies for problem-free samples, the frequencies for the samples with phasing problems are estimated. In addition, we obtain an estimate of the level of uncertainty for each frequency estimate.
When applied to our data set of 113,367 genotyped samples this approach delivered frequency estimates for 82 alleles or allele groups ( Fig. 3 and Table 2) ranging from 0.002 % to 32 %. To our knowledge this is the first estimation of allelic ABO frequencies on such a large dataset.

High-resolution genotyping
We further explored the possibilities for high-resolution genotyping based on exon 6 and 7 sequencing. While an algorithm as discussed in the previous paragraph is limited to retrospective frequency estimations, these estimates may help distinguishing between alternative allele combinations based on their relative likelihood. Given sufficient difference, the less likely allele combination may be ignored accepting a minor increase in the error rate. In cases of allele combinations with similar likelihood both allele combinations should be included in the genotype result reducing the resolution.
We analyzed all observed genotyping results with unresolved phasing with regard to their frequencies and the relative likelihoods of their allele combinations (Table 3). We identified 24 unique cases with diverse properties in our data set. The 15 least abundant cases have a cumulative frequency of 0.11 % and will affect only very few samples. The six most abundant cases sum up to a cumulative frequency of 32 %, demonstrating the prevalence of the issue. Likewise the relative likelihoods of two possible allele combinations range from close to 1 to several millions. Two cases (case 4 and 11, Table 3) as discussed above would lead to a different ABO genotype (OB1 versus OAx/OA1). Disregarding the OAx/OA1 combination seems warranted without an inadequate error risk at least for German samples: Even though the OB1/OA1 ambiguity has only an odds ratio of 6,271, given the low frequency of that ambiguity this would theoretically result in one error in 23 million typed samples. Twelve of the remaining cases are irrelevant for determining the subgroups as the ambiguities affect only the next field (e.g. A2.01.01* versus A2.06.01, O.01.01 versus O.01.05). In most circumstances those differences will not be of interest. Two cases remain with a frequency above 0.1 % and an effect on subgroup results: For case 2 the OAx combination can be safely ignored, having a relative likelihood of smaller than 1/100,000. Case 3 remains unpleasant as it combines a high frequency of 4.5 % with a moderately low relative likelihood of one in 617. In most scenarios an error rate of 1 in 10,000 samples (4.5 % divided by 617) is probably acceptable. Otherwise in 4.5 % of the samples OA2 versus OA1 cannot be resolved. We conclude that the missing phasing information does not interfere with high-resolution genotyping for German samples as for 99.9 % of the samples phasing could be resolved at least to the subgroup level if an error rate of 0.01 % (A1 versus A2, case 3) is acceptable.

Novel alleles
An intrinsic advantage of sequencing approaches compared to other molecular methods is the ability to identify and characterize novel alleles. As of March 2016 we have genotyped 1,693,287 samples for ABO. We identified 20,190 samples (1.2 %) with indications for the presence of a novel allele. As the characterization of novel alleles was not the primary focus of the project, we attempted verification by replicate sequencing only for a subset of 4,375 samples (21.7 %). That subset was not systematically selected over the time course of the project. It rather reflects historic changes in analysis policies and automation capacities. For 815 samples (18.6 %) the replicate sequencing confirmed the presence of a novel allele with the identical so far unreported sequence. A total of 287 unique novel allelic sequences were found (Additional file 1: Table S4). While 193 of  Table S4), an orthogonal technology with completely different error profile. All PacBio resequencing results confirmed the original findings. This study demonstrates the potential of our workflow for the detection of novel alleles. To submit the identified novel sequences to the BGMUT database more work is required. In particular an approach that covers the whole gene and provides fully phased sequence information should be applied. Based on our experience with the submission of fully phased HLA genes we intend to embark on this task soon.

Discussion
Here, we propose a cost-effective workflow for highthroughput ABO genotyping. Despite the restricted coverage of exons 6 and 7, the approach delivers allelic or allele group level resolution for 99.9 % of the samples.
While the frequency analysis based on 113,367 samples from individuals of German descent allowed us to propose ways to handle ambiguities originating from unresolved phasing, we lack such detailed frequency information for the alleles compromising the not resolvable allele groups (Table 1) whose sequences differ only in the regions not covered by our approach. Therefore we cannot judge if those alleles prevent ABO subgroup-level resolution since they may appear too frequent to disregard them. This limitation is however shared with many published molecular approaches that a priori limit the data basis to the more frequent alleles that are readily distinguishable. Our workflow, however, lends itself to extend the targeted region to the other exons. Such an extended workflow could deliver true allelic level ABO genotyping and resolve the frequencies of the less common alleles in the so far unresolved allele groups. Even such an extended workflow would still be very costeffective. Main cost factors of our workflow, when applied at high throughput, are DNA isolation, PCR reactions (including target-specific and barcoding primer oligonucleotides) and sequencing. Costs for DNA isolation and PCR reactions depend largely on the chosen reagent providers. Current reagent costs for sequencing on an Illumina MiSeq (2x300 bp) are well below 1 € per 20,000 reads which would deliver more than plenty of reads to cover the whole ABO gene. This assumes, however, that ABO genotyping is performed together with other targets and/or that the throughput is sufficiently  Table 1. Allele group identifiers ending with an asterisk combine alleles across subgroups high to utilize the full capacity of the instrument. In our setup, ABO genotyping is performed alongside of genotyping of six HLA loci, CCR5Δ32, KIR and the Rh blood group. Up to 4,800 samples are jointly analyzed on one rapid-run flow-cell on HiSeq 2500 instruments resulting in 60,000 reads per sample on average, at sequencing reagent costs of about 1 €. This underscores the cost effectiveness of the described workflow. Given the low sequencing costs, applying these strategies to an extended blood group panel seems feasible. The major challenge would lie in developing highly multiplexed efficient PCR assays targeting the genes of interest. While the workflow is slim compared to other sequencing approaches, the sequencing alone runs for two full days. Taken together, genotyping results can be obtained within four days. However, in a high-throughput optimized setting the turn-around-time would probably extend to two or three weeks.

Conclusions
The application of next generation sequencing to blood group genotyping has been proposed [9,10] and the feasibility demonstrated in proof-of-concept studies [18][19][20]. We report the application of NGS to ABO analysis and successfully genotyped more than 1.5 million samples. Despite the restricted focus on exons 6 and 7 the data enabled us to report frequency data on 82 distinguished alleles or allele groups. For most of the less abundant alleles this constitutes the first quantitative frequency estimation. While this approach can by no means substitute serological ABO status analysis, it could serve as a cost-effective complementation to reveal the molecular ABO genotype.

Samples, DNA isolation and quantification
Samples were provided by DKMS German Bone Marrow Donor Center and other donor centers for HLA and blood group typing between April 2014 and December 2015. DNA was isolated from 150 μl whole blood or a single buccal swab using the magnetic-bead-based "chemagic DNA Blood Kit special" or "chemagic DNA Buccal Swab kit special" (Perkin Elmer, Baesweiler, Germany), respectively. DNA was eluted in 100 μl elution buffer (10 mM Tris-HCl pH8.0). DNA concentrations were measured by fluorescence (SYBR Green, Biozym, Hessisch Oldendorf, Germany) using the TECAN infinite 200Pro (Tecan, Männedorf,   [14]. We used a thermal profile of 50°C for 2 min, 70°C for 20 min, 95°C for 10 min, followed by 20 cycles at 95°C for 25 s, 60°C for 30 s and 72°C for 90 s and additional 15 cycles at 95°C for 25 s, 50°C for 30 s and 72°C for 90 s and a finishing step at 72°C for 5 min. PCR setup included the target-specific and the barcoding primers. Alternatively, amplification was performed in 384-well plates with 2 μl template DNA, 1 μl 10x buffer mix without MgCl 2 (Roche Fast Start Kit), 0.8 μl 25 mM MgCl 2 , 0.5 μl DMSO, 0.2 μl 10 mM dNTPs each (Roche Fast Start Kit), 0.1 μl Fast Start Taq Polymerase (5 U/μl) (Roche Fast Start Kit), 4.4 μl PCR grade water and 1 μl of target-specific primer mix. We used a thermal profile of 95°C for 4 min followed by 35 cycles at 95°C for 25 s, 57°C for 30 s and 72°C for 90 s, and a finishing step at 72°C for 5 min. Amplicons belonging to one sample were pooled with an CyBi-Well Vario system (Analytik Jena AG, Jena, Germany) and 2 μl transferred to an 9 μl pre-aliquoted PCR master mix including 1 μl 10x buffer mix without MgCl 2 (Roche Fast Start Kit), 1 μl 25 mM MgCl 2 , 0.2 μl DMSO, 0.2 μl 10 mM dNTPs each (Roche Fast Start Kit), 0.1 μl Fast Start Taq Polymerase (5 U/μl) (Roche Fast Start Kit), 3.5 μl PCR grade water as well as 2 μl of barcode primers (2 μM equimolar mix of index 1 and index 2). Barcoding PCR was performed with the following thermal profile: 95°C for 4 min, 10 cycles at 95°C for 25 s, 50°C for 30 s and 72°C for 90 s, and a finishing step at 72°C for 5 min.

Library preparation and sequencing
For sequencing library preparation we pooled 48 barcoded samples from a 48.48 Access Array, 2 x 96 barcoded samples from a 192.24 Access Array, or all barcoded amplicons from one 384-well plate, respectively.
Pooled PCR products were purified with SPRIselect Beads (BeckmanCoulter, Brea, USA) using a ratio of 0.7:1 beads to PCR product. Purified amplicon pools were diluted 1:4000 for quantification by qPCR. Pooling, purification and subsequent dilution for qPCR quantification were performed on Biomek 3000 or Biomek NX workstations (Beckman Coulter, Brea, USA). qPCR was performed on an ABI-StepOnePlus qPCR cycler (Thermo Fisher, Carlsbad, USA) using the Library Quant Illumina Kit (KAPA Biosystems, Boston, USA) with standards in a range from 0.2 fM to 20 pM.
The purified and quantified amplicon pools were mixed in equimolar amounts and prepared as recommended by Illumina (MiSeq Reagent Kit v2-Reagent Preparation Guide). Libraries were loaded at 12.5 to 18.5 pM onto MiSeq or HiSeq flow cells with 10 % PhiX spiked in. Paired-end sequencing was performed at 249, 251 or 260 (ABO assay 2) cycles.
Confirmatory sequencing of novel ABO alleles was performed using single molecule real-time (SMRT) sequencing on a Pacific Bioscience RS II instrument. ABO genes spanning exons 3 to 7 were amplified by long-range PCR. A barcoded sequencing library was prepared using the SMRTbell Template Prep Kit 1.0 (Part number 100-259-100) from Pacific Biosciences following standard protocols. Base calling and demultiplexing was performed using Pacific Biosciences' SMRT Portal. Sequence reads were mapped against ABO reference alleles (ABO*A1.01.01.1 for antigens A and O, ABO*B1.01.01.1 for antigen B) with BWA-MEM [21] using PacBio settings (-x pacbio) and subsequently visually inspected in IGV [22].

Genotyping
Next generation sequencing based ABO genotyping was implemented in the genotyping application neXtype using the same principles as described previously for HLA allele typing [13]. Briefly, neXtype utilizes a set of reference allele sequences against which query sequences are matched for each exon separately. If different alleles share an exonic sequence, a query sequence will match multiple target alleles per exon. We refer to these sets of matched target alleles as Exon Allele Groups (EAGs). A final ABO genotype assignment for a sample is obtained by intersecting the member alleles of all EAGs across exons 6, 7a and 7b. If only one (in a homozygous sample) or two (in a heterozygous sample) alleles are shared across EAGs across exons, the sample ABO genotype can be fully resolved. If multiple alleles share the same sequences in the amplified region, those alleles cannot be resolved unambiguously. We distinguish the following levels of resolution (low to high): blood group (A, B, O), subgroup (e.g. A1, Ax, B1, O1, O2), allele groups (set of alleles derived from at least two subgroups), second field allelic resolution (e.g. A1.01), third field allelic resolution (e.g. A1.01.02) and full allelic resolution (e.g. A1.01.01.1). Allele group identifiers and their constituent alleles are provided in Table 1.
An additional layer of ambiguity arises if, e.g., the two EAGs at one exon intersect each with two EAGs at another exon. In this case the phase between exons cannot be resolved and multiple solutions for the underlying genotype exist.

Allele group frequency estimation
For ABO allele group frequency estimation a subset of 124,206 donor samples typed between June 2014 and September 2014 for DKMS German Bone Marrow Donor Center was extracted from the total data set. Only samples with a DNA concentration ≥ 20 ng/μl were included. For each sample, genotyping was performed using neXtype as described above. 123,250 of the samples could be successfully genotyped. Therefore, the failure rate for these high quality samples was below 1 %. Based on self-declared ethnic origin, 113,367 of these donors were of German descent and therefore included in the final analysis. For 67 % of these samples (76,276) an unambiguous genotype or allele group genotype could be assigned, i.e. there was no phasing ambiguity across exons (see above). In 33 % of the cases (37,091 samples) multiple genotypes or allele group genotypes could be mapped to the particular EAG compositions. Both groups were included in further analyses.
To estimate allele frequencies, we mapped observed genotype frequencies for unambiguous genotype assignments to allele frequencies assuming standard Hardy-Weinberg proportions. For ambiguous genotype assignments, observed frequencies were mapped to the sum of the individual genotype probabilities that an ambiguous genotype was composed of. We used these mappings to formulate a Bayesian model for multinomial frequency estimation implemented in JAGS (version 3.4.0, http:// mcmc-jags.sourceforge.net) [23]. Additional file 2 (ABO_ model.bug) provides the model specifications in the BUGS language [24]. Additional file 3 (ABO_count.data) provides the raw count data for EAG groups in the order in which they appear in the model. Additional file 4 (AlleleGroup Identifiers.csv) provides the mapping from EAG group codes to the allele group identifiers used in this paper. The mean and standard deviation of allele group frequency estimates were estimated from the posterior distribution generated by 10,000 MCMC iterations after a burn-in of 10,000 iterations. All calculations were performed in R (version 3.2.2) [25], using the package rjags (version 3-15, http://CRAN.R-project.org/package=rjags).

Validation Serology
Serological characterization of the blood samples was performed by German Red Cross (DRK) Blood Donor Service Nord-Ost gGmbH (Institut Chemnitz) and DRK Blood Donor Service Baden-Württemberg -Hessen, Ulm and Baden-Baden, Germany.