Benchmarking the HLA typing performance of three HLA assays and seven NGS-based HLA algorithms


 Background

With the great progress made recently in NGS (Next Generation Sequencing) technology, sequencing accuracy and throughput have increased, while the cost for data has decreased. Various HLA (Human Leukocyte Antigen) typing algorithms and assays have been developed and have begun to be used in clinical practice. However, there is no systematic benchmarking to evaluate the HLA typing performance of different HLA assays and algorithms. In this study, we compared the HLA typing performance of three HLA assays and seven NGS-based HLA algorithms and assessed the impact of sequencing depth and length on HLA typing accuracy.
Results

Seven HLA typing algorithms at 4- and 6-digit allele levels were compared on three different assays in terms of accuracy, read depth and read length. The algorithms HLA-HD and HISAT-genotype showed the highest accuracy at both 2- and 4-digit resolution, followed by HLAscan. We designed a capture-based HLA assay, which showed comparable or even better performance compared with WES (Whole Exome Sequencing). In the depth evaluation, the sequencing data were down-sampled from 500X to 10X based on the depth of HLA genes. We found that the minimal depth was 100X for HLA-HD and HISAT-genotype to obtain more than 90% HLA typing accuracy at the 6-digit allele level. The accuracy of all three algorithms did not change when the read length decreased from 150 bp to 76 bp.
Conclusion

Although HISAT-genotype and HLA-HD may need more computing resources, we recommend using them for NGS-based HLA genotyping because of their higher accuracy and robustness to sequencing depth and read length. We propose that the minimal sequence depth for obtaining more than 90% HLA typing accuracy at the 6-digit allele level is 100X. Besides, targeting capture-based NGS HLA typing may be more suitable than WES in clinical practice due to its lower sequencing cost and higher HLA sequencing depth.


Abstract Background
With the great progress made recently in NGS (Next Generation Sequencing) technology, sequencing accuracy and throughput have increased, while the cost for data has decreased. Various HLA (Human Leukocyte Antigen) typing algorithms and assays have been developed and have begun to be used in clinical practice. However, there is no systematic benchmarking to evaluate the HLA typing performance of different HLA assays and algorithms. In this study, we compared the HLA typing performance of three HLA assays and seven NGS-based HLA algorithms and assessed the impact of sequencing depth and length on HLA typing accuracy.

Results
Seven HLA typing algorithms at 4-and 6-digit allele levels were compared on three different assays in terms of accuracy, read depth and read length. The algorithms HLA-HD and HISAT-genotype showed the highest accuracy at both 2-and 4-digit resolution, followed by HLAscan. We designed a capture-based HLA assay, which showed comparable or even better performance compared with WES (Whole Exome Sequencing). In the depth evaluation, the sequencing data were down-sampled from 500X to 10X based on the depth of HLA genes. We found that the minimal depth was 100X for HLA-HD and HISAT-genotype to obtain more than 90% HLA typing accuracy at the 6-digit allele level. The accuracy of all three algorithms did not change when the read length decreased from 150 bp to 76 bp.

Conclusion
Although HISAT-genotype and HLA-HD may need more computing resources, we recommend using them for NGS-based HLA genotyping because of their higher accuracy and robustness to sequencing depth and read length. We propose that the minimal sequence depth for obtaining more than 90% HLA typing accuracy at the 6-digit allele level is 100X. Besides, targeting capture-based NGS HLA typing may be more suitable than WES in clinical practice due to its lower sequencing cost and higher HLA sequencing depth.

Background
The human leukocyte antigen (HLA), commonly referred to as major histocompatibility complex (MHC), is located within a region of approximately 4 M in length on the short arm of human chromosome 6 (6p21.3), with more than 200 protein-coding genes [1]. Except for identical twins, no two individuals have exactly the same HLA. Therefore, HLA is also known as the "identity card" of the human cell. It is a marker for the mutual recognition of immune cells in different individuals. MHC gene products are expressed on different cell surfaces and play a key role in antigen presentation and immune signaling.
HLA mainly includes three regions, namely HLA-I, HLA-II and HLA-III. HLA-I genes include HLA-A, HLA-B and HLA-C, which are distributed on almost all nucleated cell surfaces with the highest lymphocyte surface density [2]. HLA-II genes include the HLA-D family, mainly HLA-DP, HLA-DQ and HLA-DR, which are mainly distributed on the surface of professional antigen-presenting cells such as B lymphocytes, macrophages and dendritic cells. The HLA-III gene contains approximately 75 genes, most of which are   of unknown function. HLA-I (MHC I) and HLA-II (MHC II) genes are molecules that encode binding and  presenting antigens, allowing cytotoxic T lymphocytes to bind to mature HLA cell surface proteins via antigen-binding channels. HLA-I genes mainly encode antigens to CD8 + T cells, and HLA-II genes mainly encode antigens to CD4 + T cells.
HLA has been widely used in bone marrow transplantation, detection of susceptibility genes in immunerelated diseases and drug allergy testing. Recent studies have demonstrated that HLA typing complexity is associated with the e cacy of cancer immune checkpoint blockade (ICB) [3]. Furthermore, the combined effect of HLA class I heterozygosity and tumor mutation burden (TMB) on improved survival is greater as compared with mutation load alone. Researchers have also sequenced the CDR3 of the hypervariable region of the T cell receptor (TCR) and found that the TCR CDR3's tumor-associated clones are signi cantly elevated in patients with greater heterogeneity of the HLA class of molecular sites. That is to say, in the treatment of ICB, the diversity of HLA molecules in patients will affect the clonal expansion of T cells against new tumor antigens and thus affect the therapeutic effect [4]. The highly polymorphic HLA genes present unique challenges for the development of molecular approaches to genotyping HLA alleles. According to the traditional method, both alleles of a particular HLA locus are PCR ampli ed and Sanger-sequenced together, resulting in multiple heterozygous positions in the electropherogram tracing. With the development of next-generation sequencing (NGS) technology, each fragment of HLA DNA is ampli ed and sequenced independently, dramatically reducing the phase ambiguities encountered with Sanger sequencing. Since 2009, many different approaches for HLA genotyping by the NGS method have been reported using a variety of capture strategies and sequencing platforms [5][6][7][8][9][10]. Many bioinformatics approaches have also been developed to produce HLA genotyping information from amplicon-based NGS, targeted capture including whole-exome sequencing and non-targeted whole-genome sequencing [11][12][13][14][15][16][17][18] (software used in this study are listed in Table 1).
All these algorithms can be generally divided into two categories: alignment-based methods and assembly-based methods. The former category aligns the sequencing data to the HLA reference database IPD-IMGT [19,20] and predicts HLA genotypes using probabilistic models [21], whereas the latter assembles reads into contigs and aligns those to the known HLA allele reference sequences.
Several studies have been conducted to compare the accuracy of different software [21][22][23][24]. Bauer et al. evaluated the HLA typing accuracy of ve computational methods on three different data sets, nding that PHLAT has the highest accuracy, and sequencing coverage has a weak correlation with accuracy [21]. However, no conclusions have been made regarding several critical questions: Which HLA typing assay is more suitable in a clinical context? Are any HLA typing algorithms biased towards a speci c NGS assay? What are the basic sequencing requirements for accurate HLA genotyping? To answer these questions, we evaluated the performance of different combinations of HLA NGS typing assays and software using our in-house benchmarking dataset.
HLA typing accuracy for all assay-software combinations As a preliminary screening, we rst compared the HLA typing accuracy of all possible assay-software combinations at the 2-, 4-and 6-digit allele levels. The results were much more discordant among different algorithms than among the capture assays used. At the 2-digit allele level, six of the seven algorithms had an overall accuracy of higher than 75% no matter which assay was used ( Fig. 2A). HLA-HD and HISAT-genotype had almost perfect accuracy, whereas the accuracy of HLAVBseq was much lower (the accuracy was 68%, 65% and 50% for Internal, WES and Bofuri, respectively). Among all three assays used, the overall accuracy of Bofuri was the lowest, and our internal NGS assays showed comparable or even better performance compared with WES. As the HLA resolution increased from 2-to 6-digit allele levels, the accuracy of HLA tying gradually decreased ( Fig. 2B and 2C; HLA typing results for HLAminer and HLAseq2HLA at the 6-digit allele level were not available). Only HLA-HD and HISATgenotype showed greater than 75% accuracy at the 6-digit allele level. Thus, the combination of the HLA-HD/HISAT-genotype with our internal assay/WES showed the highest HLA typing accuracy.

Computer resource consumption
All HLA programs were run on a Linux server with the maximum eight threads if possible. As expected, with the increase in panel sizes of NGS capture assays, the running times for all of the software increased (Fig. 3). Unsurprisingly, the running time for WES increased exponentially compared with the other two assays (median running time: WES, 77 min; Internal, 4 min; Bofuri, 3 min). In the other two assays, the most time-consuming algorithms were HLA-HD and HISAT-genotype.

Discordant HLA typing patterns across algorithms
We investigated the speci c patterns of discordance in each algorithm. Among all the algorithms, HLA-VBSeq had the highest number of miscalled HLA typing at the 4-digit allele level, followed by HLAminer (Fig. 4A). Out of the ve HLA genes, HLA-A gene was the most frequently miscalled gene, and the most discordant pattern was A*02:07 to A*02:01 (Fig. 4B). Each algorithm had biases on ratios of miscalled HLA typing within speci c serological allele groups. For example, 81% (57 out of 70) HLA-A miscalled errors observed in HLAforest were within the same serological allele group, whereas the ratio decreased to less than 15% for HLAscan, HLA-HD and HISAT-genotypes (Supplementary Table 2).
The impact of sequence depth and length on HLA typing accuracy Based on the above evaluations, we focused on the three algorithms with the highest accuracies, that is, HISAT-genotype, HLA-HD and HLAscan, to investigate the impact of read length and read depth on HLA typing.
Regarding the depth evaluation, when the sequencing data of Bofuri were down-sampled from 700X to 10X, the accuracies of HLA-HD and HISAT-genotype at the 4-digit allele level were still above 95% at 50X and higher read depths, and then they decreased gradually when the sequence depths were less than 50X (Fig. 5A). The overall accuracy of HLAscan was lower than the other two algorithms. The required sequence depth for HLA-HD and HISAT-genotype to get more than 90% HLA typing accuracy was above 100X at both 4-and 6-digit allele levels (Fig. 5B).
Regarding the read length evaluation, we manually generated paired-end 100 bp (PE100) and paired-end 75 bp (PE75) sequence data based on paired-end 150 bp (PE150) using an in-house pipeline. When the read length decreased from PE150 to PE100 and PE75, the overall HLA typing accuracy was quite similar for each algorithm, except that HLAscan had lower accuracy, as in the depth research ( Fig. 5C and 5D), demonstrating that the selected three HLA typing algorithms were robust to the read length.

HLA typing performance in validation data sets
We selected another 998 Chinese population samples sequenced by the 3DMed internal developed assay.
The reference HLA typing results were de ned as the most concordant HLA types called by these seven algorithms. HISAT-genotype, HLA-HD and HLAscan showed higher accuracy than other algorithms again, and no obvious difference was found for the ve HLA genes when these algorithms were selected ( Fig. 6A and 6B), rea rming our comparison results on HLA typing accuracy.

Discussion
In this study, we performed a benchmarked analysis of HLA typing based on seven algorithms and three capture-based sequencing methods. We found that the choice of NGS-based HLA typing algorithm and the sequencing depth contributed most to the overall HLA typing accuracy. Among the seven algorithms tested, HLA-HD and HISAT-genotype displayed the highest overall accuracies at both 4-digit and 6-digit allele levels. HLA-HD constructed an extensive dictionary of HLA alleles and calculated a score based on weighted read counts to select the most suitable pair of alleles [17]. The high accuracy of HLA-HD was more likely related to its elaborate reference database. For HISAT-genotype, it not only had higher HLA typing accuracy but also could be used in CYP (cytochrome P450) typing and V(D)J (variable (V), diversity (D), and joining (J) recombination) typing, which have broad clinical applications. Besides, it could provide 8-digit allele level HLA genotyping, although no reference HLA genotype was available to evaluate the accuracy.
Though NGS-based HLA typing can type HLA alleles on each homologous chromosome and can function at higher HLA resolutions, it is also limited by read length and read depth because of the highly polymorphic nature of the HLA system [21]. For example, Ka et al. [15] found that read depth is a critical factor for successful HLA typing by HLAscan and recommended a coverage depth over 90X to ensure 100% predictive accuracy for clinical use, whereas in another accuracy evaluation study of ve HLA typing methods, only a weak Pearson correlation between HLA typing accuracy and coverage was found [21]. In this study, we evaluated the impact of read depth and read length on the HLA typing accuracy of three algorithms, and the result showed that HLA typing accuracy decreased gradually when the sequence depth was down-sampled from 700X to 10X regardless of which algorithm was used, demonstrating that read depth was a critical factor for accurate HLA typing. To achieve more than 90% HLA typing accuracy at the 4-digit level, the minimal read depth was 50X for the three algorithms used, whereas 100X read depth was needed for HLA-HD and HISAT-genotype to obtain 90% overall accuracy at the 6-digit level.
Though HLA genotyping accuracy was generally concordant among the three NGS assays, our internal capture-based assay showed comparable or even better performance compared with WES, no matter which algorithms were used. Our internal assay designed exon probes of 10 HLA genes (HLA-A, HLA-B, HLA-C, HLA-DPA1, HLA-DPB1, HLA-DQA1, HLA-DQB1, HLA-DRA, HLA-DRB1 and HLA-DRB5). The geographic range is the union of the coding regions of all possible transcripts of the gene. Design rationale included but were not limited to the following: (1) For exons longer than the length of the probe, the target area is completely covered by overlapped probes, and the overlaps are larger than 60 nt. (2) Each probe was aligned to the whole genome by BLAT [27]. The total score was calculated based on the number of hits. The higher the score, the worse the probe speci city. Probes with scores greater than 2 were not considered. (3) Probes were not considered in regions of homologous repeats (e.g., SINE, LINE, LTR, etc.). A well-designed probe may improve probe speci city and HLA exon coverage, thus contributing to the accuracy of NGS-based HLA genotyping.
Different algorithms showed different miscall patterns, with HLA-A*02:07 to HLA-A*02:01 being the most widely miscalled allele by HLAforest, seq2HLA and HLA-VBSEq. It has been reported that the only difference in the peptide sequence between HLA-A*0201 and HLA-A*0207 is the 123rd amino acid, which is either Tyr or Cys [28], making it di cult to type HLA accurately by less sensitive algorithms. Researchers have also demonstrated that HLA-A*0207 is the most common HLA-A2 subtype among Chinese [29], and the HLA-A0207 peptide binding repertoire is limited to a subset of the A0201 repertoire [30], so we need to pay more attention to this allele in practice when these algorithms are used.
One of the drawbacks of this study was that only seven HLA typing algorithms (which were selected considering the ease of use of the software and the number of citations of the corresponding articles) were used in this benchmarking evaluation. For example, Polysolver [31] was not evaluated in this study because it depends on Novoalign, which requires commercial components and hence was not executable for us. Besides, all algorithms were run with their default parameters, which may not represent the best performance of the algorithms.

Conclusions
In conclusion, the choice of algorithm and sequencing depth are very important to the accuracy of HLA typing. Among all the algorithms evaluated here, we recommend using HISAT-genotype or HLA-HD for NGS-based HLA genotyping because of their higher accuracy and robustness to sequencing depth and read length. We also propose a minimal sequence depth of 100X for obtaining more than 90% HLA typing accuracy at the 6-digit allele level.

Sample preparation
A total of 24 samples were collected, and genomic DNA was extracted from white blood cell samples HLA genotyping assays HLA genotyping from the amplicon assay NGSgo-AmpX was used as the benchmark reference. NGSgo-AmpX consists of dedicated primer sets for the ampli cation of individual HLA genes, enabling the ampli cation of the following HLA genes: Class I: HLA-A, HLA-B and HLAC-C; and Class II: HLA-DRB1 and HLA-DQB1 (GenDx, Utrecht, The Netherlands). Three capture-based assays include 1) Agilent SureSelect Human All Exon V5 + UTR kits according to the Menu and paired-end sequencing (150PE) was carried out using standard Illumina protocols on an Illumina HiSeq X10 system (WES for short). Each sample met the average depth over 100X and capture on-target ratio > 50%. 2) IDT xGen® Exome Research Panel kits according to the Menu and paired-end sequencing (150PE) was carried out using standard Illumina protocols on an Illumina HiSeq X10 system (Bofuri for short). Each sample met the average depth over 100X and capture on-target ratio > 60% (10 samples were not available). 3) 3DMed Inc. in-house designed and developed HLA speci c probes and paired-end sequencing (150PE) was carried out using standard Illumina protocols on an Illumina HiSeq X10 system (Internal for short). Each sample met the average depth over 100X and capture on-target ratio > 60%.

NGS-based HLA genotyping algorithms
We compared seven publicly available algorithms for HLA typing: seq2HLA, HLAminer, HLAscan, HLA-VBSeq, HLA-HD, HLAforest and HISAT-genotype. The algorithms were chosen considering their accessibility and number of citations. All algorithms were run according to their respective manuals with default parameters. For HLAscan and HLA-VBSeq analysis, raw sequence data were rst mapped to the human reference genome UCSC hg19, and reads from chr6 of the BAM les were then generated as an input. For the other algorithms, we used raw sequence les as an input. HLA typing accuracy was de ned as the percentage of correctly identi ed alleles among all the reference alleles. We tested the HLA typing accuracy of all seven algorithms and selected the three with the highest overall accuracy for our read depth and length evaluation.
Linux server hardware con guration All software were run on a Linux server (CentOS6.5, kernel version: 2.6.32-431.11.2) with the hardware con guration as follows: Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30 GHz / 250 GB RAM / more than 10 TB disk space. R software was used for statistical analysis and plot creation (version: 3.6.1).
Abbreviations NGS: next-generation sequencing; HLA: human leukocyte antigen; MHC: major histocompatibility complex; ICB: immune checkpoint blockade; TMB: tumor mutation burden Declarations Ethics approval and consent to participate The study protocol was reviewed and approved by the Research Ethics Committee of the First A liated Hospital, College of Medicine, Zhejiang University (Number: 2017KYKSD678-1). Written informed consent for genetic testing was obtained from each participant.

Consent for publication
Written consents were obtained from each participant to publish their details.

Availability of data and materials
The raw sequencing data (FASTQ) generated during this study are not publicly available due to the policy in China. Benchmarking HLA typing results of the 24 samples and the number of miscalled HLA genotypes used in this studies are available as supplemental data.  Tables   Table 1. HLA-typing software used in this study. Figure 1 Work ow of HLA typing using benchmarked data sets. All HLA typing algorithms were run with default parameters.

Figure 3
Running time for different HLA typing software.  Accuracy of the three tools for HLA typing at 4-digit allele or 6-digit allele resolution for different depths and read lengths. Depth evaluation at A) 4-digit allele level; B) 6-digit allele level. For sequence depth evaluation, alignment les of the 24 Bofuri samples were down-sampled from 700X to 10X based on the raw depths of HLA genes. C) and D) are the overall HLA typing accuracy at 4-digit allele and 6-digit allele level, respectively, while the read length decreased from 150 bp to 76 bp.

Figure 6
Positive prediction agreement of the seven algorithms for HLA typing at different resolutions and genes.
A) Agreement for the seven algorithms at 2-or 4-digit allele levels. B) Agreement for the seven algorithms for different HLA genes at 4-digit allele level.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.