SpecHLA enables full-resolution HLA typing from sequencing data

Summary Reconstructing diploid sequences of human leukocyte antigen (HLA) genes, i.e., full-resolution HLA typing, from sequencing data is challenging. The high homogeneity across HLA genes and the high heterogeneity within HLA alleles complicate the identification of genomic source loci for sequencing reads. Here, we present SpecHLA, which utilizes fine-tuned reads binning and local assembly to achieve accurate full-resolution HLA typing. SpecHLA accepts sequencing data from paired-end, 10×-linked-reads, high-throughput chromosome conformation capture (Hi-C), Pacific Biosciences (PacBio), and Oxford Nanopore Technology (ONT). It can also incorporate pedigree data and genotype frequency to refine typing. In 32 Human Genome Structural Variation Consortium, Phase 2 (HGSVC2) samples, SpecHLA achieved 98.6% accuracy for G-group-resolution HLA typing, inferring entire HLA alleles with an average of three mismatches fewer, ten gaps fewer, and 590 bp less edit distance than HISAT-genotype per allele. Additionally, SpecHLA exhibited a 2-field typing accuracy of 98.6% in 875 real samples. Finally, SpecHLA detected HLA loss of heterozygosity with 99.7% specificity and 96.8% sensitivity in simulated samples of cancer cell lines.


INTRODUCTION
Human leukocyte antigen (HLA) genes encode the major histocompatibility complex (MHC) molecules essential in adaptive immune system regulation. 1 HLA typing assists in organ transplantation and preimplantation genetic diagnosis 2 and yields insights into the molecular mechanism of autoimmune disorders, 3 infectious diseases, 4 cancer immunotherapy, 5 etc.Many HLA typing methods identify the most compatible pair of alleles for each gene that match the sequencing data from the database. 6For instance, OptiType 7 maximizes the number of mapped reads on the inferred alleles by employing integer linear programming.PolySolver 8 selects alleles via a Bayesian model by integrating base qualities of aligned reads, observed insert sizes, and ethnicity-dependent allele frequency.HLA-VBSeq 9 optimizes both read alignments to alleles and relative quantities of reads based on variational Bayesian inference.HLA-HD 10 maps reads to exons and introns of alleles in the database and utilizes weighted read counts to choose suitable allele pairs.HLA*PRG 11 transforms the allele database to a population reference graph and applies the read-to-graph alignment algorithm to infer the most likely allele pair.The method arcasHLA 12 aligns reads to the de Bruijn graph constructed from the reference transcriptome and then identifies the allele pair using the k-mer structure.HLA*LA 13 projects the linear read alignment on the reference graphs, allowing allele inference from short-or long-read data.However, these methods have several limitations: (1) they are imprecise if the target alleles are incomplete or absent from the MOTIVATION HLA genes play a crucial role in the regulation of the adaptive immune system, and HLA typing is an essential approach for understanding numerous diseases.Nevertheless, existing HLA typing methods are underdeveloped due to two primary challenges.First, the high similarity among alleles of different HLA genes poses a challenge.Second, the exceptional polymorphism of HLA alleles within the same gene further complicates the typing process.In this study, we utilize fine-tuned reads binning to overcome the issue of allele similarity and leverage local assembly to address the challenge of allele polymorphism, allowing accurate read alignment.By identifying and phasing variants based on enhanced read alignment, we can accurately reconstruct the diploid sequences of HLA genes.Additionally, the genotype frequency of phased variants allows us to detect HLA loss of heterozygosity.
database 14 ; (2) they are inefficient in adopting the ever-expanding list of alleles, as the HLA database updates frequently 6 ; (3) they are inaccurate for the high level of resolution 14 ; and (4) they cannot infer the exact sequence of HLA alleles.Recently, researchers have attempted to reconstruct the diploid sequences of HLA genes, i.e., full-resolution HLA typing.Lee and Kingsford developed Kourami 6 to assemble HLA exons using the modified partial-order graph.Kim et al. implemented HISAT-genotype 15 that splits aligned reads into k-mers and resolves assembly ambiguities to generate full-length HLA alleles.
Assembly and alignment are indispensable for full-resolution HLA typing.The assembly-based approach is to assemble reads into haplotype-resolved sequences, and the alignment-based approach is to align reads to a reference allele and then identify and phase the variants.Nevertheless, assembly and alignment both suffer from two challenges.First, the alleles of different HLA genes possess a high similarity, 7 making it complicated to identify the genomic sources of reads unambiguously.Hence, typing an HLA locus would be interfered with by reads derived from other homologous loci.For instance, reads originating from the HLA-A-like pseudogene HLA-Y hamper typing HLA-A. 10Although a previous method attempted to discard the reads potentially generated from other loci to type the target locus, the method cannot reconstruct HLA allele sequences and is not open source. 16Second, HLA alleles of the same gene are exceptionally polymorphic; consequently, the reads might be highly different from the reference allele.Due to the limitation of scoring systems of alignment algorithms, homologous reads are possibly mapped to alternative coordinates.Moreover, the highly variable reads render assembly highly fragmented and prone to error. 17ere, we present a software package, SpecHLA, for accurate full-resolution HLA typing on HLA-A, -B, -C, -DPA1, -DPB1, -DQA1, -DQB1, and -DRB1 genes.The package adopts finetuned reads binning and local assembly for precise read alignment and reconstructs complete diploid sequences of HLA loci through variant phasing.SpecHLA determines the reads belonging to an HLA locus by aligning the reads to known HLA alleles to solve the first challenge.Furthermore, SpecHLA assembles reads mapped to each highly divergent region into contigs independently and realigns the reads to the reference allele through the assembled contigs to overcome the second challenge.SpecHLA achieved 97.6%, 98.4%, and 99.1% 2-field typing accuracies in 230 whole-genome sequencing (WGS) samples, 183 whole-exome sequencing (WES) samples, and 462 RNA sequencing (RNA-seq) samples from the 1000 Genomes Project, respectively.In the 32 WGS samples from the Human Genome Structural Variation Consortium, Phase 2 (HGSVC2) Project, 18 SpecHLA achieved 98.6% HLA typing accuracy at G group resolution.Also, on average, it reconstructed entire HLA alleles with three mismatches fewer, ten gaps fewer, and 590 bp less edit distance than HISAT-genotype compared with ground-truth per allele.
Furthermore, the development of sequencing technologies makes it necessary to support HLA typing for multiple data protocols.Unlike most methods that are limited to a single data protocol, SpecHLA accepts the sequencing data of paired-end (PE), 103 Genomics, 19 high-throughput chromosome conformation capture (Hi-C), 20 Pacific Biosciences (PacBio), and Oxford Nanopore Technology (ONT). 21Additionally, it has been recognized that utilizing pedigree relations 22 and genotype frequency 23 can guide variant phasing and help to produce more precise diploid sequences.SpecHLA can incorporate the pedigree relations and combine genotype frequency to generate more reliable HLA typing results.
HLA loss-of-heterozygosity (LOH) events frequently occur in patients with cancer and can cause immune evasion during cancer evolution. 24For example, loss of HLA-CÃ08:02 was proposed to result in immune evasion in metastatic colorectal cancer. 25Computational methods have been developed to identify LOH events from sequencing data.LOHHLA 26 infers LOH by counting the uniquely mapped reads onto the typed alleles.DASH 27 applies machine learning to predict LOH using features such as adjusted b-allele frequency, sequencing depth ratio, consistency of sequencing depth, etc. SpecHLA detects LOH events based on the genotype frequency of phased variants.SpecHLA exhibited 99.7% specificity and 96.8% sensitivity for LOH detection in 300 simulated samples from cancer cell lines with various tumor purities.SpecHLA enables accurate full-resolution HLA typing and LOH detection in MHC class I and II genes from different data types.

Overview of SpecHLA
SpecHLA collects the sequencing reads of the HLA region (Figure 1A) and then maps the collected reads to the HLA allele database and bins the reads to HLA loci according to their sequence identity with the mapped alleles (Figure 1B).At each HLA locus, the first recorded allele in the IMGT/HLA database is utilized as the IMGT representative reference.We align the binned reads to the IMGT representative reference.Then, we extract the reads aligning to each highly divergent region of the reference and assemble them into contigs independently (Figure 1C).The extracted reads are realigned to the IMGT representative reference coordinated by the assembled contigs.We next call small variants and long insertions or deletions (indels) (Figure 1D) and phase the variants across the gene.Small variants are phased using spectral graph theory 28 (Figures 1E and S1A), and long indel variants are distinguished with supporting reads.We generate diploid sequences of each HLA locus through the phased variants.The generated sequence is aligned to the IMGT/HLA database 29 for the official designation, considering the ethnicity-dependent allele frequency (see Figure S1B for the naming scheme of HLA alleles).To detect LOH events, we infer the sequence frequency at each HLA locus based on the genotype frequency of phased variants (Figure 1F).We validated SpecHLA for 2-field typing, sequence reconstruction, and LOH detection using both actual and simulated samples (Figure 1G).The functionalities of SpecHLA and the state-of-the-art tools related to HLA are summarized in Figure 1H.
Additionally, we evaluated the methodology of SpecHLA in various aspects.First, we investigated the impact of different representative reference alleles on SpecHLA's performance.We conducted separate runs of SpecHLA in the 230 WGS samples from the 1000 Genomes Project, using ten randomly chosen reference alleles and the default IMGT representative reference.For most genes, the 2-field HLA typing results showed no significant differences when using different reference alleles, except for the HLA-DRB1 gene (see Figure S1C).Notably, when using the randomly selected allele DRB1*09:01:02:02 as the reference, the accuracy of HLA-DRB1 significantly decreased to 65.3%.Conversely, employing the default reference allele resulted in relatively high accuracy across all genes, suggesting that Report ll OPEN ACCESS SpecHLA exhibits greater robustness when utilizing the default reference allele.Second, in terms of computational resource consumption, SpecHLA exhibited relatively higher CPU time requirements but demanded less memory compared with other tools (Figures S1D and S1E).Third, we evaluated the effect of alignment tools on the read-binning performance in 2,000 simulated samples.When using Novoalign, the read-binning step exhibited an average precision of 99.8% and a recall of 89.4%.Second, the typing accuracy of SpecHLA was robust with varying sequencing depths.The 230 WGS samples have sequencing depths ranging from 103 to 493 in the MHC region.We observed that the 2-field typing accuracy of SpecHLA raised from 98.1% to 99.7%, with the depth cutoff increasing from 103 to 223 (Figures S2A and S2B).Moreover, we subsampled the 87 IHWG cell line WES samples 30 for various sequencing depths (103, 153, 203, 303, 353, and 403).The average MHC class I gene typing accuracies of SpecHLA, HISAT-genotype, HLA-HD, HLA*LA, HLA-VBSeq, Kourami, OptiType, and Polysolver across different depths were 93.4%, 80.7%, 90%, 82.7%, 80.2%, 57%, 93%, and 88.3%, respectively (Figures 2D and  2E).The average MHC class II gene typing accuracies of SpecHLA, HISAT-genotype, HLA-HD, HLA*LA, HLA-VBSeq, and Kourami were 91.7%, 88.7%, 86.1%, 86.9%, 88.1%, and 72.7%, respectively.SpecHLA had a typing accuracy that increased steadily with the increase of sequencing depths.
Furthermore, SpecHLA reconstructed sequences highly trio consistent in two family trios.We calculated the difference between the child's inherited and the parent's original sequences.The difference was measured by mismatch rate and gap rate.On average, the mismatch and gap rates were 7.69eÀ4 and 2.63eÀ4, respectively (Figure S2I).We visualized the phased exonic variants of the trios using Oviz-Bio. 31The phased genotypes were consistent with the trio structure (Figure 2M).Also, the 4-field typing results of SpecHLA were trio consistent for 93.8% (30/32) of the alleles in the children.These results consistently implied that SpecHLA is accurate in full-resolution HLA typing.

SpecHLA works with different data types
SpecHLA can integrate short and long reads for HLA typing.Seven individuals of HGSVC2 have both PE and PacBio data, and with these data, we performed SpecHLA using PE data and PE plus PacBio data separately.The mean mismatch rates without and with PacBio data were 1.95eÀ3 vs. 1.37eÀ3, and gap rates were 3.16eÀ4 vs. 2.98eÀ4 (Figures 3A and 3B).The result indicated that SpecHLA could efficiently integrate short and long reads to yield better results.SpecHLA can also accept long reads independently for HLA typing.Apart from the seven HGSVC2 PacBio samples, we collected a PacBio and an ONT sample of the individual NA12878, resulting in a total of nine long-read samples.In the nine long-read samples, SpecHLA and HLA*LA achieved G-group-resolution typing accuracies of 95.1% and 94.4%, respectively (Figure S2J).
In addition, simulated data showed that SpecHLA could combine PE data with PacBio, ONT, 103, and Hi-C data.We generated 50 individuals using simulated novel alleles and produced sequencing reads with different protocols.We performed SpecHLA with different data permutations of the 50 individuals.The average mismatch rates were 5.84eÀ4, 5.46eÀ6, 8.92eÀ6, 2.61eÀ6, 2.61eÀ6, 2.03eÀ5, and 1.97eÀ4 using PE, PacBio, ONT, +PacBio, +ONT, +103, and +Hi-C data, respectively (Figure 3C).The plus symbol (+) indicates the integration of PE and the present data.The highest mismatch rate was observed when utilizing only PE data.The lowest mismatch rate was achieved by integrating PE data with either PacBio or ONT data.Incorporating Hi-C data was not as efficient as long-read and 103 data.
SpecHLA could refine HLA typing with pedigree relations in actual and simulated family trios.We executed SpecHLA without and with pedigree relations in two family trios from the 1000 Genomes Project.The mean mismatch rates for children without and with pedigree relations were 1.10eÀ3 and 7.89eÀ4, respectively, while the corresponding gap rates were 3.39eÀ4 and 3.21eÀ4 (Figures 3D and 3E).We also simulated 50 family trios with PE reads; the read length, depth, and error rate were 75 bp, 303, and 1%, respectively.In simulated trios, SpecHLA exhibited average mismatch rates of 3.59eÀ3 and 2.10eÀ3 for children without and with pedigree relations, respectively, and the corresponding gap rates were 9.06eÀ4 and 7.73eÀ4 (Figures S2K and S2L).The results indicated that SpecHLA reduced variant phasing ambiguities using pedigree relations.
SpecHLA demonstrated the ability to incorporate genotype frequency for HLA typing in allelic imbalance samples.We ran SpecHLA separately using only reads (''read''), using the integration of reads and genotype frequency (''hybrid''), and using only genotype frequency (''geno'').First, to investigate whether SpecHLA can incorporate genotype frequency, we performed SpecHLA on 200 allelic imbalance samples, which were LOH samples simulated from cancer cell lines.When solely using genotype frequency, SpecHLA achieved a 2-field typing accuracy of 90% (Figure 3F).Next, we examined the sequence reconstruction performance of SpecHLA after using genotype frequency.We ran SpecHLA in a simulated dataset comprising 80 allelic imbalance samples.Each gene contained one haplotype with a depth of 203 and the other haplotype with a depth of 803.On average, in the dataset, the mismatch rates for ''read,'' ''hybrid,'' and ''geno'' were 3.49eÀ4, 3.98eÀ4, and 2.79eÀ4, respectively (Figure 3G).Notably, using only genotype frequency achieved the lowest mismatch rate.Furthermore, we tested SpecHLA's genotype frequency incorporation ability with different allelic depths.SpecHLA was applied to a dataset of 100 simulated novel-allele samples with varying allelic depths (Figure 3H).In this dataset, the average mismatch rates for ''read,'' ''hybrid,'' and ''geno'' were 9.93eÀ4, 4.34eÀ4, and 6.40eÀ4, respectively.The combined utilization of sequencing reads and genotype frequency achieved the best performance.These results showed that SpecHLA can efficiently utilize genotype frequency for HLA typing.

SpecHLA precisely infers HLA LOH events
SpecHLA identified HLA LOH events more accurately than DASH and LOHHLA in simulated data from cancer cell lines.We simulated 300 samples by combining HLA exonic data from cancer cell lines and matched normal cells, in which 200 samples possessed LOH events.For MHC class I genes, the LOH detection specificities of SpecHLA (99.7%) and LOHHLA (100.0%) were much higher than that of DASH (44.3%) (Figure 4).SpecHLA exhibited a higher sensitivity for LOH detection than LOHHLA across a range of tumor purities from 0.2 to 0.9, and the average sensitivities of SpecHLA and LOHHLA with tumor purity ranging from 0.1 to 1 were 96.8% and 92.8%, respectively.SpecHLA was more accurate than DASH and LOHHLA with MHC class I genes in the 300 samples.Additionally, SpecHLA exhibited a high LOH detection accuracy with a specificity of 97.5% and a sensitivity of 92.0% for MHC class II genes (Figure 4), but DASH and LOHHLA cannot handle such genes.Furthermore, through additional simulated data, we demonstrated the robustness of SpecHLA in handling allelic imbalance samples (Figure S2M) and different tumor purity and ploidy conditions (Figure S2N).

DISCUSSION
HLA genes persist in high sequence identity with the genes in the same class.For instance, HLA-A has high identities with HLA-B, HLA-C, and the pseudogene HLA-Y. 10Reads from different HLA genes are mixed during sequencing; some of them are prone to align to the faulty gene, introducing errors in HLA typing.SpecHLA identifies the exact HLA locus for sequencing reads based on two assumptions: (1) HLA alleles belonging to the same HLA gene have a higher sequence identity than those from different genes, and (2) de novo alleles have high sequence identities with the known ones.SpecHLA permits adjusting read-assignment parameters to retain more novel small variants for de novo allele reconstruction.The curated HLA database grows rapidly. 29SpecHLA has the advantage of adapting the iterative updates of the database.Based on the reads-binning strategy, we can use continuously added alleles in the database to assign reads more precisely.On the contrary, the database-matching methods optimized for a specific collection of alleles thus cannot efficiently remodel the ever-expanding list of alleles.Furthermore, they may even suffer from increased HLA alleles due to the increment of searching complexity. 14

Report
Previous research has highlighted the incompleteness of current HLA allele databases, 14 which poses a challenge for accurate HLA designation.Even if the reconstructed sequence is correct, the absence of the allele in the database can lead to erroneous designation results.Conversely, directly analyzing the HLA sequences is not affected by the incompleteness of the database.Furthermore, clustering the HLA sequences into groups using self-defined distance measurements has the potential to offer novel biological insights.As a result, we anticipate that conducting HLA typing at full resolution will become a prominent trend in future research endeavors.
MHC class II molecules present processed antigens to CD4(+) T lymphocytes, which is crucial for antigen-specific immune response. 32Notably, LOH events can occur in MHC class II genes, as observed in a previous study on a patient with acute myeloid leukemia. 33Tumor-specific MHC class II expression may be a clinically actionable biomarker of response to immune checkpoint inhibition, while LOH of MHC class II genes may reduce response to immunotherapies. 34Many HLA-related tools merely focus on MHC class I genes.SpecHLA can perform full-resolution HLA typing and detect LOH events in MHC class II genes, which provides the opportunity to study MHC class II genes further.

Limitations of the study
To achieve accurate reconstruction of HLA sequences, our method, SpecHLA, requires a relatively high read alignment depth, typically 53 or greater.Any sequence regions with insufficient read depth are automatically masked by SpecHLA.Moreover, low-depth data can negatively impact the accuracy of SpecHLA's long indel detection, leading to lower completeness and increased contamination in the reconstructed HLA sequences.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

ACKNOWLEDGMENTS
This work described in this paper is funded by the Applied Research Grant 9667204.We thank Prof. Sean Michael Boyle for providing the CRL-2314, CRL-5915, and CRL-5922 cell lines data and matched normal cell data.We also thank Mr. Yonghan Yu for the assistance in using SpecHap and thank Mr. Bowen Tan for the help in the long indel phasing.

AUTHOR CONTRIBUTIONS
S.C.L. designed and supervised the study.S.W. and M.W. developed and evaluated the software.L.C. was involved in software implementation.G.P.  them into contigs with Fermikit v.0.13. 39We may have one or more assembled contigs for each highly divergent region.After that, each extracted read is then mapped to the assembled contig with the highest alignment score by BWA MEM.We also map the contigs to the IMGT reference using Blastn v.2.3.0. 36Last, we re-build the alignment between the extracted reads and the IMGT reference by projecting the reads onto the IMGT reference according to aligned loci between the reads and the contig, and between the contig and the IMGT reference.

Calling variants
With the reads aligned to the IMGT reference, we identify the variants.We categorized the variants into two types, small variants and long indels.Small variants consist of single nucleotide variants (SNVs) and short indels.Long indels are insertions and deletions longer than 150 bp.We detect both small variants and long indels from the reads alignment.Freebayes v.1.2.0 49 is used to call small variants, and ScanIndel v.1.3 38is adopted to detect long indels.To handle PacBio reads, we also allow adopting pbsv v.2.6.2 to call long indels based on the alignment file processed by pbmm2 v.1.4.0 (https://github.com/PacificBiosciences/pbbioconda).At the variant locus with more than two alleles, if the sum of the frequency of the two highest-frequency alleles is larger than 0.7, we retain the two highest-frequency alleles of the variant.Otherwise, we discard the variant.
Phasing small variants guided by spectral graph theory SpecHLA phases small variants based on spectral graph theory using SpecHap v.1.0. 28We model the variant phasing problem as a graph-bipartition problem.With n heterozygous variants, we construct an undirected graph G of 2n vertices.Assume each locus has two variants.Each vertex represents an allele of the variant locus (0 or 1).The edge among two vertices of different loci indicates the two variants are from the same haplotype; the edge weight is the logarithmic likelihood of the corresponding haplotype.We designate reads that have been mapped to a minimum of two variant loci as phase-informative reads.These reads can originate from a range of sequencing protocols, including but not limited to PE, 10x, Hi-C, PacBio, and ONT.We calculate the edge weight from phase-informative reads.Assume q i;j as the likelihood of nucleotide(s) at variant locus j is mistaken on read R i .At two variant loci, given the haplotype h, the likelihood of observing the read R i is as where ð1 À q i; j Þ Ri; j = hj is ð1 À q i; j Þ if R i; j and h j are equal, and 1 otherwise; ðq i; j Þ Ri; j shj is ðq i; j Þ if R i; j and h j are different, and 1 otherwise.Denote h as the complementary haplotype of h, the likelihood of the read R i derives from h is pðhÞ.We determine which haplotype the read supports by comparing the likelihoods with maxfpðhÞ;pðhÞg.Given the self-complement haplotype pair H = ðh;hÞ, the likelihood of observing the read set R is inferred as where i indicates the index of the read in the set R. Then, we assume the edge weight of conflicting haplotype H 1 and H 2 as 3) In this way, we deduce the edge weight between every two variant loci.After the graph construction, the 2n vertices of the graph G can be partitioned into two subgroups by the sign (+/À) of the Fiedler vector.Therefore, we obtain the haplotypes equivalent to vertex subgroups.The locus with the absolute value of the Fiedler vector lower than 1e-5 will not be phased.We illustrate an example of phasing variants based on the Fiedler vector in Figure S1A.

Incorporating genotype frequency to phase in allelic imbalance samples
The alleles from different variant loci on the same haplotype should share similar frequencies.Hence, genotype frequency can guide variant phasing in allelic imbalance samples. 23SpecHLA incorporates genotype frequencies in edge weight computation.Assume bðh j Þ as the frequency of allele h j at variant locus j.At two variant loci, given a self-complement haplotype pair H = ðh;hÞ, we define the function f as (Equation 5) Given conflicting haplotype H 1 and H 2 , we calculate the edge weight from both sequencing reads and genotype frequency as Cell Reports Methods 3, 100589, September 25, 2023 e3 Report where w is hyper-parameter, 0 % w % 1, w is 0 by default.The genotype frequency is incorporated into variant linkage graph construction for variant phasing.Evaluation of SpecHLA with different values of w can be seen in Figures S2O-S2Q.
Adopting pedigree relations to phase SpecHLA has the capability to enhance phasing results through the utilization of pedigree relations when such relations are accessible.We refine the phasing outcome by establishing a connection between the child's unlinked blocks guided by the parents' phased blocks.Moreover, at each phased block of the child, we employ parental locus linkage as a reference, enabling the correction of the child's haplotype in instances where conflicts arise.Specifically, let S represent the child's phased blocks, F denote the father's phased blocks, and M denote the mother's phased blocks.SpecHLA identifies connections between distinct phased blocks of the child based on parental blocks.Let S i and S j represent two unconnected phased blocks of the child.If a parental block f (f ˛FWM) encompasses both blocks, we consider a link to exist between S i and S j .Two types of links may exist: (i) direct link from S i to S j , and (ii) link from S i to the flipped version of S j .If the parental blocks support different link types, SpecHLA assigns a score 6 to each link type and determines the final link type with a higher 6 value.The calculation of 6 involves examining the variant sites in the overlap region of f with S i and S j separately.Consider one haplotype of a parental block f.If the alleles from the child and parents are the same at a site, 6 is incremented by one; otherwise, it is decremented by one.This process is repeated for the other haplotype of f (i.e., the complement haplotype), and the maximum value is selected as the final 6 score.
SpecHLA can also correct falsely phased variants of children using parental phased blocks.If a variant within the child-phased block conflicts with both the paternal and maternal haplotypes, SpecHLA identifies this site as wrongly phased and flips it.For instance, consider a child's phased block with two variant sites: the first haplotype is [0,1], and the other haplotype is [1,0].If the haplotype of the father is [0,0] and the haplotype of the mother is [1,1], SpecHLA will flip one of the variant sites in the child.Phasing unlinked blocks guided by HLA database SpecHLA uses known HLA alleles from the IMGT/HLA database to phase unlinked blocks based on spectral graph theory.The procedure is based on the assumption that a haplotype with higher similarity to known HLA alleles is more likely to be correct.To construct the linkage graph, we denote the self-complement haplotype pair of each phased block as two vertices.The edge indicates that the haplotypes should be linked.We merge the haplotypes from two blocks and align the merged haplotype h to known HLA alleles.For each mapped allele, we calculate an alignment score g according to the mapping identity I and the mapped length L by g = IL.dðhÞ represents the highest g value among the mapped alleles of h.At two blocks, for a self-complement merged haplotype pair H = ðh; hÞ, we have dðHÞ = dðhÞ + dðhÞ: (Equation 8) The edge weight of conflicting haplotype H 1 and H 2 is defined as dðH 1 Þ and dðH 2 Þ, respectively.Afterward, unlinked blocks are phased by partitioning the vertices with the Fiedler vector.

Phasing long indel loci
We link the long indel to the small variant phased haplotypes.The long insertion sequences and the IMGT representative reference are combined to generate a modified reference.The modified reference is split into segments based on the breakpoints of long indels.In particular, if a deletion overlaps with another, it would be divided into two segments at the overlap boundary.The segments can be classified into three categories: insertion, deletion, and regular.We align reads to the modified reference using BWA MEM.The copy number of long indels is estimated based on the reads alignment.We compute the average depth of all regular segments (m) and each insertion segment (s).The copy number of the insertion segment is deduced as two if s=m > 0:85 and one otherwise.Phasing of small variants takes place within insertion segments characterized by a copy number of two.For each deletion segment, the unmapped ratio (representing the proportion of zero-depth regions) is assessed.A copy number of zero is inferred for deletion segments if the unmapped ratio exceeds 0.2; otherwise, it is determined as one.
The heterozygous long indels are phased with the supporting reads.This involves the collection of reads that independently support each haplotype phased for small variants.Additionally, the determination of which allele is supported by the reads at each long indel locus allows for a linkage to a haplotype featuring a greater number of shared supporting reads.The diploid sequences are generated with phased small variants and long indels.Finally, we slide a window of 20 bp along the gene; the window with a mean depth less than z is masked with 'N' (z is five by default).Sequence inference of the HLA-DRB1 duplication The region spanning 3,900 to 4,400 bp within the HLA-DRB1 locus harbors a long duplication in the IMGT representative reference.In response to this challenge, a strategic approach was implemented: homologous sequences of this region were extracted from the entire ensemble of HLA-DRB1 alleles contained within the IMGT/HLA database.Through subsequent multiple sequence alignments, a discernable set of unique sequences emerged from this process.Among these, a judicious selection was made of the eight most indicative sequences, which collectively constituted a targeted regional reference database.For a given sample, we extract reads mapped to this region and methodically align them to this reference database.The two sequences commanding the highest rankings in terms of coverage and depth are then chosen for inclusion.These selected sequences are seamlessly integrated into the phased haplotypes.Within each haplotype, the sequence possessing the greatest aligned read count is identified.The net outcome of these procedures culminates in the comprehensive reconstruction of full-length diploid sequences of the HLA-DRB1 locus.Reporting HLA official designations SpecHLA assigns each reconstructed sequence an official designation corresponding to the best-matched allele sourced from the IMGT/HLA database. 29To achieve this assignment, a comprehensive comparison is conducted between the reconstructed sequence and all alleles documented in the database, leveraging Blastn.During this process, a sequence is annotated with the official designation of the allele exhibiting the highest identity to this sequence.In instances where multiple alleles share the same highest identity, the allele endowed with the highest ethnicity-dependent prior frequency is prioritized as the best-matched allele.Notably, all alleles achieving the highest identity are duly reported in the results.Furthermore, we undertake a direct alignment of the inferred HLA sequences with the exon database to extract G group resolution annotations.
SpecHLA provides five choices to determine how to use the population-dependent allele frequency, including Asian, Black, Caucasian, Unknown, and nonuse.We classify samples into Asian, Black, and Caucasian populations based on the geographical regions.The HLA allele population frequency table was downloaded from the Allele Frequency Net Database (http://www.allelefrequencies.net/).SpecHLA will use the population-specific allele frequency for designations if given Asian, Black, or Caucasian.SpecHLA will apply the mean frequency among the populations if given Unknown and will ignore the population frequency information if given nonuse.In the 183 WES samples from the 1000 Genomes project, we found that SpecHLA maintained consistent 2-field typing accuracy, irrespective of the inclusion or exclusion of population frequency information.

Detecting LOH events in cancer samples
To detect LOH events in cancer samples, we first compute haplotype frequencies (a 1 and a 2 ) with the genotype frequency of phased small variants.At each variant locus j, there are two alleles 0 and 1, and the frequency of allele 1 is denoted as b j .b values can be observed from sequencing data and deduced from the haplotype.The observed allele frequency b b j is obtained from the VCF file at each locus j.Moreover, we deduce the expected allele frequency b j with phased genotype g j = ðg j;1 ; g j;2 Þ, where g j;1 ; g j;2 f0; 1g, g :;1 are alleles on the first haplotype, and g :;2 are on the other haplotype.We have b j = g j;1 3 a 1 + g j;2 3 a 2 : (Equation 9) The observed and expected b values should be the same in the ideal setting.To infer haplotype frequencies, we minimize the difference between observed and expected b values at all loci.Assume the number of small variant loci is n, and our optimization objective is (Equation 10) Based on the least squares method, the haplotype frequencies can be estimated by 11) 12) Based on the inferred haplotype frequencies, we then calculate the copy number of two alleles (ε 1 , ε 2 ) at each HLA locus.Calculating the copy number of HLA alleles should consider tumor purity because the cancer tissue is contaminated by normal cells.With tumor purity (r) and tumor ploidy (c) predicted by ABSOLUTE v1.2, 40 we have the following equations: (Equation 13) (Equation 14) The copy number of the two alleles is then ε 1 = a 1 3 r 3 j À a 1 3 r+a 2 3 r+a 1 À a 2 r ; (Equation 15) Cell Reports Methods 3, 100589, September 25, 2023 e5

Report
Assessing HLA sequence reconstruction We compared the full-resolution HLA typing utility of SpecHLA and HISAT-genotype in the HGSVC2 dataset.The extracted HLA allele sequences of HGSVC2 samples were used as ground truth to assess the sequence reconstruction accuracy.We counted the number of samples with reconstructed sequences for each HLA locus separately.In addition, we evaluated the exon reconstruction accuracy of SpecHLA in the HGSVC2 dataset.The alleles in the same G group have identical exons encoding the peptide binding domains (exon 2 and 3 for MHC class I and exon 2 only for MHC class II alleles).We compared the HLA typing accuracy of SpecHLA, Kourami, HISAT-genotype, HLA-VBSeq, HLA*LA, and HLA-HD at G group resolution.The methods were executed with default settings.The typing results of all methods were converted to G group resolution using the same G grouping file (date: 2023-01-12).The ground truth G group resolution HLA types were inferred by mapping the phased assemblies of HGSVC2 samples onto the HLA exons database using Minimap2. 42n assessing trio-consistency of two family trios, the exclusion of HISAT-genotype was necessary due to its inability to successfully recover alleles for all trio members of a given HLA locus, rendering an assessment of trio consistency unfeasible.To gauge sequence accuracy, we opted to select the most closely inferred sequence from the parental pool for each child's sequence.A comparative analysis of these two sequences was performed, employing metrics such as mismatch rate and gap rate.Notably, a lower mismatch rate and gap rate indicate a heightened level of trio consistency.
Additionally, the two trios were employed to assess SpecHLA's prowess in integrating pedigree relations for phasing.This evaluation involved separate runs of SpecHLA on the trios, both with and without pedigree relations.A subsequent comparison of sequence accuracy between these runs was undertaken.During the SpecHLA runs, the guidance of block linking from the database was intentionally omitted (-b 0).Moreover, to exemplify SpecHLA's ability to effectively integrate short and long reads, SpecHLA was also carried out with -b 0.

Computational resource comparison
We conducted a comparative analysis of the computational resource utilization across various HLA typing tools on a LINUX cluster featuring 8-core processors, a 143 MB cache, and a total of 16 TB global shared memory.We employed a random simulation involving a sample with PE150 reads, a sequencing depth of 50x, and a 1% sequencing error rate.Each tool was executed five times, and we documented both the CPU time and the peak RAM usage for each individual run of all the tools.Benchmark tool parameter setting We compared SpecHLA with ten state-of-the-art and widely used tools: Polysolver, OptiType, HLA-VBseq, Kourami, HISAT-genotype, HLA-HD, HLA*LA, arcasHLA, DASH, and LOHHLA.The default parameters were employed for running these tools (details provided in Table S3).Running LOHHLA to detect LOH was based on the HLA typing results of OptiType, as the authors of LOHHLA suggested.In the simulated datasets, the alleles were chosen from the database randomly, so we ran SpecHLA without considering the HLA population frequency using nonuse.We employed our read extraction procedure to obtain HLA-related reads for all the tools to save computational resources.The extracted reads were mapped to the HG19 reference as input for Polysolver and HLA*LA.

QUANTIFICATION AND STATISTICAL ANALYSIS
All procedures involving statistical analysis were described in the Method details section.The number of samples used in each experiment was described in both the figure legend and Results sections.

Figure 1 .4
Figure 1.The SpecHLA workflow (A) Extract the reads mapped to the HLA region and align them to the HLA allele database.(B) Bin reads to the HLA loci according to their sequence identity with known alleles.(C) Align the binned reads to the IMGT representative reference and perform local realignment.(D) Call SNVs and short indels using Freebayes and call long indels using ScanIndel.(E) Reconstruct the diploid sequences by phasing the variants.(F) Compare the reconstructed sequences with the database to report the official designations and compute sequence frequencies to detect LOH events.(G) Evaluation setting of SpecHLA on actual and simulated data.(H) Functionalities of state-of-the-art tools in HLA research.A checkmark (O) indicates that the software has the feature.PE refers to paired-end reads, while SE indicates single-end short reads.The term ''pedigree'' indicates the incorporation of pedigree information, while ''imbalance'' represents the adoption of genotype frequency in samples with allelic imbalance.See also Figure S1.

Figure 2 .
Figure 2. Validation of SpecHLA for 2-field and full-resolution HLA typing

Figure 3 .
Figure 3. Validation of SpecHLA for using different data types (A and B) Evaluation of the performance difference between SpecHLA with and without PacBio data across seven HGSVC2 individuals, quantified through the metrics of mismatch rate (A) and gap rate (B).''SpecHLA'' indicates only using PE data, and ''SpecHLA-hybrid'' means integrating PE and PacBio data.The y axis is scaled using a square root transformation.(C) The mismatch rate of SpecHLA with different data permutations of 50 simulated individuals.The plus symbol (+) indicates the integration of PE and the present data.(Dand E) The comparison of SpecHLA without and with pedigree relations in two family trios from the 1000 Genomes Project, measured by mismatch rate (D) and gap rate (E).''SpecHLA'' indicates only using PE data, and ''SpecHLA-trio'' means integrating PE data and the pedigree relations.(F-H) Evaluation of SpecHLA for incorporating genotype frequency.''Read'' represents using only reads, ''hybrid'' represents using the integration of reads and genotype frequency, and ''geno'' represents using only genotype frequency.(F) 2-field HLA typing accuracy in 200 LOH samples generated by cancer cell lines and normal cell data.(G) Mismatch rate in 80 simulated allelic imbalance samples.In each sample, the depths of the two haplotypes are 203 and 803.(H) Mismatch rate of SpecHLA in 100 simulated novel-allele samples with different allelic depths, e.g., ''30_70'' indicates that the depths of the two haplotypes are 303 and 703.See also FigureS2.

Figure 4 .
Figure 4. Validation of SpecHLA for HLA LOH detection in 300 simulated samples from cancer cell line data The specificity and sensitivity of LOH detection were calculated separately for MHC class I and II genes.Only 200 out of the 300 samples harbored LOH events.Bar colors indicate different methods.DASH and LOHHLA are inapplicable for MHC class II genes.See also Figure S2.

TABLE
d RESOURCE AVAILABILITY B Lead contact B Materials availability B Data and code availability d METHOD DETAILS B Algorithm of SpecHLA B Incorporating genotype frequency to phase in allelic imbalance samples B Benchmark datasets B Evaluation methods d QUANTIFICATION AND STATISTICAL ANALYSIS