Data on haplotype-supported immunoglobulin germline gene inference

Data that defines IGHV (immunoglobulin heavy chain variable) germline gene inference using sequences of IgM-encoding transcriptomes obtained by Illumina MiSeq sequencing technology are described. Such inference is used to establish personalized germline gene sets for in-depth antibody repertoire studies and to detect new antibody germline genes from widely available immunoglobulin-encoding transcriptome data sets. Specifically, the data has been used to validate (Parallel antibody germline gene and haplotype analyses support the validity of immunoglobulin germline gene inference and discovery (DOI: 10.1016/j.molimm.2017.03.012) (Kirik et al., 2017) [1]) the inference process. This was accomplished based on analysis of the inferred germline genes’ association to the donors’ different haplotypes as defined by their different, expressed IGHJ alleles and/or IGHD genes/alleles. The data is important for development of validated germline gene databases containing entries inferred from immunoglobulin-encoding transcriptome sequencing data sets, and for generation of valid, personalized antibody germline gene repertoires.


Value of the data
The data is valuable for development of computational inference approaches that feature improved confidence in the outcomes of the inference process.
The data is valuable for development of validated immunoglobulin germline gene databases. The data is valuable for validation of computational inference of personalized antibody germline gene repertoires.
The data is valuable for the analytical process preceding studies of evolution of immune responses.

Data
The data of this article summarize the identity and accession numbers of sequencing data files (Table 1), the sizes of the sequence sets during the different stages of data processing (Table 2), and the outcome of validation of new inferred genes/alleles (Table 3), identified by use of IgDiscover and TIgGER. The frequencies of readily inferable [2] IGHD (Immunoglobulin heavy D-gene) genes used by the two haplotypes of five subjects are summarized (Table 4). Furthermore the data illustrate the effect of using a germline gene database that extends beyond codon 105 on gene inference (Fig. 1), and summarizes the outcome of TIgGER-based germline gene inference of six transcriptoms (Fig. 2). The data also illustrates how low sequencing quality scores are associated with some, but certainly not all, inferred germline gene alleles (Fig. 3), and summarizes IGHJ (Immunoglobulin heavy J-gene) alleles used by transcriptomes of six subjects (Fig. 4). The link between inferred IGHV (Immunoglobulin heavy V-gene) germline genes/alleles and different alleles of IGHJ6 in bone marrow (BM)-and peripheral blood (PB)-derived transcriptomes of two heterozygous subjects is shown (Fig. 5). The data summarizes linkage of different IGHD genes to two different haplotypes defined by alleles of IGHJ6 or defined by heterozygous IGHV genes (Fig. 6). The linkage of IGHV1-8, IGHV3-9, IGHV5-10-1, and IGHV3-64D germline genes to different haplotypes in subjects with two different IGHD gene-defined haplotypes (Fig. 7) is shown. Association of IGHV germline genes/alleles with particular IGHD genes in five subjects with different IGHD-defined haplotypes is shown (Fig. 8), as is the extent of association of alleles of IGHV4-59 to particular IGHD genes (Fig. 9). Finally, data describing assessment of alleles of IGHD genes detected in IgM-encoding transcriptomes of six subjects (Fig. 10), and of IGHV germline genes associated to the different alleles of IGHD genes in two subjects (Fig. 11) is shown.

Experimental design, materials and methods
IgM heavy chain variable domain-encoding gene repertoires were isolated by RT-PCR from transcriptomes of PB and BM collected out of season of most seasonal allergens from six allergic subjects [3]. Ethical approval and informed consent had been obtained from all donors. Sequencing was performed using the 2 Â 300 bp MiSeq technology (Illumina, Inc., San Diego, CA, USA) at the National Genomics Infrastructure (SciLifeLab, Stockholm, Sweden) [3]. Details of sequence output and availability are outlined in Table 1. Data was pre-processed using pRESTO [4] and Change-O [5] as summarized in Fig. 1 in Ref. [1]. Germline gene inference was performed using TIgGER [6] and IgDiscover [7]. Additional bioinformatics analysis was performed as outlined elsewhere [1] including analysis performed using GIgGle (release 0.2) that is available under Apache License at https://github.com/ ukirik/giggle. Immunoglobulin gene names and sequence numbering complies with the nomenclature defined by the International ImMunoGeneTics information system s (IMGT) (http://www. imgt.org) [8,9]. Table 3 Summary of sequence variants of germline genes not present in the IMGT germline gene database but inferred from BM transcript data using IgDiscover or TIgGER.  Table 4 Estimated frequency * of use of readily identified IGHD germline genes [2] in haplotypes of five lymphocyte donors, and the ratio of estimated frequency † of these genes in the two haplotypes.       Fig. 2 in Ref. [1]) and peripheral blood (PB). † defines that the name of only one of a set of different alleles of the gene that cannot be differentiated by the analysis approach is shown.   7. Linkage of IGHV1-8*01, IGHV3-64D*06, IGHV3-9*01, and IGHV5-10-1*01 to different IGHD genes in transcripts of donor 1, 3, and 5. While germline genes IGHV1-8*01 and IGHV3-9*01 were linked to the haplotype also carrying IGHD genes not present on both haplotypes, IGHV3-64D*06 and IGHV5-10-1*01 were not. Fig. 8. Association of IGHV genes/alleles of donors 1 (A), and 3-6 (B-E) with different IGHD genes as indicators of association with different haplotypes represented by IGHD. Analysis was performed on sequences found in cells of PB using the final filtered output of IgDiscover (diff ¼ 0). Only IGHV genes/alleles represented by at least 50 sequences with V_errors ¼0 and D_coverage 435 in the IGHD gene set shown in dark blue are shown. The frequencies of IGHV sequences associated to IGHD genes found in both haplotypes are shown in blue while the corresponding frequencies of IGHV sequences associated to IGHD genes expressed from only one of the inferred haplotypes are shown in red. † defines that the name of only one of a set of different alleles of the gene that cannot be differentiated by the analysis approach is shown. Fig. 9. Differential association of inferred alleles of IGHV4-59 with different haplotypes of IGHD of donors 1 (A), 3 (B), and 6 (C). The frequencies of sequences associated to IGHD genes apparently expressed from both haplotypes are shown in blue while the frequencies of sequences associated to IGHD genes apparently expressed from only one of the haplotypes are shown in red. The fraction of reads represented by IGHV4-59*01 (blue) and *08 (green) in all three subjects is shown (fraction of sequences to the left and fraction of unique CDR3 to the right) (D). † defines that the name of only one of a set of different alleles of the gene that cannot be differentiated by the analysis approach is shown.   . 11. Immunoglobulin IGHV gene haplotype analysis based on heterozygous presence of IGHD alleles of donor 1 (A, B) and donor 5 (C, D). Transcripts found in BM (A, C) and PB (B, D) were analysed. The analysis of transcripts derived from PB employing IGHD2-21 was not included due to the low number of such sequences. Detailed sequence analysis (E) may be used to define whether or not IGHD allele assignments are appropriate. The rare association of reads of IGHV1-2*02 to IGHD2-21*01 (grey) instead of the expected IGHD2-21*02 (black) in some BM-derived transcripts of donor 1 (see A) does not cover the base within the IGHD that defines the individual alleles. IGHD2-21 allele calls for both alleles of IGHV4-59*01 include the alleledifferentiating base, and rearrangements involving IGHV4-59*08 include the base identifying IGHD2-21*02. The arrow indicates the only base that differentiate IGHD2-21*01 and *02. Mutated bases within the sequences derived from IGHD genes are spelled out. † defines that the name of only one of a set of different alleles of the gene that cannot be differentiated by the analysis approach is shown.