AIRR community curation and standardised representation for immunoglobulin and T cell receptor germline sets

Analysis of an individual’s immunoglobulin or T cell receptor gene repertoire can provide important insights into immune function. High-quality analysis of adaptive immune receptor repertoire sequencing data depends upon accurate and relatively complete germline sets, but current sets are known to be incomplete. Established processes for the review and systematic naming of receptor germline genes and alleles require specific evidence and data types, but the discovery landscape is rapidly changing. To exploit the potential of emerging data, and to provide the field with improved state-of-the-art germline sets, an intermediate approach is needed that will allow the rapid publication of consolidated sets derived from these emerging sources. These sets must use a consistent naming scheme and allow refinement and consolidation into genes as new information emerges. Name changes should be minimised, but, where changes occur, the naming history of a sequence must be traceable. Here we outline the current issues and opportunities for the curation of germline IG/TR genes and present a forward-looking data model for building out more robust germline sets that can dovetail with current established processes. We describe interoperability standards for germline sets, and an approach to transparency based on principles of findability, accessibility, interoperability, and reusability.

• Gene and Allele Naming

Motivation
Understanding and cataloguing receptor germline genes and allele sequences is critical to the analysis of AIRR data. While the human set is relatively well understood in outline, although probably still far from complete, those of other species, even those that are relatively closely studied, is at a much earlier stage. There is an urgent need to define a standardised format for listing such genes, so that they can be shared between researchers and easily consumed by software tools.
For V-genes, an IMGT-gapped sequence (i.e.,. a sequence delineated in accordance with the IMGT numbering scheme) is provided in AlleleDescription. Other delineations, such as Chothia and Kabat, can be provided via linked SequenceDelineationV objects. A GermlineSet brings together multiple AlleleDescriptions from the same locus to form a curated set. The schema assumes that germline sets will be published by multiple repositories. A germline set may be uniquely referenced by means of the germline_set_ref, which is a composite field containing the repository id, germline set label, and version.

Gene and Allele Naming
AlleleDescription contains a label field, which should contain the accepted name for the field, as determined by the authors/curators of the record. The Nomenclature Committee of the International Union of Immunological Societies (IUIS) allocates gene symbols for receptor genes, and, if a gene symbol has been allocated, this should be used as the label. Where a gene symbol has not been allocated (for example, because the gene or allele has only recently been discovered, or because the available evidence does not meet IUIS standards, a 'temporary label' should be used. It is anticipated that publishers of gene sets will provide mechanisms to issue these temporary labels, and to allow researchers to review change history of AlleleDescriptions and GermlineSets. To provide consistency across research groups, the Germline Database Working Group of the AIRR Community is developing a community-wide approach to the allocation of temporary labels.

Genotypes
A GenotypeSet describes the specific receptor alleles found in a subject, and also identifies genes that are not found (this could be either because they are not present in the chromosomal locus, or because they are not expressed or expressed only at low levels).
Depending on the data available and the inference method used, genotypes may contain haplotyping information, which may be full, or partial. As an example of partial haplotyping, the genotype may have been determined from genomic sequencing in which the sequence of the locus was assembled into contigs, but could not be fully assembled. In this case the co-location of alleles in each contig has been established, but the co-location across the entire locus can not be. Co-location is therefore indicated by means of the phasing parameter, which in this case would be assigned a different value for alleles on each contig.

MHC Genotypes
Similary to the IG/TR genotypes, the MHCGenotype amd MHCGenotypeSet objects describe the MHC alleles found in a subject. MHCGenotype objects assemble alleles from one class: MHC-I, MHC-II or MHC-nonclassical. The method used to determine the genotype can be provided in the mhc_genotyping_method field. As different methods might be use for the various classes, this field is located in the MHCGenotype object, not the MHCGenotypeSet.
The mhc_genotyping_method allows free-text descriptions, however data curators are asked to keep close to the following terms if applicable: • PCR-based typing: Methods whose read-out is the amplification of specific sequences, but which do not provide sequence data by themselves. This includes SSP and SSOP.
• Sequencing-based typing: Clinical-grade NGS-based assays, providing high quality and resolution.
• Inference-based typing: Allele inferrence based on genome-wide DNA or RNA sequencing.

File Format Specification
Files are YAML/JSON with a structure defined below. Files should be encoded as UTF-8.
Identifiers are case-sensitive. Files should have the extension .yaml, .yml, or .json.

Germline Set File Structure
The Germline Set file has a standardised structure that is utilized by all top-level AIRR Schema Objects and defined by the DataFile schema. It is intended to contan all information necessary to annotate receptor sequences derived from a single germline locus, and to be directly usable by annotation tools and other processing software.
The file must contain YAML or JSON representation of one or more GermlineSet objects, including the associated AlleleDescription objects. It may optionally include other associated objects: SequenceDelineationV, RearrangedSequence, UnrearrangedSequence, Acknowledgement. These should all be embedded into the overall GermlineSet as specified in the schema.
• The file as a whole is considered a dictionary (key/value pair) structure with the keys Info, GermlineSet, and AlleleDescription.
• The GermlineSet contains fields release_version, release_description and release_date, which are intended to be used for version identification, under the control of the authors of the GermlineSet as identified by the fields author, lab_name and lab_address. If the set is modified by a party other than these authors, that these 6 fields should be modified to reflect the authors of the modification, and their own version identication. These modifications MUST be made if the GermlineSet is, or is likely to become, public, in order to avoid confusion with the original set prior to modification. Repositories are encouraged to manage version fields automatically.
• The file can (optionally) contain an Info object, at the beginning of the file, based upon the Info schema in the OpenAPI specification. If provided, version in Info should reference the version of the AIRR schema for the file.
• The file should correspond to a list of GermlineSet objects, using GermlineSet as the key to the list.
• The file should correspond to a list of AlleleDescription objects, using AlleleDescription as the key to the list.
• There should be only one AlleleDescription for each allele in the list.
• Each AlleleDescription object should contain a top-level key/value pair for allele_description_id that uniquely identifies the allele description object in the file.
• Each GermlineSet object should contain a top-level key/value pair for germline_set_id that uniquely identifies the germline set object in the file.
• Some fields require the use of a particular ontology or controlled vocabulary.
• GermlineSet and AlleleDescription contain reference fields germline_set_ref and allele_description_ref. These are intended to be globally unique references (containing identifiers of the repository, object and version) that can be used in a query API.
• The structure is the same regardless of whether the data is stored in a file or retrieved from a data repository. For example, The ADC API will return a properly structured JSON object that can be saved to a file and used directly without modification.