IMGT, the international ImMunoGeneTics information system®: a standardized approach for immunogenetics and immunoinformatics

IMGT, the international ImMunoGeneTics information system®, was created in 1989 by the Laboratoire d'ImmunoGénétique Moléculaire (LIGM) (Université Montpellier II and CNRS) at Montpellier, France. IMGT is a high quality integrated knowledge resource specialized in immunoglobulins (IG), T cell receptors (TR), major histocompatibility complex (MHC) of human and other vertebrates, and related proteins of the immune system (RPI) of any species which belong to the immunoglobulin superfamily (IgSF) and to the MHC superfamily (MhcSF). IMGT consists of five databases, ten on-line tools and more than 8,000 HTML pages of Web resources. IMGT provides a common access to standardized data from genome, genetics, proteome and three-dimensional structures. The accuracy and the consistency of IMGT data are based on IMGT-ONTOLOGY, a semantic specification of terms to be used in immunogenetics and immunoinformatics. IMGT-ONTOLOGY comprises six main concepts: IDENTIFICATION, CLASSIFICATION, DESCRIPTION, NUMEROTATION, ORIENTATION and OBTENTION. Based on these concepts, the controlled vocabulary and the annotation rules necessary for the immunogenetics data identification, classification, description and numbering and for the management of IMGT knowledge are defined in the IMGT Scientific chart. IMGT is the international reference in immunogenetics and immunoinformatics for medical research (repertoire analysis of the IG antibody sites and of the TR recognition sites in autoimmune and infectious diseases, AIDS, leukemias, lymphomas, myelomas), veterinary research (IG and TR repertoires in farm and wild life species), genome diversity and genome evolution studies of the adaptive immune responses, biotechnology related to antibody engineering (single chain Fragment variable (scFv), phage displays, combinatorial libraries, chimeric, humanized and human antibodies), diagnostics (detection and follow up of residual diseases) and therapeutical approaches (grafts, immunotherapy, vaccinology). IMGT is freely available at .


Introduction
IMGT, the international ImMunoGeneTics information system®http://imgt.cines.fr [1,2], was created in 1989, by Marie-Paule Lefranc, at the Laboratoire d'ImmunoGénétique Moléculaire (LIGM) (Université Montpellier II and CNRS) at Montpellier, France, in order to standardize and manage the complexity of the immunogenetics data. Fifteen years later, IMGT is the international reference in immunogenetics and immunoinformatics, and provides a high quality integrated knowledge resource, specialized in the immunoglobulins (IG) and T cell receptors (TR), major histocompatibility complex (MHC) of human and other vertebrates, and related proteins of the immune systems (RPI) of any species which belong to the immunoglobulin superfamily (IgSF) and to the MHC superfamily (MhcSF) [3][4][5][6][7][8][9][10][11][12][13]. The number of potential protein forms of the antigen receptors, IG and TR, is almost unlimited. The potential repertoire of each individual is estimated to comprise about 10 12 different IG (or antibodies) and TR, and the limiting factor is only the number of B and T cells that an organism is genetically programmed to produce. This huge diversity is inherent to the particularly complex and unique molecular synthesis and genetics of the antigen receptor chains. This includes biological mechanisms such as DNA molecular rearrangements in multiple loci (three for IG and four for TR in humans) located on different chromosomes (four in humans), nucleotide deletions and insertions at the rearrangement junctions (or N-diversity), and somatic hypermutations in the IG loci (see FactsBooks [3,4] for review). Although IMGT was initially implemented for the IG, TR and MHC of human and other vertebrates [6], data and knowledge management standardization, based on the IMGT unique numbering [14][15][16][17][18][19], has now been extended to the IgSF [15][16][17][20][21][22] and MhcSF [18,23,24] of any species. Thus, standardization in IMGT contributed to data enhancement of the system and new expertised data concepts were readily incorporated.

IMGT-ONTOLOGY concepts and IMGT Scientific chart rules
The IMGT Scientific chart [2] comprises the controlled vocabulary and the annotation rules necessary for the immunogenetics data identification, description, classification and numbering and for knowledge management in the IMGT information system. Standardized keywords, labels and annotation rules, standardized IG and TR gene nomenclature, the IMGT unique numbering, and standardized origin/methodology were defined, respectively, based on the six main concepts of IMGT-ONTOLOGY: IDENTIFICATION, CLASSIFICATION, DESCRIPTION, NUMEROTATION, ORIENTATION and OBTEN-TION [2,5] (Table 1). The IMGT Scientific chart is available as a section of the IMGT Web resources (IMGT Marie-Paule page). Examples of IMGT expertised data concepts derived from the IMGT Scientific chart rules are shown in Table 1.

IMGT sequence databases, tools and Web resources
IMGT sequence databases, tools and Web resources correspond to the IMGT genetics approach that refers to the study of genes in relation with their polymorphisms, mutations, expression, specificity and evolution ( Table 2). The IMGT sequence knowledge management and the IMGT genetics approach heavily rely on the DESCRIP-TION concept (and particularly on the V-REGION, D-REGION, J-REGION and C-REGION core concepts for the IG and TR), on the CLASSIFICATION concept (gene and allele concepts) and on the NUMEROTATION concept (IMGT unique numbering [14][15][16][17][18]).

IMGT sequence databases IMGT/LIGM-DB
IMGT/LIGM-DB is the comprehensive IMGT database of IG and TR nucleotide sequences from human and other vertebrate species, with translation for fully annotated sequences [7]. It was created in 1989 by LIGM (Montpellier, France), and is on the Web since July 1995 [6]. In August 2005, IMGT/LIGM-DB contained more than 96,500 sequences of 150 vertebrate species [7]. The unique source of data for IMGT/LIGM-DB is EMBL, which shares data with the other two generalist databases Gen-Bank and DNA DataBank of Japan (DDBJ). Based on expert analysis, specific detailed annotations are added to IMGT flat files. The annotation procedure includes the IDENTIFICATION of the sequences, the CLASSIFICA-TION of the IG and TR genes and alleles, and the DESCRIPTION of all IG and TR specific and constitutive motifs within the nucleotide sequences. The Web interface allows searches according to immunogenetic specific criteria and is easy to use without any knowledge in a computing language. Selection is displayed at the top of the resulting sequences pages, so the users can check their own queries. Users have the possibility to modify their IMGT, the international ImMunoGeneTics information system ® http://imgt.cines.fr Figure 1 IMGT, the international ImMunoGeneTics information system ® http://imgt.cines.fr. Databases and tools for sequences, genes and structures are in green, yellow and blue, respectively. The IMGT Repertoire and other Web resources are not shown. Interactions in the genetics, genomics and structural approaches are represented with dotted, continuous and broken lines, respectively.
request or consult the results with a choice of nine possibilities. The IMGT/LIGM-DB annotations (gene and allele name assignment, labels) allow data retrieval not only from IMGT/LIGM-DB, but also from other IMGT databases. Thus, the IMGT/LIGM-DB accession numbers of the cDNA expressed sequences for each human and mouse IG and TR gene are available, with direct links to IMGT/LIGM-DB, in the IMGT/GENE-DB entries. IMGT/ LIGM-DB data are also distributed by anonymous FTP servers at CINES ftp://ftp.cines.fr/IMGT/ and EBI ftp:// ftp.ebi.ac.uk/pub/databases/imgt/ and from many Sequence Retrieval System (SRS) sites http://imgt.cines.fr/ textes/IMGTotheraccesses.html. IMGT/LIGM-DB can be searched by BLAST or FASTA on different servers (EBI, IGH, INFOBIOGEN, Institut Pasteur, etc.).   IMGT sequence analysis tools [1] IMGT Repertoire"Proteins and alleles" section [2](2)
(1) IMGT/Automat [29,30] is an integrated internal IMGT Java tool which automatically performs the annotation of rearranged cDNA sequences that represent the half of the IMGT/LIGM-DB content. So far 7,418 human and mouse IG and TR cDNA sequences have been automatically annotated by the IMGT/Automat tool, with annotations being as reliable and accurate as those provided by a human annotator.

IMGT sequence analysis tools
The IMGT sequence analysis tools comprise IMGT/V-QUEST [10], for the identification of the V, D and J genes and of their mutations, IMGT/JunctionAnalysis [11] for the analysis of the V-J and V-D-J junctions which confer the antigen receptor specificity, IMGT/Allele-Align for the detection of polymorphisms, and IMGT/PhyloGene [12] for gene evolution analyses.

IMGT/V-QUEST
IMGT/V-QUEST (V-QUEry and STandardization) is an integrated software for IG and TR [10]. This tool, easy to use, analyses an input IG or TR germline or rearranged variable nucleotide sequence. The IMGT/V-QUEST results comprise the identification of the V, D and J genes and alleles and the nucleotide alignments by comparison with sequences from the IMGT reference directory, the FR-IMGT and CDR-IMGT delimitations based on the IMGT unique numbering, the translation of the input sequence, the display of nucleotide and amino acid mutations compared to the closest IMGT reference sequence, the identification of the JUNCTION and results from IMGT/ JunctionAnalysis (default option), and the two-dimensional (2D) IMGT Collier de Perles representation of the V-REGION [10] ("IMGT/V-QUEST output" in IMGT/V-QUEST Documentation).

IMGT/JunctionAnalysis
IMGT/JunctionAnalysis [11] is a tool, complementary to IMGT/V-QUEST, which provides a thorough analysis of the V-J and V-D-J junction of IG and TR rearranged genes. IMGT/JunctionAnalysis identifies the D-GENEs and alleles involved in the IGH, TRB and TRD V-D-J rearrangements by comparison with the IMGT reference directory, and delimits precisely the P, N and D regions [11] ("IMGT/JunctionAnalysis output results" in IMGT/Junc-tionAnalysis Documentation). Several hundreds of junction sequences can be analysed simultaneously.

IMGT/Allele-Align
IMGT/Allele-Align is used for the detection of polymorphisms. It allows the comparison of two alleles highlighting the nucleotide and amino acid differences.
IMGT/PhyloGene IMGT/PhyloGene [12] is an easy to use tool for phylogenetic analysis of variable region (V-REGION) and constant domain (C-DOMAIN) sequences. This tool is particularly useful in developmental and comparative immunology. The users can analyse their own sequences by comparing with the IMGT standardized reference sequences for human and mouse IG and TR [12] (IMGT/ PhyloGene Documentation).

IMGT gene databases, tools and Web resources
IMGT gene databases, tools and Web resources correspond to the IMGT genomics approach that refers to the studies of the genes within their loci and on their chromosome [2] (Table 3).

IMGT/GENE-DB, the IMGT gene database
Genomic data are managed in IMGT/GENE-DB, which is the comprehensive IMGT genome database [8].  [3,4] were approved by the Human Genome Organisation (HUGO) Nomenclature Committee HGNC in 1999 [27], and entered in IMGT/GENE-DB [8], Genome DataBase GDB (Canada) [32], LocusLink and Entrez Gene at NCBI (USA) [33], and GeneCards [34]. Reciprocal links exist between IMGT/GENE-DB, and the generalist nomenclature (HGNC Genew) and genome databases (GDB, LocusLink and Entrez at NCBI, and GeneCards). All the mouse IG and TR gene names with IMGT reference sequences were provided by IMGT to HGNC and to the Mouse Genome Database (MGD) [35] in July 2002. Queries in IMGT/GENE-DB can be performed according to IG and TR gene classification criteria and IMGT reference sequences have been defined for each allele of each gene based on one or, whenever possible, several of the following criteria: germline sequence, first sequence published, longest sequence, mapped sequence [2]. IMGT/GENE-DB interacts dynamically with IMGT/ LIGM-DB [7] to download and display gene-related sequence data. As an example ans as mentioned earlier, the IMGT/GENE-DB entries provide the IMGT/LIGM-DB accession numbers of the IG and TR cDNA sequences which contain a given V, D, J or C gene. This is the first example of an interaction between IMGT databases using the CLASSIFICATION concept.

IMGT gene analysis tools
The IMGT gene analysis tools comprise IMGT/LocusView, IMGT/GeneView, IMGT/GeneSearch, IMGT/CloneSearch and IMGT/GeneInfo. IMGT/LocusView and IMGT/ GeneView manage the locus organization and the gene location and provide the display of physical maps for the human IG, TR and MHC loci and for the mouse TRA/TRD locus. IMGT/LocusView allows to view genes in a locus and to zoom on a given area. IMGT/GeneView allows to view a given gene in a locus. IMGT/GeneSearch allows to search for genes in a locus based on IMGT gene names, functionality or localization on the chromosome. IMGT/ CloneSearch provides information on the clones that were used to build the locus contigs displayed in IMGT/ LocusView (accession numbers are from IMGT/LIGM-DB, gene names from IMGT/GENE-DB, and clone position and orientation, and overlapping clones from IMGT/ LocusView). IMGT/GeneInfo [13] provides and displays information on the potential TR rearrangements in human and mouse.

IMGT gene Web resources
The IMGT gene Web resources are compiled in the IMGT Repertoire "Locus and genes" section that includes Chromosomal localizations, Locus representations, Locus description, Gene exon/intron organization, Gene exon/ intron splicing sites, Gene tables, Potential germline repertoires, the complete lists of human and mouse IG and TR genes, and the correspondences between nomenclatures [3,4] ( Table 3). The IMGT Repertoire "Probes and RFLP" section provides additional data on gene insertion/ deletion.

IMGT structure database, tool and Web resources
The IMGT structural approach refers to the study of the 2D and 3D structures of the IG, TR, MHC and RPI, and to the antigen or ligand binding characteristics in relation with the protein functions, polymorphisms and evolution ( Table 4). The structural approach relies on the CLASSIFI-CATION concept (IMGT gene and allele names), DESCRIPTION concept (receptor and chain description, domain delimitations), and NUMEROTATION concept  [3,4,16,17,19] (1) 2D Colliers de Perles MHC [18,36] 2D Colliers de Perles RPI [16-18, 21, 22, 24, 37] IMGT classes for amino acid characteristics [31] IMGT Colliers de Perles reference profiles [31] 3D representations (1) Amino acids are shown in the one-letter abbreviation. Arrows indicate the direction of the beta strands that form the two beta sheets of the immunoglobulin fold [3,4]. Hatched circles correspond to missing positions according to the IMGT unique numbering [16,17]. In the IMGT Collier de Perles on the IMGT Web site http://imgt.cines.fr hydrophobic amino acids (hydropathy index with positive value) and Tryptophan (W) found at a given position in more than 50 % of analysed IG and TR sequences are shown in blue, and all Proline (P) are shown in yellow.
Structural and functional domains of the IG and TR chains comprise the variable domain or V-DOMAIN (9-strand beta-sandwich) which corresponds to the V-J-REGION or V-D-J-REGION and is encoded by two or three genes [3,4], the constant domain or C-DOMAIN (7-strand betasandwich), and, for the MHC chains, the groove domain or G-DOMAIN (4 beta-strand and one alpha-helix). A uniform numbering system for IG and TR V-DOMAINs of all vertebrate species has been established to facilitate sequence comparison and cross-referencing between experiments from different laboratories whatever the antigen receptor (IG or TR), the chain type, or the species [14][15][16]. In the IMGT unique numbering, conserved amino acids from frameworks always have the same number whatever the IG or TR variable sequence, and whatever the species they come from. As examples: Cysteine 23 (in FR1-IMGT), Tryptophan 41 (in FR2-IMGT), hydrophobic amino acid 89 and Cysteine 104 (in FR3-IMGT) ( Figure  2). This numbering has been applied with success to all the sequences belonging to the V-set of the IgSF [20], including non-rearranging sequences in vertebrates (human CD4, Xenopus CTXg1, etc.) and in invertebrates (drosophila amalgam, drosophila fasciclin II, etc.) [15, 16,21]. The IMGT unique numbering, initially defined for the V-DOMAINs of the IG and TR and for the V-LIKE-DOMAINs of IgSF proteins other than IG and TR, has been extended to the C-DOMAINs of the IG and TR ( Figure 2B), and to the C-LIKE-DOMAINs of IgSF proteins other than IG and TR [17]. An IMGT unique numbering has also been implemented for the groove domain (G-DOMAIN) of the MHC class I and II chains (Figure 3), and for the G-LIKE-DOMAINs of MhcSF proteins other than MHC [18].
Hatched circles correspond to missing positions according to the IMGT unique numbering [18]. Positions in colour correspond to the IMGT contact sites provided, for each peptide/ MHC 3D structure, in IMGT/3Dstructure-DB [36].

IMGT/StructuralQuery tool
The IMGT/StructuralQuery tool [9] analyses the interactions of the residues of the antigen receptors IG and TR, MHC, RPI, antigens and ligands. The contacts are described per domain (intra-and inter-domain contacts) and annotated in term of IMGT labels (chains, domain), positions (IMGT unique numbering), backbone or sidechain implication [37]. IMGT/StructuralQuery allows to retrieve the IMGT/3Dstructure-DB entries, based on specific structural characteristics: phi and psi angles, accessible surface area (ASA), amino acid type, distance in angstrom between amino acids, CDR-IMGT lengths.

IMGT structure Web resources
The IMGT stucture Web resources are compiled in the IMGT Repertoire "2D and 3D structures" section which includes 2D representations or IMGT Colliers de Perles [16][17][18][19], 3D representations, FR-IMGT and CDR-IMGT lengths [16], amino acid chemical characteristics profiles [31], etc. In order to appropriately analyse the amino acid resemblances and differences between IG, TR, MHC and RPI chains, eleven IMGT classes were defined for the 'chemical characteristics' amino acid properties and used to set up IMGT Colliers de Perles reference profiles [31]. The IMGT Colliers de Perles reference profiles allow to easily compare amino acid properties at each position whatever the domain, the chain, the receptor or the species. The IG and TR variable and constant domains represent a privileged situation for the analysis of amino acid properties in relation with 3D structures, by the conservation of their 3D structure despite divergent amino acid sequences, and by the considerable amount of genomic (IMGT Repertoire), structural (IMGT/3Dstructure-DB) and functional data available. These data are not only useful to study mutations and allele polymorphisms, but are also needed to establish correlations between amino acids in the protein sequences and 3D structures and to determine amino acids potentially involved in the immunogenicity.

Conclusion
In order to allow any IMGT component to be automatically queried and to achieve a higher level of interoperability inside the IMGT information system and with other information systems, our current objectives include the modelling of the three major IMGT biological approaches, genomics, genetics and structural approaches, the analysis of the IMGT components (databases, tools and Web resources) in relation with the concepts, and the development of Web services http:// www.w3.org/2002/ws/ [2]. They are the first steps towards the implementation of IMGT-Choreography [2], which corresponds to the process of complex immunogenetics knowledge [25] and to the connection of treatments performed by the IMGT component Web services. IMGT-Choreography has for goal to combine and join the IMGT database queries and analysis tools. In order to keep only significant approaches, a rigorous analysis of the scientific standards [3,4], of the biologist requests and of the clinician needs [39][40][41][42] has been undertaken in the three main biological approaches: genomics, genetics and structural approaches. The design of IMGT-Choreography and the creation of dynamic interactions between the IMGT databases and tools, using the Web services and IMGT-ML, represent novel and major developments of IMGT, the international reference in immunogenetics and immunoinformatics. IMGT-Choreography enhances the dynamic interactions between the IMGT components to answer complex biological and clinical requests.
Since July 1995, IMGT has been available on the Web at http://imgt.cines.fr. IMGT has an exceptional response with more than 140,000 requests a month. The information is of much value to clinicians and biological scientists in general. IMGT databases, tools and Web resources are extensively queried and used by scientists from both academic and industrial laboratories, from very diverse research domains: (i) fundamental and medical research (repertoire analysis of the IG antibody sites and of the TR recognition sites in normal and pathological situations such as autoimmune diseases, infectious diseases, AIDS, leukemias, lymphomas, myelomas), (ii) veterinary research (IG and TR repertoires in farm and wild life species), (iii) genome diversity and genome evolution studies of the adaptive immune responses, (iv) structural evolution of the IgSF and MhcSF proteins, (v) biotechnology related to antibody engineering (single chain Fragment variable (scFv), phage displays, combinatorial libraries, chimeric, humanized and human antibodies), (vi) diagnostics (clonalities, detection and follow up of residual diseases) and (vii) therapeutical approaches (grafts, immunotherapy, vaccinology).

Citing IMGT
If you use IMGT databases, tools and/or Web resources, please cite [1] and this paper as references, and quote the IMGT Home page URL address, http://imgt.cines.fr.