IMGT/HighV-QUEST: the IMGT® web portal for immunoglobulin (IG) or antibody and T cell receptor (TR) analysis from NGS high throughput and deep sequencing

Background The number of antigen receptors, immunoglobulins (IG) or antibodies and T cell receptors (TR) of the adaptive immune response in vertebrates with jaws is almost unlimited (2.10 12 per individual in humans). This huge diversity results from complex mechanisms in the synthesis of the variable (V) domains, that include DNA molecular rearrangements of the V, diversity (D) and joining (J) genes, N-diversity at the V-(D)-J junctions and, for IG, somatic hypermutations. The specificity of the V domains is conferred by the complementarity determining regions (CDR) and more particularly the CDR3. IMGT®, the international ImMunoGeneT-ics information system®, has developed online tools that provide a detailed and accurate sequence analysis of the V domains (IMGT/V-QUEST) and CDR3 (IMGT/JunctionAnalysis), based on IMGT-ONTOLOGY. However online analyses are limited to 50 sequences per batch. The challenge was to provide identical high-quality analysis for the huge number of sequences obtained by Next Generation Sequencing (NGS) high throughput and deep sequencing. Results IMGT® has developed IMGT/HighV-QUEST that analyses up to 150,000 IG or TR V domain sequences per batch and performs statistical analysis on the results of up to 450,000 sequences. IMGT/HighV-QUEST provides users with: (i) a friendly web interface for submission and results retrieval, (ii) high-quality detailed results of IMGT/V-QUEST and IMGT/JunctionAnalysis, based on the IMGT-ONTOLOGY concepts and IMGT Scientific


BACKGROUND
The number of the antigen receptors, immunoglobulins (IG) or antibodies and T cell receptors (TR) of the adaptive immune response in vertebrates with jaws (or gnathostomata) is almost unlimited.In humans, the potential repertoire of each individual is estimated to comprise about 2.10 12 different IG and TR, and the limiting factor is only the number of B and T cells that an organism is genetically programmed to produce [1,2].This huge diversity is inherent to the particularly complex and unique molecular synthesis and genetics of the antigen receptor chains [3].This includes biological mechanisms such as DNA molecular rearrangements of the variable (V), diversity (D) and joining (J) genes (combinatorial diversity) in multiple loci (three for IG and four for TR in humans) located on different chromosomes (four in humans), nucleotide deletions and insertions at the rearrangement junctions (or N-diversity) and, for IG, somatic hypermutations (for review, see [1,2]).IMGT®, the international ImMunoGeneTics information system® (http://www.imgt.org)[3][4][5], was created in 1989 by Marie-Paule Lefranc, Laboratoire d'ImmunoGénétique Moléculaire LIGM (Université Montpellier 2 and CNRS) at Montpellier, France, in order to standardize and to manage the complexity of immunogenetics data.IMGT® has reached that goal through the building of a unique ontology, IMGT-ONTOLOGY [6][7][8][9][10][11], the first ontology in immunogenetics and immunoinformatics.
IMGT-ONTOLOGY is now acknowledged as the global reference in immunogenetics and immunoinformatics, allowing IMGT® to bridge biological and computational spheres in bioinformatics [10].IMGT-ONTOLOGY manages the immunogenetics knowledge through diverse facets that rely on the seven axioms of the Formal IMGT-ONTOLOGY or IMGT-Kaleidoscope: "IDENTIFICATION", "DESCRIPTION", "CLASSIFICATION", "NUMEROTATION", "LOCALIZATION", "ORIENTATION" and "OBTENTION" [9][10][11].These axioms postulate that any object, any process and any relation has to be identified, described, classified, numbered, localized and orientated, and that the way it is obtained can be characterized [9][10][11].From these axioms, concepts were generated that led to the IMGT Scientific chart rules (http://www.imgt.org):standardized keywords (concepts of identification) [12], standardized labels (concepts of description) [13], standardized gene and allele name and nomenclature (concepts of classification) [14], IMGT unique numbering and IMGT Colliers de Perles (concepts of numerotation) [15][16][17].Owing to that standardization, IMGT® has become an internationally acknowledged high-quality integrated knowledge resource that comprises several databases for sequences (e.g., IMGT/LIGM-DB [18]), genes (IMGT/ GENE-DB [19]), two-dimensional (2D) and threedimensional (3D) structures (IMGT/2Dstructure-DB, IMGT/3Dstructure-DB [20][21][22]), monoclonal antibodies and fusion proteins for immune applications (FPIA) (IMGT/mAb-DB [23]), seventeen tools for analysis of nucleotide sequences (IMGT/V-QUEST [24][25][26], IMGT/ JunctionAnalysis [27,28]), amino acid sequences (IMGT/ DomainGapAlign [29]), genes and structure analysis [30], and >15,000 pages of web resources [3][4][5].Among the nucleotide sequence analysis tool, IMGT/V-QUEST [24][25][26] is the most popular one, as it allows the standardized identification and very detailed description of any rearranged IG or TR sequence of human, mouse and rat.It constantly evolves with other species being added, following the IMGT® annotation of IG and TR loci of newly sequenced genomes.IMGT/V-QUEST provides a detailed and accurate characterization of the submitted IG and TR sequences entirely based on the IMGT-ONTOLOGY concepts of identification, description, classification and numerotation [12][13][14][15][16][17].It identifies the V, D and J genes and alleles in rearranged V-J and V-D-J sequences by alignment with the germline IG and TR gene and allele sequences of the IMGT reference directory from IMGT/GENE-DB.It delimits the framework regions (FR-IMGT) and complementarity determining regions (CDR-IMGT) according to the IMGT unique numbering for V domain [15].The tool describes the V-REGION mutations and identifies the hot spot positions in the closest germline V gene.It detects and accurately describes insertions and deletions in the submitted sequences by reference to the IMGT unique numbering [15].
IMGT/V-QUEST integrates IMGT/ JunctionAnalysis [27,28] for a detailed analysis of the V-J and V-D-J junctions (it identifies the D-REGION if present, the nucleotides (nt) deleted as a result of exonuclease trimming and the nontemplated N-REGION nucleotides added at random by the terminal deoxynucleotide transferase TdT), and uses IMGT/Automat [31,32] for a full annotation of the V-J-and V-D-J-REGION.In the context of Next Generation Sequencing (NGS) [33 -38], computational power is required in order to be able to analyze huge amounts of data in less time with IMGT® tools.Although High Performance Computing (HPC) Systems are designed to solve advanced computational problems that are highly challenging, complex and time consuming, this means facing the problematic of the computational aspects, working with different HPC technologies and dealing constantly with changing hardware fronted by diverse operating systems.Even with a similar operating system the different aspects, like jobs queue, are not the same.The challenge for IMGT® in providing IMGT/V-QUEST high-quality results for the analysis of IG and TR sequences from NGS high throughput and deep sequencing was to create a friendly and systemneutral environment in which, from the user's point of view, the distributed character and heterogeneity of the computational system components is transparent.To reach that goal, IMGT® has developed IMGT/HighV-QUEST [39][40][41], the first web portal for IG and TR analysis from NGS high throughput and deep sequencing and a secure system destined to run a standalone version of IMGT/V-QUEST on remote computational resources.

IMGT/HighV-QUEST user friendly interface for analysis submission
IMGT/HighV-QUEST's users are from different scientific backgrounds.In order to let all users use conveniently the tool, a simple interface was developed.An analysis submission is an easy task: the user goes to IMGT/HighV -QUEST Search page [39][40][41] by clicking on the link on the menu bar.On the search page (Figure 1) the user gives a title for his analysis, selects the species, the antigen receptor type (IG or TR) (or the locus, for instance, IGH or TRB).The user uploads a file with the sequences to be analysed in FASTA format (up to 150,000 sequences per file).The user can choose to be notified by e-mail of the advancements of the analysis (when the analysis is queued, when it is submitted (dispatched) on computational resources and/or when it is completed).By clicking on 'Start', the analysis is performed with the default parameters.Prior to submitting the analysis, the user may customize the results display options in 'Display results' (these options are identical to those of IMGT/V-QUEST [26]) (Figure 1).The 'Display results' comprises: 'A.Detailed view' for the display of the results of each analyzed sequence (with a choice of 13 different results displays) in individual result files.The user can choose to include them or not in the output.If included, the user can choose the 'Nb of nucleotides per line in alignments' (60 by default) and select among 13 results displays as mentioned above.'B.Files in CSV' for the choice of the CSV files to be retrieved in the final outputs (Summary, nt-sequences and parameters are provided by default).For sophisticated queries or for unusual sequences, the users can modify the default values in 'Advanced parameters'.The customizable values, identical of those of IMGT/V-QUEST [26], are: (i) 'Selection of IMGT reference directory set' used for the V, D and J gene and allele identification and alignments with a choice of four sets ('F+ORF', 'F+ORF+ in-frame P' (by default), 'F+ORF including orphons' and 'F+ORF+ in-frame P including orphons', where F is functional, ORF is open reading frame and P is pseudogene).This allows sequences to be compared with only relevant gene sequences (e.g., orphon sequences are relevant for genomic but not for expressed repertoire studies).The selected set can also be chosen either 'With all alleles' or 'With allele *01 only'.(ii) 'Search for insertions and deletions in V-REGION' is selected by default ('Yes') and can be deactivated if the user does not want to take into account indels in alignment with germline genes and alleles.(iii) 'Parameters for IMGT/JunctionAnalysis': 'Nb of accepted D-GENE in JUNCTION' (provided for the IGH, TRB and TRD junctions) and 'Nb of accepted mutations' in 3'V-REGION, D -REGION and 5'J-REGION (default values are indicated per locus in the IMGT/V-QUEST Documentation and in [26]).(iv) 'Parameters for Detailed View': 'Nb of nucleotides to exclude in 5' of the V-REGION for the evaluation of the nb of mutations' (to avoid, e.g., counting primer specific nucleotides) and/or 'Nb of nucleotides to add (or exclude) in 3' of the V-REGION for the evaluation of the alignment score' (e.g., in case of low or high exonuclease activity).

IMGT/HighV-QUEST analysis life cycle
After submission, the required information is controlled and a popup message will appear if a required field is not filled.After the transfer of the FASTA file on the local web server, a syntax control is also performed in order to let the user correct the syntax before launching the analysis.This will save time for the user by preventing the analysis of syntactically incorrect sequences.The submitted analysis is kept in the local web server analysis queue and dispatched on a remote computational resource when a resource can accept it.This acceptation is based on different criteria, like the number of sequences, free resources, etc.Once the analysis by IMGT/V-QUEST on the remote resource is completed and the results are prepared, the user is notified by an e-mail (if the later has chosen to be informed), and the temporary files and folders are cleaned from the local and remote resources.The analysis results are then kept for 15 days after the analysis completion date and are removed afterwards.Five days before the expiration (or 10 days after the completion date), the user is notified by e-mail of the expiration in 5 days.When an analysis is deleted by the user or be- cause of its expiration, all user data and results regarding that analysis are removed from the system.However, if the analysis has been chosen in a statistical analysis, it cannot be removed by the user or the system.

IMGT/HighV-QUEST analysis results outputs
The IMGT/HighV-QUEST analysis results outputs comprise a set of text files in two folders (Figure 2): the main folder and, if chosen, the individual result files folder, archived in a single ZIP file.
The IMGT/HighV-QUEST main folder includes eleven files (if all selected) in CSV format (results equivalent to those of the Excel file of IMGT/V-QUEST online [26]) that comprise: (i) the 'Summary' file provides the synthesis of the analysis (the sequence functionality, the names of the closest V, D and J genes and alleles with identity percentage, FR-IMGT and CDR-IMGT lengths, amino acid (AA) JUNCTION, the description of insertions and deletions if any), (ii) the 'IMGT-gapped-nt-sequences' file includes the nucleotide (nt) sequences of labels that have been gapped according to the IMGT unique numbering, (iii) the 'Nt-sequences' file includes the ungapped nt sequences of all described labels, (iv) the 'IMGTgapped-AA-sequences' file includes the AA sequences of labels that have been gapped according to the IMGT unique numbering, (v) the 'AA-sequences' file includes the ungapped AA sequences of labels, (vi) the 'Junction' file includes the results of IMGT/JunctionAnalysis, (vii) the 'V-REGION-mutation-and-AA-change-table' file includes the list of mutations (nt mutations, AA changes, AA class identity (+) or change (-), total for the V-REGION and per FR-IMGT and CDR-IMGT), (viii) the 'V-REGION-nt-mutation-statistics' file includes the number (nb) of nt positions including IMGT gaps, the nb of nt, the nb of identical nt, the total nb of mutations, and then the nb of silent mutations, the nb of nonsilent mutations, the nb of transitions and the nb of transversions, total for the V-REGION and per FR-IMGT and CDR-IMGT, (ix) the 'V-REGION-AA-change-statistics' file includes the nb of AA positions including IMGT gaps, the nb of AA, the nb of identical AA, the total nb of AA changes, and then the nb of AA changes according to AAclassChangeType (e.g., +++) [26], and the nb of AA class changes according to AAclassSimilarityDegree (e.g., Very similar) [26], total for the V-REGION and per FR-IMGT and CDR-IMGT, (x) the 'V-REGIONmutation-hotspots' file indicates the localization of the hot spots motifs detected in the closest germline V-REGION with positions in FR-IMGT and CDR-IMGT, (xi) the 'Parameters' file includes the date of the analysis, the IMGT/V-QUEST version, and the parameters used for the analysis.The IMGT/HighV-QUEST individual result files folder includes the individual files of all the sequences results (up to 150,000).They allow to visualize the results corresponding to 'Detailed view' for each analysed sequence (results identical to those of IMGT/V-QUEST online in Text; they have been detailed elsewhere [26] and are only briefly described here).Each file comprises: (i) the result summary that summarizes the main characteristics of the analysed sequence with the names of the closest V and J genes and alleles with their alignment score and the percentage of identity, the name of the closest D-gene and allele determined by IMGT/JunctionAnalysis with the D-REGION reading frame, the FR-IMGT and CDR-IMGT lengths and the AA JUNCTION sequence, and if selected, (ii) the Alignment for V, D, J genes and alleles, (iii) the detailed analysis of the JUNCTION by IMGT/ JunctionAnalysis, (iv) different displays of the V-REGION, (v) the analysis of the mutations and AA changes, (vi) the localization of the mutation hot spots, and (vii) the annotation by IMGT/Automat.

IMGT/HighV-QUEST statistical analysis submission and life cycle
The IMGT/HighV-QUEST statistical analysis submission is performed by going to the 'Launch statistics' page for which the link is accessible from the menu bar.On this page (Figure 3), the tool gives the user a list of all his current analyses under '1.Analysis results selection'.The user can choose the different analysis results on which he wants to perform the statistical analysis.The analyses must answer the following criteria (available in the table): they should be completed without error or warnings, they should be on the same species and receptor type or locus (e.g., 'Homo sapiens''TRB') and analysed with the same IMGT reference directory set (e.g., 'F+ORF+inframe P') and with the same indel option (e.g., 'Yes').The user should verify by himself that the different analysis results were obtained with the same IMGT/HighV-QUEST version, IMGT/V-QUEST version and IMGT/V-QUEST reference directory release for consistency of the statistical analysis.Under '2.Statistical analysis title', the user chooses a title (required for job identification).Clicking on the option Graphical elements allows to obtain, in the outputs, separate graphical elements in PNG format, in order to use them in other documents.The user can add optional comments which will be included in the final reports (this functionality is added here as the output PDF reports are not editable).Once the 'Start' button is clicked, the job is sent to the local web server.The IMGT/HighV-QUEST statistical analysis life cycle is the following: once on the local web server, the job is queued until a free resource is available to perform the statistical analysis.Once the job is dispatched, it is monitored automatically and regularly until it is completed and then the statistical results are prepared.The completed statistical analysis is kept until 15 days after its completion date on the local web server after which it is removed from the system.Five days before the expiration (or 10 days after the completion date), the user is notified by e-mail of the expiration in 5 days.

IMGT/HighV-QUEST statistical analysis outputs
The IMGT/HighV-QUEST statistical analysis is automatically generated on the 'Summary' and 'Ntsequences' CSV files of IMGT/HighV-QUEST results and contain the following output items: 1. Comments: if added by the user.2. Analysis list table (Figure 4): this table recapitulates the list of the analysis results chosen by the user for the statistical analysis.5): this table recalls, but only for the first analysis results in the Analysis list table, the IMGT/HighV-QUEST version, the IMGT/V-QUEST version and the IMGT/V-QUEST reference directory release and the 'PARAMETERS' used for the analyses.The 'RESULTS' section gives the general results of the statistical analysis that lead to 'Result category' with, for The 'Analysis results selection' table shows the list of the analyses with their status.In this table, the user selects the analyses on which he wants to perform the statistical analysis, respecting the criteria defined in the text.The 'Analyses list' table recapitulates the list of the analysis results chosen for the statistical analysis.For each of them, it recalls the Title, Nb of sequences, IMGT/V-QUEST reference directory species and IMGT/V-QUEST receptor type or locus.The IMGT/HighV-QUEST statistical analysis 'Summary table' indicates the title of the statistical analysis (as entered by the user), recalls the version of IMGT/HighV-QUEST and IMGT/V-QUEST, the IMGT/V-QUEST reference directory release and 'PARAMETERS' used for the analyses and provides in 'RESULTS', an overall view of the statistical analysis results that lead to 'Result category' (see details in Figure 6).This repartition in categories gives the user at first glance an idea of how much he/she can rely on his/her data.'1 copy': sequences in one copy, and therefore different by their length and/ or their sequence, and retained in 'filtered-in' sequences.For each set of identical sequences, only one copy is retained in '1 copy' and the other redundant sequences for that copy are put into 'More than 1'.The following four categories are excluded from statistical analysis (filtered-out sequences).'More than 1': redundant identical sequences (after that one copy of each set of identical sequences has been retained in '1 copy').'Warnings': sequences with warnings for the V-REGION ('different CDR lengths' and/ or 'id<85%'; 'different CDR lengths' means sequences with different AA lengths for CDR1-IMGT and/or CDR2-IMGT compared to the CDR1-IMGT and/or CDR2-IMGT lengths, respectively of the closest identified germline V gene and allele).Unknown functionality: sequences for which no functionality was detected.This category corresponds to the sequences for which the junction cannot be identified (no evidence of rearrangement, no evidence of junction anchors).No results: sequences for which IMGT/ HighV-QUEST did not return any result.The statistical analysis is performed on the '1 copy' category divided in two sets, depending on the IMGT/HighV-QUEST result: 'single allele' (only one gene and allele identified by IMGT/HighV-QUEST), 'several alleles (or genes)' (several alleles (or genes) identified by IMGT/HighV-QUEST).The results are provided for each V, D or J gene and for any combination of them, for 'single allele' (on the left hand side) and for 'several alleles (or genes)' (on the right hand side).Below the tables, histograms are provided, per gene, for each concerned locus.Color code for histograms: green for V genes, red for D genes, yellow for J genes, green with red hatchings for the combination of V and D genes, green with yellow hatchings for the combination of V and J genes, green with red and yellow hatchings for the combination of V, D and J genes.

Figure 8 IMGT/HighV-QUEST statistical analysis gene and allele table for '1 copy' with 'single allele'.
The gene and allele table is provided per Group [14] (here TRBV) and shows a list of all genes (green lines) and alleles (white lines) found for '1 copy' with 'single allele'.each category, the nb of sequences and the sequence average length (in nt). 4. Terminology (Figure 6): this section provides the definition of the result categories and the terminology of the statistical analysis report (see details in Figure 6 legend). 5. Number of '1 copy' with 'single allele' and 'several alleles (or genes)' (for V, D and/or J) tables and histograms (Figure 7): two tables are provided depending if IMGT/V -QUEST found one 'single allele' for the identified gene (expected to be an unambiguous result) or 'several alleles (or genes)' (usually in the case of too short sequences).Each table shows, per locus, the number of sequences '1 copy' for each V, D and J gene, separately, and in combination.Histograms allow to visualize the information from the tables per locus.6. IMGT/HighV-QUEST gene and allele tables for '1 copy' with 'single allele' (Figure 8): a table is provided per group of V, D and J genes (e.g., TRBV in Figure 8) [14].Each table shows the list of identified genes, with for each gene the IMGT gene and allele name (with the taxonomy 6-letter abbreviation, for instance Homsap for Homo sapiens), the functionality (F, ORF or P) [12], number of '1 copy' sequences ('Total'), sequence average length (in nt) and in the column 'id=100%' the number (and between parentheses the percentage) of sequences with an identity percentage of 100% by comparison with the germline gene.7. IMGT/HighV-QUEST gene histograms for '1 copy' with 'single allele' (Figure 9): a histogram is provided per V, D and J gene group [14].The histograms vizualize (and between parentheses recapitulate) the number of sequences per gene in a given V, D or J group (e.g., TRBJ in Figure 9).The list of all V, D or J genes is provided according to their position in the concerned locus.8. IMGT/HighV-QUEST CDR3-IMGT tables for '1 copy' and 'in-frame junction': three tables are provided for '1 The gene histogram is provided per Group [14] (here TRBJ) for '1 copy' with 'single allele'.For each gene and allele table, a gene histogram is shown, localizing the gene in the locus, with the number of sequences found between parentheses.

Figure 10 IMGT/HighV-QUEST statistical analysis CDR3-IMGT table.
The CDR3-IMGT tables are provided for '1 copy' and 'in-frame junctions'.The table shown is for '1 copy' with 'single allele' for both V and J, here TRBV and TRBJ.copy' with 'single allele' for V and J genes (e.g., between TRBV and TRBJ in Figure 10), '1 copy' for 'several alleles (or genes)' for V and/or J genes (see 11. below), and the last one for all together.Each table gives, for each length of CDR3-IMGT observed in in-frame junctions between V and J, the length in nt and AA, the number ('Total') and percentage ('Percent') of sequences for each length, the sequence average length, the number of sequences with different CDR3-IMGT (in nt and AA) and the number of sequences (and between parentheses the number of sets) with identical CDR3-IMGT (in nt and AA). 9. IMGT/HighV-QUEST CDR3-IMGT histogram for '1 copy' and 'in-frame junction': three histograms are provided for '1 copy' with 'single allele' for V and J genes (e.g., between TRBV and TRBJ in Figure 11), '1 copy' for 'several alleles (or genes)' for V and/or J genes (see 11. below), and the last one for all together.Each histogram is the graphical illustration of the corresponding CDR3-IMGT table and gives, for each length of CDR3-IMGT (nt and AA) observed in in-frame junctions between V and J: Nb of sequences with different (unique) CDR3-IMGT (nt), Nb of sequences with different (unique) CDR3-IMGT (AA), Nb of sequences (in sets) with identical CDR3-IMGT (nt) and Nb of sequences (in sets) with identical CDR3-IMGT (AA).10.IMGT/HighV-QUEST CDR3-IMGT sets (identical nt and AA) tables for '1 copy' and 'in-frame junction': three tables are provided for '1 copy' with 'single allele' for V and J genes (e.g., between TRBV and TRBJ in Figure 12), '1 copy' for 'several alleles (or genes)' for V and/or J genes (see 11. below), and the last one for all together.
Each CDR3-IMGT sets table only shows lines from the corresponding CDR3-IMGT table (see 8. above) for which sets contain at least two identical sequences CDR3 -IMGT in nt or AA (number between parentheses greater than 0).Below each recall line, details are provided with number of sets and number of sequences in the set (nt and AA) (Figure 12).For example, for the CDR3-IMGT length of 24 nt (8AA), 55 sequences belong to 17 sets (both at the nt and AA level) which correspond to 8 sets of 2 sequences, 5 sets of 3 sequences, 2 sets of 4 sequences and 2 sets of 8 sequences.11.IMGT/HighV-QUEST gene and allele tables for '1 copy' with 'several alleles (or genes)' (Figure 13): these tables are provided with the same type of information as for '1 copy' with 'single allele' (see above in 6.).However these sequences are not shown as histograms owing to the uncertainty in the identification of the allele (or even gene), the region of interest in the sequences being too short to allow a correct analysis of the V-REGION by IMGT/V-QUEST.With the progress of the NGS sequencing methodology, the percentage of longer sequences should increase and this category should decrease.Despite the uncertain identification of the V and/ or J alleles (or genes), these sequences are not excluded from CDR3-IMGT tables, histograms and sets (see 8., 9. and 10.).12. IMGT/HighV-QUEST list of sequences in 'More than 1' (Figure 14): these sequences represent redundancies.These sequences are filtered-out and excluded from statistical analysis in order to obtain one and only one copy ('1 copy') of each sequence.The 'More than 1'sequences The CDR3-IMGT sets (identical nt and AA) tables show recall lines from the corresponding CDR3-IMGT tables for which sets contain at least two identical sequences CDR3-IMGT (nt or AA).Below each recall line details are provided with number of sets and number of sequences in the set (in nt and AA).That figure shows details for lines #2 to #6 (for which the number of sets is greater than 0) from Figure 10.
are listed below each corresponding '1 copy', with their sequence number and sequence ID. 13.IMGT/HighV-QUEST list of sequences with 'Warnings' (Figure 15): these sequences corresponds to sequences with warnings for the V-REGION ('different CDR lengths' and/or 'id<85 %').These sequences are filtered-out and excluded from statistical analysis.The sequences with 'Warnings' are listed, with their sequence number and sequence ID.14.Sequences with 'Unknown functionality': list of sequences for which no functionality was detected.This category corresponds to the sequences for which the junction cannot be identified (no evidence of rearrangement, no evidence of junction anchors).
15. Sequences with 'No results': list of sequences for which IMGT/HighV-QUEST did not return any result.The IMGT/HighV-QUEST statistical analysis outputs are in PDF format (6 reports) and PNG (separate graphical elements), archived in a single ZIP file (Figure 16).The content of the PDF reports is described below, with between parentheses, the results (paragraphes above) to which they refer: 1.IMGT report all: this report contains all results of the statistical analysis (1. to 15.) 2.IMGT report summary: contains Comments (from user, optional), Analysis results list, Summary (here TRBV) and shows a list of all genes (gray lines) and alleles (white lines) found for '1 copy' with 'several alleles (or genes)'.The table lists the sequences present in multiple copies.Green lines show the '1 copy' sequences that were taken into account for the detailed statistical analysis and blue lines show the 'More than 1' sequences that were filtered-out from the detailed statistical analysis.A green line and the blue lines below thus illustrate a single pool of identical sequences.

DISCUSSION
The IMGT/HighV-QUEST web portal provides the highquality results of IMGT/V-QUEST and IMGT/ JunctionAnalysis [24][25][26][27][28] in the analysis of the antigen receptor (IG and TR) repertoire sequences generated from NGS high throughput and deep sequencing.Highquality results are based on the IMGT Scientific chart rules: standardized gene and allele nomenclature [1,2,14,19], standardized description and delimitation of labels [7,8,13], particularly the CDR-IMGT and FR-IMGT [15][16][17], and extensive and accurate analysis of the JUNCTION [27][28].In contrast to computational software developed for handling huge amount of short The table lists the sequences with 'Warnings' with their position number in the input file and their identifier.'Warnings' are defined as having an identity percentage of less than 85% in the alignment with the germline gene and/or 'different CDR lengths'.In order to decrease the doubt and to increase the reliability of the statistical analysis, the sequences with 'Warnings' are omitted from the analysis.sequences (e.g., from Illumina sequencing [42][43][44][45][46][47][48][49]), IMGT/HighV-QUEST works on longer sequences (from 454 Life Sciences sequencing) and, from the start, provides highly reliable results on sequences of good quality.Accessing the full immune antigen receptor repertoires requires sequences including the complete variable domains (~360 nt) for a reliable analysis, particularly for the IG.Currently the 454 Life Sciences technology [33] that provides the longer sequences is the more adapted for the analysis of the IG and TR repertoires.The 454 sequencing of variable domains has been done to analyse the IGH repertoire in zebrafish [50], to estimate the diversity of a combinatorial human antibody library [51], to monitor human B cell clonality in hematological lymphoid malignancies [52] and to analyse the TRA and TRB repertoires in humans with the comparison of eight T cell subsets in one healthy individual [53].More specific analyses for human IGH sequences include individual polymorphic variations [54] and quantification of minimal residual disease in chronic lymphocytic leukemia [55] and, for human TRB sequences, analysis of specific rearrangements shared between clonotypes [56].However, not only NGS methodologies still need to improve in particular to overcome/deal with sequencing errors [57,58], but great care should be taken in the obtaining of the sequences themselves to avoid biases that could lead to a skewed repertoire [59].Thus 5'RACE (5' rapid amplification of cDNA ends) type protocols [60,61] should be favored over the use of multiplex PCR (polymerase chain reaction) amplification to avoid biases but also to obtain complete variable domains in 5'.In this context, IMGT/HighV-QUEST provides a unique standardized frame for the user to appreciate the quality of the experimental sequencing by comparing the percentage of the 'Result category' in the Statistical analysis: sequence length ('single allele' vs 'several alleles'), sequencing quality ('warnings' and the 'no results' categories should be as small as possible), amplification bias ('1 copy' vs 'More than 1', although this may indicate, for reliable data, relative clone expression).Moreover, IMGT/HighV -QUEST allows to compare, with the same criteria, results whatever the antigen receptor (IG or TR) and the species (human, mouse, rat, etc.).There is always the possibility to check visually unusual results with the IMGT/V-QUEST online as the same version and the same IMGT reference directory release are used and individual files are provided in the IMGT/HighV-QUEST results.

CONCLUSIONS
IMGT/HighV-QUEST, the high throughput version of IMGT/V-QUEST, has become the standard reference for the analysis of IG and TR V domain sequences generated from NGS high throughput and deep sequencing [40,41].
It analyses up to 150,000 nucleotide sequences per batch and performs statistical analysis on the results of up to 450,000 sequences.IMGT/HighV-QUEST provides users with: (i) a friendly web interface for submission and results retrieval, (ii) highly standardized and detailed IMGT/V-QUEST and IMGT/JunctionAnalysis results based on the IMGT-ONTOLOGY concepts and IMGT Scientific chart rules, (iii) a standardized frame for NGS statistical analysis based on 'Results category' ('1 copy', 'More than 1', 'single allele', 'several alleles (or genes)', (iv) detailed statistical analysis tables and histograms (e.g., V, D and J usage, CDR3-IMGT (nt and AA) lengths).IMGT/HighV-QUEST has been freely available for use for academics on the IMGT® Home page (http:// www.imgt.org)since October 2010.More than 123 million sequences were submitted during its first year.The jobs required 70,000 computational hours of resources and generated about three terabytes of results data.More than 83% of the sequences were submitted by users from USA, the others being submitted by users from the European Union (EU) for most, but also from China, Japan, Australia, Canada, Korea and Venezuela.Sequencing data are from both IG and TR [62,63] and from any vertebrate species for which the IMGT reference directory is available.
Beyond the complexity and diversity of the immune responses, it becomes possible, using IMGT/HighV-QUEST, to establish reliable repertoires of IG and TR V domains.These repertoires will contribute to the comparison of individual immunoprofiles in diverse immune situations (healthy vs disease-related repertoires, vaccination, autoimmunity, cancer, infections, immune reconstitution following bone marrow transplant, etc.) and on different B and T cell populations (e.g., characterized by their phenotype markers, differentiation state, activation state, etc.).They will also contribute to characterize potential therapeutic antibodies from combinatorial libraries.

Distributed system
IMGT/HighV-QUEST is an automatic system.It provides users a web service to launch their analyses of up to 150,000 IG and TR nucleotide sequences.It must provide a constant web service to end users from scientific communities.There is likely to have resources that are not accessible for a period of time, planned or unplanned.The tool should continue to accept user sequences for the analysis.This reality introduces the ability to accept more than one HPC resource system in order to have at least one working resource when another resource is down.This simultaneous use of numerous HPC systems requires a distributed system with a generic nature.In this architecture the whole system is distributed on different resources (computers, servers, HPC resources) and the tasks are also shared amongst these resources.The tasks are distributed over different servers and resources.There are two main resources in action: a local web server and some remote computational resources.The local web server manages actually all the tasks related to IMGT/ HighV-QUEST in scheduled tasks and offers services of user interaction via a web interface.The computational resources, on the other hand, are used to analyze user sequences using standalone version of analysis applications.At present, IMGT/HighV-QUEST uses computational resources on several HPC systems at Centre Informatique National de l'Enseignement Supérieur (CINES) and at Institut de Génétique Humaine (IGH).It also provides some XML parameter files to add or remove resources in the list and its generic nature allows administrators to add as many resources as they want, and to configure performance related configurations used to perform a load balancing between the available resources.

A layered architecture
The distributed nature of the system requires an approach of internal management that is fully capable of facilitating the intervention of the developers for future extension and the characterization of errors and exceptions.For this reason the tasks are divided into three layers.1. Web Service (WS) Layer: this layer is responsible of user interactions.The interface design is simple to let users with a minimum of knowledge in Internet to use easily the tool.The heterogeneity of the background system is not felt here.The user simply submits the analysis using a classical web interface and chooses whether he wants to be notified of the completion of the job by email.He can download the results of completed analyses by one click and can at any time see the status of his submitted analyses.The interface lets also users to know whether there was an error during the execution of the analysis or there is a warning concerning its results.The web service is available via a simple HTTP connection and a HTTPS connection secured with SSL exchange.

System dynamics and reactivity
The system uses its configurations and uniformed exception objects to localize, resolve, if this is possible, or remember errors and exceptions in order to prevent their reoccurring.IMGT/HighV-QUEST has a system of short term memory that lets it tolerate breakdowns by remembering them, and prevent these failures in the future.It also logs administrators by e-mail of the occurred exceptions if it is necessary.This reactivity of the system lets the administrators find out the source of the problem and act in real time.This automatic routine combined with human interference makes IMGT/HighV-QUEST a system that is always stable and appropriate for load of lots of analyses running in parallel on different resources.

Error-prone system and error toleration
Errors are possible to occur during the program execution or management of jobs.No system is without errors.The abundance of errors and exceptions is directly proportionate to the heterogeneity of the background system.In IMGT/HighV-QUEST system a single core application is connected to multiple resources with different distributions of different operating systems and also with different configurations, and thus there may be exceptions that occur during the execution and management concerning the connections and configurations but also on the remote resources themselves.The system is able to manage these exceptions when they occurred in order to minimize loss of user data and time.On the other hand, it tries to compensate errors when they occurred if this is possible.The IMGT/HighV-QUEST system detects and localizes errors by means of a single exception class which uniforms the error detection and thus reduces the complexity of its localization and also its prevention.

Timers and optimization of resource use
A good system does not only perform its tasks but also accomplishes them in an appropriate time and in a good manner.The IMGT/HighV-QUEST system is responsible of three tasks simultaneously.The first task is the user interactions.As an analysis can take more than 24 hours to be completed and a user cannot wait 24 hours for his analysis to complete in front of his screen, and the user even closes the web browser some minutes after the analysis submission, so the user wants that all tasks be felt in real time.On the other hand there is a question of the performance of the tool, being reactive to users and administrators and more importantly to events (exceptions, errors, etc.).Another important issue is the accomplishment of scheduled tasks in a good time.For all these reasons, a system of timers is designed in IMGT/HighV-QUEST to synchronize the different types of tasks on heterogeneous systems.All computational resources do not have the same performance, they do not act similarly (the performance may go down if we have more than one job on a server using one core).To let a good use of these resources and to maximize the web services quality, for each task a specific timer is created.A timer is the time length between two times that a specified event (task) is triggered.In each trigger, a test is performed whether it is useful to enter the core of the tasks.In case it is not useful, IMGT/HighV-QUEST exits from the tasks without doing anything.This sort of reaction is important in order to minimize the connections to remote resources.This is done for two reasons, the first reason is that interactions of automatic routines are performed via SSH connections and SSH connections demand some capacity of computing and transfer of data, in order to not use lots of resources for doing nothing, these timers are set.The second reason is that if a control is not set, the number of connections can increase dramatically and the more the number of connections the more likely an exception can occur, the server status is then set as 'down'.As said before, the performance of different resources is different.That is why each resource has a different timer for monitoring jobs, in order to optimize, for example, the number of connections and resource use.
IMGT/HighV-QUEST administration and parameterization IMGT/HighV-QUEST is designed for the analysis of large amount of sequences.Multiple analyses of more than 100,000 sequences have been regularly running on the tool.Although the tool manages these jobs automatically, it is sometimes necessary to have a human intervention, especially for abnormal situations.To this end, an administration interface was also developed that comprises three levels of expertise.1. Regular administration: in this level the administrator can see submitted analyses, online users, and information on the functionality of local and remote servers.He can cancel analyses, or in more urgent situations, cancel the jobs of an analysis directly on the remote server, hold queued analyses, etc.Using this level does not require special precautions.2. Advanced administration: in this level the administrator can start/stop the scheduled tasks, backup database data in backup tables, tell the tool to hold all analyses that will be submitted after now.This part needs to be used with precautions.
3. Direct database and application context interaction: the SQL tool is designed to send SQL queries directly to the server and the direct context interaction facility serves to change context attributes of the application that are used for management purposes.This section has to be used only for very urgent situations and for experimentation (performance) and attention should be paid during the manipulations.
The parameterization of the tool lets administrator change the parameters that are rarely updated, adding a new resource, deleting a resource, adding logging emails, modifying the database connection parameters etc.The XML language was used for parameterization aspects in order to emphasize the simplicity and the generality.

Figure 3
Figure 3 IMGT/HighV-QUEST statistical analysis 'Analysis results selection' table.The 'Analysis results selection' table shows the list of the analyses with their status.In this table, the user selects the analyses on which he wants to perform the statistical analysis, respecting the criteria defined in the text.

Figure 4
Figure 4 IMGT/HighV-QUEST statistical analysis 'Analyses list' table.The 'Analyses list' table recapitulates the list of the analysis results chosen for the statistical analysis.For each of them, it recalls the Title, Nb of sequences, IMGT/V-QUEST reference directory species and IMGT/V-QUEST receptor type or locus.

Figure 5
Figure 5 IMGT/HighV-QUEST statistical analysis 'Summary' table.The IMGT/HighV-QUEST statistical analysis 'Summary table' indicates the title of the statistical analysis (as entered by the user), recalls the version of IMGT/HighV-QUEST and IMGT/V-QUEST, the IMGT/V-QUEST reference directory release and 'PARAMETERS' used for the analyses and provides in 'RESULTS', an overall view of the statistical analysis results that lead to 'Result category' (see details in Figure6).This repartition in categories gives the user at first glance an idea of how much he/she can rely on his/her data.

Figure 7
Figure 7 IMGT/HighV-QUEST statistical analysis Number of '1 copy' with 'single allele' and 'several alleles (or genes)' (for V, D and/or J) tables and histograms.The results are provided for each V, D or J gene and for any combination of them, for 'single allele' (on the left hand side) and for 'several alleles (or genes)' (on the right hand side).Below the tables, histograms are provided, per gene, for each concerned locus.Color code for histograms: green for V genes, red for D genes, yellow for J genes, green with red hatchings for the combination of V and D genes, green with yellow hatchings for the combination of V and J genes, green with red and yellow hatchings for the combination of V, D and J genes.

Figure 9
Figure 9 IMGT/HighV-QUEST statistical analysis gene histogram for '1 copy' with 'single allele'.The gene histogram is provided per Group[14] (here TRBJ) for '1 copy' with 'single allele'.For each gene and allele table, a gene histogram is shown, localizing the gene in the locus, with the number of sequences found between parentheses.

Figure 12
Figure 12 IMGT/HighV-QUEST statistical analysis CDR3-IMGT sets (identical nt and AA) tables.The CDR3-IMGT sets (identical nt and AA) tables show recall lines from the corresponding CDR3-IMGT tables for which sets contain at least two identical sequences CDR3-IMGT (nt or AA).Below each recall line details are provided with number of sets and number of sequences in the set (in nt and AA).That figure shows details for lines #2 to #6 (for which the number of sets is greater than 0) from Figure10.

Figure 14
Figure 14 IMGT/HighV-QUEST statistical analysis Sequences in 'More than 1'.The table lists the sequences present in multiple copies.Green lines show the '1 copy' sequences that were taken into account for the detailed statistical analysis and blue lines show the 'More than 1' sequences that were filtered-out from the detailed statistical analysis.A green line and the blue lines below thus illustrate a single pool of identical sequences.

Figure 15 IMGT
Figure 15 IMGT/HighV-QUEST statistical analysis Sequences with 'Warnings'The table lists the sequences with 'Warnings' with their position number in the input file and their identifier.'Warnings' are defined as having an identity percentage of less than 85% in the alignment with the germline gene and/or 'different CDR lengths'.In order to decrease the doubt and to increase the reliability of the statistical analysis, the sequences with 'Warnings' are omitted from the analysis.

Figure 16 IMGT
Figure 16 IMGT/HighV-QUEST statistical analysis Outputs.The IMGT/HighV-QUEST statistical analysis outputs comprise six reports in PDF format.Separate graphical elements (figures in PNG format) are also included (as shown here) if the user made this choice during the submission.

2 .
Scheduled Tasks (ST) Layer: this layer is responsible of all jobs' management aspects of IMGT/HighV-QUEST.It runs periodically and each time accomplishes tasks by considering the current situation and the timers.It dispatches the analyses that are in the local queue, monitors previously launched analyses, prepares results of completed analyses, deletes expired analyses and notifies the concerned user if an analysis is going to be expired in five days.For the current configurations, it runs once each 60 seconds.It saves a historic of the date/time of its execution and other important information before exiting.This information is very important and useful for administrators and lets them to maximize the use of resources.Each exception that occurred in the ST layer is logged only once to the administrators and if it is localized, is saved in the short term memory in order to prevent it from reoccurring.3. Computational Resources (CR) Layer: this layer is where the actual program (scientific program) is run in the standalone version to analyze the user sequences.IMGT/HighV-QUEST supports three types of launching programs and monitoring jobs.In simple mode, the program is launched using simple BASH scripts and monitored using the Process identifier (PID) of the concerned process.In PBS mode, the Portable Batch System (PBS) technology is used to launch and monitor jobs.Finally the IBM Load Leveler mode uses the IBM Load Lever (version 3.5.1.13) of IBM.All interactions with computational resources are done via SSH connections.This technology was used to enforce the security and integrity of transferred user data.The results of completed analyses are archived in a single file in ZIP format in order to let users of all operating systems to manipulate the outputs with the default archive tools.The results are transferred afterwards on the local server via SFTP connection to facilitate their download for users and the concerned user is notified by e-mail (if this latter has chosen to be informed).