Current Opinion in Biotechnology 2020 T-cell repertoire analysis and metrics of diversity and clonality

The recent developments of high-throughput bulk and single-cell sequencing technologies accelerated the understanding of the complexity of immune repertoire dynamics combined to transcriptomics. Also, profiling of cellular repertoires in health or disease requires statistical metrics to capture clonal diversity characterized by clones frequency, repertoire richness and convergence. Here we present the common technologies of bulk and single-cell sequencing of T-cell receptors (TCRs), discuss current knowledge regarding computational tools clustering and predicting specificity of TCR repertoires based on shared structural motifs and review main indices for repertoire diversity and convergence analyses. These tools represent potential biomarkers to decipher the fitness of immune repertoires in diseased or treated patients but also the presages and promises of computational approaches to revolutionize personalized immunotherapy.


Estimating T-cell repertoire diversity by computational and mathematical modeling
Unlike the innate immune system, which is mobilized by general threats, adaptive immunity is highly specific to antigens and plays a central role in the fight against pathogens and cancer as well as in autoimmune or inflammatory diseases. Recognition of nonself-or self-antigens is mainly driven by T and B cells. The efficacy of T-cell immunity in identifying peptide fragments of antigens bound to the major histocompatibility complex (MHC) molecule depends on the diversity of its repertoire. The development of next-generation sequencing (NGS) and singlecell approaches brought a revolution in the characterization of immune repertoires allowing massive parallel TCR sequencing [1,2]. This led to the development of a wide range of computational and mathematical tools to model interactions between TCR and peptide-MHC (pMHC) and describe repertoire diversity. In the present review, we describe NGS approaches allowing structural characterization of TCRs, which is the basis of clustering models inferring shared antigen specificity of immune repertoires [3]. Aside from these specificity-based clustering models, we also present the different mathematical indexes currently used to interpret TCR diversity and convergence of immune repertoires [4•]. However, diversity measures comprehending the number of distinct clones and their frequencies in a repertoire is not trivial. Thus, different diversity measures are available, each capturing slight differences, giving distinct weights to the relative clonotypes frequency. Moreover, experimental sampling only partially estimates the diversity of repertoires [5•]. Therefore, caution must be taken when interpreting and comparing immune repertoire diversity within and across studies.

TCR structural diversity driving antigen specificity
The structure of the majority of human T-cell receptor is a disulfide-linked α/β heterodimer, each chain composed of a constant and a variable domain [6]. These chains are formed by somatic rearrangements of the variable (V), diversity (D), and joining (J) gene segments together with random addition or deletion of nucleotides [7]. These diversification mechanisms yield a huge variety of TCRs [3]. TCR diversity is confined to six variable hairpin loops located in the α/β variable domains, named complementarity-determining regions (CDRs), with three CDRs per chain (CDR1α, CDR2α and CDR3α and CDR1β, CDR2β and CDR3β, respectively). The process of V(D)J recombination leads to CDRs 1 and 2 entirely encoded in germline DNA segment, whereas the CDR3α and CDR3β loops are products of junctional diversity, consequently being the most variable [8,9]. The binding between TCRs and peptide antigens displayed by MHC is of relatively low-affinity [10,11] and is degenerate, meaning that many TCRs recognize the same peptide antigen and many peptide antigens are recognized by the same TCR [12,13]. During recognition events, CDR1α, CDR1β, CDR2α and CDR2β contact the MHC [14,15], while CDR3α and CDR3β directly communicate with the peptide antigen [16,17••] (Figure 1). However, all six CDRs might be involved in antigen recognition [18,19]. As shown on Figure  1C, the direct contact between the peptide and CDR1α provides an exception to the rule claiming that CDR3s are responsible for peptide specificity, while the limited contacts exchanged between Trp5 and Met4 and CDR3β opposes the idea that CDR3β, above all, is driving the peptide recognition. This shows that taking all CDRs, as well as detailed structural aspects [20], into account in TCR clustering approaches might be necessary to achieve the highest efficacy [21,22].

Sequencing approaches to capture TCR diversity
If the diversity of immune repertoires was difficult to appreciate in the past, the arrival of NGS created a revolution in the field of TCR analysis and promoted the emergence of several highthroughput TCR sequencing (TCR-Seq) assays to characterize T-cell repertoires. The first factor to consider for TCR sequencing is the source of material, i.e. DNA or RNA. DNA was largely used owing to its stable number of copies per cell, thus allowing straightforward quantification of clonotypes frequency. However, DNA-based methods are less sensitive and do not consider allelic exclusion, therefore overestimating diversity. Conversely, RNA is less stable and expression level may vary from cell to cell therefore impacting TCR quantification [1]. However, RNA-based methods are more sensitive, circumvent the allelic exclusion issue and allow implementation of unique molecular identifiers (UMI) that correct for amplification and sequencing errors [23].

Bulk TCR sequencing
Among the latest high-throughput sequencing methods for the analysis of bulk immune populations, three main technical concepts have emerged: 1) TCR amplification by multiplex PCR [24], 2) addition of common adapters prior to PCR amplification [25][26][27] and 3) TCR amplification following gene capture [28]. Multiplex PCR is the most commonly used but heterogeneity in primers efficiency introduces bias during amplification, leading to misrepresentations in the relative proportion of clones [29]. Next to multiplex amplifications, strategies adding a common adapter to the 5' end for the amplification were developed, such as the 5'RACE PCR [25]. As other ligation-based methods, 5'RACE is limited by a suboptimal ligation efficiency of the adapter [30]. This impacts quantification accuracy and low frequency TCRs detection and could explain why 5'RACE was shown to be less reproducible than multiplex PCR [1,31]. Altogether, biases introduced by current bulk methods affect repertoire analyses and weaken the pairing of α/β chains required for functional analyses and therapeutic applications. To this end, Howie and colleagues introduced a new concept to pair TCR chains based on multiple sequencings of the same sample and combinatorial analyses [32]. This highthroughput method, called pairSEQ, requires a large number of cells from a given clone to allow chains pairing, thus limiting its application to large samples and highly represented clones.

Single-cell TCR sequencing
In the last years, several single-cell based approaches emerged allowing α and β chains pairing, also potentially associated with transcriptional profiling [2,33•]. Originally, physical single-cell isolation conjugated to multiplex PCR and Sanger sequencing [34] or highthroughput sequencing [35] was developed to obtain paired TCRαβ sequences. Han and colleagues, by using a single-cell barcoding strategy could increase the scalability of sequenced cells and, in addition to the combinative determination of both TCR chains, could sequence specific genes linked to T-cell functionality [35]. However, these methods only allowed hundreds to few thousands of cells to be sequenced. A major improvement in the throughput of single-cell TCR sequencing (scTCR-Seq) came with emulsion-based approaches. Using microfluidics, water-in-oil emulsion droplets containing a single cell trapped with small volumes of reagents are created, multiplexing the number of cells analyzed. A method using emulsion-trapped cells paired the α and β transcripts by overlap extension reverse-transcription PCR directly within the droplet [36]. Despite using high number of cells as starting material, the yield was low, therefore affecting the detection of rare clones. A few years later, an updated version came out as a new platform for T-cell repertoire analysis [37]. This low-cost technology allowed, for the first time, a full high precision profiling of TCR sequences from millions of cells. Recently, Spindler and colleagues presented a highthroughput method linking TCR identification with direct functional testing to determine TCR reactivity and avidity using a microfluidics-based system [38]. Currently, a commercially available and easy-to-use system is widely used for single-cell profiling of immune cells, for instance for intratumoral immune populations characterization or clonal changes upon anti-PD-1 therapy [39••,40]. This microfluidics technology, developed by 10x Genomics, generates so-called Gel Beads-in-emulsion containing bead-attached primers with DNA barcodes capturing polyadenylated mRNA and resulting in barcoded cDNA. Although 10x Genomics approach is detecting fewer genes than other single-cell RNA sequencing (scRNA-Seq) methods, it can cover up to 15'000-20'000 cells and can combine scTCR-Seq with transcriptional profiling of T-cell subsets. Other commercially available single-cell encapsulation methods are being developed but a major drawback of these technologies is the need for microfluidics devices that are not always accessible by research laboratories as well as the high cost of ready-to-use assays, such as 10x Genomics. Moreover, the scalability is often limited as compared to bulk sequencing, due to the microfluidics technology itself and the yield can be low: 10x Genomics reaches 50-60% of successful cell encapsulation. An overview of the applications and limitations of the aforementioned bulk and single-cell TCR-sequencing methods is presented in Figure 2. Despite being attractive for multiplexed data, single-cell transcriptomic profile analyses require high cellular viability material and significant computational analyses need to be handled afterwards. However, the major developments in single-cell immune repertoires sequencing coupled to transcriptomic signature are shedding a new light on the description of T-cells' clonality and dynamics within a wide range of applications such as the development and improvement of immunotherapeutic treatments for cancer research.

Specificity clustering of TCR based on sequence similarity architecture
The prediction of epitopes recognized by a repertoire of T-cells (i.e. the epitome) from TCR sequences remains one of the biggest challenge of cellular and computational immunologists. Identifying TCR by deep sequencing of immune repertoires allows discovery of receptor patterns that might be linked to antigen specificity or to clinical outcomes. Recent computational studies demonstrated that common patterns can be inferred among TCR sequences interacting with the same epitope [

Global and local motifs similarity: the GLIPH algorithm
Analysis of 52 TCR-pMHC structures highlighted the possible determination of pMHC contact sites in CDR3s, notably in CDR3β, as an opportunity to cluster with a high probability TCRs on the basis of the prediction of shared specificity [17••]. Based on this assumption, the authors developed a clustering algorithm, called GLIPH (grouping of lymphocyte interactions by paratope hotspots), built on global and local TCR sequences similarity. GLIPH specificity groups, likely to recognize the same or very similar MHC ligands, are scored based on the enrichment of common V-genes, the CDR3 lengths, clonal expansions, shared HLA alleles among contributors, motif significance and cluster size. When benchmarking GLIPH on a training set of 2,068 unique sequences spanning eight pMHC specificities, 94% of TCRs were correctly grouped in clusters of TCRs with common specificity, even when originating from different donors. Such an approach could be used to predict the specificity of a new TCR, by verifying its affiliation to a specificity group determined by GLIPH. Essentially, it also provides information regarding a given immune response and its complexity through the analysis of the number and size of the clusters determined by GLIPH.

Distance measure: the TCRdist algorithm
Also based on sequence similarity, Dash and colleagues defined a novel distance measure on the space of TCRs, TCRdist, allowing for clustering and visualization of repertoire diversity [41••]. This quantitative measure of similarity is obtained by listing the residues belonging to the CDR1, 2 and 3 loops, all known to possibly contact the pMHC, and by computing a similarity-weighted mismatch distance defined based on the BLOSUM62 substitution matrix, with a gap penalty to capture variations in the length of CDRs. Of note, a higher weight was given to the CDR3 sequence in view of its prominent role in epitope binding. This distance can then be calculated for each possible pair of TCRs belonging to a given repertoire, generating a so-called distance matrix. It can be used for TCRs clustering or the construction of hierarchical distance trees to analyze the diversity and complexity of TCR repertoires. The high-dimensional TCR landscape can also be projected into two dimensions plots, with each dot representing a TCR, through the dimensionality reduction of this distance matrix. Thanks to these analytical tools based on their definition of the distance between two TCRs, the authors found that TCR repertoires often contain dominant clusters of TCRs whose sequence similarity is generated partially from the use of common V-and J-regions and from the similarity of CDR3 motifs. Moreover, each epitope-specific repertoire enclosed a clustered group of receptors with strong sequence similarities, together with divergent non-clustered receptors, both providing different solutions to the pMHC binding challenge. Finally, they highlighted key conserved residues driving TCR binding to pMHC.

Clustering based on TCRs biophysicochemical properties
Recently, Ostmeyer and colleagues introduced a novel class of methods for analyzing immune repertoires of patients in order to cluster and identify disease-associated TCRs [42••]. Their approach consists in feeding machine-learning techniques, based on logistic regressions, with biophysicochemical descriptors of the TCR interface, rather than with TCR sequences. The biophysicochemical characteristics of sliding windows of four consecutive residues of CDR3β (i.e. so-called 4-mers), excluding the first four and last three residues, are described using five Atchley factors encoding for codon diversity, secondary structure, molecular size, polarity and electrostatic charge of the residues. The method identified a short list of preferred values for these descriptors at key positions in TCRs present in tumors, which permitted the identification of disease-associated TCRs. Although this approach leads to the hypothesis that these TCRs share the same specificity, this was however not validated. In addition, restricting the analysis of 4-mers of CDR3β, a choice resulting from the analysis of a small number of TCR-pMHC structures, constitutes a limitation of the method. Nevertheless, it represents a first step in the direction of physics-based predictors that can potentially fit the extremely large sequence diversity of immune receptors into a limited number of quantitative characteristics at key positions. Although the method needs to be retrained for each set of TCRs and remains restricted to CDR3β only limiting its predictive ability, this type of sequence-based 'property'based approach could circumvent some of the drawbacks of purely sequenced-based analyses. Indeed, very large numbers of disease-associated TCR sequences for training are not necessary anymore and the possibility to detect potential antigen-binding TCRs with divergent sequences from those previously encountered exists. This approach can also be used to cluster and analyze TCRs repertoires, by defining a possible distance between two receptors as the difference between the five Atchley factors of the most similar pair of 4-mers taken from their respective CDR3β (clustering tree example in the abstract figure).

Quantifying clonality, diversity and convergence of TCR repertoires
Aside from the diversity in antigenic specificity of T-cell repertoires (i.e. the epitome), clonotype diversity can capture immune fitness during disease development or in response to treatment. Numerous computational algorithms analyzing sequence reads of TCRs and characterizing repertoire clonality were established [45]. The broad structural diversity characterizing TCRs renders the analysis of immune repertoires challenging but allows fingerprinting of T-cell clones that can be tracked within different tissues (peripheral blood, tumor tissue, adjacent normal tissue, etc.) at different time-points in immune profiling studies. In the past years, several studies centered their analyses on TCR repertoire dynamics as indicators of immune monitoring in inflammatory diseases such as multiple sclerosis [46], autoimmune diseases [47], viral infection [48,49] or cancer [43••,50-52] as well as as biomarkers of response to immunotherapy [40,53-55••]. Therefore, models for immune repertoires visualization and statistically-derived descriptive indices to estimate repertoire diversity and homology with no described consensus analytical method have emerged [4•]. In the following section, we recapitulate the main indices characterizing diversity and similarity of T-cell repertoires and discuss their limitations.

Diversity measures: Hill numbers and Rényi entropy
Most of diversity indices are mathematically derived from the information theory widely used in ecology to quantify ecosystems biodiversity [5•,56]. In T-cell repertoires, diversity takes into account the clonal composition, equivalent to the number of unique TCR sequences referred from now on as richness and the distribution spectrum of these sequences (i.e. their relative abundance) hereafter referred to as evenness. Diversity relates to the level of uncertainty that a TCR sequence would be sorted from a repertoire and would belong to a certain T-cell clone (i.e. unique TCR sequence). Commonly used measures of diversity are related to the Hill numbers also referred to as effective numbers of species, from which one can retrieve the effective number of distinct clonotypes (i.e. number of equally abundant sequences producing the given value of diversity) in the dataset [57,58]: where pi is the frequency of sequence i in the repertoire and N is the total number of unique sequences. The order α parametrizes the diversity index and allows to calculate different features of immune repertoire diversity. The Hill diversity numbers are based on the generalized measure of entropy, the Rényi entropy, quantifying the diversity or randomness of a system [4•,58,59]: where b, the base of the logarithm, determines the choice of units of the entropy measure.

Diversity of order 1: Shannon entropy
The order α sets the degree of sensitivity of the diversity index to species abundance in the system. When α→0, all species are weighted equally and (1) is equivalent to species richness meaning the number of unique sequences in a repertoire, independently of their abundance. When α→1, the generalized form of the entropy (2) is equivalent to the Shannon entropy or Shannon diversity index [60]: Figure 3A shows that monoclonal (i.e. 1 TCR) and oligoclonal repertoires (i.e. emergence of a few dominant clones) have a Shannon's index closer to 0. Moreover, when there is a unique dominant clone and the other clones are evenly represented, the Shannon index is higher than in case of oligoclonality due to a higher uncertainty of the possible outcome of picking one sequence in the repertoire in the first case. Thus, when a repertoire is composed of sequences evenly distributed, the Shannon entropy reaches his maximum (i.e. maximal diversity), which is the logarithm of the number of unique sequences. This index being widely described, it is often used in immune studies. For example, when profiling dynamic changes in peripheral Tcell repertoire upon cervical carcinogenesis, the use of Shannon entropy index revealed a drop in diversity in patients with advanced cancer, thus potentially reflecting the emergence of expanded clones [50]. Shannon entropy was also used to discriminate diversity changes in melanoma-bearing mice receiving different combinations of immunotherapy [61] and was linked to clinical prognosis in patients with advanced lung cancer [62].

Diversity of order 2: Gini-Simpson index
Finally, when α→2:, the generalized entropy formula (2) becomes: where λ represents the Simpson's index [63], the probability of two entities being chosen randomly in a system (sampling with replacement) to belong to the same species. To follow the intuitive principle that a high index expresses high diversity, people commonly use the unity minus the Simpson's index, referred to as Simpson diversity index or Gini-Simpson index: with value close to 0 characterizing a repertoire with no diversity (i.e. highly oligoclonal) and 1 representing infinite diversity (i.e. polyclonal repertoire with equivalent representation of each clone). In Figure 3A, the highly diverse scenarios (#4, #7, #10 and #13) have a Gini-Simpson index that increases with higher richness to get closer to 1. Along with Shannon entropy, the Gini-Simpson index decreases with appearance of dominant clones since the probability of two selected sequences to be different drops. Rather than the Shannon entropy, several studies of repertoire diversity use the Gini-Simpson index. Lately, it was applied to describe the clonal architecture of patients with adult T-cell leukemia/lymphoma [64] or to assess the clinical prognostic value of T-cell repertoires from peripheral blood or metastases in patients with primary melanoma [51•].

Entropy-based diversity indices limitations
As mentioned, the α order determines the indices' sensitivity to rare or common species. Orders lower than 1 reflect a diversity measure highly affected by the number of rare species whereas increasing α orders tend to be more sensitive to abundant species and when α=1, each species is weighted by its proportional abundance [65]. Therefore, the Shannon diversity index encounters higher variation upon addition of low frequency clones than the Gini-Simpson index. In Figure 3A, the Gini-Simpson index, in contrast to the Shannon entropy, is barely affected by the increasing number of unique TCRs in the repertoire. Moreover, within a repertoire composed of equal numbers of unique TCRs, the Shannon entropy is more impacted by the presence of low frequency clones than the Gini-Simpson index. Most of the studies do not mention the rationale behind the choice of the diversity indices. Moreover, all these diversity indices behaving non-linearly, caution should be taken when correlating them to biological interpretation and statistical tests should be adapted. The best way to correctly interpret these entropy-derived measures would be to analyze them simultaneously (i.e. "diversity profiles") to be able to derive any biological meaning from the observed differences [66•].

Evenness measure: Pielou's index
Aside from the degree of uncertainty and heterogeneity of a system, description of the equivalency in species abundance can also be used. This measures the dominance of clones in a repertoire thus referred to as clonal diversity or clonal evenness. In a study describing changes in peripheral blood TCR diversity upon ipilimumab treatment in metastatic melanoma, the authors characterized clonal diversity defined as the ratio between the number of sequences accounting for 50% of the total repertoire abundance (i.e. cumulative frequency of each of these sequences) and the repertoire richness [67]. This measurement, referred to as diversity evenness 50 (DE50), was used to describe increasing oligoclonal responses in TILs from melanoma-bearing mice treated with optimal combinative immunotherapy [61]. In parallel to DE50, clonal evenness of a repertoire can be calculated using Pielou's index, which is itself derived from the ratio between the Shannon entropy and the maximization of the diversity distribution of species within a sample [68]: As shown in Figure 3A, the complement of clonal evenness (1-Pielou's index) is often used to get a clonality score of 0 representing a maximally diverse population with even frequencies and values close to 1, a repertoire driven by clonal dominance. As shown in Figure 3A, even though the abundance of dominant clones in repertoires #3, #6, #9 and #12 is identical, clonal evenness increases since the dominance of these oligoclonal sequences is more important in the case of a repertoire with high richness. In examining peripheral and tumoral T-cell clonality in patients with metastatic melanoma treated with immunotherapy drugs, an association between clonal expansion represented by 1-Pielou's index and clinical response was highlighted [53]. Recently, T-cell repertoires obtained from 236 NSCLC patients showed higher TCR clonality measured by 1-Pielou's evenness in healthy tumor-adjacent tissue compared to tumor tissue suggesting an impaired antigenic response [43••].

Inequality measure: Gini coefficient
Another index, the Gini coefficient (not to be mistaken with the Gini-Simpson index) is sometimes used to represent clonal distribution of a repertoire. It is a measure of inequality that is widely used in economics to study wealth distribution [69]. It quantifies the balance of a system (i.e. evenness of distribution) rather than its variety (i.e. species richness) [70]: with pi, pj the frequency of the respective i th and j th sequences in the repertoire and ̅ the average of clone frequencies. Gini coefficient ranges from 0, maximal diversity of the repertoire (i.e. equal abundance of each sequence) to 1, with high value representing extreme inequality (i.e. high clonality towards one sequence). Thus, in Figure 3A, the Gini coefficient increases as the number of abundant clones rises, thus further reducing the frequency of less represented clones (i.e. higher inequality). Moreover, with increasing richness of repertoires, the inequality between dominant and sub-dominant (low frequency) clones gets wider, leading to a small rise in the Gini coefficient. In a recent study interpreting T-cell evolution upon checkpoint inhibitors treated melanoma patients, repertoire clonality was assessed using the Gini coefficient [55••]. Moreover, a linear discriminant analysis was built to distinguish patients based on their clinical response using clonal dominance (i.e. Gini coefficient) and diversity (i.e. Rényi entropy with α=1) as repertoire features.

Repertoires overlap measures
Aside from measures of diversity and clonality that are applied on a unique repertoire, TCR sequencing data also call for similarity analyses allowing comparison of overlap between Tcell repertoires. A first similarity indices, the Jaccard index, is defined as the size of overlapping species divided by the size of the union of both compared samples [71]: with cij being the number of overlapping sequences and Ni and Nj the total number of sequences in repertoire i and j respectively. Its related indices, the Sorensen index or Sorensen-Dice coefficient differs by counting twice the shared sequences (once in both the numerator and the denominator) [72,73]: Both indexes vary from 0 (no similarity) to 1 (total similarity between repertoires). From the Sorensen index, the Bray-Curtis index of dissimilarity can be deduced as the complement of the Sorensen index (i.e. Bray-Curtis index = 1-Sorensen index) [74]. All these similarity indices are based on the presence or absence of specific sequences therefore retaining sensitivity in more heterogeneous repertoires but not taking into consideration the relative abundance of the overlapping sequences. Thus, repertoire homology between healthy tumor-adjacent tissue and tumor tissue only based on Jaccard index is not robust enough to drive any conclusion and other metrics should be used in parallel [43••]. To this extend, the Morisita-Horn overlap index considering the relative frequency of species in compared samples is a widely used measure of dispersion [75,76]: with fi=n1i/n1 and gi=n2i/n2, n1i and n2i being the clone sizes of the i th sequence (i.e. entities representing a sequence) and n1 and n2, the total number of entities in sample 1 and 2 respectively. S is the total number of unique sequences found in both samples. The indices goes from 0 (i.e. no overlap between repertoires) to 1, repertoires identical in terms of richness and evenness. The Morisita-Horn index can be used to compare immune repertoires during viral infection [49], among different T-cell compartments in cancer patients [43••], to observe T-cell repertoire turnover upon treatment [53] or to track persistence of clones from an immune therapeutic product in peripheral blood after adoptive cell transfer [54].

Undersampling -"unseen species" problem
All aforementioned metrics are widely used to profile T-cell repertoires. However, due to the high diversity of TCR sequences and limitations in sequencing methods, the frequency distribution of clones in a repertoire and its richness is largely biased by the fact that only a fraction of repertoires is analyzed, leading to undersampling (i.e. "unseen species" problem) [5•]. This translates into biases in diversity measures, as shown in Figure 3B, where 18 cells were sampled out of a repertoire composed of 180 cells with 10 unique clones. Undersampling was repeated ten million times to get the frequency of occurrence of the most probable scenarios. The top five and five additional randomly selected ones based on the Monte Carlo approach are shown. Strikingly, we observe that the probability of each subsampled scenarios is low, even for the #1 scenario, recapitulating the richness and evenness of the total repertoire, showing the heterogeneity in clones distribution obtain by sampling a large TCR repertoire. The fold changes between each undersampled scenario and the total repertoire for four diversity metrics are highlighted. In the didactic example shown, clonal evenness (i.e. 1-Pielou's evenness) represents the index that is the most affected by undersampling, as clonal distribution is biased relative to the total repertoire. The Gini coefficient, also relying on clone distribution, can be less sensible to undersampling since unique TCR sequences present in the total repertoire disappear, balancing the inequality brought by changes in frequency distribution such as in scenarios #7 and #8. Between the two diversity indices derived from Rényi entropy, Shannon entropy is more sensitive to undersampling than the Gini-Simpson index, mostly due to changes in low frequency clone numbers (i.e. repertoire richness). In addition, scenarios #4 and #5 present a case of homogeneity between the indices because the number and frequency of each TCR sequence is stable. However, we miss the information of different sequences sampled from the original repertoire, each scenario capturing another structural diversity.

Conclusions
The diversity of clonotypes composing a repertoire is a major feature of the immune system and reflects the epitome of naïve as well as antigen-experienced T-cells. Even though scRNA-Seq methods now allow coupling TCR sequences with transcriptomics, predictions of antigen specificity of a given repertoire remains challenging. In theory, computational approaches based on structural modeling rise opportunities for epitome mapping and prediction of TCR cross-reactivity of completely different sequences. Moreover, even though deep repertoire profiling magnifies the capacity to capture TCR diversity, sampling of repertoires commonly leads to an inaccurate estimation of diversity. This limits the interpretation of dynamic clonality changes of immune repertoires captured with diversity metrics. Various methods are now being developed to accurately estimate true diversity of cellular repertoires. Moreover, a gold standard method for immune repertoire analysis has not yet been described, revealing the caution that need to be taken when comparing studies using various measurement methods. However, deep profiling of T-cell repertoires represents a potential biomarker to characterize immune fitness in diseased or treated patients and development of computational tools to measure diversity changes could foster immunology research such as cancer immunotherapy.    Each of the presented methods is shown here with its corresponding reference publications. The material source is displayed as well as the features linked to the sequencing approach. The throughput ranges from bulk to single-cell methods thus showing the limitation of scTCR-Seq in terms of number of sequenced cells compared to bulk methods. The reduction of amplification bias, reproducibility and detection of low frequency TCRs is only applied to bulk sequencing, since these impact quantitatively the sequenced TCR chain. In case of single-cell sequencing, the frequency of TCR is directly linked to ratio of cells of a specific clonotype to the total number of cells and is not distorted by amplification bias or reproducibility on TCR transcripts. *Compared to scTCR-Seq, PairSeq pairs α and β chains from bulk sequencing but the yield is much lower since many cells are needed for its combinatorial analyses allowing successful pairing.  with ni is the clone size of the i th clonotype (i.e. number of entities weighting a specific sequence) and n the total number of entities found in the overall repertoire.

Funding and acknowledgements
This work was supported by FNS grant 310030_182384. We thank Fabrizio Benedetti for assistance in the mathematical and statistical interpretation of diversity indices.