Characterization of human T cell receptor repertoire data in eight thymus samples and four related blood samples

T cell receptor (TCR) is a heterodimer consisting of TCRα and TCRβ chains that are generated by somatic recombination of multiple gene segments. Nascent TCR repertoire undergoes thymic selections where non-functional and potentially autoreactive receptors are removed. During the last years, the development of high-throughput sequencing technology has allowed a large scale assessment of TCR repertoire and multiple analysis tools are now also available. In our recent manuscript, Human thymic T cell repertoire is imprinted with strong convergence to shared sequences[1], we show highly overlapping thymic TCR repertoires in unrelated individuals. In the current Data in Brief article, we provide a more detailed characterization of the basic features of these thymic and related peripheral blood TCR repertoires. The thymus samples were collected from eight infants undergoing corrective cardiac surgery, two of whom were monozygous twins [2]. In parallel with the surgery, a small aliquot of peripheral blood was drawn from four of the donors. Genomic DNA was extracted from mechanically released thymocytes and circulating leukocytes. The sequencing of TCRα and TCRβ repertoires was performed at ImmunoSEQ platform (Adaptive Biotechnologies). The obtained repertoire data were analysed applying relevant features from immunoSEQ® 3.0 Analyzer (Adaptive Biotechnologies) and a freely available VDJTools software package for programming language R [3]. The current data analysis displays the basic features of the sequenced repertoires including observed TCR diversity, various descriptive TCR diversity measures, and V and J gene usage. In addition, multiple methods to calculate repertoire overlap between two individuals are applied. The raw sequence data provide a large database of reference TCRs in healthy individuals at an early developmental stage. The data can be exploited to improve existing computational models on TCR repertoire behaviour as well as in the generation of new models.

a b s t r a c t T cell receptor (TCR) is a heterodimer consisting of TCR α and TCR β chains that are generated by somatic recombination of multiple gene segments. Nascent TCR repertoire undergoes thymic selections where non-functional and potentially autoreactive receptors are removed. During the last years, the development of high-throughput sequencing technology has allowed a large scale assessment of TCR repertoire and multiple analysis tools are now also available. In our recent manuscript, Human thymic T cell repertoire is imprinted with strong convergence to shared sequences [1] , we show highly overlapping thymic TCR repertoires in unrelated individuals. In the current Data in Brief article, we provide a more detailed characterization of the basic features of these thymic and related peripheral blood TCR repertoires. The thymus samples were collected from eight infants undergoing corrective cardiac surgery, two of whom were monozygous twins [2] . In parallel with the surgery, a small aliquot of peripheral blood was drawn from four of the donors. Genomic DNA was extracted from mechanically released thymocytes and circulating leukocytes. The sequencing of TCR α and TCR β repertoires was performed at ImmunoSEQ platform (Adaptive Biotechnologies). The obtained repertoire data were analysed applying relevant features from immunoSEQ ® 3.0 Analyzer (Adaptive Biotechnologies) and a freely available VDJTools software package for programming language R [3] . The current data analysis displays the basic features of the sequenced repertoires including observed TCR diversity, various descriptive TCR diversity measures, and V and J gene usage. In addition, multiple methods to calculate repertoire overlap between two individuals are applied. The raw sequence data provide a large database of reference TCRs in healthy individuals at an early developmental stage. The data can be exploited to improve existing computational models on TCR repertoire behaviour as well as in the generation of new models.
©  Table   Subject Immunology Specific subject area T cell antigen receptor (TCR) alpha chain and beta chain diversity and characteristics in thymus and in peripheral blood Type of data Table: Sample description by immunoSEQ and VDJTools softwares (Table 1), repertoire diversity metrics (Table 2), resampled repertoire diversity metrics (Table 3), repertoire overlap measures (Table 4). Graph: V gene usage heatmap (Figure 1), J gene usage heatmap (Figure 2), rarefaction plots (Figure 3), clustering of overlap analyses ( Figure 4). How data were acquired TCRAD and TCRB sequencing was performed at ImmunoSEQ platform (Adaptive Biotechnologies). TCR analysis was performed using immunoSEQ ® 3.0 Analyzer (Adaptive Biotechnologies) and VDJTools software [3] . Data format Raw Analysed Parameters for data collection Thymus samples were obtained from eight immunologically healthy infants undergoing open cardiac surgery for congenital heart defects. A small aliquot of blood (0.5-1 mL) was drawn from four subjects during the operation. The study was approved by the Pediatric Ethical Committee of the Helsinki University Hospital (HUS/747/2019) and a written informed consent was obtained from the parents. Description of data collection Thymocytes were extracted mechanically from tissue resects. Blood samples were treated with ACK lysis buffer (Thermo Fisher Scientific) to remove erythrocytes. DNA was extracted from 10-30 million thymocytes and from all available PBMCs. TCRAD and TCRB sequencing was performed as previously described [4] from a standardized quantity of genomic DNA using ImmunoSEQ assay (Adaptive Biotechnologies), which exploits a multiplex PCR system spanning the V(D)J region at a length that is sufficient to identify V and J genes and cover unique CDR3 regions.

Value of the Data
• These data consist of a unique collection of over 62 million T cell receptor (TCR) sequences obtained directly from human thymus. It is a large scale resource of human TCR α and TCR β repertoires at an early developmental stage before clonal selections by peripheral antigens and devoid of medical or immunological interventions. • The data are useful for those who wish to compare TCR repertoires from healthy thymus and from individuals affected by immunological diseases or other medical conditions. The large scale thymic repertoire data can also benefit computational experiments which have been typically limited to peripheral blood TCR data. • These data can be directly exploited to improve existing computational models on TCR repertoire generation as well as in the generation of new models. These data can also guide design of human TCR sequencing experiments and serve as a reference database for new experiments.

Data Description
All TCRAD and TCRB sequences obtained from eight thymus (donors A-D and donors 1-4) and four related blood samples (donors 1-4) have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number PRJEB41936 ( https://www.ebi.ac. uk/ena/browser/view/PRJEB41936 ). In addition, the sequences are available at immuneACCESS ® repository in the form of immunoSEQ TM output format and can also be downloaded as raw FASTA files ( https://clients.adaptivebiotech.com/pub/heikkila-2020-mi ). On average, we obtained 4.1 million unique TCR α and 810 0 0 0 unique TCR β clonotypes from each thymus. From blood samples we obtained on average 150 0 0 0 and 84 0 0 0 unique TCR α and TCR β sequences, respec- tively. An overview of sequence diversities, total counts and sequence productivity (in-frame vs. non-coding sequences) was generated both by immunoSEQ TM and VDJTools softwares and is displayed together with donor details in Table 1 . Two of the donors (A and B) were monozygous twins and the influence of genetics in the repertoire has been analysed previously [2] . The V and J gene usage has been shown to be biased in the peripheral blood but also already in the thymus [5][6][7] . The gene segment usage in the current samples is also biased ( Figs. 1 & 2 ).
The TCR diversity has been previously assessed both in the peripheral blood and in the thymus and multiple diversity metrics are available [4 , 8-10] . The diversity estimates for the current samples were calculated using VDJTools software with default settings. To estimate the lower bound of total species richness, VDJTools provide unmodified Chao1, extrapolated Chao (chaoE) and Efron-Thisted estimates while the repertoire diversity is depicted with Shannon's index and inverse Simpson's index ( Table 2 ). The species richness and repertoire diversity indexes are also calculated for datasets down-sampled to the size of the smallest dataset to facilitate the comparison of samples with different sequencing depths ( Table 3 ). Furthermore, a rarefaction curve based on the relationship between the sample diversity and the sample size was plotted for TCR α and TCR β with extrapolation to the size of the largest sample ( Fig. 3 ).
Despite the high potential diversity of TCR repertoires, a surprisingly high fraction of the repertoire is shared between individuals [1] . Here, we calculated various overlap measures with VDJTools: Pearson correlation, relative overlap measure [rationale explained in 11 ], Jaccard index  and Morisita-Horn index ( Table 4 ). The calculations were performed on the entire repertoire and exact matching of V gene, J gene and the CDR3 region was required. The clustering of different samples with multidimensional scaling is depicted for Jaccard index ( Fig. 4 ).

Experimental Design, Materials and Methods
Thymus samples were obtained from eight immunologically healthy infants undergoing corrective cardiac surgery for congenital heart defects. The study was approved by the Pediatric Ethical Committee of the Helsinki University Hospital (HUS/747/2019). A written informed consent was obtained from the parents. Thymocytes were extracted mechanically from tissue resects and stored as pellets of 10-30 million thymocytes in −70 °C. From four donors a small aliquot of 0.5-1 mL peripheral blood was drawn during the surgery. To remove erythrocytes, the blood samples were treated with ACK lysis buffer (Thermo Fisher Scientific, USA) according to manufacturer's orders and the obtained leukocytes were stored as pellets in −70 °C. Genomic DNA was extracted from frozen pellets with QIASymphony TM (Qiagen, Germany) according to manufacturer's orders. TCRAD and TCRB regions were sequenced from a standardized quantity of quality-controlled genomic DNA using ImmunoSEQ TM assay (Adaptive Biotechnologies). The assay uses a multiplex PCR system spanning the TCRAD VJ and TCRB VDJ regions at a length that is sufficient to cover unique CDR3 regions and to identify V and J genes. Amplicon sequencing was performed on Illumina platform. TCRAD and TCRB definitions were based on IMGT database ( www.imgt.org ). Primer bias and sequencing errors were corrected as previously described [4] .
For each sequenced sample the ImmunoSEQ TM assay outputs a file of unique nucleotide sequences covering V and J genes and the CDR3 region, the count and frequency of each sequence, the CDR3 region length, and whether the sequence is in-frame, out-of-frame or contains a premature STOP codon. For in-frame and 'has stop' sequences the nucleotide sequence is converted to CDR3 amino acid sequence and * symbol indicates the STOP codon. In addition, the V gene, D gene and J gene names, the number of non-templated nucleotide insertions and the locations of insertions in V and J gene segments are provided. The raw FASTA files are also available but not directly used in the present analysis.
In the current article we applied TCR analysis tools form two platforms: immunoSEQ TM ANA-LYZER 3.0 run on Adaptive Biotech website ( adaptivebiotech.com/products-services/immunoseq/ immunoseq-analyzer/ ) and a java based non-commercial software package VDJTools [3] . From immunoSEQ TM we adapted "Sample Overview" to calculate the sample diversity and counts. VDJTools readily accepts the basic immunoSEQ TM output format and converts it to a VDJ-Tools output file. From VDJTools we used "CalcBasicStats" command to calculate the sample diversity and counts, "CalcSegmentUsage" command to produce V and J gene usage heatmaps, Efron-Thisted (mean ±std) Table 3 Resampled diversity estimates.  "CalcDiversityStats" and "RarefactionPlot" commands with default settings to calculate and visualise diversity estimations, and finally "CalcPairwiseDistances" command to calculate the sequence overlap between two samples. For sequence overlap we selected the setting "strict", Table 4 Overlap measures.
( continued on next page ) which requires matching CDR3 nucleotide regions as well as matching V genes and J genes. For visualisation of Jaccard index overlap values we used "ClusterSamples" tool that provides a multi-dimensional scaling plot created with isoMDS() function of MASS package for R.

Ethics Statement
The study was approved by the Pediatric Ethical Committee of the Helsinki University Hospital (HUS/747/2019) and a written informed consent was obtained from the parents.

CRediT Author Statement
NH and TPA conceptualised the study and wrote the original manuscript. NH and RV collected and prepared the samples. IK, DAY and JS implemented the software usage. IPM provided the study material. All authors reviewed and accepted the manuscript.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships which have, or could be perceived to have, influenced the work reported in this article.