Early life imprints the hierarchy of T cell clone sizes

The adaptive immune system responds to pathogens by selecting clones of cells with specific receptors. While clonal selection in response to particular antigens has been studied in detail, it is unknown how a lifetime of exposures to many antigens collectively shape the immune repertoire. Here, using mathematical modeling and statistical analyses of T cell receptor sequencing data, we develop a quantitative theory of human T cell dynamics compatible with the statistical laws of repertoire organization. We find that clonal expansions during a perinatal time window leave a long-lasting imprint on the human T cell repertoire, which is only slowly reshaped by fluctuating clonal selection during adult life. Our work provides a mechanism for how early clonal dynamics imprint the hierarchy of T cell clone sizes with implications for pathogen defense and autoimmunity.


Introduction
The hallmark of adaptive immunity is the generation of diversity through genetic recombination and clonal selection. Their interplay balances the breadth and specificity of the~10 12 T cells in the human body ( Figure 1A; Arstila et al., 1999;Farber et al., 2014): The genetic recombination of the T cell receptor (TCR) locus, termed VDJ recombination, generates an enormous potential diversity of receptors ranging from early estimates of~10 15 (Davis and Bjorkman, 1988) to more recent estimates of~10 61 ) different possible receptor TCRab heterodimers. Clonal selection expands the number of specific cells during an infection for effector functions, a fraction of which is retained over prolonged periods of time as immune memory (Ahmed and Gray, 1996;Farber et al., 2014).
Much progress has been made deciphering the mechanisms of regulation and control of T cell dynamics over the last few decades (Antia et al., 2005;Sallusto et al., 2010;Farber et al., 2014). However, much of that progress has focused on the dynamics of subsets of T cells specific to a particular antigen and has come from experiments in mice. An important open question is how exposures to many antigens over a human lifetime collectively shape our T cell repertoire (Farber et al., 2014;Davis and Brodin, 2018).
High-throughput repertoire sequencing enables direct surveys of the diversity and clonal composition of T cells from human blood or tissue samples and thus promises to provide quantitative answers to this question (Robins et al., 2009;Thomas et al., 2014;Britanova et al., 2016;Emerson et al., 2017;Oakes et al., 2017;Thome et al., 2016;Robins et al., 2010;Qi et al., 2014;Lindau et al., 2019;TRACERx consortium et al., 2019). However, while the TCR locus provides a natural barcode for clonal lineages due to its large diversity, this same diversity also makes inferring past clonal dynamics a challenging inverse problem, in particular given practical limitations on sequencing depth and temporal resolution in longitudinal studies. Mathematical modeling can help address this challenge by solving the forward problem of linking clonal dynamics to emergent statistical patterns (Desponds et al., 2016;Lythe et al., 2016;Dessalles et al., 2019;Altan-Bonnet et al., 2020;de Greef et al., 2020). Comparing patterns to data can provide insights about dynamics from static snapshots of repertoire organization in different individuals. A particularly striking such pattern has been the observation of power-law scaling of clone sizes spanning several orders of magnitude (Robins et al., 2009;Thomas et al., 2014;Britanova et al., 2016;Emerson et al., 2017;Oakes et al., 2017). In a typical sample of T cells from peripheral blood, a large fraction of clones is only seen once within 10 5 -10 7 sampled sequences, while the most abundant clones account for more than 1% of all sequencing reads. Such power-law scaling of clone sizes has been shown to arise at steady state in models of fluctuating clonal selection driven by different antigen encounters (Desponds et al., 2016). However, it is unclear whether this mechanism alone is sufficient to explain how clone size scaling is established, and more broadly how variable the clonal hierarchy is over time.
Here, we develop a theory of T cell dynamics throughout the human lifespan based on the statistical laws of repertoire organization and dynamics revealed by cohort and longitudinal human TCR repertoire sequencing (Britanova et al., 2016;Emerson et al., 2017;Chu et al., 2019;Lindau et al., 2019). We find that clonal expansions during repertoire formation play an important role in establishing the clone size hierarchy, which is only slowly reshaped by fluctuating clonal selection during adult life.

A scaling law of human T cell repertoire organization
An important statistic to summarize repertoire organization is the clone size distribution, which tabulates the number of clones found at different multiplicities within a repertoire or sample. Multiple previous studies have shown that these distributions are heavy-tailed (Robins et al., 2009;Thomas et al., 2014;Britanova et al., 2016;Emerson et al., 2017;Oakes et al., 2017). However, potential confounding by noise introduced during the sequencing process has remain debated (Altan-Bonnet et al., 2020) and systematic analyses of how variable these distributions are across healthy individuals have been lacking. To fill these gaps, we reanalyzed data from two large-scale cohort repertoire sequencing studies of human blood samples, which used fundamentally different eLife digest The human immune system develops a memory of pathogens that it encounters over its lifetime, allowing it to respond quickly to future infections. It does this partly through T cells, white blood cells that can recognize different pathogens. During an infection, the T cells that recognize the specific pathogen attacking the body will divide until a large number of clones of these T cells is available to help in the fight. After the infection clears, the immune system 'keeps' some of these cells so it can recognize the pathogen in the future, and respond quicker to an infection.
Over the course of their lives, people will be infected by many different pathogens, leading to a wide variety of T cells that each respond to one of these pathogens. However, it is not well understood how various infections throughout the human lifespan shape the overall population of different T cells. Gaimann et al. used mathematical modelling to study how the composition of the immune system changes in people of different ages. Different populations of T cells -each specialized against a specific antigen -had been previously identified through genetic sequencing. Gaimann et al. analyzed their dynamics to show that many of the largest populations originate around birth, during the formation of the immune system.
These findings suggest a potential mechanism for how exposure to pathogens in infancy can influence the immune system much later in life. The results may also explain variations in how people respond to infections and in their risk of developing autoimmune conditions. This understanding could help develop new treatments or interventions to guide the immune system as it develops. sequencing pipelines and thus have different sources of noise (Materials and methods -Experimental data sources). Both studies sequenced the locus coding for the hypervariable TCR CDR3 b-chain from peripheral blood T cells of healthy human volunteers spanning a large range of ages ( Figure 1figure supplement 1). In both studies, T cells were sequenced without regard to their phenotypic characteristics. A clone thus represents the full lineage of cells derived from a common ancestral cell irrespective of differentiation status (see Appendix 5 -Relation between clone size and cellular phenotypes and Figure 1-figure supplement 2 for a phenotypically resolved analysis).
After normalizing clone sizes to account for variations in sampling depth and for the increasing fraction of T cells of memory phenotype with age ( Figure 1-figure supplement 3), we found that the tails of the clone size distributions collapsed to the same statistical law across individuals and , which then undergo clonal selection (middle) together shaping the immune repertoire. The T cell receptor (TCR) locus acts as a natural barcode for clonal lineages, which can be read out by sequencing (right). (B, C) Clone size distributions in two large cohort studies of human blood samples using disparate sequencing protocols display a power-law relationship between the rank and size of the largest clones. Each line shows the size distribution of all T cell clones in an individual in an unsorted blood sample, that is independently of the phenotypes of the cells making up the different clones. Ages are color coded as indicated in the legend. The black line shows a power law with a slope of -1 for visual comparison. Normalized clone sizes were defined as the number of reads of a given receptor's sequence divided by the total number of reads within a sample and a factor equal to the average fraction of T cells with memory phenotype at different ages to account for variations in sampling depth and in the subset composition of peripheral blood, respectively ( Figure 1-figure supplement 3). Only a single individual is displayed per 2-year age bracket to improve visibility. (D) Power-law exponents as a function of the age (legend: linear regression slope and coefficient of determination). Data sources: Britanova et al., 2016, Emerson et al., 2017 The online version of this article includes the following figure supplement(s) for figure 1:     cohorts ( Figure 1B,C): Ranking clones by decreasing size, the rank of the largest clones approximately scales with their size C as a power law, rank~C Àa ; (1) where a is a scaling exponent. To quantify the apparent similarity of the scaling relationship, we determined a for each sample by maximum likelihood estimation. Only a small fraction of all T cells are sampled, which poses a challenge because subsampling a power law leads to deviations from scaling at small clone sizes (Stumpf et al., 2005). To overcome this challenge, we used a trimming procedure and excluded clones smaller than a minimal size from the fitting, which decreases bias arising from subsampling (Appendix 1). Determined in this subsampling-robust manner the fitted power-law exponents agree remarkably well within the range of ages covered by both cohorts ( Figure 1D): with a = 1.17 ± 0.03 (mean ± standard error [SE]) and a = 1.18 ± 0.01 in the Britanova cohort and Emerson cohort, respectively. Moreover, the fitted exponents varied little between individuals in both cohorts; with a sample standard deviation of fitted exponents of 0.14 and 0.21, respectively. The agreement of the mean exponents is noteworthy given the different sequencing pipelines and provides strong evidence that the scaling relationship (Equation 1) is a true feature of the clone size distribution and not of the measurement process. What drives the emergence of a power-law distributed hierarchy of clone sizes? Given the reproducibility of the scaling law across individuals, we might hope for a statistical explanation independent of the precise antigenic history that has driven the expansion of specific cells in an individual. To test hypotheses about mechanisms underlying scaling, we describe repertoire dynamics using a general mathematical framework based on effective stochastic rate equations for the recruitment of new clones, and the proliferation and death of already existing clones within a T cell compartment (Materials and methods -Mathematical framework). In macroecology, where such reductionist approaches have a long history, simple neutral models within this framework have had surprising success in describing species abundance distributions only accounting for demographic stochasticity  Figure 2. Emergence of power-law scaling of clone sizes in a minimal model of repertoire formation. (A) Sketch of the stochastic dynamics of recruitment, proliferation, and death of T cells. Proliferation is inversely proportional to total repertoire size, which reflects increased competition as the repertoire grows. (B) Clone size distributions in simulated repertoires display power-law scaling (blue lines), in contrast to steady-state predictions that conform with those of a null model based only on demographic stochasticity (black line, Equation 48). (C) Illustration of the mechanism: early in life rates of proliferation exceed clonal turnover (lower panel). As the total repertoire size increases (gray line, upper panel) the proliferation rate decreases due to increased competition. The dynamics of selected clones after their recruitment marked by a dot is indicated by colored lines (upper panel). The line position shows the cumulative size of all prior clones, while the line width indicates the size of the clone (not to scale). The earlier a clones is recruited the larger it expands during the period of overall repertoire growth. (D) Dependence of the clone size distribution on parameters. Simulated repertoires at 5 years of age were subsampled to 10 6 cells to mimick the experimental sampling depth (solid lines). The simulated data closely follow predictions from a continuum theory of repertoire formation (dashed lines). Model parameters: (B,D) clonal death rate d = 0.2/year, clonal recruitment rate q = 10 6 /year, clone size at recruitment C 0 = 1; (B) total proliferation rate b 0 = 10 7 /year (implying a recruitment to proliferation ratio g = 0.1), (D) variable b 0 as indicated in the legend by the ratio g. The online version of this article includes the following figure supplement(s) for figure 2:    (Volkov et al., 2003), but this source of variability is insufficient to account for the observed breadth of T cell clone sizes (Desponds et al., 2016;de Greef et al., 2020; for a detailed discussion see Appendix 2). The failure of this null model has prompted a search for other mechanisms that explain scaling.
To constrain this search, we analyzed how fitted exponents varied with age. In particular, based on a finite time solution we derived for a previously proposed model (Desponds et al., 2016) of how power-law scaling can emerge from the cumulative effect of temporal fluctuations in clonal growth rates (Materials and methods -Slow convergence to steady-state scaling), we expected a substantially steeper tail in young individuals. While exponents overall decreased slightly with age, the dependence on age accounted for surprisingly little variation in both cohorts ( Figure 1D), including when controlling for sex and cytomegalovirus (CMV) exposure status (Figure 1-figure supplement 4). Notably, scaling is established within the first decade of life, with significant clone size variability existing as early as at birth (Figure 1-figure supplement 5), defying previous model predictions.
A mechanism for the emergence of scaling during repertoire formation We hypothesized that scaling might result from clonal expansions during repertoire formation, which would naturally explain the early onset of scaling. Our hypothesis is based on experimental evidence in mice (Le Campion et al., 2002;Min et al., 2003;Haluszczak et al., 2009;Kawabe et al., 2017) and human (Rufer et al., 1999;Schö nland et al., 2003) that repertoire formation is driven not only by increased thymic output, but also by large proliferative expansion of some T cell clones. Additionally, multiple studies Hammarlund et al., 2003;Pogorelyy et al., 2017;Tanno et al., 2020 have shown that some T cell clones can persist over multiple decades, which suggests that clonal turnover might be sufficiently slow for transient expansionary dynamics early in life to shape repertoire organization over a prolonged time period (see also Appendix 2 -Relaxation time scale in a neutral model).
To test our hypothesis, we constructed a minimal model of repertoire formation based on known T cell biology ( Figure 2A). Following previous work (Bains et al., 2009;Lythe et al., 2016), we assume that T cells proliferate at a rate inversely proportional to the total number of cells N already present in the repertoire. This assumption leads to increased proliferation early in life before the repertoire has reached its homeostatic size, for which there is experimental evidence Le Campion et al., 2002;Min et al., 2003;Haluszczak et al., 2009;Kawabe et al., 2017;Rufer et al., 1999;Schö nland et al., 2003. This assumption is also compatible with a simple mechanistic model of T cell competition (Materials and methods -Mechanistic motivation for the competition function). We further assume that cells die at a rate d, and that new clones are recruited at rate q with an initial size C 0 . For simplicity, we set C 0 equal to one in the following and we assume constant rates d and q. Importantly, recruitment of new clones and total expansion of already existing clones maintain a constant ratio throughout development under these assumptions in line with findings that the fraction of cells with TCR excision circles, which are diluted during peripheral division, is constant during fetal development (Schö nland et al., 2003) and infancy (Douek et al., 2001;Bains et al., 2009). We simulated the model starting from an empty repertoire and found that large clones displayed power-law scaling ( Figure 2B blue lines). The simulation results contrast with steady-state predictions ( Figure 2B black line), where the model effectively reduces to the neutral null model introduced earlier (Materials and methods -Steady-state distribution). The power-law tail persisted over multiple decades of aging, much beyond the timescale of cellular turnover, 1/d = 5 years, assumed in the simulations. A mathematical analysis shows that relative timescales of clonal and cellular turnover are controlled by the control parameter g ¼ C 0 =b 0 , which is the ratio of the contribution of recruitment and proliferation to overall compartment maintenance (Appendix 2). The long timescale of clonal turnover emerges because the chosen parameters are in the biological parameter regime g<1, where most cell death is balanced by proliferation (den Braber et al., 2012;Macallan et al., 2017). Thus, we find that repertoire formation can produce transient but long-lasting power-law scaling of clone sizes.
To obtain intuitive insight into how scaling is established, we developed a continuum theory of clonal dynamics during repertoire growth (Materials and methods -Continuum theory of clonal growth). We find that the clone size C i of the i-th clone recruited at time t i follows a subexponential growth law C i ðtÞ ¼ C 0 ðt=t i Þ 1=ð1þgÞ . Clones recruited early grow large deterministically until competition lowers proliferation rates below the death rate ( Figure 2C, lower panel). Different clones are recruited at different times and thus have more or less time to grow ( Figure 2C, upper panel), which leads to a clone size distribution that follows power-law scaling with an exponent a ¼ 1 þ g. We note that this origin of the power-law scaling is closely related to a well-known generative mechanism for power-laws first studied by Yule, 1924 (for a detailed discussion see Appendix 4 -A unified view on mechanisms generating power laws in different growth processes).
The predicted exponent closely matches simulation results for different values of g ( Figure 2D dashed lines). Intuitively, when recruitment rates are higher, clones founded early have less time to outgrow later competitors, and thus the power law is steeper (a is larger). Importantly, in the biological parameter regime in which proliferation dominates, g<1, the exponent is compatible with experiments ( Figure 1B-D). We thus find, that the model -without fine tuning of parametersreproduces the observed scaling exponent.
To expose a basic mechanism capable of producing broad clone size distributions, we have kept the model deliberately simple. More detailed models demonstrate the conditions and limits on the generalizability of this mechanism (Appendix 3). Variable recruitment sizes only affect the distribution of small clones ( Long-lived incumbency advantage shows early expansions imprint clone size hierarchy Our proposed theory for the rapid emergence of scaling predicts that large clones have expanded massively during repertoire formation. To test this prediction, we need to trace the dynamics of early    founded clones. To this end, we exploit a change in the recombination statistics taking place during fetal development (Feeney, 1991;Rechavi et al., 2015;Park et al., 2020; Figure 3A). While T cells are produced by the thymus from the late first trimester the enzyme terminal deoxynucleotidyl transferase (TdT), which inserts non-templated nucleotides during VDJ recombination, is not expressed until the mid second trimester . Therefore, many more T cells in fetal and neonatal blood have zero insertions than expected from the adult recombination statistics (Rechavi et al., 2015). This enables a statistical dating of individual clones in a repertoire based on their sequence (Sethna et al., 2017;Pogorelyy et al., 2017). If our model is correct, we expect abundant clones to be more likely to have zero insertions than smaller clones. Analyzing data from the Emerson cohort, we find that zero insertion clones are indeed highly enriched within the most abundant clones ( Figure 3B). This generalizes a previous report of such an enrichment within the naive compartment (Pogorelyy et al., 2017). The large cohort size allows us to perform a fine-grained analysis of how the fraction of zero-insertion clones depends on clonal abundance and age. We find that enrichment is particularly pronounced in the young and decreases with age at different speeds depending on clone size. Among the largest clones many more still have zero insertions than expected from the adult recombination statistics even multiple decades after repertoire formation. This suggests that the incumbent large clones created during repertoire formation are only slowly replaced by clones expanding later in life, similarly to what has been observed in mice (Gossel et al., 2017;Hogan et al., 2019).
Additional analyses rule out other potential explanations for the relation between insertion statistics and clonal abundance. First, sequences with zero insertions are similarly enriched among the largest clones in productive and unproductive sequences (    support the conclusion that dynamics during the perinatal time window of repertoire formation leave a long-lasting imprint on the T cell clonal hierarchy well into adulthood.
Longitudinal clone size fluctuations predict the dynamics of the clone size hierarchy with aging Building on this successful validation of a core prediction of our theory, we asked whether we could leverage the detailed pattern of enrichments at different ranks and ages to quantify how much being part of the wave of early expansions determines the fate of a clone relative to other sources of clone size variability. To this end, we extended our model beyond repertoire formation and allowed clonal proliferation rates to fluctuate over time to model the net effect of clonal selection by changing antigenic stimuli during adult life (Desponds et al., 2016). Specifically, we added a stochastic term to the growth rate of each clone to provide an effective Langevin description of the long-term dynamics induced by fast-changing antigen stimuli, To determine a biologically plausible fluctuation strength s, we analyzed the variability of clone sizes over time in a longitudinal repertoire sequencing study (Chu et al., 2019). We first analyzed how much recently expanded clones contribute to the tail of the clone sizes, and found that only a small fraction of the largest clones in any sample were not already large at the earliest time point ( Figure 4A and Figure 4-figure supplement 1). To minimize confounding by transient dynamics affecting these clones, we excluded them from further analysis. We found that large clones had remarkably stable abundances over time, which we quantified by calculating the variance of log-foldchanges in clone size between the second and every subsequent time point ( Figure 4B  Using the fitted fluctuation strength, we constructed an in silico cohort of individuals of different ages according to the extended model (Materials and methods -Simulated cohort). In short, we computationally assigned each newly recruited clone to have zero insertions in a way that mimics the change in fetal recombination statistics, and we simulated memory repertoire dynamics based on the combined effect of early expansion and fluctuating clonal selection (Equation 2). The enrichment of zero insertion clones in the simulated cohort ( Figure 4C) closely recapitulated the empirical findings using plausible parameter values. Notably, the longer lasting enrichment of zero insertion clones among the very largest clones is also found in the simulated cohort, and the timescales over which the enrichment decays agree remarkably well.
To obtain analytical insight into the enrichment dynamics, we studied how fluctuating growth rates reorder the clone size hierarchy established during repertoire formation (Materials and methods -Relaxation of the zero insertion distribution). The early clone size hierarchy in our model is dominated by the time of recruitment, which leads to a steep gradient in the zero insertion probabilities as a function of rank in the first decade of life ( Figure 4C). We thus calculated how the zero insertion probabilities P 0 ðr; tÞ for clones of rank r at time t change due to fluctuating clonal growth starting from an idealized initial condition resembling the early clone size hierarchy, in which the clone sizes follow power-law scaling and the r $ largest clones have a zero insertion probability p 0;À and all others a probability p 0;þ . We find that the probability of zero insertion clones then follows a sigmoidal shape as a function of log clone size rank, which changes with age as follows, where Dp 0 ¼ p 0;À À p 0;þ is the difference of zero insertion probabilities, t d a characteristic timescale, and erfcðxÞ the complementary error function. These analytical results suggest a two-parameter rescaling of the enrichment of zero insertion clones as a general test of our theory. To demonstrate the feasibility of fitting these parameters from the enrichment dynamics, we determined them from the simulated data using weighted least squares fits setting r and t in Equation 3 to the mid-value of each bin. Rescaling the data with the fitted r $ and t d lead to a collapse of all simulated datapoints onto the predicted sigmoidal curve (Figure 4-figure supplement 3). We then applied the same fitting and rescaling procedure to the experimental data, and found that it also leads to a remarkably good data collapse ( Figure 3C). The fitted parameters quantify key features of long-term repertoire dynamics, with t d characterizing the timescale over which fluctuations change the clone size hierarchy, and r $ being related to the number of clones recruited during early repertoire growth. In line with the long-lived enrichment of zero insertion clones, the fitting reveals a remarkably slow timescale of about a decade over which the clone size hierarchy is reordered during healthy aging. The fitted r $ indicates that early repertoire formation involves the expansion of a large number of different clones. Overall, the agreement between theory and data demonstrates that our model quantitatively captures how early expansions and ongoing fluctuating selection together shape the clone size hierarchy.

Discussion
The evolution of the adaptive immune system has endowed vertebrates with the ability to adapt to pathogens that evolve on a timescale faster than host reproduction (Mayer et al., 2016). However, this ability comes with a cost: every generation needs to rebuild immune memory anew. As the organism first comes into contact with the outside world, it quickly needs to train its adaptive immune system to tolerate innocuous antigens and build up immune memory against pathogens. Here, we have shown that this process of rapid adaptation leaves a long-lasting imprint on the organization of the human T cell repertoire. More broadly, we propose a theory of repertoire dynamics that quantitatively describes how early expansions during repertoire formation combine with a lifetime of exposures to cumulatively shape the T cell hierarchy. Notably, we find that the T cell repertoire is remarkably stable over time in adult individuals outside of the punctuated expansions and contractions of specific clones in acute responses. Our study demonstrates that despite its vast complexity, repertoire dynamics is partially predictable by quantitative models. The model predictions can help guide future longitudinal studies, which in turn will allow refinements of modeling assumptions. The current work thus provides a stepping stone toward a detailed quantitative understanding of T cell dynamics that we hope will ultimately power the rational development of immunodiagnostics and therapeutics.
The general mechanism we describe for imprinting in the adaptive immune system provides a unified lens through which to view a number of converging lines of evidence about how a developmental time window shapes adaptive immunity (Guerau-de-Arellano et al., 2009;Farber et al., 2014;Gostic et al., 2016;Constantinides et al., 2019;Li et al., 2019;Davenport et al., 2020;Hong et al., 2020). In our model, overall repertoire growth early in life amplifies the effect of any early exposures, as the responding clones continue to proliferate as memory cells and thus are much larger than memory produced from similar exposures after the homeostatic repertoire size is reached. We thus expect early pathogen exposures to be particularly potent, as has been observed in influenza, where disease severity across age cohorts for different strains depends on the first exposure (Gostic et al., 2016). Conversely, we expect the presence of tolerizing factors early in life to be particularly crucial during repertoire formation to avoid autoimmunity, as has been observed for the autoimmune regulator gene AIRE, for which expression is only essential during a perinatal time window (Guerau-de-Arellano et al., 2009).
A limitation of datasets used in this study is that they do not provide direct information about the phenotypic characteristics of cells belonging to different clones. Repertoire sequencing of phenotypically sorted blood samples shows that the largest clones predominantly consist of cells with memory phenotype (Appendix 5). This indirectly suggests that the clonal expansions during repertoire formation produce memory cells as we have assumed in our simulated cohort ( Figure 4C). Supporting this interpretation, a substantial number of memory cells circulate in the blood quickly following birth (Pediatric AIDS Clinical Trials Group et al., 2003) and recent evidence suggests that memory-like T cells are already generated in particular tissue sites such as the intestine even before birth Li et al., 2019). However alternatively, early expansions could also set up a broad distribution of naive T cell clone sizes (de Greef et al., 2020), whose hierarchy would then need to be roughly maintained during the transition into memory to be compatible with the observed impact of early expansions on the hierarchy of the most abundant clones. Advances in repertoire sequencing of T cells sorted with increasing granularity using cell surface markers (Oakes et al., 2017;Soto et al., 2020) along with advances in single-cell technologies linking TCR sequencing and cellular phenotyping could help differentiate between these scenarios in the future.
An important question raised by our work is which antigens drive the expansion of early T cell clones. To address this question, it will be necessary to determine the exposures that imprint the abundance of these clones, as has been done recently for mucosal-associated invariant T cells (Constantinides et al., 2019), a subset of non-conventional T cells. Going forward, the highly abundant clones with sequences close to the genetically inherited gene templates resulting from the absence of TdT expression during early fetal development are a particularly interesting target of study. They might constitute an evolutionarily controlled set of innate-like defenses within the adaptive immune system. Determining what imprints their abundances will help resolve the question of whether their large abundances are simply a byproduct of rapid repertoire formation or whether these clones serve particular functions.

Materials and methods
Experimental data sources The Britanova cohort comprises 71 individuals spanning ages 6-103 years, as well as eight cord blood samples. The Emerson cohort spans ages 1-74 years and consists of a training and validation set of 666 and 120 individuals, respectively. From the training set, we excluded 111 samples with missing age information and 62 samples with a conflicting data format. We used only samples from the training set to analyze how the scaling law of repertoire organization changes with age (Figure 1). For the zero insertion enrichment analyses ( Figure 3B,C), we combined both the training and validation set together with separately published repertoire sequencing data from eight elderly individuals Lindau et al., 2019 generated using the same experimental pipeline (immunoSEQ, Adaptive Biotechnologies, Seattle) to achieve the broadest possible coverage of all age groups.
The longitudinal study by Chu et al., 2019 performed repertoire sequencing of peripheral blood from three healthy female volunteers (using the immunoSEQ pipeline) over eight time points spanning a~1 year time frame. One individual in the study was in mid-adulthood (24-45 years, Subject 3 in the original study), while two were in early adulthood (18-24 years, Subjects 1 and 2 in the original study). In Figure 4A,B, we display data from the older individual as we expect dynamics of large clones to be masked less by measurement noise as the large clones increase in relative abundance with age.
All studies from which we analyzed data sequenced the locus coding for the TCR CDR3 b-chain only, and we thus define clones as collections of cells sharing the same CDR3 b-chain. Clone sizes are defined as the number of distinct unique molecular identifiers (UMIs) sequenced (Britanova cohort), or based on sequencing reads (Emerson cohort). The definition of a clone solely based on the CDR3 b-chain neglects convergent recombination of the most easily produced receptors with different CDR3 a-chains, but we expect convergent recombination to be sufficiently rare overall for this distinction not to qualitatively affect clone size distributions. For all studies we used data, which was preprocessed as described in the original study. This data is publicly available using the links provided in Table 1.
We also used flow cytometry data on the fraction of naive cells from Britanova et al., 2014 (avail-

Data analysis Fitting power-law exponents
We estimate the power-law exponent from sampled clone sizes fC i g, i ¼ 1; . . . ; M, which exceed a minimal size C min by numerically maximizing the log-likelihood of the data (Clauset et al., 2009), where zðx; kÞ is the incomplete Riemann zeta function. We use C min ¼ 16 for both cohorts, which provides a balance between minimizing bias of the estimated exponents induced by subsampling while not overly increasing the variance of the estimator by excluding most of the data (see Appendix 1- figure 2).

Fitting the zero insertion profiles
To fit the zero insertion fractions to the theory prediction (Equation 3) we determine the values for r $ and t d by a weighted least squares fit. We set r and t to the mid-value of each bin for the data. We weight each value by the sum of its empirical standard error and a fixed model specification error of 2 Á 10 À3 chosen based on fits to the simulated cohort ( Figure 4-figure supplement 3). To demonstrate the feasibility of the parameter inference, we reinferred the parameters from the simulated data and recovered those used as parameter values for the simulation. We also fitted the values of p 0;À and p 0;þ , but we note that they are not used in the rescaling and are only needed to display the theoretical curve (Equation 3).

Normalization of clone sizes
Variations in sampling depth can confound comparisons of clone sizes (Appendix 1). Intuitively, if we sample more cells overall we also expect to sample proportionally more cells belonging to each given clone. This suggests to use the frequency with which cells are sampled from a given clone as a more robust measure, which can be empirically estimated by normalizing each clone size by the total sample size. We further normalize clone sizes by the fraction of memory T cells found in people of different ages to account for the increase in memory cell fraction in peripheral blood with age (Appendix 5). Together these two normalization steps lead to a large degree of data collapse as compared to unnormalized clone sizes (Figure 1-figure supplement 3).

Regression analyses
We determine 95% confidence intervals on regression lines by bootstrapping using case resampling (Efron and Hastie, 2016).

Mathematical framework
We describe T cell dynamics using the following general set of stochastic rate equations. The class of models we consider are known in the mathematical literature as birth-death-immigration models. The number of cells C i , i ¼ 1; . . . ; M of each of the M clones in the repertoire changes according to proliferation : C i " biðC;X;tÞCi C i þ 1; death : C i " diðC;X;tÞCi C i À 1; where the rate of proliferation b i ðC; X; tÞ or cell death d i ðC; X; tÞ generally can depend on the repertoire composition C, on the time t, and on the state of the environment XðtÞ representing for example the levels of different antigens and cytokines in the organism at a given time. We furthermore consider that new clones are added at rate ðX; tÞ at a size C 0 , recruitment : This recruitment represents thymic output and antigen-driven differentiation of naive cells for the naive and memory compartment, respectively. b i ðC; X; tÞ ¼ b 0 =N; d i ðC; X; tÞ ¼ d; ðX; tÞ ¼ where NðtÞ ¼ P MðtÞ j¼1 C j ðtÞ is the total repertoire size. In the results section 'Long-lived incumbency advantage shows early expansions imprint clone size hierarchy' w modify this model by adding a noise term that describes the effective influence of environmental variations XðtÞ on clonal proliferation, where hh i ðtÞh j ðt 0 Þi ¼ d ij dðt À t 0 Þ.

Modeling repertoire formation Mechanistic motivation for the competition function
We consider a population of N T cells that proliferate at a rate proportional to the concentration S of a set of stimuli (stimulatory cytokines), b / S. We assume that the cytokines are produced by other cells at some fixed rate p and degraded at a basal rate q. We further assume that competition between T cells is mediated by their consumption of cytokines. The dynamics of S is then described by where ÀkSN is a mass action term describing how T cells lower cytokine levels. Assuming a separation of timescales in which cytokine concentrations change quickly we obtain the quasi steady state approximation When the consumption term dominates relative to basal decay, kN ) q, we obtain b / S / 1=N.

Mean-field competition approximation
We simplify the full stochastic model (Equation 5-Equation 7) using a mean-field approximation for the competition, which decouples the dynamics of individual clones while retaining the full stochasticity on the clonal level. This approximation replaces the dependence of the proliferation rate on N by a dependence on its continuum theory average given by Equation 14. We exactly simulated a system of reduced size to validate the mean-field approximation. The distributions of the exact and mean-field simulations agree to within stochasticity ( Figure 5), with the exception of the largest clone, which is larger in the exact simulations as has been discussed elsewhere (Dodds et al., 2017).

Continuum theory of clonal growth
To obtain insight into why the model produces power-law scaling we present a simple continuum theory of early clonal dynamics. We approximate the clone size dynamics of the i-th clone C i as with C i ðt i Þ ¼ C 0 at the time of recruitment t i . The total repertoire size N ¼ P i C i evolves according to whose solution is given by For times large compared to 1=d the total repertoire size given in Equation 14 reaches a steadystate, because competition for proliferation signals acts as a homeostatic regulator. By combining Equation 14 and Equation 12 we derive the clonal growth law C i ðtÞ ¼ C 0 e dt À 1 e dti À 1 1=ð1þgÞ e ÀdðtÀtiÞ ; where g as in Appendix 2 is the recruitment to proliferation ratio which in this model is given by g ¼ C 0 =b 0 . To simplify we expand the growth law at leading order for small times, t i <t ( 1=d, to obtain This expression can also be derived directly by noting that early repertoire growth is linear NðtÞ » ðb 0 þ C 0 Þt, and that the early dynamics is dominated by proliferation and not death such that which is solved by Equation 17. Given the constant recruitment of new clones the distribution of the t i 's is uniform, which with Equation 17 implies a clone size distribution that follows power-law scaling with an adjustable exponent that depends on g. Note that the exponent for PðCÞ differs by one from the exponent for the rank (Clauset et al., 2009), which is a complementary cumulative distribution, and thus a ¼ 1 þ g.

Steady-state distribution
To derive the power-law scaling, we have expanded the total repertoire size for small times (or death rates). How does the clone size distribution change later in life? At large times, the division rate b 0 =NðtÞ falls below the constant death rate d as the steady-state repertoire size N ¥ is approached following Equation 14. In this model this happens at a time t $ ' logð1 þ 1=gÞ=d, after which the large clones experience a deterministic force toward extinction. For times t ) t $ the model effectively reduces to the neutral birth-death dynamics considered in Appendix 2. (The growth rate fluctuations produced by variations of the total population size around steady state asymptotically vanish for large N ¥ .) We thus expect the steady-state clone size distribution to be equivalent to that of the neutral model (Equation 48). Indeed this distribution accurately describes the distribution of small clones in old age ( Figure 2B). The neutral distribution is not compatible with data as discussed before. However, the timescale over which large early founded clones vanish is long (Appendix 2) such that a tail of large clones resulting from the early growth dynamics can be maintained much beyond t $ until t ) t c (see also Appendix 2 -Relaxation time scale in a neutral model).
Modeling long-term repertoire dynamics with fluctuating clonal growth rates Slow convergence to steady-state scaling Multiplicative stochastic processes are a classical generative mechanisms for heavy-tailed distributions (Sornette and Cont, 1997;Gabaix, 1999;Newman, 2005). In the context of lymphocyte dynamics this mechanism has first been proposed by Desponds et al., 2016, who argued that fluctuations in antigen availability can lead to multiplicative stochastic dynamics producing power-law scaling at steady state. Here, we expand on this earlier work by analyzing a simple fluctuating fitness model out-of-steady-state. Our analytical results show that the emergence of scaling can be slow when the fluctuation amplitude is small.
We opted to treat proliferation rate fluctuations as temporally uncorrelated for computational tractability (Equation 2). Correlations in proliferation rate fluctuations are clearly an important feature of short-term dynamics -for example to describe the quick expansion and contraction during and following acute infection over a timescales of days and weeks, respectively (Mayer et al., 2019). However, given finite correlation times we expect to be able to capture dynamics over the long timescales which we are interested in here, with uncorrelated noise with an effective net fluctuation strength that averages over the short-term dynamics.
In this limit clone sizes follow a geometric Brownian motion, that is x ¼ log C=C 0 follows the Langevin equation with initial condition xðt i Þ ¼ 0, where s sets the fluctuation strength and where hh i ðtÞh j ðt 0 Þi ¼ d ij dðt À t 0 Þ. A negative mean fitness f 0 <0 balances the recruitment of new clones and the net expansion induced by the fluctuating term. In general, we might want to include also demographic noise and the extinction of clones as an absorbing boundary condition (Desponds et al., 2016), but here for simplicity we will neglect those effects. Equation 20 is a diffusion equation for the logarithmic clone size x and has the well-known Green's function Gðx; y; tÞ ¼ which describes how the distribution spreads out from an initial d-distribution centered at size y. The clone size distribution at time T is given by where t is the clonal age. For a constant immigration rate t is uniformly distributed and we obtain by integration where ðxÞ is the Heaviside step function, ðxÞ ¼ 0 for x<0 and ðxÞ ¼ 1 otherwise. For large T and x>0 this reduces to PðxÞ ! e f 0 x s 2 =ðÀf 0 TÞ; which implies PðCÞ~C Àð1þaÞ with a ¼ Àf 0 =s 2 ; Plotting the cumulative distribution of clone sizes at different effective ages ( Figure 6) we observe that the convergence of clone size distributions is slow when s 2 is small. Based on estimates for the fluctuation strength from longitudinal data ( Figure 4B), we would expect significant deviations from the steady state power-law scaling that persist into adulthood. Thus, this mechanism alone is unable to account for the observed power-law scaling in data.
A note on the scaling exponent A minimal requirement for the existence of a steady state is f 0 <0 ensuring that clones eventually die to balance the recruitment of new clones. This condition still allows such multiplicative processes to produce power-laws with arbitrary exponents as noted before (Desponds et al., 2016). Here, we propose that the parameters should fulfill a stronger condition. In particular, it seems reasonable to require that the large clones do not deterministically take up a larger fraction of the overall repertoire, or equivalently that their expected change in clone size should not exceed one. The mean of the lognormal distribution of clone size change is given by e f0þs 2 , and thus we find the stronger condition Àf 0 <s 2 : Importantly, it follows that exponents in the vicinity of a ¼ À1 arise without fine-tuning as long as the timescale of expected net clonal decay is large compared to the diffusion timescale.
Another perspective on the parameterization is provided by noting that the Langevin equation for C (not x ¼ log C) in the Stratonovich convention includes an extra drift term Às 2 , to keep hDCi independent of the choice of s. Alternatively, in the Ito convention the extra drift term arises by Ito's lemma when transforming the equation from C to x.

Predictions for longitudinal fluctuations in clone sizes
To quantify longitudinal fluctuations, we calculate the mean and variance of log-clonesize changes with respect to a reference time t 0 . From the model we, according to Equation 21, expect hðxðtÞ À xðt 0 Þ À hxðtÞ À xðt 0 ÞiÞ 2 i ¼ 2s 2 t: The variance of log-clonesize changes in empirical data involves an additional term s 2 S accounting for sample-to-sample variability. This term is expected not to depend on the time difference, and we can thus determine s 2 by linear regression with an intercept that captures the sampling variability s 2 S ( Figure 4B).
We note that a similar approach has been independently proposed in unpublished work by Ferri, 2018.

Relaxation of the zero insertion distribution
Here, we solve for the relaxation dynamics of the zero insertion distribution in a simplified setting. Throughout we use log clone sizes x ¼ log C for notational convenience. We posit that at time 0 the power-law distribution Pðx; 0Þ ¼ ae Àax is already established and we further assume that the r $ largest clones have zero insertion probability p 0;À and all smaller or later added clones have probability p 0;þ . Then the probability that a clone of a given size x has zero insertions is given by where Dp 0 ¼ p 0;À À p 0;þ and f early ðx; tÞ is the fraction of clones of size x and time t that derive from the r $ largest clones at time 0.
In the following, we determine an analytical formula for f early ðx; tÞ under the assumption that the dynamics leaves the distribution unchanged Pðx; tÞ ¼ Pðx; 0Þ. We then have f early ðx; tÞ ¼ R ¥ xmin dy e Àay Gðx; y; tÞ e Àax ; where Gðx; y; tÞ as before is the Green's function of the fluctuating proliferation rate dynamics and x min is defined such that the total number of clones times Pðx>x min Þ equals r $ . By integration one obtains which after setting f 0 ¼ Àas 2 reduces to f early ðx; tÞ ¼ To convert clone size into ranks, we note that rank~e Àax and thus x min À x~1 a log r r $ À Á . In combination with Equation 33 and Equation 30 we thus obtain Defining a characteristic timescale for the diffusive dynamics as t d ¼ 1=ðasÞ 2 , we can simplify this expression to

Repertoire formation
To simulate the model efficiently at large scales, we use a mean-field competition approximation (Materials and methods -Mean-field competition approximation). We verified the validity of the mean-field assumption by comparing them to full stochastic simulations of the coupled birth-deathimmigration equations, which we simulated using the Gillespie algorithm (Press et al., 2007;Figure 5). In the mean-field approximation, the proliferation rate is time-dependent, which requires a specific procedure for sampling event times. The time interval until the next event depends on the total rate for all possible processes lðtÞ ¼ þ bðtÞ þ d : To sample an interval of time Dt between two events from an inhomogeneous Poisson process of rate lðtÞ one can sample from a Poisson process with a rate function l $ ðtÞ fulfilling the majoration condition l $ ! lðtÞ 8t and then reject a proposed time interval Dt $ with a probability of 1 À lðt þ Dt $ Þ=l $ ðt þ Dt $ Þ ( Lewis and Shedler, 1979). The thinned set of event times follows the statistics of the Poisson process with rate lðtÞ. Here, because competition is increasing with time, lðtÞ decreases monotonically. Therefore, the homogeneous Poisson process with a constant rate function l $ ðtÞ ¼ lðt 0 Þ, satisfies the majoration condition. Using this thinning technique, we are able to efficiently sample the next event time while accounting for the time-dependence of the proliferation rate. The source code that allows reproduction of the statistical analyses and numerical results reported in the manuscript is published (Gaimann et al., 2020).

Simulated cohort
As empirical evidence shows that the tail of the clone size distribution is almost exclusively driven by cells with memory phenotype (Appendix 5), we focused on the clone size dynamics within the memory compartment. We assumed that the recruitment size for memory cells is independent of the prior naive cell dynamics, and we thus did not explicitly model the clone size dynamics within the naive compartment. Within the memory compartment, we modeled clone size dynamics under the combined effect of early deterministic expansions during repertoire formation and fluctuating clonal growth rates according to Equation 2. Given the large sizes of memory clones, we expect demographic stochasticity to be negligible relative to clone size variability introduced by fluctuating selection. For tractability, we thus ignored demographic fluctuations, which allowed us to combine the continuum solution to the deterministic clonal growth (Equation 16) with the stochastic propagator for the fluctuating dynamics (Equation 21) to efficiently simulate the dynamics. To study the enrichment of zero insertion clones in silico, we assigned newly recruited memory clones as having zero insertions with a probability equal to the fraction p 0 ðtÞ of zero insertion clones within the naive compartment. We assumed p 0 ðtÞ ¼ p 0;À before TdT expression turn-on at time t † and p 0 ðtÞ ¼ p 0;À t=t † þ p 0;þ ð1 À t=t † Þ for t>t † , where t=t † is the fraction of naive clones produced since the switch to the adult recombination statistics. Taken together, these simplifications lead to the following direct sampling scheme: . Sample the age T of an individual uniformly from the range ½0; 80 years. . Set the number of clones equal to T (rounded to the nearest integer), where is the rate of recruitment of new clones to the memory compartment.
. For each clone determine its recruitment time t i by drawing uniformly from the range ½0; T. . Assign each clone as having zero insertions with a probability Sample the size C i ðTÞ of each clone as follows (Equation 16 and Equation 21), where d; g; s 2 are model parameters and y~Nð; s 2 Þ indicates x being drawn from a normal distribution of mean m and variance s 2 .
. Finally to mimic the experimental sampling depth of N sample reads we determine sampled clone sizesC i by Poisson sampling, where x~PoisðlÞ indicates x being drawn from a Poisson distribution of parameter l.

Parameter choices
In the following, we summarize the parameters we used to simulate repertoire dynamics and provide additional motivation for our parameter choices. Lifetimes of several years and several months have been measured by deuterium labeling for naive and memory T cells, respectively (De Boer and Perelson, 2013;Borghans et al., 2018). Clonal turnover can be substantially slower than cellular turnover when proliferation balances most death (Appendix 2). This has been shown to be the case for the maintenance of naive cells in human , where the aging-associated decline of the fraction of T cells with TCR excision Table 3. Parameters of the simulated cohort ( Figure 3C). circles (TRECs) suggests g~0:1. Similarly, memory T cell numbers decline much more slowly overall than suggested by the deuterium labeling literature, which is thought to be driven by homeostatic proliferation in the absence of reinfection (Macallan et al., 2017). For example, T cell memory has been observed to decline with half-lives of 8-15 years by following titers after small pox vaccination (Hammarlund et al., 2003). Additionally, the relatively short average lifetime of memory T cells likely masks substantial heterogeneity with a subset of more long-lived cells also contributing to the slower long-term decline of memory cells (Akondy et al., 2017). Another line of direct evidence for long clonal persistence has come from two studies of identical twins (Pogorelyy et al., 2017;Tanno et al., 2020), which have shown an excess sharing of identical clones decades after in utero blood exchange in monochorionic twins.
To simulate repertoire formation ( Figure 2B), we used a set of parameters summarized in Table 2. In choosing these parameters, we were not trying to reproduce exact values of any particular T cell subset, but rather illustrate a plausible biological parameter regime that characterizes T cell dynamics. We note that qualitatively the scaling presented in Figure 2B only depends on g as shown by our theoretical analysis, in a way that is shown in Figure 2D. For the recruitment rate , we used a rate intermediate between those suggested by estimates of thymic output and the rate of recruitment of memory clones suggested by estimates of the diversity of the memory compartment (see below). We note that under mean-field competition the rate of recruitment only determines the overall number of clones, but does not influence the dynamics of individual clones. While decreases during adulthood for both the naive compartment (as thymic production wanes with age) and the memory compartment (as new primary responses are rarer at advanced age), we used a constant for simplicity. We expect this simplifying assumption not to qualitatively affect the results in Figure 2 due to a separation of timescales: The clone size scaling emerging from the expansionary dynamics only depends on the rate during infancy and not on slower changes in happening during adulthood. We chose a recruitment size of one to numerically investigate potential deviations from the continuum theory predictions due to demographic stochasticity. The number of recruited clones is of course much larger in practice in the memory compartment, and even naive cells undergo a few rounds of division before thymic export and thus have C 0 >1. Assuming a larger C 0 will further decrease the small effects of demographic stochasticity relative to the continuum theory.
To study the enrichment of zero insertion clones in a simulated cohort ( Figure 4C), we used the same recruitment to proliferation ratio and death rate as in the previous simulation of repertoire formation. To determine the absolute number of large clones that have zero insertions in these simulations, the choice of the recruitment rate is important. Based on order-of-magnitude estimates of the clonal diversity of the memory compartment (Robins et al., 2009;Qi et al., 2014), we chose a value of ¼ 10 5 /year. Additionally, we chose a fraction of zero insertion clones within the early naive compartment of p 0;À ¼ 0:07 (roughly equal to their overall fraction in cord blood [Pogorelyy et al., 2017]) and in the late naive compartment equal to p 0;þ ¼ 0:02 (roughly equal to their overall fraction in adult blood). Finally, we used t † ¼ 0:05 years for the time of the recombination switch, which together with the choice of produces~10 4 excess zero insertion clones recruited during repertoire formation in line with the empirical enrichment data in the <10 years age group ( Figure 3B). All parameters are summarized in Table 3.

Inference of scaling exponent
Given a clone of size C in the repertoire, the number of reads from that cloneC follows a distribution PðCjCÞ. The form of PðCjCÞ depends on the sampling process. To build intuition let us consider the simplest case, in which every cell is sampled independently with a probability h, the subsampling fraction. Then the sampling distribution is binomial The mean of this distribution is hCi ¼ hC; which implies that sampled clone sizes are on average smaller by a factor h than the actual clone size. In the practically relevant limit where the sampling fraction is small, h ( 1, we can further simplify and assume that the counts from the large clones follow a Poisson distribution. In the Poisson limit, the sampled clone size varies around its mean value with a coefficient of variation that scales as an inverse of the square root of the mean sampled count, Importantly, the stochastic sampling introduces a subsampling scale,C ¼ hC~1, at the clone size C ¼ 1=h, from which on average we expect a single sampled cell. Due to the existence of this scale subsampling breaks scale-invariance: even if PðCÞ follows a perfect power law, the distribution of sampled counts deviates from power-law scaling close toC ¼ 1. This intuition can be made rigorous using a generating function formalism (Stumpf et al., 2005): for example for PðCÞ ¼ C À2 =zð2Þ one obtains for C>1 PðCÞ~1 CðC À 1Þ : As expected the scaling with an exponent À2 is recovered asymptotically, but subsampling leads to a deviation from scaling whenC is close to 1.
The deviation from scaling due to subsampling leads to biases in naive estimates of the scaling exponent. How can we determine a power-law exponent in a way that is robust to subsampling? When the sampling distribution is known or can be inferred from replicate sequencing the exponent can be determined using maximum likelihood estimation of a model with an underlying power law distribution of clone sizes convolved with the sampling probability (Puelma Touzel et al., 2020).
Here, we propose a simpler approach that does not require precise knowledge of the sampling process. We exploit the fact that the deviations from scaling vanish asymptotically for largeC (Equation 42), by excluding small clones below some minimal size C min from the fitting. The power-law exponent is expected to converge as we increase C min , which we confirm using simulated data (Appendix 1-figure 1, blue dots). We can also consider more realistic models for the sampling process that account for overdispersion, that is their coefficient of variation exceeds the minimal value of one set by Poisson sampling. Mechanistically, such overdispersion arises for a number of reasons, most importantly because in practice we are not actually directly counting cells: in the DNA-based sequencing pipeline every cell can give rise to multiple sequencing reads due to the PCR amplification step, and in the mRNA-based sequencing pipeline despite the addition of unique molecular identifiers several of them can originate from different mRNA molecules from the same cell. As long as the number of reads from each cell is independently and identically distributed the law of large numbers ensures that the relative frequencies of large clones converge. We thus expect that the trimming method of fitting only to counts greater than C min also works for overdispersed sampling. We test the trimming method on simulated data, in which the sampling follows a negative binomial distribution with mean m and variance þ a 2 (which reduces to Poisson sampling for a ¼ 0). We find that trimming allows a correct estimate of a (Appendix 1-figure 1, orange and green dots).
Appendix 1-figure 1. Estimated power-law exponents converge to correct value using trimming method. Fitted exponent as a function of the cutoff choice in simulated data (errorbars ±2 SE over 50 independent draws). The fitted exponent changes drastically for small C min before leveling off indicating deviations from true power-law scaling at the smallest clone sizes. Such a deviation is expected due to subsampling despite the true power-law scaling in the underlying distribution (see text). Simulations: 10 7 clones were drawn from a discrete power-law distribution with a ¼ 2:15. A sample of size 5 Á 10 5 cells was then drawn from the underlying power law based on a Poisson (blue dots) or negative binomial sampling (orange and green dots show two choices of the overdispersion coefficient a).
Applying the same method to the empirical data we find that the fitted exponents also depend on C min (Appendix 1-figure 2). In practice, we chose C min ¼ 16 to balance a trade-off between minimizing bias and variance, which increases as more of the data is excluded from the fit Appendix 1figure 2.

A B
Appendix 1-figure 2 continued on next page

Modeling neutral repertoire dynamics
In the following, we review results on the neutral dynamics of clone sizes in which the continuous recruitment of new clones is balanced by a net negative growth of already established clones b<d. These models have a long history in ecology (Volkov et al., 2003), and have also been proposed as null models in the context of T cell dynamics previously (Desponds et al., 2017;Desponds et al., 2016;de Greef et al., 2020). While the results can be found in the literature their inclusion serves to introduce a parametrization highlighting the recruitment to proliferation ratio g as a key quantity governing clonal dynamics.

Steady-state clone size distribution
At steady state, the probability distribution PðCÞ of clone sizes C needs to fulfill the balance condition for all C>C 0 which yields This distribution is characterized by power-law scaling with an exponent of 1 for small clone sizes, and, importantly, has an exponential cutoff at C $ ¼ 1= logðd=bÞ. In contradiction with this model experiments point toward power-law scaling with an exponent~2 (Note that PðCÞ~C ÀaÀ1 when rank~C Àa ). Additionally, the size of large clones seen experimentally is incompatible with the predicted exponential cutoff as we discuss below.
The total repertoire size follows the following continuum equation such that at steady-state, dN dt ¼ 0, the repertoire has a total size For a more interpretable alternative parametrization, we introduce the recruitment to proliferation ratio for the maintenance of cells at steady state Using this relation to rewrite Equation 44 we obtain implying a cutoff clone size of C $ ¼ 1= logð1 þ gÞ. The largest clones represent on the order of one percent of the repertoire, which assuming independent sampling from the underlying repertoire would correspond to~10 10 cells in the complete repertoire. For small g, we can expand C $ » 1=g, so in order to have a cutoff clone size C $ of this order of magnitude one would need to have an unreasonably small g~10 À10 .

Relaxation time scale in a neutral model
Over what timescale do transiently expanded clones disappear in a neutral model? The time scale t c ¼ 1 dÀb for deterministic clonal decay can be much larger than the lifespan 1=d of a single cell when birth and death are closely balanced. Rewriting the birth rate in terms of g and d we obtain demonstrating that for g ( 1 clonal dynamics is a factor of 1=g slower than cellular dynamics.

Influence of model assumptions on repertoire formation process
For tractability and interpretability, we have kept the model presented for repertoire formation in the main text deliberately simple. Here, we explore how a saturation of the proliferation rate, competition for specific resources, or variations in the recruited number of cells impact the clone size distributions.

Saturation of proliferation rate
Cellular growth is not arbitrarily fast, which is not accounted for in the simple model in which cells proliferate very rapidly early in life. To understand how such a saturation effect influences clone size distributions, we introduce an upper limit on birth rate that limits proliferation in the absence of competition. Following De Boer and Perelson, 1995, we set the clonal birth rate to bðtÞ ¼ b 0 =ðK þ NÞ for some constant K, which sets the repertoire size below which competition is negligible. Given this choice the birth rate remains limited to a value b 0 =K even in the absence of any competitors. Increasing K leads to deviation in the scaling of the largest clones, but the same scaling remains at intermediate clone sizes (Figure 2-figure supplement 2). In the model early clonal growth is exponential until the total repertoire has reached size NðtÞ~K, which explains the different distribution of the largest clones. However, the number of clones that are recruited during this phase grows only logarithmically with K due to the exponential increase in total repertoire size.

Competition for specific resources
T cells respond to stimuli from peptide-MHC complexes, which could also act as limiting resources. T cells then compete only with those cells specific to the same antigens in contrast to the global competition considered previously. To assess how assumptions about the mechanisms of competition influence, our results we simulated the repertoire formation process using a classical description of competition for antigens (De Boer and Perelson, 1994;De Boer et al., 2001;Mayer et al., 2015). We consider a fixed number of antigens N a and encode the specificity of the M clones in a matrix K of size M Â N a , where K ij ¼ 1 if clone i recognizes the antigen j and K ij ¼ 0 otherwise. We draw the entries of K independently with a fixed binding probability p b . We assume that the proliferation rate of a cell of the i-th clone is proportional to the amount of antigenic stimulation: where the availability of antigen j is given by The normalization of Equation 50 ensures that total proliferation is comparable to a global resource model with the same parameters independent of N a . For computational tractability, we simulated the clone size dynamics without taking into account demographic stochasticity in proliferation and death of cells. While more specific competition (smaller p b ) leads to a deviation in the distribution of the largest clones, we find that clone size distributions are heavy tailed independently of the choice of p b and all display the same scaling at intermediate clone sizes (Figure 2

Variations of the recruitment size
The numbers of cells C 0 that are recruited might also be variable. In particular, this will be the case when modeling the memory compartment, in which C 0 represents the number of cells from a clone recruited into memory following infection. To understand how such variations modify the dynamics of repertoire formation, we derive an analytical prediction in the case where the distribution of recruitment sizes, PðC 0 Þ, is lognormal. Given a lognormal distribution with parameters 0 and s 0 the mean introduction size is given by C 0 ¼ e 0þs 2 0 =2 . To keep the mean introduction size constant while changing the variability of clone sizes, we use a parametrization in terms of C 0 and s 0 and set 0 ¼ logð C 0 Þ À s 2 0 =2. To determine the clone size distribution resulting from early repertoire growth, we integrate the continuum theory prediction, PðC=C 0 Þ / ðC=C 0 Þ À2Àg over the distribution of C 0 : The complementary error function ðxÞ saturates for x ( À1. Thus, the distribution follows the same power-law PðCÞ~C À2Àg for large clones, log C= C 0 ) s 0 ffiffi ffi 2 p þ 3þ2g 2 s 0 À Á , while it deviates for smaller clones within the range of recruitment sizes (Figure 2-figure supplement 1).

A unified view on mechanisms generating power laws in different growth processes
The origin of power-law scaling during repertoire formation is reminiscent of a class of stochastic processes widely studied in the literature as a mechanism underlying power-law distributions found in diverse contexts (Yule, 1924;Luria and Delbrück, 1943;Barabasi and Albert, 1999), which has been rediscovered multiple times since the pioneering work of Yule on speciation (Yule, 1924). Common to these processes is that the distribution of types at a given point is the result of a balance between the growth of existing types and the addition of new types. The different models depending on their context differ in (i) the growth rate rðtÞ of the number of units of each already existing type and (ii) the rate function ðtÞ at which new types are introduced. They all share the same basic mathematical mechanism that produces a power law distribution of types as we review below. The three maybe most well-known instances of this class of processes are the following: . The Yule model of speciation (Yule, 1924), in which (i) species within a genus speciate at some constant rate, and (ii) new genus is created at a rate proportional to the number of already existing genera.
. The Luria-Delbrü ck model of population genetics during exponential growth (Luria and Delbrück, 1943) in which (i) each cell divides at a constant rate, and (ii) new alleles arise through random mutation at a constant rate per cell division.
. The Barabá si-Albert model of network growth (Barabasi and Albert, 1999), in which at every time step (ii) a new node is added, and is (i) linked to m already existing nodes with a probability proportional to the number links that a chosen node already has.
Despite differing in their assumptions about the functional form of the growth and innovation rates we show in the following that these different models all share a common mathematical basis. To provide a common terminology, we will use the language of urn models and refer to different types as urns and to the different number of units of each types as balls in each urn. In an attempt to unify the different models, we develop a continuum theory for these growth-innovation processes. To do so, we rescale time to such that new urns are added at unit rate, ðt Þ ¼ 1. The number of balls in each urn then grows according to The key to the power-law scaling in all these models is the existence of a regime in which that is the growth rate scales inversely with rescaled time with a proportionality factor 1=a. Equation 55 has the same form as Equation 18 that we derived for our model of repertoire formation. Thus following the derivation of Equation 19 within Materials and methods -Continuum theory of clonal growth we obtain a subexponential growth of balls in already existing urns, which leads to a power law scaling of the distribution of balls per urn, with an adjustable exponent that depends on a.
Before deriving how Equation 55 arises in specific contexts let us first remark on a general consequence of this form of effective growth law: The total number of balls added to all existing urns per rescaled time unit is constant. To derive this let us assume that each new urn is populated by C 0 balls, then the total number of balls NðtÞ ¼ P i C i grows according to which is solved by Nðt Þ ¼ C 0 a a À 1 t : Multiplying by the growth rate Equation 55 yields a constant, thus showing the equivalency between the assumed growth rate dependency on rescaled time and the constancy of how many balls are added per rescaled time unit.
In the Yule process, we have rðtÞ ¼ r and recruitment is proportional to the number of genera, ðtÞ ¼ sGðtÞ, which grow at a (generally different) rate s, GðtÞ ¼ G 0 e st . By integration we obtain t ðtÞ ¼ G 0 e st À 1 ð Þ, which leads to zðt Þ ¼ r st þsG0 . Thus zðt Þ » r st when the number of newly created genera exceeds the initial number t ) G 0 . The exponent of the power-law is determined by the ratio of the growth of genera and species, a ¼ s=r.
In the Luria-Delbrü ck model, we have rðtÞ ¼ r assuming mutations do not change the growth rate. Recruitment is proportional to the total population size ðtÞ ¼ rNðtÞ, where m is the mutation probability per replication and where NðtÞ ¼ N 0 e rt . By integration we obtain t ðtÞ ¼ N 0 e rt À 1 ð Þ, which leads to zðt Þ ¼ 1 t þN0 . Thus zðt Þ ¼ 1 t when t ) N 0 . In contrast to Yule's model, the power-law exponent is fixed at a ¼ 1 under our assumption that mutations are neutral, because the same growth process governs the increase in ðtÞ and in cell numbers.
In the Barabasi-Albert model, the introduction rate ðtÞ ¼ 1 is constant, but rðtÞ decreases with time. The m newly added links attach preferentially to those nodes that already have a large degree. The growth rate rðtÞ ¼ m=NðtÞ of a node thus decreases proportionally to the total degree N ¼ 2mt of all present nodes. We have rðtÞ ¼ 1=ð2tÞ, which implies zðt Þ ¼ 1 2t and a ¼ 2.

Relation between clone size and cellular phenotypes
In both cohorts, all T cells from peripheral blood were sequenced irrespective of their phenotypes. Antigenic challenges drive large clonal expansions and we thus expect clones with effector or memory cells to be larger than naive clones all else being equal (Farber et al., 2014;Mayer et al., 2019). This has generally been confirmed by TCR repertoire sequencing studies (Oakes et al., 2017), but there have also been some reports (Qi et al., 2014;Pogorelyy et al., 2017) of expanded naive clones with similar sizes to the largest memory clones. Given this unclear picture from the literature, we analyzed the relative contribution of naive and memory cells to clones of different sizes. Overall, we might expect that naive clones dominate the clone size distribution at the smallest sizes. To test this idea, we compared sequencing and flow cytometry data from the Britanova cohort and found that the fraction of naive cells in different individuals explains a remarkably high 88% of variability in the number of clones sequenced only once after subsampling all repertoires to the same size (Figure 1-figure supplement 2A). To further determine how cells from clones of different sizes partition phenotypically, we analyzed data from a study in which T cells were sequenced both in unsorted blood as well as after sorting into naive and memory cells (Chu et al., 2019). We find that the sizes of large clones follow the same scaling in unsorted blood and in the memory compartment ( Figure 1-figure supplement 2B) Within the naive compartment most clones are small, in particular when excluding clones from which cells are also found in the memory compartment (Figure 1-figure supplement 2B, red line). We note from the plot that all the largest 200 clones in unsorted blood have memory phenotype cells, and less than 1% of the top 1000 clones are not found within the memory compartment. This rules out that the enrichment of zero insertion clones among the most abundant clones found in Figure 3 is driven by naive clones as has been suggested in a previous study (Pogorelyy et al., 2017). The relative frequency of a clone within the memory compartment is larger by a constant fold-factor (Figure 1-figure supplement 2C), likely reflecting an increased relative frequency of the large clones when excluding naive cells from the denominator.
Our analysis of the clone size distributions in unsorted blood shows that the largest clones take up an increasing fraction of all sampled reads with age ( Figure 1-figure supplement 3B,E). Flow cytometry data shows that the overall fraction of memory T cells in blood also increases with age ( Figure 1-figure supplement 2D; Pediatric AIDS Clinical Trials Group et al., 2003;Britanova et al., 2016). How much is the former increase explained by the latter? In the following, we show how to answer this question in the absence of direct data on sorted memory T cell repertoires. Let us define C þ i as the number of memory cells of the i-th clone and C À i as the number of naive cells. What we are measuring in unsorted blood is the overall fractional size f i of the clone, What we are interested in instead is the fraction of the memory compartment taken up by cells belonging to the same clone, We have shown above that empirically for the largest clones C þ i ) C À i . Thus where r þ is the fraction of memory cells within the repertoire. Finally, we approximate r þ by its mean value expected at each given age as fitted from the flow cytometry data (Figure 1-figure  supplement 2D). We find that this normalization mostly collapses the tails of empirical clone size distributions (Figure 1-figure supplement 3C,F). This implies that most of the increase in the sizes of the largest clones is explained by the overall expansion of the memory compartment, while the relative clone sizes within the memory compartment are more stable.