RESTAMP – Rate estimates by sequence-tag analysis of microbial populations

Microbial division rates determine the speed of mutation accumulation and thus the emergence of antimicrobial resistance. Microbial death rates are affected by antibiotic action and the immune system. Therefore, measuring these rates has advanced our understanding of host-pathogen interactions and antibiotic action. Several methods based on marker-loss or few inheritable neutral markers exist that allow estimating microbial division and death rates, each of which has advantages and limitations. Technical bottlenecks, i.e., experimental sampling events, during the experiment can distort the rate estimates and are typically unaccounted for or require additional calibration experiments. In this work, we introduce RESTAMP (Rate Estimates by Sequence Tag Analysis of Microbial Populations) as a method for determining bacterial division and death rates. This method uses hundreds of fitness neutral sequence barcodes to measure the rates and account for experimental bottlenecks at the same time. We experimentally validate RESTAMP and compare it to established plasmid loss methods. We find that RESTAMP has a number of advantages over plasmid loss or previous marker based techniques. (i) It enables to correct the distortion of rate estimates by technical bottlenecks. (ii) Rate estimates are independent of the sequence tag distribution in the starting culture allowing the use of an arbitrary number of tags. (iii) It introduces a bottleneck sensitivity measure that can be used to maximize the accuracy of the experiment. RESTAMP allows studying microbial population dynamics with great resolution over a wide dynamic range and can thus advance our understanding of host-pathogen interactions or the mechanisms of antibiotic action.


Introduction
During the last decade, considerable advances have been made toward understanding the detailed population dynamics of pathogens [1][2][3][4][5][6]. These were made possible by theoretical [1,3,7] and methodological developments such as signature tagging [1,8] and next-generation sequencing [9]. Mutations mainly occur during replication, and therefore microbial division rates are the main driver of the rates with which pathogens acquire antibiotic resistance or evade vaccines. A complete understanding of the complex population dynamics from the level of individual dynamical processes, such as division and death, offers a way to counter the rise of antibiotic resistance [5,10,11] and for rational design of vaccines and therapies against pathogen colonization and infection [1][2][3][4].
One set of methods for determining the division and death rate of a microbial population relies on a single identifiable marker that loses signal strength with each division. These markers include phenotypic tags such as conditionally non-replicative plasmids [12] unstable plasmids [13] and fluorescent inclusion bodies [14]. sion [15,16]. The ratio of marker-to-marker-less cells or the magnitude of the marker signal, are used to estimate the division and death rate. These methods yield robust and accurate rate estimates for times short enough to ensure that the markers have not been completely diluted out of the population. For example, Frenoy et al. [12] used conditionally non-replicative plasmids in Escherichia coli to show that mutation rates are systematically overestimated in bacteria under stressful conditions unless the individual division and death rates are taken into account. In the study by Myhrvold et al. [14] self-aggregating fluorescent proteins were used to estimate division rates and death rates, which were subsequently used to inform a mathematical model of the population dynamics of E. coli in the mouse gut. Although successful, the application of marker-based techniques is dependent on the assumption that the markers are fitness neutral, i.e. do not change the wildtype microbial division rates and death rates. This is typically experimentally challenging to confirm within the context of within-host infection models.
An alternative approach relies on the use of distinguishable inheritable markers, where each marker labels a subpopulation of cells and the change in the composition of the total population is used to infer division and death rates by quantitative stochastic population-dynamic approaches. In principle, these methods offer an unlimited observation time with their accuracy limited by the total number of unique tags. For example, sequence tags inserted into a fitness neutral locus in the genome (wild type isogenic tagged strains; WITS) have been successfully used to investigate the pathogenesis of Salmonella enterica serovar Typhimurium [1] to quantify the effects of different vaccines [3] and investigate the colonization of the cecal lymph node [2]. Vlazaki et al. reviews the underlying mathematical approaches for estimating rates and the applications in more detail [15]. However, it remains unclear how the error in the rate estimates depends on the specific experimental protocol. Importantly, the impact of technical population bottlenecks on rate estimates is typically not accounted for and when addressed require additional calibration experiments [3]. Technical bottlenecks can change the population composition and thereby the basis for rate estimates [17]. Bottlenecks are often inevitably introduced during the experiment, for example when sampling a small volume from a large volume or when sequencing a limited number of cells. Moreover, the typical mathematical analysis of WITS data constrains the experimental design, and requires a uniform distribution of tags in the starting culture to work accurately. This is experimentally challenging to achieve for a large number of tags and therefore restricts the number of useable distinguishable sequence tags.
A number of studies have employed population genetic concepts to study microbial dynamics qualitatively [8,18,19]. The sequence tag-based analysis of microbial populations (STAMP) method allows for an indefinite number of sequence tags to be incorporated, limited only by the throughput of next-generation sequencing. Like WITS, STAMP allows for a much longer observation time than marker-loss methods. STAMP also prescribes a simple way to aggregate the change of each individually tagged subpopulation into a single measure, the founder population size. The founder population size carries a simple interpretation as the size of a population that survived a death event [20] (Fig. 1). These events are also often referred to as bottlenecks and can correspond to host-pathogen interactions, e.g. physical barriers, immune defenses, nutritional limitations, etc. The smaller the size of the founder population, compared to the initial population, the greater the stringency of the bottleneck. This measure can be affected by artificially changing the composition of the tagged population by technical handling of the sample after the biological process, for example by transferring only part of a sample by pipetting or by sequencing to an insufficient sequencing depth. These events can be seen as random sampling events [17].
In this work, we specifically distinguish them as technical bottlenecks in contrast to the biological bottlenecks due to birthdeath processes. In combination with the current bacterial burden, measured by colony forming units (CFU), the founder population size can help understand the detailed dynamics of a population. For example, if the current bacterial burden is linked to a small founder population size that would indicate growth of the microbes, while the same bacterial burden linked to a large founder population size would indicate slower growth. Loss of tag diversity indicates bacterial death, and the more bacteria are killed, the more tag diversity is lost. Fig. 1D illustrates how tags are lost depending on the division or death rates. At one extreme, when there is only death, the founder population size decreases in step with the bacterial burden. At the other extreme, when there is only growth, the tag diversity does not change and the founder population size remains constant. Hence, the simultaneous time-course data for the bacterial burden and the founder population size provides a unique signature for how rapidly cells divide and die that has been exploited to qualitatively assess the relative contribution to bacterial division and death events [8] (Fig. 1E). However, to date there exists no mathematical framework for the STAMP method able to infer these rates from CFU and founder population size values.
In this work, we develop and expand the population genetic framework for the STAMP method to quantify the relationship between the founder population size and microbial division and death rates. Our method, RESTAMP (Rate Estimates by Sequence Tag Analysis of Microbial Populations), allows for estimating rates via measurements of the founder population size and colony forming units that are independent of the distribution of the sequence tags. The RESTAMP method relies on estimating the magnitude of the fluctuations in the genetic composition of cells as they undergo random birth-death processes. Unavoidable technical bottlenecks such as sampling a small volume and sequencing influence the fluctuations in the genetic composition and can lead to biased rate estimates. We show that measuring the founder population size permits for a simple analytical scheme to correct for technical bottlenecks. Finally, we propose a bottleneck sensitivity measure and show how the bottleneck sensitivity depends on experimental parameters. This measure can be used to design experiments to maximize the accuracy and precision of bacterial division rate and death rate estimates. The method is validated by simple control experiments, aimed at emulating a pure death process (Fig. 1AB), and for cells growing in lysogeny broth (LB) media (Fig. 1C) by comparison with the well-established marker-loss plasmid segregation (PS) method.

Results -theory
In this section we develop the population genetic framework for the experimental STAMP method (equation (1)) against the backdrop of a random birth-death process. In subsection 2.1 we consider the ideal case where there is no influence of technical bottlenecks, e.g. due to sequencing or sampling, and derive an equation (equation (5)) that relates the mean founder population size to the division and death rate of a population of cells. In subsection 2.2 we explore the influence of technical bottlenecks on the analysis of sequence tags for estimating division and death rates. Here we also explore the impact of using an experimental estimate of the reference state at time 0 by sampling. The main result of this subsection is equation (7), which prescribes how to subtract the influence of technical bottlenecks. In subsection 2.3 we propose a bottleneck sensitivity measure, the purpose of which is to quantify the sensitivity in the rate estimates against making an error in the bottleneck correction terms in equation (7). The rationale for the bottleneck sensitivity measure is to maximize the variance in the frequencies of sequence tags due to random birth-death events relative to the variance in the frequencies induced by experimental sampling events.
2.1. RESTAMP -bacterial division rate and death rate estimates from the founder population size To model the population dynamics of k (see Table 1 for the definition and meaning of variables) distinguishable and independent subpopulations of cells we adopt the standard stochastic framework for which the trajectories are assumed to be continuoustime Markov processes [22].
We consider a birth-death process where both the time until the next event and the type of event are random variables. The event is either a division with rate b per unit time, defined as the inverse average time it takes for a cell to divide, or a death event with rate d per unit time, defined as the inverse average time it takes for a cell to die. The division rate and the death rate do not depend on the specific sequence tag i = 1, 2, 3,. . .,k since the insertion is fitness neutral. Hence, each subpopulation i undergoes random division and death events with the same rates for a length of time t resulting in a random subpopulation size, n i (t). Consequently, the proportion of cells with a sequence tag i, f i (t), is also is the total population size at time t. The implication is that the founder population size, N B (t), as determined by equation (1), is also to be treated as a random variable.
The equation for the founder population size is derived in the context of a multinomial random-sampling process, where it is interpreted as the population size that survived a multinomial random sampling event [20]. The equation was originally derived by [8] for diploid organisms and adapted by [20] for haploid organisms. The validity of this interpretation is contingent on a small volume being sampled so that the proportion of subpopulation i before sampling, f i (0), remains unchanged after sampling, f i . Here we introduce the notation f i without an explicit time dependence to signify that the change in subpopulation proportions are due to a random sampling process (technical bottleneck) in contrast to changes in subpopulation proportions due to a birth-death process (biological bottleneck) (see Table 1). In this section, we focus on the stochastic birth-death process where we seek to relate the founder population size to the division rate and death rate. By taking the inverse of equation (1) and apply the mean operator, <>, we get where the mean is with respect to repeating the birth-death processes given an initial proportion, f i (0). This assumes that the initial tag frequencies can be determined precisely. Next, we assume that the total population size is large enough to make the error in the approximation of the mean subpopulation proportion as Since the mean subpopulation size for a birth-death process is given by < n i t ð Þ >¼ n i 0 ð Þe bÀd ð Þ t and the total population size is given by where by definition, ance in the subpopulation proportions. Equation (3) is likewise valid for a multinomial random sampling process for which the mean subpopulation size after sampling N B cells is Using the error propagation method we derive (see 5.4 -The variance in the proportion of cells with respect to repetitions for a birth-death process) the variance in the subpopulation proportions for a birth-death process which reads Substituting equation (4) in (3) and making the approximation < N B t ð Þ >% 1= < N B t ð Þ À1 > (i.e. the average of the inverse is not equal to the inverse of the average) we get Equation (5) shows that the mean founder population size due to a birth-death process is independent of the distribution of tags. This simplifies the work of an experimenter aiming to estimate bacterial division rates and death rates as care need not be taken to produce a library of cells with a specific distribution of sequence tags. Consequently, it becomes a simple matter to analyze an arbitrary number of sequence tags which are all aggregated into the founder population size. We also note that < N B (t) > is directly pro- Fig. 1. Schematic of the experimental setups. (A) Illustrates a pure death process for a population of bacteria with i = 1, 2, 3,. . .,k = 1000 unique 30 base pairs sequence tags at a fitness neutral location in the genome. The three magnified cells illustrate the genome within the bacteria (black circle) with potential fitness neutral locations (rectangles) and the fitness neutral location with a sequence tag (color of rectangle/bacterium). The i:th subpopulation is initially present with a frequency f i (0) where the total population size is N(0). The bacteria undergo random death events for a length of time t 1 with rate d per unit time after which the total population size is N(t 1 ) and the frequency of the i: th subpopulation is f i (t 1 ). The founder population size at time t 1 , N B (t 1 ), is calculated by comparing the frequency of bacteria at time t 1 with the initial frequency, f i (0) (equation (1)). After an elapsed time t 2 > t 1 fewer cells remain with N(t 2 ) < N(t 1 ) and N B (t 2 ) < N B (t 1 ). (B) Random sampling of the initial population with the aim of emulating a pure death process. The volume Dvt is sampled from a large volume in which tagged bacteria are suspended such that the number of cells sampled is equal to the number of cells having undergone a death process for a length of time t (A). The founder population size, N B ('t 0 1 ), and the population size, N('t 0 1 ), are determined in the sample Dv t and are used to estimate the death rate, d, as it would be in a pure death process (see 5.1 -Emulating a death process by sampling). The random sampling process in itself is not a real time-dependent process, which is indicated by apostrophes around t. (B) A birth-death process that includes both death events and division events with rate d per unit time and b per unit time, respectively. By measuring the total population size and the genetic drift in terms of the founder population size the division rate and death rate can be estimated. (D) Simulation for tag loss in two populations with the same initial composition but different growth and death rates. The mathematical framework from [21] was adopted to investigate the mean fraction of unique sequence tags (y-axis, RESTAMP) that survives until time t (x-axis) in a birth-death process. The black line represents a population that dies more quickly than it replicates. The red line represents a population that replicates more than it dies. The parameters for the simulation are the same as in Figure Supplementary   portional to the total population size at t = 0, N(0), and is given in units of per volume. This implies that the mean founder population size must be expressed with respect to the same unit of volume for equation (5) to be dimensionally consistent. For example, if N(0) is determined as the CFU per ml and only 200 ml is sampled for determining N B , then the measured N B values need to be multiplied by 5 (200 ml Â 5 = 1 ml). Using equation (5) and the exponential growth model for which < N t ð Þ >¼ N 0 ð Þe bÀd ð Þ t , we solve for the division rate and death rate to get where r is the net growth rate and can be experimentally estimated as the slope of a regression line of ln(CFU) versus time. Net growth rate. The net growth rate is defined as the difference between the division rate b and the death rate d, The length of time cells undergo a birth-death process.
Average number of cells with a sequence tag insertion i at time t. <N(t)> Total average number of cells at time t. f i (t) The proportion of cells having undergone a birth-death process for a length of time t with a sequence tag i.
<f i (t) > denotes the average proportion of cells where the average is with respect to realizations. f i (0) is the proportion of cells with sequence tag i in the inoculum.
The variance in the proportion of cells having undergone a birth-death process for a length of time t with a sequence tag insertion at site i.
The variance is with respect to repetitions of the experiment.
f i The proportion of cells having undergone a random sampling event with a sequence tag insertion at site i.
In this work we assume that experimental samplings (technical bottlenecks), e.g. due to pipetting or sequencing, are modeled as random sampling processes, whereby a subset of cells are sampled such that each cell has the same chance of ending up in the sample.
The variance in the proportion of cells having undergone a random sampling event with a sequence tag insertion at site i.
The founder population size is calculated by comparing the proportion of cells with tag i at time t with the proportion of cells with tag i in the inoculum (equation (1)). The sample size of the j:th random sampling event.

I j
The inoculum size of the j:th random sampling event. Section 2.3: RESTAMP -Bottleneck sensitivity of bacterial division and death rate estimates N B The sample size in a random sampling event. N B is the notation used for the magnitude of a technical bottleneck e.g. the sample size in the transferred volume when pipetting or the sequencing depth when loading the sample on a sequencing chip. s B (t) Bottleneck sensitivity. Defined as the ratio of the variance in the subpopulation proportions due to a random sampling event and a birth-death process, i.e.
Þ. The purpose of S B (t) is to serve as a measure of how sensitive division rate and death rate estimates are to technical bottlenecks. Section 5.1: Materials and Methods -Emulating a death process by sampling Dv t The size of the sampled volume. This is the volume sampled to emulate a death process at time t where Dv t = Dv 0 e -dt with Dv 0 being the sampled volume of the starting culture at t = 0 (Dv 0 = 1 ml in our experiments). s The number of cells at time t for a death process or the number of cells in the sample Dv t . P(n i = s) The probability of sampling s cells with a sequence tag insertion at site i in the sample Dv t . P(n i (t) = s) The probability of s cells remaining at time t in a pure death process. p Probability of a single cell surviving until time t. p = e -dt Section 5.2: Materials and Methods -Plasmid Segregation (PS) F(t) Proportion of cells with a plasmid at time t.
a All cell numbers are implicitly expressed as per unit volume. b Averages over repetitions are denoted with angular brackets <>.

RESTAMP -Correcting for technical bottlenecks
Equations (6ab) prescribe how to estimate the division rate and death rate from CFU measurements and N B measurements assuming that the only contribution to the variance in the subpopulation proportions in equation (3) is due to a birth and death process. Another implicit assumption made with regard to equation (3) is that the variance is conditional on f i (0) which means that f i (0) is treated deterministically. However, a typical experiment involves additional technical bottlenecks that add to the variance in the subpopulation proportions. For example, sequencing is a technical bottleneck due to the limited capacity of the sequencing chip and sample preparation can impose bottlenecks. Additional technical bottlenecks include sampling the initial proportions which means that f i (0) in itself is a random variable. The aim of this section is to understand how to subtract the added variance in the subpopulation proportions due to technical bottlenecks so that the experimentally determined founder population size value is consistent with the assumptions made in deriving equations (6ab) for esti-mating the bacterial division rate and death rate. Equation (3) is central in this endeavor and whose derivation (see 2.3 RESTAMP -Bottleneck sensitivity of bacterial division rate and death rate estimates) relied on the approximation < f i |f i (0)>% f i (0), where < f i | f i (0) > is the mean tag proportions conditional on the tag distribution at t = 0. In this section f i (0) is treated as a random variable, hence < f i >=<f i (0) > by the law of total expectation [23]. Thus, for the derivation of equations (6ab) to be valid the initial proportions in equation (1) need to be substituted for the average initial proportion, <f i (0) > . Experimentally, we use triplicate samples of f i (0) to estimate the mean initial proportions of sequence tags, <f i (0)> (see 6.7-RESTAMP). To separate the contributions from a birth-death process to N B and the contributions due to sampling bottlenecks we iteratively apply the law of total expectation and the law of total variance to propagate the variance in the frequency of sequence tags throughout the experiment illustrated in Fig. 2A (see 6.5 -Correcting for technical bottlenecks by the iterative application of the law of total expectation and the law of total variance). From this analysis we find that subtracting the added genetic vari- Fig. 2. The bottlenecks in a typical experimental setup for RESTAMP rate estimates. The schematic lays out a typical experimental setup to determine division and death rates of cells by RESTAMP. An initial population with i = 1,2,. . .,k = 4 unique sequence tags, indicated by green, purple, beige and blue color, respectively, undergoes a birth-death process (biological bottleneck; indicated by red arrows) for time t with division rate b [min À1 ] and death rate d [min À1 ]. At the beginning of the experiment and at time t, S 1 cells are sampled, i.e. pass through a technical bottleneck (small opening in big grey bars). This could for example represent harvesting a set of cells during a time-lapse experiment. This changes the proportion of subpopulation i from f i (t) to f i,1 . Following genome extraction (black loops), the genetic tag regions (colored triangles) are amplified by PCR, which we assume is unbiased. Hence, the proportion of the subpopulations do not change and remain f i,1 . The amplified tag regions are then sequenced, which constitutes another technical bottleneck (grey bars), where S 2 sequence reads are sampled and the proportions are changed from f i,1 to f i,2 . (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.) ation due to j = 1,2,..,m I technical bottlenecks of size I j at time 0 and j = 1,2,..,m S technical bottlenecks of size S j at time t, can be achieved by analyzing the frequency of sequence reads according to where <S j > is the j:th sequential mean sample size taken at time t and <I j > is the j:th sequential mean sample size at t = 0 taken to estimate <f i (0)>. Equation (7) corresponds to the equation used in [8] to analyze data for m I = m S = 1 and is validated against simulations and experiments in section 3. Stochastic Simulations and Experimental Results. The typical experimental setup for RESTAMP used in this work is illustrated in Fig. 2 which highlights the technical bottleneck events where m I = m S = 2.

RESTAMP -Bottleneck sensitivity of bacterial division and death rate estimates
Tight technical bottlenecks can affect growth and death rate determination despite corrections. How much technical bottlenecks affect rate determinations and the impact on experimental design remain unanswered questions. The variance in the subpopulation proportions determines the magnitude of the founder population size (equation (3)), and the total variance is approximatively the sum of the variances due to a birth-death process (Var BD ) and a bottleneck event (Var B ). Hence, if Þ ( Var B f i ð Þ and the technical bottleneck event becomes dominating. As a numerical and artificial example, if Var B (f i ) is 100 and Var BD (f i ) is 1 then the total variance is 101. Suppose that the error in estimating the sample sizes in equation (7) has an error rate of 10% due to experimental noise. Thus, one might conceivably estimate the bottleneck correction term to be 90. The variance due to the birthdeath process is then estimated as (1 + 100)-90 = 11. This overestimates the variance by one order of magnitude and results in a one order of magnitude underestimation of the founder population size values and approximately a one order of magnitude overestimate in the rates according to equations (6ab). In another case, one might estimate the correction term to be 110 which would lead to a negative founder population size value. While negative founder population size values cannot be used to estimate rates, they are a strong indicator of the presence of very stringent technical bottlenecks. Considering the reverse situation where Var BD (f i ) is 100 and Var B (f i ) is 1, the impact of experimental noise due to technical bottlenecks is negligible. Thus, we define a bottleneck sensi- Þ. The variance in the subpopulation proportions due to a birthdeath process is given by equation (4) and is derived in 5.4 -The variance in the proportion of cells with respect to repetitions for a birth-death process. For a technical bottleneck event, modeled as a multinomial random sampling process, the variance is where the sample size is N B . The bottleneck sensitivity measure is therefore given by Equation (8) shows that the bottleneck sensitivity, s B (t), increases with smaller sample sizes (N B ) as expected. What might be less intuitive is the dependence of s B (t) on the total population size at t = 0. This results from the variance in the proportion of cells, due to a birth-death process, being smaller for larger population sizes (equation (4)). Importantly, the total population size at t = 0, N(0), is implicitly expressed as per unit volume, which means that s B (t) is also a quantity that depends on the volume. Since CFUs are typically reported as per ml, we define s B (t) to be in a volume of 1 ml meaning that the value for N(0) to be put into equation (8) is the CFU count per ml. We perform control experiments to find a threshold value for s B (t) below, which we expect to result in accurate rate estimates in section 3. Another feature of equation (8) is that sampling bottlenecks can become much more important to account for than the sequencing bottleneck. For example, consider a sampling bottleneck where 2x10 5 cells are sampled for determining the founder population size. After genome extraction, shearing, and PCR amplification, the sequence tags are sequenced on a chip with a capacity of the order of 2x10 7 sequence reads. Hence, the variance in the subpopulation proportions due to the sampling bottleneck are 100 times higher than for the sequencing bottleneck. Failure to correct for the sampling bottleneck as prescribed by equation (7) will result in underestimating the founder population size due to a birth-death process and consequently overestimating the division rate and death rate (equations (6ab)).
Equation (8) also shows that technical bottlenecks become more dominating for shorter times, or for stationary-like cells where the division rate and death rate are small in magnitude. Shorter times and smaller rates will therefore magnify technical bottleneck effects and could potentially lead to overestimating the rates. Importantly, the bottleneck sensitivity measure depends on experimentally controllable parameters. Using an estimate of the individual rates as b%r, d%0 for a growing population of cells or b%0, d%r for a dying population of cells we can plan the experiment so that the sensitivity is minimized and the accuracy of rate estimates maximized.

Results -stochastic simulations and experiments
We first test the theory developed in section 2 by comparing against stochastic tau-leaping simulations [24] for the case of cells dying on average with d = 0.03 min À1 , b = 0.01 min À1 (Fig. 3A, 3B and 3C) and for cells growing on average with d = 0.01 min À1 , b = 0.03 min À1 (Fig. 3D, 3E and 3F) for t = 120 min. The sequence tag distribution at t = 0 is geometric with the probability parameter set to 1/1000 for k = 1000 unique sequence tags, which results in an inoculum size, N(0), of approximately 10 6 cells. This is sufficient information to calculate the theoretical average founder population size (equation (5)) as a function of time, shown as red dotted lines in Fig. 3A and 3D. The corresponding stochastic tau-leaping simulations were ran for 100 iterations with the founder population size calculated using equation (1) for each iteration. The mean and the standard deviation for both the N B values (black dashed line) and the CFU values (black solid line) where subsequently calculated and plotted in Fig. 3A and 3D. Fig. 3A and 3D show an excellent agreement between the theoretical average founderpopulation size and the simulated average. Next, we use equations (6ab) to determine the division and death rate for both cases ( Fig. 3B and 3E). The estimated death rate (solid red line) and division rate (solid blue line) agree very well with the target rates (dashed lines), with the deviations being negligibly small.
Lastly, we tested equation (7) which prescribes how to correct for technical bottlenecks, including sampling from a larger volume and sequencing with a limited chip capacity. The population at t = 120 min and the inoculum, the population at t = 0, undergoes two random sampling events (m = 2) where I 1 = I 2 = S 2 = 10 6 and S 1 = 10 5 . The ideal N B value without any added technical bottlenecks, the estimated N B value without correction terms (equation (1)) and the estimated N B value with correction terms (equation (7)) are plotted in Fig. 3C and 3D. The results show excellent agreement between the theoretical mean N B value and the N B value as calculated using equation (7) with technical bottleneck correc-tions. However, we observe that the standard deviation relative to the ideal case without technical bottlenecks increase more when a sample is very small relative to the total population size. Notably, the standard deviation can even be larger than the mean founder population size value (Fig. 3F). It is therefore important to carefully design experiments aimed at estimating division rates and death rates to minimize this source of error. Figure Supplementary  Fig. 2 shows additional stochastic simulation results where the accuracy and precision in rate estimates are compared to the corresponding bottleneck sensitivities. The results show that the accuracy and precision for RESTAMP rate estimates increase as the magnitude of the bottleneck sensitivity decrease.
Next, we tested the RESTAMP method by devising a control experiment where we controled death rates (see 5.1 -Emulating a death process by sampling). We aimed to emulate a pure death process by sampling a volume Dv t = Dv 0 e -dt from a flask that contained E. coli MG1655 cells tagged with k = 1000 unique, 30 bp long sequence-tags that are fitness neutral as experimentally verified ( Figure Supplementary Fig. 3). The target division rate is 0 and the death rate and time points can be freely chosen. We set a high target death rate of d = 0.1 min À1 and a low target death rate of d = 0.015 min À1 . The time points were set to t={20,25,30,35,40} min. The sampled volume of the inoculum at t = 0, Dv 0 , was 1 ml and equation (7) was used to calculate the founder population size values for each time point with m I = m S = 1 bottleneck corrections due to the limited sequence chip capacity. The specific instantiation of equation (7) used to analyze the sequence reads for this experiment is therefore In addition, matching a death-process with a multinomial random sampling process requires scaling the N B values by the factor 1/(1-p) where p = e -dt (see 5.1 -Emulating a death process by sampling). The mean founder population size at each time point and the corresponding mean CFU were substituted into equations (6ab) to estimate the bacterial division rate and death rate. The mean CFU at t = 0 was determined by serial dilution. The CFU was determined as < N(0) > e -dt at time t. We also performed the same experiment using plasmid (pAM34-Plac) containing cells and the established plasmid segregation method to estimate rates (Materials and Methods -Plasmid Segregation) [12]. Fig. 4F illustrates a simple schematic of the experimental workflow. Fig. 4A shows the mean estimated rates and the standard error of the mean, at all time points, for the high death rate experiment. The corresponding data for the low death rate experiment is shown (Fig. 4B). The mean rate estimates (diamond markers) agree well with the target death rates and division rate of 0 for both methods. However, RESTAMP does have a slight propensity to overestimate the rates, particularly for the high death rate case. From the time-resolved RESTAMP rate estimates we see that much of the contribution to the noise is for the shorter time points (Fig. 4C), where we also observe larger fluctuations in the mean rate estimates. In contrast, the time-resolved RESTAMP rate estimates for the low death rate experiment are robust (Fig. 4D). We exploit this difference between the high death rate and the low death rate experiments to define a threshold for the sensitivity measure s B (t) (equation (8)), below which we expect rate estimates to be accurate and robust against making an error in estimating the technical bottleneck correction terms in equation (7). The sensitivity measure is plotted in Fig. 4E for the high death rate (black line) and the low death rate (magenta line) experiments, where the parameter N B was set to 2x10 7 (i.e. the sequencing bottleneck) in equation (8). The total population size at t = 0, N(0), were experimentally determined to be 5.8x10 5 CFU/ml for the high death rate experiment and 7.7x10 4 CFU/ml for the low death rate experiment. The graph shows that s B (t) is larger for the high death rate experiment, primarily a consequence of N(0) being larger by a factor 7.5. A larger N(0) means that the variance in the subpopulation proportions due to the death process is smaller. Hence, the variance due to technical bottlenecks becomes more dominating and the sensitivity in the rate estimates increases. The death rate estimates for the high death rate case are robust after 25 min, however we do notice large fluctuations for the bottleneck sensitivity value at t = 25 min and at t = 30 min. . Therefore, we set a threshold s B (t) = 0.17, corresponding to the bottleneck sensitivity value at   (8), this means that the variance in the frequency of subpopulation i due to random birth-death processes in a volume of 1 ml should be at least 5 times larger than the variance due to sampling. Fig. S4 shows the rate estimates and the bottleneck sensitivities for the individual replicates at all time points.
Next we perform an experiment for bacteria growing in complex media (LB) where we do not control the division and death rates. Fig. 5H illustrates a simple schematic of the experimental workflow. Fig. 5A shows the mean net growth rate and the standard error of the mean of three repetitions of the plasmid segregation experiment where the bacteria grow for a time t= {20,40,60,80} min. The mean net growth rate is stable at approximately 0.025 min À1 corresponding to a generation time of 28 min. Fig. 5B and 5C show the plasmid segregation division and death rate estimates, respectively. Here we see that the division rate is close to the net growth rate, meaning that the bacteria are not dying. The death rates are close to 0, although may be estimated as negative due to experimental noise. Hence, we expect that the RESTAMP rate estimates produce the same result with no death and a division rate close to the net growth rate. Fig. 5D shows the natural logarithm of the CFUs as estimated in the RESTAMP experiment at the time points t={20, 40, 60, 80} min. The slope of the regression line is an estimate of the net growth rate of 0.025 min À1 . Fig. 5E and 5F show the division and death rate estimates for each time point, respectively. Here we see that the division rate estimates are accurate where the mean division rate over all time points (black dashed line) correlate very close to the net growth rate (blue dashed line). Likewise, we see that the death rate estimates are close to 0, and can potentially be estimated as negative due to experimental noise as discussed in section 2.3 RESTAMP -Bottleneck sensitivity of bacterial division rate and death rate estimates. Fig. 5G shows the magnitude of the experimental bottleneck sensitivities for sequencing where the dashed red line correspond to the bottleneck sensitivity threshold, s B (t) = 0.17. Since the bottleneck sensitivities exceed the threshold value we expect that the rate estimates are more sensitive to making errors in estimating the bottleneck correction terms in equation (7). We confirm this by reanalyzing the data without bottleneck correction terms using equation (1) and comparing with the rate estimates with bottleneck correction terms ( Figure Supplementary Fig. 5). Important to note is that the RESTAMP rate estimates need to be closely integrated with the experimental protocol. In this experiment we have two technical bottleneck events (Fig. 2), the first from sampling 200 ll and the second from sequencing (Fig. 5H). Therefore, m I = m S = 2 and the specific instantiation of the equation used to calculate N B values (equation (7)) is

Discussion
Powerful methods have been devised to investigate the detailed population dynamics of pathogens in order to gain insight into the disease causing mechanisms and establish guiding principles and strategies for disease prevention. Typically, these methods are based on tracking and identifying a marker that loses signal strength as the cells divide or die. Plasmid segregation (PS) have been shown to be a very capable method which uses conditionally non-replicative plasmids as the marker [12]. Since the proportion of cells that contain a plasmid decrease exponentially with time (Materials and Methods -Plasmid Segregation), the PS method and other marker-based techniques work best during a short time window. It can also be experimentally challenging to ensure that the accessory genes or fluorescent markers of the plasmids do not change the wild-type division and death rates, especially in the context of within-host infection models. Methods such as WITS can overcome some of the limitations of the plasmid segregation method [1]. The WITS method infer migration rates, division rates and death rates based on changes in the composition of tags in the total population. The tradeoff is that WITS is more sensitive to technical bottlenecks, e.g. sampling and sequencing, as these influence the variation in the genetic composition of the total population [17]. However, it should be noted that previous work have taken steps towards detecting technical bottlenecks in [25,26] which is summarized in [27]. In addition, the mathematical analysis of WITS data relies on the assumption that the tags are initially evenly distributed [1][2][3]. Consequently, these studies are typically constrained to using~10 tags, which limits the accuracy in the rate estimates.
In this work, we develop the RESTAMP method (Rate Estimates by Sequence Tag Analysis of Microbial Populations) that takes into account, and corrects for, the impact of technical bottlenecks. The mathematical framework for RESTAMP is constructed on top of the experimental STAMP method [8] and provides a simple way to aggregate information about many sequence tags into a single measure; i.e. the founder population size. Hence, our method can handle any number of DNA sequence tags, limited only by the sequence chip capacity. Furthermore, we show that the average founder-population size is independent of the initial tag distribution. This simplifies the process of estimating rates for the experimentalist, as the sample does not require an exact composition of tags.
The independence of the mean founder population size on the sequence tag distribution at t = 0 (equation (5)) relies on a fixed initial tag distribution, f i (0). Ideally, this means that the samples taken to determine mean founder population size values exactly reflect the sequence tag distribution of the population they were sampled from, regardless of the initial distribution. However, it is experimentally challenging to completely remove the influence of sampling on tag distributions. Thus, to minimize the effect of sampling on the accuracy of the rate estimates, multiple samples are taken from the same culture. Ultimately, we find that the impact of this variation is minimal since we get reasonable rate estimates as validated by control experiments and by comparison with the PS method (Figs. 4 and 5).
The successful application of the bottleneck corrections is contingent on an accurate measurement of the CFUs used to determine the sample sizes. Severely underestimating the sample size can result in calculating a negative founder population size value. Even with an accurate CFU estimate, there is also a chance to produce negative founder-population size values in the presence of stringent technical bottlenecks. This is due to the noise from random division and death events being overwhelmed by the magnitude of stochastic variation induced by technical bottlenecks (see 2.3 RESTAMP -Bottleneck sensitivity of bacterial division rate and death rate estimates). We used this to define a bottleneck sensitivity measure, s B (t), in the rate estimates as the ratio between the variance in the subpopulation proportions due to a birth-death process and the variance due to a multinomial random sampling event used to model a technical bottleneck. By devising a control experiment, we a priori set the target death rate by emulating a death processing by sampling (see 5.1 -Emulating a death process by sampling) and find a threshold for the sensitivity measure, below which we expect the rate estimates to be robust and accurate (Fig. 4). For the experimental results illustrated in Fig. 5 we find that the bottleneck sensitivity is larger than the sensitivity threshold. We therefore expect that the rate estimates are more sensitive to making an error in the bottleneck correction terms in equation (7). We confirm this by comparing the rate estimates with and without bottleneck correction terms in Figure Supplementary Fig. 5.
The sensitivity measure predicts an increasingly accurate rate estimate for longer observation times. In control experiments, we find that the relative fluctuations (the standard error of the mean relative to the mean) are 16-23% for the high death rate case (d = 0.1 min À1 ) and 18-21% for the low death rate case (d = 0.015 min À1 ) (Fig. 4). We find that the magnitude of the relative fluctuations are more stable at 18-21% at each time point as expected based on the shallow slope of the bottleneck sensitivity measure versus time (Fig. 4E -magenta line). By comparison with the PS method, RESTAMP mean rate estimates are slightly closer to the target division and death rates. In terms of the uncertainty in the rate estimates, RESTAMP performs equally well as the plasmid segregation method. However, the advantages of RESTAMP comes into the forefront when it is experimentally challenging to assess whether the plasmids are fitness-neutral or where long observation times ( Figure Supplementary Fig. 1) are needed; e.g. in studying within-host population dynamics.
In Figure Supplementary Fig. 1 we investigate the precision and accuracy of rate estimates as a function of time up until 24 h for both RESTAMP and PS by performing stochastic tau-leaping simulations. If the cells are growing, on average ( Figure Supplementary  Fig. 1A) we observe a practically unlimited observation time with RESTAMP while PS is limited to~17 h. The reason is that the plasmid containing cells are unaffected by division events and are lost on average at a rate proportional to e Àdt . However, the precision and the accuracy of the rate estimates using the PS method are very robust during the time window where PS works. This remains true for the PS method also in the case when cells are dying (Figure Supplementary Fig. 1B), where on average the observation time is reduced to~5 h as the death rate for the cells is increased. The RESTAMP observation time also becomes limiting for this case where we observe a deterioration in the precision of the rate estimates beginning at~5-6 h, a consequence of lost sequence tags. To calculate the fraction of unique sequence tags that survives until time t in a birth-death process, we adopted the mathematical framework for a transposon insertion sequencing experiment [21]. The variables are reinterpreted in the context of RESTAMP where the number of transposon insertion sites correspond to the number of barcodes (k = 1000). The fitness coefficients (w i ) are set to 1 and the equation for the extinction probability for a birthdeath process was used (equation (6) in [21]). The extinction probabilities were subsequently used to calculate the mean reduction in library complexity (equation (8) in [21]), interpreted as the fraction of unique sequence tags that survives until time t. Upon multiplying the fraction of sequence tags by k we get the number of unique sequence that survives until time t. This is plotted in Figure Supplementary Fig. 1D for both the case of cells growing on average and for cells dying on average. For the former case we do not observe any loss of tags while for the latter the number of tags decrease as a function of time where there is less than 400 tags at 5-6 h. In principle, one could increase the initial population size to drive the observation times to be longer for both methods. For example, in Figure Supplementary Fig. 1C the inoculum size was increased by a factor of 100 to N(0) = 10 8 for the case of cells dying on average.
In summary, it might be advantageous to increase the number of unique sequence tags prior to executing a RESTAMP experiment to buffer against loss of accuracy depending on whether severe death events are expected. To aid in this decision, this type of analysis could be used to estimate the expected number of extinct sequence tags, e.g. in within-host infection models, from the CFU time-course data by setting r%d for cells dying on average.
The RESTAMP method abstracts away the details of the underlying stochastic dynamics that drive birth-death processes and simplifies the analysis of an arbitrary number of sequence tags by providing an explicit equation relating the division and death rates with the average N B and CFU values (equations (6ab)). It was previously shown that the accuracy of N B estimation can be improved by increasing the number of sequence tags [8]. Hence, there is potential for the accuracy of the rate estimates to improve with respect to the founder population size. However, there is a limit to this improvement due to other unavoidable sources of experimental errors. For example, the PCR amplification step might not be unbiased and error in CFU determination is heavily influenced by the experimentalist's consistency in methodology (e.g. sample dilution, time to plate samples, counting). We minimized potential technical bottlenecks due to PCR by minimizing the number of amplification cycles and performing the amplification in triplicate and pooling the results. Our experimental results suggest that the best way to increase the accuracy is to take the average of multiple rate estimates as the average rates accurately approximate the target rates (Fig. 4AB).
The quantitative analysis of WITS data typically relies on expressing the dynamics in the form of a master equation, i.e. an equation for the probability of having a certain number of cells at a particular time point, whereby the rates are inferred using maximum likelihood estimates [1][2][3]. In contrast, RESTAMP abstracts away the details with respect to the master equation that drives the stochastic population dynamics. It is centered on analyzing the frequency of sequence tags by defining a particular function of the frequencies, e.g. equation (1) for the founder population size, such that it becomes relatively simple to tie it to experiments and correct for unavoidable technical bottlenecks.
A recent alternative approach employed the moment-closure method where a system of ordinary differential equations for the mean, variance, and covariance in the subpopulation sizes are solved for, and the rates inferred by adopting an appropriate divergence measure between the WITS data and the generated moments [7]. Likewise, the RESTAMP approach for inferring rates also depends on higher order moments, namely the variance in the proportion of subpopulations (equation (3)). Theoretically, the calculation of the average founder population size can be integrated with the moment-closure framework which would allow RESTAMP to also estimate migration rates in addition to division and death rates. By expressing the mean founder population size in terms of the first and second moments, we can expand to a multi-compartment model with arbitrary topology and location dependent division, death, and migration rates. This approach would allow RESTAMP to determine rates for bacteria that are not strictly growing exponentially, e.g. logistic growth, in multiple compartments and is best studied in the context of a within-host infection model.
Nevertheless, the RESTAMP framework as is can be used to provide a low-resolution picture of in vivo dynamics by treating the animal model as a single compartment. The first step in using RESTAMP to plan an in vivo or an in vitro experiment is to measure CFUs over time, i.e. a growth curve. The growth conditions for this experiment must be the same as the growth conditions that will later be used to determine division and death rates. It is not necessarily required to use the barcoded library of cells, as long as the used strain and the final tagged library have the same division and death rates, e.g. when the used tags are fitness neutral and the CFUs over times are measured with the untagged parental strain. Our model is restricted to exponential growth or decay. The CFU over time data can be used to check if the growth conditions fulfill this requirement and given a certain inoculum size, the time interval of exponential growth or decay can be determined. In this interval, the net growth rate r can be estimated. Next, it is necessary to identify whether the environment is particularly hostile to the cells. If there is a sharp decrease in the CFUs over time, then it might be necessary to increase the number of identifiable subpopulations (k) to buffer against barcode extinction events. The number of identifiable subpopulations should be at minimum k = 400-500 [8]. Whether this is a necessity can be checked by setting r%d and using the rationale as discussed above to plot a graph corresponding to Figure Supplementary Fig. 1D and read out the number of tagged subpopulations that survives until time t. The next step is to identify the number of technical bottlenecks in the experiment that all samples are subjected to, i.e. at time 0 (m I ) and at time t (m s ) (Fig. 2). In principle, all technical bottlenecks can be accounted for. However, in this work we only consider the two (m s = m I = 2) major technical bottlenecks, which typically are the harvesting of the cells from the growth environment and the limited sequencing depth of next-generation sequencing. This will normally remain true in an in vivo experiment, where all the cells are harvested from the organ of interest and then sampled and loaded onto a sequencing chip. The next step is to estimate the sample sizes for each technical bottleneck, e.g. the number of harvested cells (S 1 , I 1 ) or the number of generated sequences (S 2 , I 2 ). These are required to calculate the bottleneck sensitivity, s B (t). Within the constraints of the experimental setup, these can be freely chosen. At this point the experimenter only needs an estimate of the division rate b and the death rate d to calculate s B (t) for each sample by using equation (8). The number of cells per ml at the beginning of the experiment is known (N(0)), the net growth rate is known (r), the sampling time is known (t) and the sample size is known (N B ). For a first estimate the (b + d) term in equation (8), one can set r%d for a dying population of cells or r%b for a growing population of cells. A better approach would be to calculate s B (t) over a range of biologically plausible division and death rates. For example, if the minimum expected time until the population doubles in size is~15 min then the maximum division rate is ln(2)/15 min À1~0 .046 min À1 . Likewise, if the minimum expected time until the population halves in size is 5 min then the maximum death rate is ln(2)/5 min À1~0 .14 min À1 . If this calculation shows an s B (t) value that exceeds the threshold of s B (t) = 0.17 (see justification above, Fig. 4E) then the experimenter can change the experimental setup, i.e. lower the initial concentration (N(0) per ml), sample more cells (increase N B ) or choose to sample at later time points. Lastly, given that the experiment has been designed such that the s B (t) values do not exceed the sensitivity threshold, all that remains is to determine the N B (t) values using equation (7). Note that the CFU estimates are experimentally decoupled from N B (t) estimates. In our setup, the CFUs were determined from 1 ml of culture, while only the DNA extracted from 200 ml of culture was sequenced. Therefore, it is important to scale both measures to the same volume (see section 2.1). Finally, the microbial division and death rates are estimated by parameterizing equations 6ab with the CFU and the scaled N B (t) values.
To provide the reader with a guideline for the typical m I = m s = 2 experimental system (Fig. 2) we will assume a minimum number of sequences of 10 6 = I 2 = S 2 per sample and the sample size for the inoculum to be 10 6 = I 1 . We choose a sampling time point t = 60 min and an initial concentration N(0) = 10 5 cells per ml. The division rate and death rate are varied between [0, 0.046] min À1 and [0, 0.14] min À1 in 0.001 min À1 increments. The calculations of s B (t) using equation (8) shows that this system does not exceed the bottleneck sensitivity threshold for nearly the whole range for the division and death rate when at minimum 10 6 = S 1 cells are sampled at t = 60 min. However, the bottleneck sensitivity does tend to sharply increase when the division rate and death rate approach 0 i.e. when b and d are less than 0.005 min À1 . Therefore, extra care should be taken in estimating the division and death rate for cells growing slowly, where the CFUs do not change appreciably during the time of observation.
One of the assumptions that underlies the RESTAMP model is that all cells divide and die at an equal rate (section 2). This is a limitation, which prevents using RESTAMP to study e.g. experiments with high selection pressure over extended periods of time, where mutants with altered fitness could accumulate or phenotypic heterogeneity, where genetically identical cells can manifest different phenotypes in a constant environment [28]. A striking example of this would be persister cells, where a fraction of cells in a genetically identical population survives longer in the presence of antibiotics [29]. In Figure Supplementary Fig. 6, we test the performance of the RESTAMP method in the presence of variation in the division rate and the death rate. The rates were independently drawn from a normal distribution where the mean death rate is 0.03 min À1 and the mean division rate is 0.01 min À1 . We also consider the case where the rates are interchanged, i.e. the mean division rate is 0.03 min À1 and the mean death rate is 0.01 min À1 . Figure Supplementary Fig. 6C shows the rate estimates for the former case while Figure Supplementary Fig. 6D shows the rate estimates for the latter case as functions of the standard deviation for the normal distribution, from which the rates were drawn. The results suggest that the estimated rates correspond to the mean rates for standard deviations smaller than 10 -3 min À1 , i.e. when the standard deviation relative to the mean is less than 10%. For wider distributions, when the standard deviation is 10 -2 min À1 , the rate estimates drift towards very large values. Accounting for this requires extending the RESTAMP theory for example by considering the division rate and death rate to be random variables. This potential extension of RESTAMP is best studied in the context of environments that induce phenotypic heterogeneity.
In this work, RESTAMP was tested and validated against experiments using E. coli. While the experimental (RE)STAMP protocol has been optimized and calibrated for bacteria, there is in principle no reason why it cannot be extended to other organisms. Likewise, the mathematical framework developed in this work does not make any organism-specific assumptions. We therefore believe that our approach can be useful to study the population dynamics of other pathogens, such as viruses.

Emulating a death process by sampling
Here we seek to emulate a pure death process (Fig. 1A) by random sampling events (Fig. 1B) with the aim of devising an experiment where the death rate can be controlled. This control experiment can then be used to test how well the death rate, with the target division rate being 0, can be estimated using equations (6ab). The total average number of cells at time t in a pure death process with death rate d is given by < N t ð Þ >¼ N 0 ð Þe Àdt . This is emulated by sampling a small volume Dv t = Dv 0 e -dt from a large volume. Since the rate estimates require not only the mean number of cells but also the mean founder population size at time t, one needs to ensure that <N B (t)> is correctly matched as well. Analogously, this translates to a requirement of matching the mean subpopulation size and the variance in the subpopulation size with the death process and the random sampling process. This is due to equation (3) that states the magnitude of the founder population size is inversely proportional to the variance in the subpopulation proportions. To proceed, we model the random sampling event as a multinomial random sampling process for which the probability of sampling s cells with a sequence tag insertion at site i is binomial.
The binomial distribution applies for n i (t) cells surviving until time t in a death process as well where and p = e -dt is the probability of a single cell surviving until time t. The mean number of cells in the random sample is < n i >¼< N t ð Þ > f i 0 ð Þ ¼ n i 0 ð Þe Àdt ¼ n i 0 ð Þp which agrees with the mean number of cells for a death process. However, the variance in the number of cells with a sequence tag at site i in the sample Dv t is Var n Þwhich is not equal to the variance in the death process where Var n i t ð Þ ð Þ¼n i 0 ð Þp 1 À p ð Þ . Hence, for the random sampling experiment to match a death process, Var (n i ) need to be scaled with the factor (1-p)/(1-f i (0)). From equation (3), this translates to scaling the mean inverse founder population size as In our experiments k is 1000 and the factor (1-f i (0)) in the correction term is negligible and can be approximated as 1 with a negligible effect on the founder population size values. Thus, the experimentally determined mean founder population size as estimated in the random sampling experiment is simply scaled by the factor (1-p) -1 .

Plasmid segregation
In this work, we adopt a simple mathematical framework for describing the dilution of an identifiable marker within cells that was recently used to quantify the dilution of self-aggregating fluorescent proteins [14]. Within this framework, the fraction of cells containing a plasmid at time t, <F(t)>, is given by < F(t)>=F(0)e -bt where b is the division rate and r = b-d is the net growth rate. Solving for the rates we get Where <N(t)> is the average total population size at time t and is experimentally estimated as the number of colony forming units.

Protocol for removing spurious sequence reads
The development of next generation sequencing technologies have revolutionized the quantification and analysis of the structures of microbial communities [9]. In particular, Illumina's MiSeq platform [30] was successfully used in establishing the STAMP method for qualitatively investigating the population dynamics of cells [8]. The workflow of STAMP includes clustering and tallying individual sequence reads, the purpose of which is to remove spurious sequences that typically arise in using next generation sequencing technologies [31]. The aim of this section is to update the clustering step in the STAMP workflow [8] with a simple model for the expected number of extraneous spurious sequences. The consequence is that the sequence identity threshold in clustering is replaced with a query for the expected number of distinguishable subpopulations, i.e. the expected number of unique barcodes. The overarching strategy will be to estimate the expected number of the extraneous sequences that come about due to sequencing errors assuming that the error of a misread is equal and independent of basepair position. The additional sequences that remain after clustering k unique sequence tags including the ones that arise due to sequencing errors are then designated as spurious and removed from the analysis. All the variables used for modeling sequencing errors and their meaning are summarized in Table 2.
Using STAMP, we generate on the order of 10 7 51 base pairs long sequence reads, where the first 30 base pairs are random and the last 21 base pairs are constant, and integrate them in a neutral position of the genome of our bacterial model. We denote the random sequence as the random barcode and the fixed sequence as the strain barcode and denote the lengths of these sequences as N R and N S , respectively. Artificial genetic variation is introduced in our bacterial model by using the random barcodes and is subsequently exploited to investigate the population dynamics of cells. The purpose of the strain barcode is to validate that the sequence reads with a random barcode are sequenced as opposed to sequencing random DNA snippets. In addition, the strain barcode allows for multiplexing different strains, which can be used to study interactions between populations. After filtering, the sequence reads with respect to the strain barcode and clustering the 100% matching sequences, our experiments result in 10 4 -10 5 sequences. However, we expect k unique random barcodes where k is 1000 in this study. The additional sequences are due to a convolution of different effects such as sequencing errors, PCR errors or pooling of multiple barcodes [31]. The task at hand is to determine how many of the extraneous sequences arise due to sequencing errors and to designate the rest of the sequences as spurious which are then removed from the analysis. We use a simple model that assumes that the probability of reading an incorrect nucleotide at any position in the sequence is equal for all positions and independent of the position. Hence, the expected number of sequences after filtering with respect to the strain barcode (n F ) is where p is the probability of a correct nucleotide at any position and N is the total number of sequence reads. We use equation (15) to estimate p as Given p, the probability of m incorrect reads in the barcode region is binomially distributed where q = 1-p is the probability of an incorrect nucleotide at any position. After sorting the sequences with respect to abundance we remove all sequences beyond k with average number of m mismatches for which n $ m < 1 as spurious sequences. The average is used because we compare every extraneous sequence with all the k barcodes and thus we get a distribution of mismatches. This leaves k unique barcode sequences plus a mix of spurious-and sequencing error sequences. The last step in the algorithm is to pick out, top to bottom, n

The variance in the proportion of cells with respect to repetitions for a birth-death process
Here we use the error propagation approximation to derive the variance in the proportion of subpopulation i, Var(f i (t)), for a stochastic birth-death process. Let n i (t) be the number of cells for subpopulation i at time t and let N(t) be the total number of cells at time t. Thus, f i (t) = n i (t)/N(t). For notational simplicity we substitute x = n i (t), y = N(t) and f i (t) = g(x,y) = x/y. Using the error propagation method, the variance of a ratio of random variables is where m x =<n i (t)>, m y =<N(t) > and the angular brackets <> denote an average over repetitions. For independent subpopulations the covariance term reduces to Var(x). Cov x; y ð Þ ¼Cov n i t ð Þ; N t ð Þ ð Þ¼Cov n i t ð Þ; n i t ð Þ ð Þ¼Var Typically on the order of 10 7 . n F Total number of sequences after filtering for the strain barcode.
Empirically it is often between 10 4-5 . p Probability of a correct nucleotide at any position in the sequence.
Typically, p is between 0.98 and 0.99 in our experiments. q Probability of an incorrect nucleotide at any position in the sequence.

P(m)
Probability of m mismatches in the random barcode region. For the birth-death process with division rate b, death rate d and net growth rate r = b-d we have the well-known relations for the means and the variances for the i:th subpopulation of cells and the total population.
Substituting equations (19)(20) in (18) the variance in the proportion of cells is 5.5. Correcting for technical bottlenecks by the iterative application of the law of total expectation and the law of total variance.
In this section we consider a typical RESTAMP experiment, which involves technical bottlenecks such as sampling a small volume from a larger volume and sequencing due to the limited sequence chip capacity, in addition to a stochastic birth-death process. The variance of the founder population size will therefore include contributions due to these technical bottlenecks. An additional source of variation comes from estimating the initial tag frequencies, f i (0), by experimental sampling. The task at hand is to separate these contributions from the birth-death process by propagating the added variance in the frequencies of sequence tags by iteratively applying the law of total expectation [23] and the law of total variance [23].
For simplicity and notational clarity, we consider a single technical bottleneck event. Using equation (1) to estimate the founder population size for cells having undergone a birth-death process and an additional downstream bottleneck event, we determine a founder population size that is a function of the total variance in the proportions, Var(f i,1 ) (equation (3)). See Fig. 2A for an illustration of the experimental setup. Using the law of total variance [23] Var(f i,1 ) can be decomposed as where <Var(f i,1 |f i (t))> is the mean contribution to the total variance due to the bottleneck event. Here, f i,1 is the frequency of sequence tag i after sampling the population of cells having undergone a random birth-death process for a time t and f i (t) is the frequency of sequence tag i at the end of the birth-death process at time t ( Fig. 2A). By modeling the bottleneck as a multinomial random sampling process we have <f i,1 |f i (t)>=f i (t), i.e. sampling does not change the frequency of sequence tag i on average. Hence, Var(<f i,1 |f i (t) > ) = Var(f i (t)) which is the total variance due to a birth-death process. Applying the law of total variance on Var(f i (t)) we get where <Var(f i (t)|f i (0))> is the mean variance in the sequence tag frequency due to birth-death process given an initial subpopulation proportion f i (0). Substituting equation (23) in equation (22) and using < f i (t)|f i (0)>%f i (0), the total variance in the subpopulation proportions with a sequence tag i becomes Substituting equation (24) in equation (3), with f i (0) substituted for <f i (0)> as discussed in section 2.2, we get where < N BD B t ð Þ À1 > is the contribution to the founder population size due to a birth-death process, < N B B t ð Þ À1 > is the contribution due to the bottleneck (a multinomial random sampling event) and < N 0 B t ð Þ À1 > is the contribution due to sampling the inoculum at t = 0. In the context of sampling, the founder population size is equivalent to the sample size and we used equation (3) to equate the left hand side of equation (25) with the right hand side. We simplify the notation where N B B = S 1 and N B 0 = I 1 and equation (25) becomes The founder population size due to a birth-death process, having accounted for a single bottleneck event, can therefore be estimated as This result is generalized for j = 1, 2, 3. . .,m S bottleneck events for the sample taken at time t and j = 1, 2, 3. . .,m I bottleneck events for the reference sample at t = 0 by iteratively applying the law of total variance, which leads to subtracting the sum of the average inverse sample sizes.

Plasmid Segregation: Experimental
E. coli MG1655 pAM34-pLAC was recovered from frozen stocks on LB agar with carbenicillin 50 mg/ml and IPTG 1 mM. From a single colony 5 ml LB broth carbenicillin 50 mg/ml and IPTG 1 mM was seeded and grown for 16 h. Cultures were then pelleted and resuspended twice in 5 ml PBS (Sigma, Cat. #P4417) to remove residual IPTG. The culture was diluted 1:1600 into pre-warmed 37°C LB broth and grown to 1.0 O.D. 600 nm to obtain~20-30% pAM34-pLac positive cells. Culture was diluted to target starting concentration of 2.4x10 5 CFU/ml in 25 ml of pre-warmed 37°C LB broth (~1:1470 dilution). For wild-type plasmid-loss controls, every 20 min, 600 ml sample was removed from the culture, serial diluted in PBS, and 100 ml of each dilution plated in triplicate for colony counts on LB agar carbenicillin 50 mg/ml and IPTG 1 mM and LB agar IPTG 1 mM. To simulate death, the starting culture was grown as above to a concentration of 2.4x10 5 CFU/ml culture in 400 ml LB broth and then grown to an O.D. 600 0.5. For a target death rate of 0.015 min À1 , the culture was then serial diluted 4 times at 74 ml, 93 ml, 93 ml, and 93 ml to a total volume of 100 ml to represent targeted decreases in CFU at 20, 25, 30, and 35 min respectively. For a target death rate of 0.1 min À1 , the sample was diluted 13.5 ml into 100 ml 4 fold for 20, 40, 60, and 80 mins. Immediately following dilutions, 100 ml were serial diluted and plated in triplicate on LB agar carbenicillin 50 mg/ml and IPTG 1 mM and LB agar IPTG 1 mM. Percent-positive pAM34-pLac colonies for were determined by the ratio of CFUs on carbenicillin containing agar to no antibiotics. All experiments were performed in triplicate.

RESTAMP
Frozen 1 ml aliquots of E. coli MG1655 (SoA2898) were recovered in 300 ml of pre-warmed 37°C LB broth and grown at 37°C to 0.3 O.D. 600 nm . Cells were then diluted in 25 ml of pre-warmed 37°C LB broth to a concentration of 2.4x10 5 CFU/ml (~1:133 dilution). For measuring wild-type death rates, every 20 min 600 ml sample was removed from the culture, and serial diluted in PBS. 100 ml of each dilution was plated in triplicate for colony counts on LB agar chloramphenicol 5 mg/ml. Harvest plates for RESTAMP were prepared by plating 200 ml undiluted sample on to LB agar chloramphenicol 5 mg/ml and growing for 16 h at 37°C. To simulate death rates, 1 ml frozen aliquots of E. coli MG1655 (SoA2898) were recovered as described above and diluted to a calculated O.D. 600 nm of 0.001 (~1x10 6 CFU/ml). Sample volumes were then taken from the solution, with decreases in volumes representing increased time, and diluted to a volume of 1 ml in LB and allowed to grow to an O.D. 600 nm of 0.1 (~1 h). Short log-phase growth does not alter tag frequencies. The samples were the pelleted, and DNA extracted for N B determination as described later. For the time points of 0, 20, 25, 30, 35, and 40 mins the volumes taken for the target death rate of 0.015 min À1 were 100, 74.1, 68.7, 63.8, 59.1, and 54.9 ml respectively. For the target death rate of 0.1 min À1 , the sample volumes were 1000, 135.3, 82.1, 49.8, 30.2, and 18.3 ml. All experiments were performed in triplicate.

RESTAMP: Sample processing
Plates for RESTAMP analysis were harvested by placing 5 ml PBS on top of the plate and scraping. The O.D. 600 of a 1:10 dilution of the sample was measured and the dilution factor calculated for an O.D. 600 of 1.0. Using the calculated dilution factor the original solution was diluted and 1 ml pelleted. Genomic DNA extraction was performed on the pellet by adding 600 ml 2% sodium dodecyl sulfate (w/v) 0.5 M Ethylenediaminetetraacetic acid (aq) pH 8.0 lysis buffer for 5 min at 80°C then 3 ml RNase A solution (Sigma, Cat. #R6148) was added for 30 mins at 37°C. Cell debris was precipitated with 200 ml 7.5 M ammonium acetate (aq) then centrifuged. DNA was precipitated from the supernatant with 800 ml of isopropanol then washed with 70% ethanol (aq) and suspended in 100 ml molecular grade water.
Illumina Miseq sequencing samples were generated using PCR with primers targeting the barcode flanking sequences with custom indexes and sequencing primer overhangs and manufacturer's recommended P5 P7 regions (S1 Table). PCR was performed with OneTaq 2x Master Mix (NEB, Cat. #M0482) spiked with 1 U of Phusion High-Fidelity DNA Polymerase (NEB, Cat. #M0530). Three 50 ml PCR reactions were performed per sample with 20 cycle reaction to minimize replication bias then combined and purified using QIAquick PCR Purification Kit (Qiagen, Cat. # 28104) per manufacturer's protocol. PCR products were confirmed by gel electrophoresis and concentration determined by Nanodrop (ThermoFisher, ND-1000). Samples were combined for a concentration of 10 ng/ ml of each sample. The final concentration of the sample was measured by Qbit (ThermoFisher, Cat. #Q32854) and diluted to 8 nM. Sequencing was performed on Illumina MiSeq System TruSeq HT assay per manufacturer's protocol using MiSeq Reagents Kit v2 50 cycles (Illumina, Cat. #MS-102-2001) with custom sequencing primers (S1 Table).

Code
Codes for all figures were implemented in MATLAB (R2017b, The MathWorks, Natick, MA, USA). All code for reproducing the results and the raw sequencing data is available on SourceForge: https://sourceforge.net/projects/restamp/. All scripts are free software and are free to redistribute and/or modify under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or any later version.
To make the method in this work accessible we implemented an analysis pipeline in Matlab R2017b that takes next-generation sequencing files and produce founder population size values via a graphical user interface. The analysis pipeline is an extension of [8] and has been updated with a protocol for removing extraneous spurious sequences that typically arise in next-generation sequencing technologies [31] (see 5.3 -Protocol for removing spurious sequence reads). This software was used to analyze the nextgeneration sequencing data and produce founder population size values, which were subsequently used to calculate the rates using equations (6ab). The software is freely available for download on

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.