Repertoire-wide phylogenetic models of B cell molecular evolution reveal evolutionary signatures of aging and vaccination

Significance High-affinity antibodies that protect us from infection are produced by B cells through an evolutionary process of mutation and selection during adaptive immune responses. B cell repertoire sequencing combined with phylogenetic methods has provided unprecedented potential to study B cells as an evolving population. However, phylogenetic models operate on individual lineages rather than the thousands of lineages often found in B cell repertoires. Here, we develop an evolutionary framework that incorporates B cell-specific features and combines information across lineages to characterize mutation and selection dynamics of entire repertoires. We use this technique to demonstrate evidence of age-associated changes in somatic hypermutation targeting and uncover a general trend within our datasets toward negative selection over the course of affinity maturation.


HLP19 model definition
We begin with the HLP17 substitution model (1), in which instantaneous rates of codon substitution are parameterized by the nonsynonymous/synonymous mutation rate ratio (w), transition/transversion mutation rate ratio (k), a vector of codon frequencies (p), and a vector of modified substitution rates h = (h WRC , h GYW … , h (a) ) where a is an SHM hot-or cold-spot motif, such as WRC (2); W=A/T, R=A/G). We modify the HLP17 model by removing codon frequencies from the Q matrix, and parameterize the instantaneous rate of change from codon i to j, where i ≠ j, as: Diagonal elements of the Q matrix (i.e. i = j) are equal to the negative sum of the off-diagonal elements of the same row. More explicitly, for each codon i: Each element in the Q matrix is then divided by , such that for the resulting calibrated Q matrix, = 1. This calibration makes phylogenetic branch lengths interpretable as the expected number of substitutions per site, assuming codon frequencies have converged to a stationary distribution. Codon frequencies (π) are an important part of this calculation, and if they do not accurately reflect the composition of the sequence at a particular time point this may lead to inaccurate branch length estimates. Typically, codon frequencies are assumed to have reached their stationary distribution, and π is set to the equilibrium codon frequency distribution. This assumption is made in virtually all substitution models, including HLP17 (1), which estimates state frequencies using maximum likelihood.
Calibrating Q matrices using equilibrium frequencies is likely a poor choice for B cell lineages because B cells are known to begin out of sequence equilibrium and change over time (4).
Because the germline sequence of B cell lineages is predicted or specified before phylogenetic estimation, it is possible to use either the empirical codon frequencies of the germline sequence (π (g) ) or the empirical codon frequencies of the tip sequences (π (t) ) to calibrate the Q matrix.
However, both of these reflect codon frequencies at only the beginning and end of the affinity maturation evolutionary process. Because codon frequencies are modelled as changing from π (g) to π (t) using the substitution process described in Q, we can use π (g) and Q to predict the codon frequency change over a defined interval. We use this to calibrate our Q matrix using predicted codon frequencies at the midpoint of the tree (π (m) ). To predict these values, we first calibrate the Q matrix using π (g) to give Q (g) and calculate the transition probability matrix (S) = exp ( (X) ), where m is half the mean root-tip divergence of all sequences in the initial tree -which we estimate using the GY94 (5, 6) model on the same data. From these pieces of information, we calculate π (m) using: after interval m given the starting frequencies π (g) . This gives the predicted frequency of codons over an interval equal to the average midpoint between the root and each tip. Because this procedure does not use codon frequencies from the tips, the accuracy of the predicted midpoint frequencies depends on how well the substitution process is described by the Q matrix.
In addition to being used in the calibration of branch lengths, codon frequencies are also used to calculate "# (;) values. To solve for the latter, we use π (g) values to create the (;) matrices used in constructing Q (g) . Ideally, we would then use the estimated π (m) values to construct the matrices (;) used in Q (m) . However, doing so would require recalculating all "# (;) values each time any parameter is updated. This is computationally expensive, requiring 61 4 computations rather than 61 2 computations for simply updating the Q matrix. We therefore calculate each element of "# (;) once using the values of π (g) and keep the (;) matrices constant throughout remaining parameter estimation. This dramatically improves runtime. In this way, Q (m) depends on π (m) through calibration and on π (g) through (;) matrix values.
Further, in many of our analyses, we partition our sequences into complementarity determining regions (CDRs) and framework regions (FWRs), and estimate separate values of w for each. In these cases, π (g) and π (m) estimation is the same as described above but performed on the separate partitions independently: each tree has two Q matrices, which are identical except that Z[\ (S) uses w CDR and is calibrated using codon frequencies in the CDRs, while ]^\ (S) uses w FWR and is calibrated using codon frequencies in the FWRs. Likelihood calculations involving ambiguous codons in the root sequence are averaged over all compatible codons, weighted by their frequencies in either CDRs or FWRs obtained from an empirical dataset of four healthy subjects (7).
Overall, compared to HLP17, the updated model defined above, which we refer to as HLP19, uses less than half the number of free parameters and models the non-equilibrium nature of affinity maturation in a more interpretable manner. It is also more formally similar to previous nonreversible phylogenetic model forms (8,9).

Repertoire dataset processing
The "Age" dataset consists of samples taken from 27 healthy individuals in two consecutive years (10). Subjects varied in age from 20 to 81 years old, and both male and female subjects were included. Sequencing was performed in three replicates: two using genomic DNA and one using mRNA. Raw bulk-sequencing reads using the 454 (Roche) platform from PBMCs as described in (10) (14). The clonal distance threshold was determined by manual inspection to identify the local minima between the two modes of the bimodal distance-to-nearest histogram produced using SHazaM v0.1.8) (15). The V and J genes of unmutated germline ancestors of each clone were predicted using Change-O, with D segment and N/P regions masked by "N" nucleotides. Clonal clustering was performed using all three sequencing replicates (two obtained from genomic DNA and one mRNA) pooled together; however, after this mRNA sequences were separated and exclusively used for subsequent analysis.
The "Vaccine" dataset consists of samples from three male donors at 10 time points: -8 days, -2 days, -1 hour, +1 hour, +1 day, +3 days, +7 days, +14 days, and +28 days relative to seasonal influenza vaccination (16). The sample collection protocol is available in (16); sequencing, as well as data preprocessing for read quality and inference of clonal clusters is described in (14).
Briefly, samples were re-sequenced from mRNA using 5'RACE with unique molecular identifiers (UMIs) and sequenced using the Illumina MiSeq platform. Using pRESTO v0.4 (17), low quality bases in reads (Phred score < 5) were masked with Ns, paired-end reads were matched and assembled, and assembled sequences with the same UMI were collapsed into a consensus sequence (14). Following preprocessing, V(D)J assignment was performed using For HLP17/19 phylogenetic analysis it is necessary for all sequences within a particular clone to be placed in a codon alignment with the correct reading frame. To achieve this, sequences from each clone were multiple aligned using the IMGT numbering scheme, which was also used to determine the locations of CDRs and FWRs (18). It is possible that BCRs will accumulate insertion mutations relative to their germline precursor over the course of affinity maturation. In this case, Change-O removes insertions in its IMGT sequence alignment in order to preserve numbering. However, frame-preserving insertions occasionally occur mid-codon. Removing these insertions brings together nucleotides that do not actually form an observed codon in the sequence, creating a potentially false codon substitution. We detected these events by comparing input sequences to IMGT aligned sequences in Change-O files, and masking them using "N" nucleotides. We then trimmed all gap-only columns in IMGT alignments within each clone. This functionality, as well as converting Change-O files to IgPhyML inputs, is now included in the BuildTrees tool as part of Change-O v0.4.3+ (13).
Sequences from both years within the same patient were pooled together in the Age dataset, resulting in a single repertoire sample per subject. From both datasets, lineages with fewer than 2 unique sequences (not including the predicted germline) were discarded. Due to the large number of sequences per sample in the Vaccine dataset, and the computational complexity of our parameter estimation procedure, these repertoires were subsampled at each subject and time point combination. Sequences from each sample were subsampled without replacement until either the sample contained at least 3,000 unique sequences within clones of at least 2 unique sequences (occasionally this resulted in some datasets having up to 3,001 sequences), or all sequences were used. This cutoff was chosen because all samples from subject hu420139 and 420IV, and 5/10 samples from PGP1 had >3000 unique sequences in non-singleton clones.

Section S2: Simulating under the HLP19 model
We performed multiple simulation analyses to validate our parameter estimation procedure under a fully context-dependent version of the HLP19 model (Sections S3, S4, and S6). These were performed under different parameter values, tree topologies, and initial sequences as detailed in their respective files, but they share a common means of simulating sequence changes along a given tree. These processes require a starting tree topology, branch lengths, a germline sequence, and substitution parameters to be specified in advance. For a given lineage tree, the starting germline progenitor sequence is specified as either a random naïve B cell sequence (Sections S3 and S4) or the previously identified germline sequence in the empirical data (see Section S6).
For each branch in the tree we simulated the number of nucleotide substitutions by drawing from a Poisson distribution whose mean equals the branch length divided by three and multiplied by the number of codons in the germline sequence. Branch lengths were divided by three because branches lengths estimated under all substitution models used in this paper are in units of expected codon substitutions per site, while the simulation procedure operates on single nucleotides. For a given sequence we then calculated the probability of all single substitutions under the HLP19 Q matrix (Section S1) based on their full nucleotide context (rather than the mean field approximation; described by parameters h WRC , h GYW , h WA , h TW , h SYC , and h GRS ), whether the substitution was synonymous or non-synonymous (described by parameters w FWR and w CDR depending on whether the site was in the FWRs or CDRs), and whether the substitution was a transition or transversion (described by k). This process began at the specified germline progenitor and continued down the tree to the tips. We then masked the CDR3 of the germline sequence to mimic the behavior of the CreateGermlines prediction program, which is unable to reliably predict germline D segment and N/P regions (13). We also removed the CDR3 regions of all sequences in each dataset to avoid potential issues generated with our inability to properly predict this region. This process repeated for all clonal lineage trees in a dataset. These simulations were performed using custom Perl scripts, which are available at https://bitbucket.org/kleinstein/projects.

Issues with the simulation approach in Hoehn et al 2017
Hoehn by calibrating the Q matrix of the HLP17 model with its codon frequencies so that branch lengths could be interpreted as substitutions per site (discussed in Section S1), and then exponentiating this matrix to give the P matrix of substitution probabilities for each codon.
Codon substitutions were then introduced based on these probabilities at each site. However, because Q matrices are calibrated using codon frequencies, the degree to which branch lengths can be accurately interpreted as substitutions per site depends on how accurately the codon frequencies used reflect the composition of the sequences being simulated. For the HLP17 model, these frequencies are estimated through maximum likelihood as free parameters -they are not necessarily tied to the composition of the sequences. This results in branch lengths that less accurately reflect the expected substitutions per site than standard models (this inaccuracy is confirmed in Section S3). Unfortunately, because the simulations performed in that paper used calibrated, exponentiated Q matrices, this inaccuracy was present in the simulated data itself, leading to simulated datasets with fewer substitutions than expected given their branch lengths. This

Section S3: Performance of GY94, HLP17, and HLP19 substitution models
The goal of this simulation analysis is to compare the parameter estimation performance of the GY94 (5, 6), HLP17 (1), and HLP19 substitution models. Simulations were performed on two trees with identical topologies and total tree length (i.e. expected number of substitutions), but different branch length distributions (split and long; see Figure S3a below). We created artificial datasets by simulating sequences using 200 copies of either split or long tree topologies.
Simulations were performed as described in Section S2. In these analyses, we selected a random naïve heavy chain sequence from a dataset of two healthy individuals (7) for each tree in each dataset. We further used two sets of substitution parameters. In the first, named hot, and h GRS = -0.6. Overall, we created 50 datasets for each combination of tree topology and substitution model.

Figure S3a
: Trees used in simulation analyses described in Section S3. Repertoires specified as "split" used the tree in a, and repertoires specified as "long" used the tree in b. Branch lengths are in codon substitutions per site. All branches with unlabeled lengths are equal to zero. "Germline" is the germline progenitor sequence of the lineage, while Sequence 1 and Sequence 2 are the data sequences produced by the simulation process.  were set to their empirical estimates across all lineages within the repertoire using a CF3X4 model (19). As in Section S3, we selected a random heavy chain sequence from a dataset of naïve B cells from two healthy individuals (7) to act as the germline sequence for each tree in each dataset. The CDR3 region was removed for all sequences after simulation. Simulations were then performed using the procedure outlined in Section S2, using substitution parameters k within the repertoire using a CF3X4 model (19). We repeated this process both for each lineage individually, and for each repertoire using a repertoire-wide model. For individual trees, all parameters were estimated for each individual lineage.
Two types of simulations were performed. In the first simulation, parameters were identical among lineages, and w FWR = 0.5, w CDR = 0.7. We compared the accuracy of parameter estimates at the repertoire level versus the mean parameters of individual lineages containing at least 2, 10, and 30 sequences ( Table S4b). In the second simulation, the k and h parameters were identical among lineages, but w parameters were allowed to vary. This is motivated by the expectation that the strength of selection may vary among lineages but the mutational biases of SHM (i.e. hot-spots and cold-spots) are expected to be constant. For each lineage, the k and h parameters were identical to previous simulations, but w CDR was drawn from a gamma (shape=10, scale=0.07) distribution (mean = 0.7), and w FWR was drawn from a gamma (shape=10, scale=0.05) distribution (mean = 0.5). These are shown in Figure S4a. The results of these simulation analyses are shown in Table S4c. We performed 50 repetitions of each type of simulation.    Table S4c: Performance of repertoire-wide and individual estimates in simulations where w varies across lineages. In the first eight rows, the mean of the true parameter distribution ( Figure  S4a; dotted lines) was compared to the repertoire-wide estimate and mean individual estimate obtained from lineages with sequence counts at or above 2, 10 and 30 sequences. The last twelve rows show how repertoire-wide and individual estimates compared to the true parameter values of individual lineages (Figure S4a; distributions). Namely, for all lineages containing at least the specified number of sequences, that lineage's true parameter value was compared to the repertoire-wide estimate (using all lineages regardless of size), and the estimate obtained from that lineage individually. The mean of bias, variance, and mean squared error (MSE) across all comparisons in all repetitions is reported in the last three columns. The lowest (i.e. best) value for each comparison shown in bold. Numbers are rounded to two significant digits. Data were generated from a simulated repertoire using tree topologies from subject 97 in the Age dataset and identical parameters among lineages. Note that this figure is the same data as in Figure 1, but plotted across its full range.

Section S5: Comparison of model fit on empirical datasets
The Akaike information criterion (AIC) is a means of scoring the quality of model fit to data that compares the number of freely estimated parameters to the maximum log likelihood calculated using the model (20). Lower AIC values indicate a better model fit. Here, we calculate AIC for GY94 (5, 6), HLP17 (1), and HLP19 (Section S1) models when our parameter estimation procedure (Methods: Phylogenetic model parameter and topology estimation) was applied to each subject of the Age dataset ( To make AIC values comparable among the three models, we altered the HLP17 and HLP19 models slightly by multiplying the partial likelihood of each possible codon at the root by the frequency of that codon (π), as is typically done for reversible models (21). For GY94 these frequencies were the empirical estimates of codon frequencies under a CF3X4 model for each lineage within the repertoire, for HLP17 these frequencies were estimated by ML using the CF3X4 model and shared across all lineages within the repertoire, while for HLP19 these frequencies were estimated from the root sequence of each lineage (π g for FWR or CDR, Section S1).  Table S5: AIC values (rounded to the nearest integer) and w CDR estimates of all subjects in the Age dataset using the GY94, HLP17 and HLP19 substitution models, modified as detailed in Section S5.

Section S6: Simulation analyses based on empirical datasets
The goal of these simulation analyses is to test whether differences in the underlying tree topology and branch lengths of samples within the Age and Vaccine dataset could be responsible for observed trends between substitution model parameters and age in the Age dataset and time in the Vaccine dataset. As described in Methods: Phylogenetic model parameter and topology estimation, we first estimated maximum likelihood tree topologies, branch lengths, and shared values of w, and k for all lineages within each repertoire under the GY94 model. Codon base frequencies were set to their empirical estimates across all lineages within the repertoire using a CF3X4 model (19). For each lineage tree within each repertoire, we used the predicted germline sequence (see Repertoire dataset processing in Section S1) of the empirical data as the progenitor of each lineage. Masked codons (specified by 'NNN') were replaced with codons drawn from a random uniform distribution. These were primarily in the CDR3 region which was removed after simulations. Simulations were then performed using the procedure outlined in Section S2, using substitution parameters k = 2, w FWR = 0.5, w CDR = 0.7, h WRC = 4, h GYW = 6, h WA = 3, h TW = 1, h SYC = -0.6, and h GRS = -0.6. These were chosen to be similar to those estimated from the Age dataset (Figure 2). We simulated 20 repertoire datasets for each sample in both datasets using this procedure and removed the CDR3 regions of all simulated sequences. We then repeated the parameter estimation procedure in Methods: Phylogenetic model parameter and topology estimation by first estimating tree topologies, branch lengths, and shared values of w, and k for all lineages within each repertoire under the GY94 model, and then estimating substitution parameters across all lineages using a repertoire-wide HLP19 model. For GY94 parameter estimation, codon base frequencies were set to their empirical estimates across all lineages within the repertoire using a CF3X4 model (19).
We first investigated whether our simulations could reproduce the observed negative relationship between age, sex, and estimates of h WA . For each repetition of our Age dataset of 27 subjects, we fitted a multiple linear regression model with age and sex as interaction variables against substitution rate biases of WA motifs (i.e. h WA ). We compared the slope coefficients estimated from each of our simulated datasets to those observed in our empirical data. For all 20 repetitions the h WA slope coefficents for males and females were closer to zero than their respective empirical estimates (Figure S6a).
We next investigated whether our simulations could reproduce the behavior of w CDR over the time course of influenza vaccination in our Vaccine dataset. As shown in Figure S6c, estimates of w CDR varied fairly constantly around a mean slightly above the true value (0.7) for the entire time course for all three subjects. For subjects 420IV and hu420139, we calculated the fold change in w CDR at day +7 compared to the pre-vaccine sample (-1 hour) for each simulation, and compared it to the empirical estimate. No simulation reproduced the observed fold change at day +7 in either subject ( Figure S6b). As expected given its low sequence count, PGP1 day +14 had wide variation in w CDR estimate between repetitions. Figure S6c confirms that the simulation procedure reliably reproduced differences in mean tree length among repertoire datasets with little variation among repetitions.
We then tested whether our results in either the Age or Vaccine dataset could reproduce our observed negative relationship between w and mean tree length. For each repetition of our Age and Vaccine datasets, we fit a log-linear regression model in which either w CDR or w FWR are modelled against the natural logarithm of mean tree length. We compared the slope coefficients estimated from each of our simulated datasets to those observed in our empirical data. For all 20 repetitions, slopes of both w parameters from both the Age and Vaccine dataset were far lower than any slope parameters observed in the empirical Age or Vaccine datasets (Figure S6d).
These results are further shown in Figure S6e, which shows these simulation results plotted as regressions and scatterplots.     Coefficients are colored by region (i.e. whether they represent w CDR or w FWR ). The true value for all simulations w CDR = 0.7 and w FWR = 0.5. The y axis is scaled identically between the two plots.

Section S7: Additional simulations using an empirical SHM model
As an additional validation, we also performed simulations based on the Age dataset using the S5F model within SHazaM (15). Simulations were performed under a modified version of Alakazam v0.2.11 (13) and SHazaM v1.9.0 (15) which are noted below and are available at: https://bitbucket.org/kleinstein/projects. These simulations for each clone were performed as follows: 1. For a given set of clonal sequences of at least two unique sequences, maximum parsimony lineage tree topology and branch lengths were estimated using dnapars from PHYLIP v3.697 (22), implemented in Alakazam v0.2.11 (13).  (13). Mutations are added to the starting sequence based on their relative probability under the S5F model (15). Starting from the germline sequence at the root node of the tree, this process is repeated for each descendent node in the tree, with the number of mutations added between ancestor and child nodes equal to the branch length between them.
5. This process is repeated for all clones within a subject in the Age dataset.
6. These simulated datasets are converted into an IgPhyML-readable format using BuildTrees.py in Change-O v0.4.5 (13). We then repeat our parameter estimation procedure (see Methods:Phylogenetic model parameter and topology estimation) by first estimating maximum likelihood tree topologies, branch lengths, and repertoire-wide substitution parameters using the GY94 model, and then using these tree topologies to estimate maximum likelihood branch lengths and substitution model parameters under the HLP19 model.
For each simulation repetition of all 27 subjects, we fit a log-linear regression model in which either w CDR or w FWR are modelled against ln(mean tree length). We compared the slope coefficients estimated from each of our simulated datasets to those observed in our empirical data. Of the 50 simulation repetitions performed, none showed a slope as negative as that observed between either w CDR or w FWR and mean tree length (Figure S7a). These results are further shown in Figure S7b, which shows these simulation results plotted as regressions and scatterplots. From this we conclude that even under this different simulation framework, which uses a model of SHM not fully captured by the HLP19 model, that the observed trend is not the result of model bias between tree length and w, or selection of germline sequence.
We also compared performance of HLP17 and GY94 models on each of these simulated datasets. We repeated the parameter estimation procedure in Methods: Phylogenetic model parameter and topology estimation by first estimating tree topologies, branch lengths, and shared values of w, and k for all lineages within each repertoire under the GY94 model. For this initial GY94 parameter estimation, codon base frequencies were set to their empirical estimates across all lineages within the repertoire using a CF3X4 model (19). We then estimated substitution parameters across all lineages under repertoire-wide HLP19, HLP17, and GY94 models with separate w parameters for FWR and CDRs (w CDR and w FWR , respectively). Our results largely reproduce those of Section S3. As expected, both HLP19 w estimates are roughly centered at 1 (mean w CDR = 1.01 , mean w FWR = 1.01), while HLP17 tended to overestimate w CDR (mean w CDR = 1.11, mean w FWR = 1.0). GY94 tended to overestimate w CDR even more than HLP17, and underestimate w FWR (mean w CDR = 1.29, mean w FWR = 0.88). Overall, these confirm that HLP19 estimates of w are not severely affected by alternative models of SHM context sensitivity, while estimates using HLP17 and GY94 models are more biased.

Figure S7a
: Slope coefficients of log-linear regressions between HLP19 estimated w (w CDR =purple, w FWR =orange) and mean tree length among 50 simulation repetitions of the Age dataset. Vertical dashed lines show the values observed in their respective empirical datasets.

Figure S7b
: Log-linear regressions between HLP19 estimated w (w CDR =left panel, w FWR =right panel) and mean tree length among 50 simulation repetitions of the Age dataset. These represent the same data as in Figure S7a. The Y axis is scaled identically in the two plots.  Section S9: Estimates of w CDR for lineages of different sizes Figure S9: Estimates of w CDR for lineages of different sizes from the same repertoire (subject 420IV, day +7 post-vaccination). Samples on the x-axis show approximate quartiles of sequence count in the repertoire, with larger lineages on the right and smaller lineages on the left. Lineages within their respective size constraints were grouped together and w CDR was estimated for each group using the HLP19 model and the same parameter estimation procedure described in Methods: Phylogenetic model parameter and topology estimation. These results show that lineages with more sequences tended to have lower w CDR estimates compared to lineages with fewer sequences, though this trend is not monotonic. Sequence counts represent number of unique sequences in a lineage before the CDR3 region is masked.