Lineage motifs as developmental modules for control of cell type proportions

Analysis (LMA)


In brief
Biological tissues require fine-tuned cell type proportions for optimal function, but how this process is regulated remains poorly understood.Tran et al. suggest that lineage motifs reflect modular developmental programs that could constrain variation in cell type proportions.

INTRODUCTION
Most tissues comprise multiple specialized cell types that appear in appropriate proportions to support proper tissue-level functions.In many cases, cell type proportions vary spatially within the tissue.For example, the center of the primate retina is cone-dense, allowing for high visual acuity, while the periphery is rod-dense, enabling greater sensitivity in low light conditions. 1 Cell type proportions also vary between species.For instance, the ratio of rod and cone photoreceptors varies depending on the visual needs associated with the lifestyle of each species. 2 Tissue development thus faces the fundamental challenges of (1) generating cell types in correct proportions, and (2) facilitating spatial and evolutionary changes in those proportions. 3,4ne prevalent mechanism for specifying cell type proportions occurs through regulating cell fate differentiation.As progenitor cells undergo successive rounds of cell division, they progressively become restricted in their fate potential, eventually committing to terminal cell fates.This process can be described in terms of a collection of cell states and the rates at which cells in each state transition to other states, i.e., a cell state transition map 5 (Figures 1A and 1B).In some cases, like the nematode C. elegans, cell state transitions are deterministic, producing a stereotyped lineage tree in all individuals. 6owever, in most other organisms, one cannot infer a quantitative cell state transition map from any single lineage tree due to variability.For example, in the mammalian retina, individual progenitor cells can give rise to a wide distribution of cell numbers and types with no apparent fixed ratios between different types.This observation prompted investigators to initially suggest a stochastic view of cell fate determination. 7,87][18][19][20][21][22] Cell state transition dynamics can also integrate extrinsic signals, developmental time, and stochastic ''noise'' with internal progenitor states. 23,24Thus, even in well-studied systems such as the retina, it remains a major challenge to quantitatively elucidate cell state transition maps.
Different cell state transition maps can generate distinct distributions of cell fates on lineage trees.One simple transition map comprises a multipotent progenitor that can directly and probabilistically differentiate into multiple terminal fates (Figure 1A).A system employing such a direct, memoryless transition map would not exhibit fate correlations between related cells.Alternatively, a more complex transition map could involve the probabilistic generation of various types of committed progenitors, each predetermined to give rise to an invariant set of descendant cell types (Figure 1B).In this case, each type of progenitor would produce a characteristic distribution of descendant cell fates, introducing fate correlations on lineage trees.These fate correlations represent lineage motifs that reflect otherwise hidden progenitor states.Furthermore, based on what motifs are used to specify cell types in developing tissues, this could in turn limit variation in overall cell type proportions (Figure 1C).
Recently, new methods have begun to allow for lineage tree reconstruction at scale.Long-term in toto live imaging allows direct tracking of dividing progenitor cells. 257][28][29][30][31][32] These advances provoke the question of how fully resolved lineage trees with endpoint cell fates can be used to infer cell state transition maps.
To address this challenge, we introduce Lineage Motif Analysis (LMA), a computational approach for inferring statistically overrepresented patterns of cell fates on lineage trees.LMA is based on the principle of motif detection, which has been used to identify the building blocks of complex regulatory networks, 33 DNA sequences, 34,35 and other biological features, 36,37 but has not to our knowledge been applied to understand cell fate differentiation.As a ''bottom-up,'' data-driven approach, LMA does not require specific assumptions about underlying molecular mechanisms and can be applied to diverse systems for which sufficient cell lineage information is available.Biologically, motifs could be generated by progenitors intrinsically programmed to autonomously give rise to specific patterns of descendant cell fates.They could also reflect more complex cell state transition maps involving extrinsic cues and cell-cell signaling that generate correlated cell fate patterns on lineage trees.
Here, we first define LMA and demonstrate how accurately it performs using simulated datasets.We then identify lineage motifs in published zebrafish and rat retina development datasets, as well as a dataset of early mouse embryonic development.These results reveal spatial and temporal differences in cell fate determination across different progenitors.Further, the appearance of shared retinal motifs across different species suggests that motifs may be evolutionarily conserved features of development.Computationally, we explore how various dataset characteristics affect motif identification.We demonstrate how the use of lineage motifs facilitates adaptive variation in retinal cell type composition and show that this theory is consistent with known variation in vertebrate retinal cell type proportions.Together, these results support LMA as a broadly useful tool to understand cell fate differentiation.

DESIGN
A previous study analyzed sister cell fate correlations by comparing the frequency of two-cell clones with that predicted by random association of cell types given their observed proportions. 24Another study analyzed triplet fate correlations by comparing the frequency of triplet patterns with that observed in simulated lineage trees using a stochastic model where each starting progenitor can self-renew or differentiate into all possible cell types within the dataset under a set of probabilities. 38These studies provide evidence for fate correlations between related cells.However, a framework that can be recursively applied to any lineage tree dataset to systematically identify lineage motifs of varying size remains lacking.
We first simulated a dataset of lineage trees with two terminal cell fates (Figure 2A; STAR Methods).We then applied LMA to analyze the tree dataset, starting by enumerating all possible doublet and triplet cell fate patterns (with varying fate composition and order of fate differentiation) and counting the number of times each occurred within the observed trees (Figure 2B).Then, we compared these counts with those expected in a ''null'' model without fate correlations.This can be done by randomly shuffling the cell fates at the leaves of the lineage trees to generate resampled trees, followed by counting the number of times each pattern occurs across the resampled trees.We then repeat the resampling process many times.Since the arrangement of cell types in the resampled trees are randomized, the average of counts obtained within the null model represents the expected count if there is no relationship between lineage and cell fate.To identify larger motifs that span more than one cell division, the resampling was done in a manner that preserves the frequency of sub-patterns within each pattern (STAR Methods).
For each pattern, we computed a Z score to quantify the degree of over-representation, as well as a false discovery rate (FDR)-adjusted p value 39,40 to measure significance (STAR Methods).In the identified lineage motifs, higher over-representation can be interpreted as stronger intrinsic commitment of a given progenitor toward generating a particular fate pattern.Alternatively, it could represent the strength of extrinsic interaction that generates a particular fate pattern.Finally, anti-motifs, defined as patterns that are underrepresented in the observed trees, were identified using the same approach.
LMA is distinct from a related approach termed Kin Correlation Analysis (KCA).KCA infers cell state transition dynamics from lineage trees and endpoint cell state datasets but is mainly applicable to systems governed by Markovian dynamics, in which sister cell transitions are independent of one another. 41,42o demonstrate that LMA can recover lineage motifs that reflect progenitor states in cell state transition maps, we simulated lineage tree datasets using either a competence progression model (Figure 3A) or a binary fate model (Figure S1A).We used differentiation probabilities that generate roughly equal cell type proportions in the overall dataset (Figures 3B and  S1B).Applying LMA to both datasets, we found that the resulting motifs reflected the structure of the generative model and captured multiple levels of progenitor commitment over time.For example, in trees generated using a competence progression model (Figures S2A-S2C), where cell fates A through F are generated progressively over time, only symmetric doublet patterns, such as (F,F), were statistically overrepresented within all possible doublet patterns (Figure 3C).
We next analyzed triplet patterns, in which a single progenitor divides to produce a terminal cell, X, and a second progenitor cell that divides once more to produce a doublet of terminal cells, Y and Z, producing a triplet denoted as (X,(Y,Z)).Only triplet patterns including two sequential levels of progenitor commitment, such as (E,(F,F)), were significantly overrepresented (Figure 3D).
LMA can be scaled up to analyze larger asymmetric patterns.Given a reasonable number of trees (500 total), the motifs successfully captured up to 5 levels of the competence progression model.Similar to the triplet results, the significant higher-order motifs exclusively involved sequentially generated cell fates.As motif size grows larger, the size of the dataset required for detection also increases (Figure 3E).Together, these results confirm that LMA can be used to recursively identify lineage motifs in large patterns.
We also analyzed trees generated using a binary fate model in which progenitors make binary choices which restrict their fate potential over time (Figures S2D-S2G).The doublet and quartet motifs reflect the structure of the generative model as expected (Figures S1C and S1D).However, no octet patterns were significantly over-or underrepresented (Figures S1E and S1F).Taken together, these results indicate that LMA is capable of recursively identifying lineage motifs of multiple sizes in different models of development and is especially powerful when applied to the competence progression dynamics, likely due to the lower number of possible patterns per level of progenitor commitment.
Having demonstrated that LMA can recover motifs in lineage trees generated using an intrinsic program, we next sought to demonstrate that LMA could do so in trees generated using an extrinsic program.More specifically, we considered a simplified model of the classic developmental mechanism of lateral inhibition, in which cells of one fate inhibit similar fates in their neighbors [43][44][45] (Figures S3A-S3C).Our model assumes a two-dimensional grid of progenitors, which self-renew or differentiate into two cell fates, A or B, each of which inhibits differentiation of its neighbors into its own fate.The inhibitory effects of multiple neighbors are assumed to combine additively (Figure S3B).As expected, applying LMA to a null dataset generated without lateral inhibition revealed no significantly overrepresented doublet patterns (Figure S3D).By contrast, symmetric sister doublets (A,A) and (B,B) were underrepresented in the lateral inhibition model, whereas the asymmetric sister doublet (A,B) was overrepresented.These results show that extrinsic developmental programs can generate signatures of fate correlation on lineage trees, which can be reliably detected using LMA.
To enable the identification of lineage motifs across diverse developmental contexts, we created a Python package, termed ''linmo.''The package is available on a GitHub repository (https://github.com/labowitz/linmo),which includes supporting documentation and tutorials for processing the following lineage tree datasets analyzed here.(B) The LMA workflow consists of the following steps.First, all possible cell fate patterns are enumerated.Second, the occurrence of each pattern within the observed lineage trees is counted (triplet pattern ''E'' is shown here as an example).Third, the cell fate labels at the leaves of the trees are randomly shuffled to obtain a resampled set of trees with no fate correlations.This process is then repeated across many resamples.To identify the higher-order motifs that span multiple cell divisions, the shuffling process is done in a manner that preserves sub-pattern frequency (STAR Methods).The occurrence of each cell fate pattern is then counted for each resample.Fourth, the count in the observed lineage trees is compared with the distribution of counts across resamples, whose average is approximately equal to the expected count if there were no fate correlations.Finally, overrepresented patterns are classified as lineage motifs, which represent possible committed progenitors or extrinsic interactions.

LMA reveals spatial organization of zebrafish retina development
Retina development provides a well-studied example of cell fate diversification.It involves generation of a conserved set of terminal cell fates across diverse vertebrate species.At the same time, it also exhibits substantial spatial and interspecies variation in cell type proportions, 1 making it an ideal target tissue for LMA.Therefore, we examined a zebrafish retina development dataset spanning 32 to 72 h post fertilization (hpf), 46 during which progenitors terminally differentiate to form major neuronal and glial cell types, including ganglion (G), amacrine (A), bipolar (B), photoreceptor (P), horizontal (H), and M€ uller glia (M) (Figure 4A).He et al. used time-lapse confocal microscopy in reporter zebrafish lines to track every cell division event for 60 retinal progenitors spanning the nasal-temporal axis.Their data supported previous work showing that a wave of differentiation starts in the nasal region and gradually progresses to the temporal region. 47,48Clonal cell type composition was generally observed to be variable, with weak fate correlations between related cells.A key exception, however, was the frequent appearance of symmetric terminal pairs of photoreceptor, bipolar, and horizontal cells.We sought to identify lineage motifs and characterize how their frequency varies across spatial regions in the zebrafish retina.Therefore, we partitioned lineage trees based on the progenitor spatial location and applied LMA, beginning with doublet patterns, representing the terminal cell division.We found that the (H,H), (B,B), and (P,P) doublet patterns had significantly higher observed counts in the lineage trees, compared with the distribution of counts across resamples and expected count, in a similar manner across the three spatial regions (Figures 4B-4D).Therefore, these doublet patterns are statistically overrepresented in the dataset and represent lineage motifs (Figure 4E).The exception was a lack of (H,H) and (B,B) doublets in the nasal region, likely because those cell types were only present at very low levels in this region (Figures 4D and 4E).These results were consistent with key findings from He et al., while extending the analysis to assess regional variation.
LMA also found motifs not previously identified in the He et al. study and revealed how their frequency varies across space.For example, even though amacrine and bipolar cells appear at similar frequencies across all three retinal regions, the (A,B) doublet was specifically overrepresented in the nasal region (Figures 4D and 4E).Also, doublets comprising one P cell and all other cell types were generally underrepresented across all regions, constituting anti-motifs.We also searched for higher-order motifs that involve multiple cell divisions but found that no patterns were significantly over-or underrepresented, possibly due to the limited size of the dataset (Figure S4).Overall, the observed motif profile suggests that amacrine and bipolar cells frequently share a common progenitor at the terminal cell division, specifically in the nasal region of the zebrafish retina, whereas photoreceptor and non-photoreceptor cells do not share a common progenitor at the terminal cell division in all regions.

Shared retinal lineage motifs across species suggest conservation of cell fate determination
Are retinal lineage motifs conserved between different species?To address this question, we analyzed a dataset of post-natal rat retinal progenitor cells grown in vitro at clonal density, consisting of 129 lineage trees with at least 3 cells. 38During post-natal development, rat retinal progenitor cells gave rise to mostly rod cells (R), some bipolar and amacrine cells (respectively, B and A), and few M€ uller glia (M) (Figure 5A).In this work, the authors showed that a stochastic model based on independent fate decisions could explain the observed frequencies of most triplet patterns.However, some triplets may be generated by fatecommitted progenitors that give rise to sets of correlated cell fates.
Applying LMA to this rat retina dataset confirmed some of these conclusions, such as over-representation of (B,(A,B)) triplets (Figures 5C and 5E).However, it also revealed additional features of rat retinal development.For example, using LMA, we found that (A,B), (B,M), and (A,A) doublets were overrepresented, whereas (B,R) doublets were underrepresented (Figures 5B and 5D).Correcting for sub-pattern frequencies in the triplet analysis revealed that the apparent over-representation of the (R,(A,A)) triplet in the previous study 38 could be entirely explained by the (A,A) doublet motif frequency.This highlights the importance of the recursive nature of LMA.
Because this dataset excluded two-cell lineages, this could potentially introduce biases in three-cell motifs.Therefore, we analyzed cell type proportions in triplets and compared this with those across all other cells (Table S1).We found that there are no obvious differences in cell type proportions between triplet and non-triplet populations, suggesting that the lack of two-cell lineages in the dataset does not substantially bias the triplet motifs detected here.
We next compared the motif profile between zebrafish and rat retina.Because the time period analyzed in these datasets is different and involves the generation of different cell types, we limited this analysis specifically to cell types that are shared between the analyzed datasets (i.e., amacrine, bipolar, and M€ uller glia).Notably, the (A,B) and (A,A) motifs are observed in both species, suggesting that the committed progenitors that these motifs possibly represent are at least partially evolutionarily conserved.In contrast, the (B,B) motif appears specifically in the zebrafish retina, whereas the (B,M) motif appears specifically in the rat retina.Overall, these data suggest that cell fate allocation in retina across species can occur in a biased and evolutionarily conserved manner, in which amacrine and bipolar cells share a common progenitor at the terminal cell division.At the same time, other aspects of cell fate differentiation may be more species-specific.For example, bipolar and M€ uller glia tend to share a common progenitor in rat, but not zebrafish, retina at the terminal cell division.More generally, these results provide a case example for how LMA can be used to assess the evolution of cell fate differentiation.

Computational simulations reveal how various dataset characteristics affect motif identification
To what degree the limited size of available lineage tree datasets, coupled with sampling variation, affects the accuracy of motif identification is unclear.Therefore, we simulated lineage tree datasets using a stochastic model, using set division and fate probabilities based on the Gomes et al. dataset (Figure S5A), while varying dataset size.We also added varying levels of sister fate correlation, quantified as conditional probabilities, into the model.For example, the conditional probability of a rod cell also having a rod sister, P(R sister | R), would be the same as the overall frequency of rod cells, P(R), if no sister fate correlations were present.However, this conditional probability could be increased or decreased to reflect sister fate correlation or anti-correlation, respectively.To further explore how differences in absolute cell type frequency impact motif identification, we allowed sister fate correlations for either rod or M€ uller glia fates, which are present at 75% or 3%, respectively.Analysis of the simulated lineage tree datasets revealed several conclusions.First, no motifs were detected when lineage trees were generated without sister fate correlations, across all tested dataset sizes (Figures S5B and S5C).This indicates that our thresholds for significance are stringent and ensures the absence of false positive motifs.Second, fate correlations are more strongly detected in larger datasets (Figures S5D and  S5E).Conversely, they often go undetected within small datasets even if sister cell fates are completely correlated or anticorrelated.This suggests that analysis of small datasets is likely to be affected by high false-negative rates.Third, fate correlations are more strongly detected when the conditional probability deviates further from overall cell type frequency (Figures S5D  and S5E).Fourth, fate correlations are more strongly detected for more abundant cell types.Overall, this analysis suggests that the fate correlations detected in both zebrafish and rat retina were very strong, since those datasets were relatively small (60 and 129 trees, respectively).However, there may also be weak fate correlations that were not detected as motifs due to limited dataset size.

LMA reveals temporal differences in cell commitment during early mouse development
Early embryonic development features conserved cell types across mammals and spatially restricted cell fate specification, making it an ideal system to apply LMA.We therefore used LMA to analyze a dataset of early mouse embryo development. 49n a previous study, Morris et al. used time-lapse confocal microscopy to trace individual progenitor cells starting at the 8-to 16-cell division within 20 mouse blastocysts until their final fate is known at the late blastocyst stage ($E4.5).Beginning at the 16-cell stage, progenitors can either be located inside, contributing to the inner cell mass, or be located outside along the periphery (Figure 6A).(For shorthand, we will refer to the progenitors at the 16-cell stage simply as ''inside'' and ''outside'' progenitors.)During the next two rounds of cell divisions, outside progenitors can become internalized, adding to the inner cell mass, or continue to remain outside, eventually differentiating into trophectoderm (T).Cells within the inner cell mass either undergo apoptosis (A) or further differentiate into either epiblast (E) or primitive endoderm (P) fates.The authors found that inside progenitors are biased to give rise to epiblast, whereas outside progenitors that internalize later are biased to give rise to primitive endoderm.However, it remained unclear whether individual progenitors give rise to sets of correlated cell fates in the final few divisions prior to fate assignment.
We therefore partitioned lineage trees into those generated from inside or outside progenitors and applied LMA to both sets.We found that most doublet patterns (90%) within both types of progenitors are either motifs or anti-motifs (Figure S6).For example, symmetric sister pairs such as (P,P), (E,E), and the apoptotic doublet (A,A) were overrepresented among descendants of both inside and outside progenitors.(T,T) was also overrepresented among outside progenitors (Figure S6A).These results suggest that by E4.5, most cells have already committed to one of the three lineages before the previous cell division and therefore produce symmetric doublets.We also observed doublet motifs comprised multiple cell types, such as (A,P), (A,E), and (E,P), which were overrepresented in trees from outside progenitors while underrepresented in trees from inside progenitors.Trophectoderm, unlike the other cell fates, was part of all anti-motifs among outside progenitors.Overall, the weaker motif signatures for inside progenitors suggest less commitment compared with the strong, and usually symmetric, doublet motifs among the descendants of the outside progenitor cells (Figure S6C).
Analyzing higher-order motifs among outside progenitors, we strikingly observed both triplet motifs with multiple cell fates, such as (A,(P,P)) and (T,(A,P)), and the homogeneous triplet motif (P,(P,P)), suggesting the existence of committed progenitors at least two generations earlier (Figure 6B).In contrast, all triplet patterns for inside progenitors did not significantly deviate from null expectations, suggesting that they remain uncommitted toward defined fates (Figure 6C).Taken together, these results suggest that some outside progenitors may commit to give rise to defined groups of cell types at least two cell divisions before blastocyst formation, while inside progenitors remain plastic and uncommitted toward certain fates (Figure 6D).

Lineage motifs facilitate adaptive variation in cell type frequencies
The fitness landscape over cell type frequency space could in principle have peaks of high fitness, plateaus of relatively constant fitness, or valleys of low fitness.Inspired by the concept of Pareto optimality in evolutionary trade-offs, 3,4,50 we asked whether it is possible to structure the cell state transition map in such a way that allows cell type frequencies to disproportionately populate the high fitness, or more adaptive, regimes.We reasoned that lineage motifs could address this problem.Mathematically, lineage motifs represent a linear transformation from a set of motif frequencies to a set of cell type frequencies.If most cell fate decisions resulted in generation of cell types as motifs, then a developing tissue could indirectly control the frequencies of individual cell types by specifying the frequency of each motif (Figure 1C).S1.
More precisely, we can describe the conversion from motif frequencies to cell type distributions as a linear transformation: zðsÞ = X 3 yðsÞ + eðsÞ.Here, zðsÞ is a vector whose components represent the counts of each cell type in position/species s, X is a non-negative integer matrix describing how many cells of each type (rows) are produced by each motif (columns), yðsÞ denotes the motif frequencies in position/species s, and eðsÞ represents the number of additional cells of each type in position/species s that cannot be explained through the motifs (Figure 7A).
To understand how motifs constrain cell type frequencies, we first constructed a set of hypothetical motif matrices X 0 , X 1 , X 2 , each reflecting a different motif structure.X 0 is a diagonal matrix representing the null model in which each column trivially corresponds to a single cell type.In contrast, X 1 and X 2 contain  exclusively doublet or triplet motifs, respectively, where multiple cell types are generated together (Figure 7B).For each motif matrix, we simulated datasets by randomly choosing frequencies for each of the motifs present within each matrix (STAR Methods).For simplicity, we initially assumed e = 0 and constrained cell type frequencies to sum to a constant, P i z i = const, to reflect the limited total capacity of the tissue.
We then analyzed the range of cell type distributions produced by each matrix.In this framework, each randomly chosen y(s) would generate a particular z(s), which corresponds to one individual with a particular set of cell type frequencies.Under the null model, cell type frequencies spanned the full space, as expected.By contrast, the other two models restricted cell type frequencies to limited subspaces.
Although motifs can constrain cell type distributions in general, it remained unclear whether the specific motifs observed in the rat retina dataset would be consistent with the distributions of retinal cell types independently observed in different species.Addressing this question requires (1) defining the motif-accessible space of cell type frequencies permitted by the observed rat retina motifs, and (2) determining whether independently measured retinal cell type proportions from other vertebrate species lie within that space.
To define the motif-accessible frequency space, we first need to determine the lower and upper bounds for cell type frequencies.We set the lower bounds at e 3 , the cell type counts for all cells in the rat retina dataset born outside of a motif (STAR Methods; Figure S7A).We set the upper bounds by constraining the total number of cells to be the same as in the rat retina dataset, Using these constraints, we simulated datasets containing randomly chosen frequencies for each of the observed rat retina motifs in X 3 , or as a control, the null model, X 0 .The motif model accessed only a subset of the space of type proportions spanned by the null model (Figure 7C).Within this subspace, the motif model showed higher density of fate distributions corresponding to moderate levels of both amacrine and bipolar cells and low levels of M€ uller glia.Bipolar cells and M€ uller glia exhibited a reduced maximum proportion relative to the null model, consistent with the observation that both of these cell types are generated with other cell types in the rat retina motifs.
We compared the datasets generated using the motif or null model with independent measurements of retinal cell type proportions across multiple vertebrate species. 2,51Strikingly, this analysis revealed that all of the independently measured fate distributions of vertebrate retina lie within, or very close to, the subspace accessed by the motif model.To understand how the structure of the motif matrix impacts the resulting cell type distributions, we repeated this analysis omitting the (A,A) motif from the motif matrix X 3 (Figure S7B).This resulted in a smaller subspace achieved by the motif model, specifically lowering the maximum proportion of A cells from 62.8% to 46.0% (Fig- ure S7C).This perturbed model failed to capture the empirical cell type distributions, indicating that the (A,A) motif is required to explain variation in cell type proportions across vertebrate retina.Taken together, these results are consistent with the notion that motifs identified in lineage trees of rat retina could facilitate evolutionary variation in retinal cell type proportions across vertebrates.They further show that the range of cell type proportion space achieved using the motif model can be expanded or constrained by respectively increasing or decreasing the number of different motifs in the model.

DISCUSSION
Producing cell types in optimal ratios is essential for tissue function.In many contexts, these proportions are established during development, when intrinsically committed progenitors or extrinsic interactions generate sets of terminal cell types.Increasing recent attention to the role of lineage in development 52 and the emergence of new methods for reconstructing lineage trees 53,54 provoke the question of how one can infer committed progenitors or extrinsic interactions based on the arrangement of descendant cell fates on lineage trees.
In this work, we introduce a general computational approach, LMA, based on statistical resampling of lineage trees.Using simulations, we demonstrated that LMA can be recursively applied to uncover fate correlations in large patterns that span multiple cell divisions.By applying this framework to three biological datasets, we identified fate correlations, some of which validate known fate patterns.In the retina, motifs can recur across space, or appear specifically in different tissue regions.The presence of shared motifs across zebrafish and rat retina suggests evolutionary conservation of retinal cell fate determination.In early mouse development, inside progenitors at the 16-cell stage appear plastic and uncommitted toward certain fates compared with outside progenitors, when analyzing their last two cell divisions before blastocyst formation.Finally, we showed that the rat retina motifs, if utilized in different frequencies, could explain variation in cell type frequencies across several vertebrate species.Lineage motifs thus provide a useful and biologically meaningful lens through which we can analyze cell fate differentiation.
Lineage motifs could be regarded simply as the consequence of a differentiation process that requires the cells to pass through intermediate states of partial fate commitment.However, this explanation still leaves open the question of why certain commitment states have been selected for during evolution.One potential answer is that lineage motifs play functional roles in controlling cell type distributions.Developing organisms could use various regulatory strategies including intrinsic transcription factors or extrinsic signals as ''knobs'' to modulate motif frequencies.Since lineage motifs represent groups of cell types (B) Cell type distributions were simulated by randomly varying the frequencies of motifs using three example motif matrices (X 0 , X 1 , X 2 ).(C) Cell type distributions were simulated using a null model (X 0 ) or the empirical motif matrix based on the rat retina motifs (X 3 ) in Figure 5.The independently measured cell type proportions of mouse, rabbit, monkey, and chick retinas from Masland 2 and Yamagata et al. 51 were overlaid on the ternary plot.See also Figure S7.
produced in fixed stoichiometric ratios, this mechanism could restrict variation of cell type proportions during development or evolution to physiologically adaptive regimes.In the future, we anticipate greater availability of high-quality lineage datasets, which should allow more complete tabulation of motifs across different tissue contexts.These data should thus enable more stringent tests of the model proposed here.
A second potential role for lineage motifs could be to create spatially localized neighborhoods of interacting cell types to implement specific functions.In the context of the retina, particular types of interneurons must be synaptically connected.For example, in crossover inhibition, OFF bipolar cells receive input from ON amacrine cells, which are depolarized by ON bipolar cells at light onset. 55A neural circuit of these cell types in close spatial proximity could be ensured by regulating the generation of these cell types through a lineage motif, such as the (B,(A,B)) motif observed in the rat retina dataset (Figure 5E).Consistent with this hypothesis, recent work has shown that specific synapses develop preferentially among sister excitatory neurons in the mouse neocortex. 56ineage motifs can be compared with other methods for analyzing cell fate differentiation, such as pseudotime, where single cells are densely profiled throughout time to obtain a population-level branching continuum of cell states. 53A previous study using pseudotime inference suggested that molecularly defined subpopulations of retinal progenitors give rise to different sets of cell types. 21In particular, neurogenic early stage progenitors give rise to ganglion, amacrine, and horizontal cells, Otx2+ late-stage progenitors give rise to bipolar and rod cells, and other late-stage progenitors give rise to M€ uller glia.However, in our analysis of both the zebrafish and rat retina, we observe progenitors that are biased to form a sister pair of one amacrine and one bipolar cell.In the rat retina, we also observe progenitors that are biased to form a sister pair of one bipolar cell and one M€ uller glia.Therefore, individual progenitors during development can generate lineage patterns that deviate from the population-level trajectories inferred using pseudotime. 57In future work, pseudotime trajectories could be refined by using dynamic information learned through lineage motifs.
Looking forward, LMA should be especially useful for contexts that have systematic spatial or cross-species variation in cell type composition.Deeper tree reconstructions could enable the analysis of developmental hyper-motifs, representing higher level correlations between constituent motifs. 58Analyzing how signaling or transcription factor dynamics are correlated with the generation of motifs will reveal how this process is regulated during development.Overall, by decomposing complex lineage trees into their functional building blocks, lineage motifs should help provide insight into longstanding questions in development and evolution.
Limitations of the study Because progenitor states are not directly observed in the datasets analyzed here, we cannot make claims about the exact dynamics that regulate how progenitors differentiate over time to give rise to lineage motifs.Incomplete identification of cell types due to the use of limited numbers of markers in the experimental studies analyzed here could prevent discovery of more complex motifs.Finally, the largest limitation has to do with the limited size of existing datasets.Although we have identified motifs and anti-motifs in three different datasets in this work, it is likely that not all fate correlations will recur with high enough significance to be classified as a motif.Programs with weaker fate correlations and datasets of limited size can hinder motif identification, as explored in Figures S5D and  S5E.In the small retina datasets analyzed here, rare lineage combinations could be falsely underrepresented (i.e., absent) or overrepresented in the captured data due to limited sampling size.Future work leveraging lineage recording systems should allow the production of much larger spatially resolved lineage tree datasets that could be analyzed with the approaches introduced here.

METHOD DETAILS
Lineage tree resampling and motif identification NEWICK-formatted lineage trees were first sorted to have doublet and quartet patterns arranged in alphabetical order according to their cell type annotations.All patterns were then aligned in order of earliest to latest born cells.For example, before alignment, triplet patterns could be present in the raw lineage tree data as ((X,X),X) or (X,(X,X)), and were therefore aligned to match the latter format in all cases.A similar procedure was followed for higher-order patterns, like asymmetric quartets, quintets, sextets, and septets.
All cell types and cell type patterns were then enumerated and counted for the number of occurrences within the lineage trees.The datasets were then appropriately resampled according to the type of motifs to be identified.For doublet motif identification, each cell type in the lineage tree dataset was replaced by a random cell type drawing from a list of all cell types within the dataset.This procedure maintains tree topology and overall cell type proportions but eliminates fate correlations between related cells.Our results were not sensitive to replacing with vs. without replacement.To detect larger motifs, it is necessary to control not only for overall cell type frequencies but also for the frequencies of any ''sub-patterns'' within the pattern of interest.For example, a triplet pattern comprising a sister cell doublet and their common cousin could appear over-represented solely because the sister doublet is itself a motif.To account for this, the lineage trees were resampled in a manner that preserves sub-pattern frequency, by drawing from a pool of similar sub-patterns across all trees.Therefore, for triplet motif identification, each singlet and doublet in the lineage tree dataset was respectively replaced by a random singlet or doublet drawing from a list of all singlets or doublets in the dataset.In this way, the overall frequencies of singlets and doublets remains the same across the resampled dataset while eliminating fate correlations between particular singlets and doublets.For quartet motif identification, each doublet in the lineage tree dataset was replaced by a random doublet drawing from a list of all doublets in the dataset.A similar procedure was followed for increasingly larger patterns.
The occurrences of each pattern were counted for each resampled dataset, then used to calculate an average number of occurrences and standard deviation across all resamples.The average and standard deviation were then used to calculate a z-score as follows: where x is the observed count in the original set of lineage trees, x is the average count across all resamples, and s is the standard deviation across all resamples.For plotting, the expected count of each pattern was calculated by multiplying the marginal probabilities of observing each of the two sub-patterns by the total number of that type of pattern across the entire dataset.Additionally, if the sub-patterns were not identical, the expected number was multiplied by two.For example, the expected number of the triplet (A,(B,C)) would be calculated as P(A) * P((B,C)) * total number of triplets * 2, and the expected number of the quartet ((A,B),(A,B)) would be calculated as P((A,B)) * P((A,B)) * total number of quartets.The null z-scores were calculated by repeating the same resampling procedure above for randomly chosen resampled datasets.

Construction of synthetic lineage tree datasets
To generate the example lineage trees shown in Figure 2, lineage trees were simulated using a one-step model in which a progenitor can self-renew or differentiate using probabilities of 10 or 90% respectively.For all differentiated cells, one of three possible fates were assigned with equal probability: blue, green, or a committed progenitor.Finally, two cell divisions were simulated for all committed progenitors, such that each gave rise to a triplet of (green, (green, blue)) with 100% probability.Cell divisions and fate differentiation were repeatedly simulated for all progenitors present within the tree until all cells reached terminal fates.
To test the recursive nature and accuracy of LMA in Figures 3 and S1, synthetic lineage tree datasets were simulated using a competence progression model or binary fate model of development.Each tree started as an 'a' or 'i' progenitor for each respective model, and a cell division was simulated producing two descendant cells, whose fates were chosen probabilistically based on the transition probabilities of the parental progenitor type.Cell divisions and fate differentiation were repeatedly simulated for all progenitors present within the tree until all cells reached terminal fates (A-F or A-H for each respective model).
To test LMA on lineage trees generated using an extrinsic model of development in Figure S3, synthetic lineage tree datasets were generated by populating a grid with 61 x 61 progenitor cells.A cell division was simulated for a randomly chosen progenitor, producing two descendant cells.One of the descendent cells remains at the same position in the grid, while the other cell is randomly placed in the immediately adjacent up, down, left, or right position, shifting all other cells within the column or row in an outwards manner.Fates were then chosen for each of the descendent cells probabilistically based on the transition probabilities as defined in Figure S3B.Cell divisions and fate differentiation were repeatedly simulated for all progenitors present within the grid in a similar manner until all cells reached terminal fates.For the extrinsic model, P(A)* was calculated as (40 -10n A ) * SF, where n A is the number of neighboring A cells with 4-connectivity, SF is 80 / ((40 -10n A ) + (40 -10n B )), and n B is the number of neighboring B cells with 4-connectivity.
To test various dataset characteristics on motif identification, synthetic lineage tree datasets were generated by simulating selfrenewal and differentiation in a progenitor cell, using probabilities as defined in Figure S5A.To account for rod sister fate correlations, each doublet of terminally differentiated cells was first assigned one fate based on the standard set of fate probabilities.If the assigned fate was indeed a rod cell, then the fate probability of the remaining cell within the terminal doublet was modified to be the conditional probability being tested.Cell divisions and fate differentiation were repeatedly simulated for all progenitors present within the tree until all cells reached terminal fates.This process was repeated in a similar manner to incorporate M€ uller glia sister fate correlations.

Simulation of cell type proportions with input motif matrices
Cell type proportions were first simulated using input motif matrices (Figure 7B) by choosing random frequencies for each motif and taking sets of cell types that were of total size 100 cells (for X 0 and X 1 ) or 99 cells (for X 2 ).For the motif transformation using the motifs measured in the rat retina dataset, e 3 was computed as z 3 À X 3 y 3 where z 3 is the total number of A, B, and M cell types in the dataset, X 3 is the empirical motif matrix based on the observed motifs, [((B,(A,B)), (A,A), (A,B), and (B,M)], and y 3 is the number of occurrences of each motif within the dataset (Figure S7A).Similarly, e 4 was computed as z 4 À X 4 y 4 where z 4 is the total number of A, B, and M cell types in the dataset, X 4 is the empirical motif matrix based on the observed motifs, [((B,(A,B)), (A,B), and (B,M)], and y 4 is the number of occurrences of each motif within the dataset (Figure S7B).The total number of 113 A, B, and M cells across the entire dataset was used for P i ðz 3 Þ i and P i ðz 4 Þ i .R cells were omitted from this analysis because no rat retinal motifs contained R cells.

QUANTIFICATION AND STATISTICAL ANALYSIS
For resampling lineage trees, 10 4 datasets were used to generate counts for each pattern across the resamples that resemble a normal distribution.For the motif transformation, 10 5 datasets were generated to sample the possible space of cell type proportions.The p-value for all patterns in the paper was calculated by (1) determining whether the observed count is higher or lower than the average across the resamples (the null distribution), (2) counting the number of resamples that have counts at least as extreme as the observed counts, and (3) dividing this by the total number of resamples to obtain a one-sided p-value.P-values were adjusted to the total number of patterns analyzed using the Benjamini/Hochberg correction with false discovery rate (a) = 0.05 and a two-stage linear step-up procedure with estimation of the number of true hypotheses. 39,40DITIONAL RESOURCES GitHub repository: https://github.com/labowitz/linmolinmo package documentation: https://labowitz.github.io/linmo/

20 CFigure 1 .
Figure1.Cell type proportions can be controlled using partially stochastic cell state transition maps that specify defined groups of cell types as motifs (A) A completely stochastic cell state transition map where a multipotent progenitor can self-renew or give rise to different fates in a memoryless manner.Lineage trees (only triplets shown) generated under this transition map would not exhibit fate correlations between related cells.(B) A partially stochastic cell state transition map where a multipotent progenitor can self-renew or give rise to different types of committed progenitors.The committed progenitors differentiate, and each gives rise to a defined set of cell types (motif A or B).Lineage trees generated under this transition map would exhibit fate correlations between related cells, representative of the committed progenitors present within the transition map.(C) In tissues that specify cell types solely by modulating the frequency of motif A and B, variation in cell type proportions is capped such that a cell type can be at most twice as abundant as the other type.

Figure 2 .
Figure 2. Lineage Motif Analysis identifies fate correlations in lineage trees by statistical resampling(A) Lineage trees with two cell types were simulated (STAR Methods).(B) The LMA workflow consists of the following steps.First, all possible cell fate patterns are enumerated.Second, the occurrence of each pattern within the observed lineage trees is counted (triplet pattern ''E'' is shown here as an example).Third, the cell fate labels at the leaves of the trees are randomly shuffled to obtain a resampled set of trees with no fate correlations.This process is then repeated across many resamples.To identify the higher-order motifs that span multiple cell divisions, the shuffling process is done in a manner that preserves sub-pattern frequency (STAR Methods).The occurrence of each cell fate pattern is then counted for each resample.Fourth, the count in the observed lineage trees is compared with the distribution of counts across resamples, whose average is approximately equal to the expected count if there were no fate correlations.Finally, overrepresented patterns are classified as lineage motifs, which represent possible committed progenitors or extrinsic interactions.

Figure 3 .
Figure 3. Lineage motifs reflect sequential progenitor states in a competence progression model (A) Lineage trees were simulated using a competence progression model.(B) Cell type proportions in 500 simulated lineage trees.(C) Deviation score for doublet patterns.Null Z scores were calculated by comparing a random resample dataset with the rest of the resample datasets.10 datasets containing 500 simulated trees each were used, with the standard deviation across the datasets plotted as error bars (** and *** represent adjusted p value < 0.005 and < 0.0005, respectively).(D) Deviation score for triplet patterns.(E) Deviation score for select patterns that reflect sequential differentiation of cell types using datasets of varying size.Shading indicates 95% confidence interval across 10 datasets for each point.See also Figures S1-S3.

Figure 4 .Figure 5 .
Photoreceptor (P) (D) Deviation score for doublet patterns in observed lineage trees.Null Z scores were calculated by comparing a random resample dataset with the rest of the resample datasets.(E) Deviation score for triplet patterns in the observed lineage trees.See also Figure S5 and Table

Figure 6 .
Figure 6.Triplet lineage motifs in mouse blastocyst development suggest that ''outside'' progenitors initiate fate commitment earlier than ''inside'' progenitors (A) Schematic of early mouse embryo development and the cell types involved.(B) Counts for triplet patterns in the observed mouse blastocyst trees from Morris et al. 49 in the outside progenitors and across 10,000 resamples (* and ** represent adjusted p value < 0.05 and < 0.005, respectively).The expected count was calculated analytically (STAR Methods).(C) Counts for triplet patterns in the observed mouse blastocyst trees in the inside progenitors and across 10,000 resamples.(D) Deviation score for triplet patterns in the outside and inside progenitors.Triplet patterns with an observed and expected count of 0 were omitted from the analysis.See also Figure S6.

Figure 7 .
Figure 7. Motifs can facilitate optimal variation in cell type frequencies between species (A) The zðsÞ = X Ã yðsÞ + eðsÞ matrix equation describes the linear transformation from motif frequencies to cell type distributions.(B) Cell type distributions were simulated by randomly varying the frequencies of motifs using three example motif matrices (X 0 , X 1 , X 2 ).(C) Cell type distributions were simulated using a null model (X 0 ) or the empirical motif matrix based on the rat retina motifs (X 3 ) in Figure5.The independently measured cell type proportions of mouse, rabbit, monkey, and chick retinas from Masland 2 and Yamagata et al.51 were overlaid on the ternary plot.See also FigureS7.