Quantifying intrinsic and extrinsic control of single-cell fates in cancer and stem/progenitor cell pedigrees with competing risks analysis

The molecular control of cell fate and behaviour is a central theme in biology. Inherent heterogeneity within cell populations requires that control of cell fate is studied at the single-cell level. Time-lapse imaging and single-cell tracking are powerful technologies for acquiring cell lifetime data, allowing quantification of how cell-intrinsic and extrinsic factors control single-cell fates over time. However, cell lifetime data contain complex features. Competing cell fates, censoring, and the possible inter-dependence of competing fates, currently present challenges to modelling cell lifetime data. Thus far such features are largely ignored, resulting in loss of data and introducing a source of bias. Here we show that competing risks and concordance statistics, previously applied to clinical data and the study of genetic influences on life events in twins, respectively, can be used to quantify intrinsic and extrinsic control of single-cell fates. Using these statistics we demonstrate that 1) breast cancer cell fate after chemotherapy is dependent on p53 genotype; 2) granulocyte macrophage progenitors and their differentiated progeny have concordant fates; and 3) cytokines promote self-renewal of cardiac mesenchymal stem cells by symmetric divisions. Therefore, competing risks and concordance statistics provide a robust and unbiased approach for evaluating hypotheses at the single-cell level.


Application of competing risks statistics to cell lifetime data
There is a substantial amount of literature describing the application of competing risks survival analysis to clinical trial data [1][2][3][4][5] . This has been driven by the need for tractable and rigorous methods to analyse patient-lifetime data, since the results of such analyses inform epidemiological-based health policy. While there are several commonly used survival statistics used by clinicians, including Kaplan-Meier analysis and Cox regression, it has been clearly demonstrated that such approaches will yield severely biased results in the presence of competing risks 4,6 .
Competing risks (CR) are defined as any mutually exclusive fate outcomes that may occur during an individual's lifetime. An individual's lifetime is said to be 'right censored' if none of the competing outcomes are observed during the observation period. In a clinical trial, competing risks occur when there is more than one clinical endpoint for a patient (e.g., response to therapy, death from unrelated causes). Importantly, clinicians utilise CR analysis to quantify which treatment conditions may increase the probability of patient survival, or may lead to a higher probability of death. The scope of this paper does not include a detailed review of CR statistics, and therefore readers are encouraged to read the reviews highlighted above to understand the fundamental importance of these statistics in the clinical context.
In this manuscript, we adapt CR statistics to analyse cell lifetime data which are analogous to patient-lifetime data. An individual cell may have mutually exclusive fate outcomes (e.g. division and death); and a cell's lifetime may be right-censored (e.g. because it was lost during tracking or because its final fate was not observed before the end of the observation period).
In addition to using CR analysis to determine the effect of different treatment conditions on patient outcomes, Scheike et al. also developed CR concordance statistics to study the effect of inheritance on lifetime events in monozygotic and dizygotic twins 7 . For example, it was found that there was strong concordance in lifetime events such as menopause and breast cancer in monozygotic, but not dizygotic twins. Similarly, in this manuscript CR concordance statistics are applied to study the effect of inheritance on fate outcomes in cellular kin, i.e. the equivalent of monozygotic and dizygotic twins. Interestingly, single cell pedigrees provide an opportunity to study the effect of inheritance on fate outcomes over multiple generations, i.e. as genetic similarity progressively diverges. Such analyses provide an opportunity to determine the strength of latent determinants of cell fate by studying concordance in siblings, 1 st cousins, and 2 nd cousins, respectively.
The following sections describe the CR statistical methods that are applied in this manuscript. These include a) estimation of empirical probability of a competing risk; b) testing group difference; c) competing risks regression; and d) cell clustering and concordance analysis.
All of the described CR statistics are implemented using R-studio statistical software. For a detailed explanation of competing risks statistics and implementation in R see Scheike et al. 8 , as well as the R packages required for implementation of CR statistics including timereg 9 , survival, and mets. All of the analysis code and data presented in this manuscript are available at the following GitHub repository https://github.com/Jamcor/crpaper. Any questions regarding code, analysis methods, and data format may be directed to the corresponding author.

a) Estimation of empirical probability of a competing risk
The cumulative incidence for competing risks was first described by Gray 10 . Cumulative incidence is an empirical estimate (non-parametric) of the probability ( ) that a cell fate outcome (e.g. division or death) has occurred by a specified time or cell age( ). The cumulative incidence function (CIF) for event of type in group is defined as: where is the event time for individual in group , and indicates the event type . Observed experimental outcomes and are random variates. The cuminc function in R was used to estimate empirical cumulative incidence functions.

b) Testing group difference
To test if there is a difference between experimental groups indexed by ∈ {1, … , } Gray developed a K-sample test for comparing the cumulative incidence of a competing risk. The null hypothesis for this test states that groups are independent and identically distributed ( 0 : 1 = 2 = ⋯ = ), though risks need not be independent as previously posited 11 . For example, we used this test to determine the effect of p53 mutation on the resistance of breast cancer cell lines to chemotherapy (Supplementary Text 2); competing event types were division ( = 1) and death ( = 2), and the index = 1 … denoted the chemotherapy treatment groups (Dox and Nut) and genotype (WT or MUT p53). The ksample function in R was used to test for group differences as described above.

c) Competing risks regression (CRR)
The development of non-parametric and semi-parametric CRR models followed the method described by Schieke and Zhang 12 . The goal was to select CRR models that are parsimonious (lowest number of estimated coefficients) and give rise to monotonically increasing CIFs. Therefore, semi-parametric models were preferable because fewer coefficients require estimation compared to non-parametric models. The development of semi-parametric CRR models was an iterative approach that required trial and error to select an optimal link function for the model. The comp.risk function was used to generate CRR models. Scheike and Zhang 8 described the development of flexible CRR models using the comp.risk function (contained within the timereg package). This manuscript applied the same approach for developing CRR models to analyse the effect of intrinsic and extrinsic factors on single-cell fate outcomes.
Briefly, semi-parametric models were fitted by 1) commencing with a nonparametric model and an arbitrary link function; 2) testing each covariate for timeinvariant effects (using the Kolmogorov-Smirnov test included within the output of the comp.risk function); 3) if non-parametric covariates were time invariant (p>>0.05) they were substituted with parametric covariates (see below); and 4) steps 2-3 were repeated using different link functions (proportional, additive, Fine and Gray) to find the link function that maximised the number of parametric terms; while checking that the CIF is monotonically increasing (sometimes the additive link function gave rise to decreasing CIFs). For readers with a background in statistics, a mathematical description of this procedure is provided below.
Fine and Gray developed a proportional hazards model to quantify the effect of one or more covariates on competing risks 13 . For cause = 1, the cumulative incidence function is conditional on covariates, , a vector denoting one or more cell extrinsic or intrinsic covariates that may influence cell fate: Scheike developed a goodness-of-fit procedure based on a statistical test for the hypothesis that non-parametric coefficients for the regression model are constant over time. Starting with a fully non-parametric model where = , the number of estimated coefficients and model complexity is reduced iteratively by substituting the ℎ nonparametric term ( ) with a parametric term (multiplicative) or (additive) if ( ) is considered to be constant over time. A constant (time invariant) non-parametric effect is identified by testing the null hypothesis by the Kolmogorov-Smirnov test ( ( ) = ; multiplicative model or ( ) = ; additive model). The output of the R function comp.risk (timereg package, version 1.7.0, 9 ) displayed the result of this test as the 'Test for time invariant effects', where the probability value is given by p-value H_0:constant effect. If the test is non-significant (p>0.05) one assume that the null hypothesis is true, and the model can be simplified by assuming a parametric term for covariate j.
The significance of non-parametric covariates was tested using the Supremumtest of significance and was output by the comp.risk function under the heading 'Test for non-significant effects' (i.e. p<0.05 taken as a significant effect). Parametric covariates, their standard error and level of significance are output if the model has parametric terms.
The goodness-of-fit of the semi-parametric CRR model was also inspected by comparison with the non-parametric model, and the empirical Gray subdistribution for those covariates where possible (cuminc function, cmprsk R package) 10 .
Competing risks regression (CRR) models are constructed using R shorthand notation that include specific terms. Firstly, a baseline CIF is indicated by the presence of a number one (1) to the right of a tilde (~). To the right of the baseline CIF any number of terms may be included in the model separated by a plus sign (+). Parametric terms are identified by wrapping them with const(). E.g. if A was a parametric term it would be included as const(A) in the model. Models can also include R shorthand notation to represent the nth order interactions ( = 1,2, 3):
After construction of the model terms the comp.risk function requires selection of a link function (as described above) and specification of the fate outcomes being evaluated (specified by a coded cause number). Link functions may either additive (additive), proportional (prop), or a Fine and Gray (fg). In all the models presented here the fate outcomes (causes) are coded as 0 for right-censored, 1 for division, and 2 for death.

d) Cell clustering and concordance analysis
Scheike at al. proposed a cross-odds ratio (COR) function to measure the association of cause-specific event times within a cluster 7 . In the cellular context, a cluster is defined as any pair of cells that share a common ancestor and are separated by the same familial distance, i.e. are in the same generation ( Figure S1). For example, clusters included sister pairs, mother and daughter pairs, and cousins (1 st , 2 nd , 3 rd , etc). Importantly, when evaluating concordance for kinship pairs multiple counting of the same cell was prevented by randomly sampling kinship clusters without counting the same cell twice.
The COR was used to determine if a cell's fate is correlated with the fate of the other cell within the cluster (i.e. does the fate outcome for one sibling in a sibling cluster depend on the fate of its sibling). Therefore, the COR can provide evidence that a cell's fate is influenced by heritable determinants of cell fate. If cell fates are dependent within a cluster, a significant value of the cross odds ratio indicates whether these fates are concordant (symmetric) or discordant (asymmetric).
The odds of event A is defined as the probability of event A divided by the probability of its complement: The conditional odds of event A given B is defined And the cross odds ratio is

Development of semi-parametric models to study BC cell line chemotherapy resistance
The comp.risk function from the timereg package 12 in R was used to fit nonparametric and semi-parametric regression models to BC cell lifetime data. Development of semi-parametric CRR models was desirable because they estimated coefficients (one for each covariate) that described how the probability of a cell fate outcome was modified by a covariate 12 .

a) Development of semi-parametric CRR models
Briefly, semi-parametric models were fitted by 1) selecting a link function; 2) testing each covariate for time-invariant effects (using the Kolmogorov-Smirnov test); and 3) estimating parameters for time-invariant effects and empirical coefficients. The link function scaled the empirical baseline CIF to link it to other CIFs using estimated parametric and empirical coefficients. Constant coefficients were estimated for covariates that passed the time invariance test (p>0.05), otherwise an empirical function was used to model time-variant effects. For example, a semi-parametric CIF for division was estimated using an additive link function, effects of the covariates WT, WT:Nut, and WT:Dox were represented by constant coefficients (Table S1-S3). Effects of Nut and Dox were estimated using empirical functions because they failed the test for time invariance (p<0.05). Goodness-of-fit was assessed by comparing semi-parametric and non-parametric CIFs (Fig. 2c- where const( ) is used to denote parametric covariates. The estimated coefficients for division are shown in Table S3.
This model was then used to obtain CIFs for any factor, or combination of factors, by adding their effects to the baseline CIF. For example, non-parametric CIFs for division and death of WT cells exposed to Nut were predicted by substituting WT = 1, Dox = 0, Nut = 1, WT:Dox = 0 (representing the interaction between WT and Dox), and WT:Nut = 1 into the model.

CR and concordance probability analysis of GMP data
To study concordance in the fate of cellular kin in GMP and their differentiated progeny we applied CR concordance probability statistics (see Text 1c).
First we constructed a non-parametric CRR model to test for the effects of growth factor (GF) treatment on the probability of division and death for GMPs treated with either MCSF or GCSF. The non-parametric CRR model is shown in equation (S3.1).

Equation S3.1
( ℎ)~1 + + : + In this model we found that the GF term had no effect on division or death of GMPs until they differentiated into either macrophage or granulocyte progeny. Therefore, the GF term was excluded from the model.
We then investigated concordance in fate for sibling, mother-daughter, 1 st cousin, and 2 nd cousin clusters by constructing CRR models that included a term for each kinship cluster. See Table 2 in the main text for a description of the CRR models used. The cross-odds ratio (COR) was calculated using the cor.cif (contained within the mets R package) function using the symmetry condition (sym=1). The pseudo-code below shows the implementation of the cor.cif function in R.
cor.cif(cif = cif.model.division, data = tempdata, cause1 = 1, cause2 = 1, sym = 1) where cif was the output from the CRR models (Table 2, main text), data contained cell lifetimes for division, death, and right-censoring. cause1 and cause2 were the fate outcomes of interest for each cell in the kinship cluster, respectively. For example, cause1=1 and cause2=1 allowed for the cor.cif function to compute the COR for sibling 1 division and sibling 2 division. The same method was used to estimate COR for mother-daughter pairs, 1 st cousins, and 2 nd cousins.
Yule's Q was used to quantify concordance in cell fate for undifferentiated (GFP -) GMPs. The function Yule from the R package psych was used. To determine correlation in time to division for GFPcells the ICC and PCC were calculated. The function ICCest from the package ICC and the function cor from the package stats were used. The results for each of these tests are shown in Table S4. While these tests are routinely used to quantify association in cell fate outcomes, as shown in Table S4 they exclude large amounts of cell lifetime data.

Factorial experimental design to test effects of cytokines on cCFU-F growth dynamics
To The effect of cytokines and their interactions on cell population data (CFU frequency and population doublings) was calculated from their contrasts and ANOVA as described in Montgomery's text 14 . CR regression analysis in section was used to estimate the effect of growth factors and their higher order interaction on the CIF for division or death. The regression model uses R shorthand notation to represent the nth order factor interactions ( = 1,2, 3):

a) Factorial analysis of colony formation and proliferation data
Main effects are defined as a change in response (e.g. cell number, colony number, rate of division or death) as a consequence of a change in the level of a factor (e.g. an increase in FGF from 0ng/ml to 4ng/ml). As outlined above a two-level, full factorial design experiment measures the response and effect of all factor combinations. The main effect of factor A is calculated by subtracting the responses where factor A is absent from the responses where factor A is present. Interaction effects (e.g. interaction of factor A and B) are defined as the average difference between the effect of A at the high level of B and the effect of A at the low level of B. Therefore main and interaction effects were calculated using contrasts which can be defined for k factors using the formula, and y ij is the response of the ith cytokine combination and the jth replicate and ̅ is the average response. Error sum of squares may be calculated from the difference between total sum of squares and sum of squares of effects. The F distribution and statistic were used to test for significance in treatment effects.

= … 2
Where each effect has a single degree of freedom and the mean squared error 2 = / 2 ( − 1) has 2 ( − 1) degrees of freedom. The null hypothesis (of no effect) was rejected if the probability of the effect was less than 0.05. Table S5) were tracked for a period of 96 hours. For generation 0, the CR regression model for division is shown in equation (S5.1).

Cells exposed to the conditions outlined in a factorial design experiment (Supplementary
The R output for this non-parametric CR regression model (using the additive link function) is summarised in Supplementary Table S6 and Supplementary Table S7. The effect of single growth factors on cells (generation 0) is shown in Supplementary Figure S5a as a quantile plot. These effects are consistent with empirical CIFs (produced using cuminc) for separate experimental groups with various growth factor combinations as shown in Supplementary Figure S5b.

Supplementary
A similar approach was used to construct a semi-parametric CRR model for cells whose birth was observed (generation > 0):

division CIF (generation > 0)~1 + const(TGF) + const(PDGF) + const(FGF)
In this model the Fine and Gray link function was used. All model parameters were found to be time invariant (data not shown), however, higher order interactions were not included in this model because they were not significant (data not shown). The model showed that FGF and PDGF had significant effects on division of generation > 0 cells, while TGF did not (Supplementary Table S9). The CIFs for this model are shown in Supplementary Figure S5c. The results from this model showed that TGF-β1 was the only growth factor that increased the rate of cell death (Supplementary Figure S5d and Table S10).

b) Interaction effect of Pdgfra-GFP expression and cytokine stimulation on division and death
This section describes the development of CRR models used to study the interaction of Pdgfra-GFP expression with cytokine treatment. Firstly, the fidelity of GFP as a reporter for PDGFRα expression and mitogenic activity was tested using a CRR model that included a term for GFP expression, as shown in equation (S5.5). GFP expression was measured at the end of a cell's lifetime, i.e. before division, death, or right-censoring (this is referred to as GFPAtDeath in the model below). GFP expression was a continuous variable that represent raw measurements of pixel intensity (arbitrary fluorescent units) after image processing (see Materials and Methods). The model also included terms for treatment with TGF-β1, PDGF or FGF.
Supplementary Table S12 | Results from the semi-parametric model (shown in equation (S5.5)) describing the effect of cytokines and GFP expression (at the end of a cell's lifetime) on cCFU-F division for generation > 0 cells. A similar model was developed to investigate the effect of GFP expression as measured at cell birth (GFPAtBirth), shown in equation (S5.6).
Supplementary Table S13 | Results from the semi-parametric model (shown in equation (S5.6)) describing the effect of cytokines and GFP expression (at cell birth) on cCFU-F division for generation > 0 cells. We also developed a CRR model used to quantify effects of GFP expression and cytokine treatment on cell death, shown in equation (S5.7).

Equation S5.7
death CIF ~1 + const(TGF) + const(PDGF) + const(FGF) + const(GFPAtDeath) + const(GFPAtDeath * TGF) + const(GFPAtDeath * PDGF) + const(GFPAtDeath * FGF) The results produced by this model showed that nuclear GFP intensity at the end of a cell's lifetime was not related to the rate of cell death (Supplementary Table S14). In order to classify cells as GFP + and GFPwe established a threshold for GFP intensity below which there was no effect of PDGF. Using this method, a threshold of 100 fluorescent units defined PDGFRα + (GFP + ) and PDGFRα -(GFP -) cell subsets. We fitted a non-parametric CR regression model, shown in equation (S5.8), using this classification for PDGFRα expression. In this model, GFP is a categorical variable (GFP=0 represented GFPcells and GFP=1 represented GFP + cells).

Equation S5.8 division CIF ~ 1 + (GFP + PDGF) 2
The terms in the model were tested for non-significant effects (Supplementary  Table S15) and time-invariance with respect to the covariates GFP, PDGF and PDGF:GFP (Supplementary Table S16) which enabled the construction of a semiparametric model, shown in equation (S5.9).

Equation S5.9
division CIF ~ 1 + (const(GFP) + const(PDGF)) 2 The semi-parametric model showed that GFP and PDGF alone did not have significant effects on the probability of division, however, the interaction GFP:PDGF was significant (Supplementary Table S17). The CIFs derived from this model are shown in Figure 4d.

c) Effect of Pdgfra-GFP expression and cytokine treatment on renewal of GFP + cells
To study self-renewal of GFP + cells we divided possible fate outcomes into four CR categories.
These categories were used to identify the fate outcome of interest in the CRR model shown in equation (S5.10).

division CIF ~ 1 + const(PDGF) + const(FGF)
The CIF for GFP + cell division given PDGF and FGF (CR category 1) in generation > 0 was modelled using the single factors PDGF and FGF, and the Fine and Gray link function. Simulated CIFs are shown in Figure 4f (green line) and results for parametric terms are shown in Supplementary Table S18.
The CIF for PDGFRα -(GFP -) cell division (CR category 2) in generation > 0 was modelled using the single factor FGF (knowing that PDGF had not effect on GFPcells) and the Fine and Grey link function. The model is shown in equation (S5.11).

d) Generation of GFP + and GFPdaughters from GFP + and GFPmothers
To study the renewal of GFP + and GFPcells, mothers and daughters were gated for their GFP status at the end of their lifetime (i.e. before death, division, and rightcensoring). Using these classifications, mother-daughter fate outcomes were divided into the following categories: 1) GFP + daughters derived from GFP + mothers 2) GFPdaughters derived from GFP + mothers 3) GFP + daughters derived from GFPmothers 4) GFPdaughters derived from GFPmothers These CR categories were used to identify fate outcomes of interest in a CRR model that included terms for PDGF and FGF, shown in equation (S5.12).

Equation S5.12
division CIF ~ 1 + const(PDGF) + const(FGF) The Fine and Gray link function was used in this model. FGF and PDGF promoted renewal of GFP + daughters from GFP + mothers (category 1). The simulated CIF for the generation of GFP + daughters from GFP + mothers is shown in Figure 4H (green lines). Results for parametric terms are shown in Supplementary Table S19.
Both PDGF and FGF did not increase the probability of generation of GFPdaughters from GFP + mothers (category 2). The simulated CIF for the generation of GFPdaughters from GFP + mothers is shown in Figure 4h (black lines). Results for parametric terms are shown in Supplementary Table S19. bFGF and PDGF did not promote renewal of GFPdaughters from GFPmothers (category 3). The simulated CIF for the generation of GFPdaughters from GFP + mothers is shown in Figure S7c (green lines). Results for parametric terms are shown in Supplementary Table S19. Both PDGF and FGF did not increase the probability of generation of GFP + daughters from GFPmothers (category 4). The simulated CIFs for the generation of GFP + and GFPdaughters from GFPmothers are shown in Fig. S8c (black lines), and results for parametric terms are shown in Supplementary Table S19. The cross-odds ratio (COR) was calculated using the cor.cif (contained within the mets R package) function using the symmetry condition (sym=1). The pseudo-code below shows the implementation of the cor.cif function in R.

Supplementary
cor.cif(cif = cif.model.division, data = tempdata, cause1 = 1, cause2 = 1, sym = 1) where cif was the output from the CRR model above, data contained cell lifetimes for division, death, and right-censoring. cause1 and cause2 were the fate outcomes of interest for each sibling, respectively. For example, cause1=1 and cause2=1 allowed for the cor.cif function to compute the COR for GFP + divisions, as defined in the categories described in Text 5c.
Below is the output from the cor.cif function, which showed that GFP + siblings had significantly symmetric fates (COR>>1, p<<0.001). In order to visualise the symmetry in GFP + cell fates the probandwise concordance probability was calculated using the concordance function (contained within the mets R package). Conditional and unconditional probabilities are plotted in Figure 5e.

f) Flow cytometry analysis of MSC surface markers
Flow cytometry analysis of MSC surface markers was used to test for expression of canonical MSC markers. This is a routine quality control assay for MSC cultures, as described by the International Society for Cell Therapy (ISCT) MSCs. Table S22 shows that cCFU-F cultured in SFM expressed MSC surface markers indicative of a high quality MSC culture. Furthermore, cCFU-F did not lose expression of the stem cell marker SCA-1 nor gain expression of the endothelial marker CD31.

Permutation and randomisation tests for cCFU-F sibling cell fate outcomes
Permutation tests were used to quantify similarity in cCFU-F sibling cell cycle times. To test for similarity in sibling cell cycle times we calculated the mean difference in cycle time between sibling cell pairs. For each sibling pair the difference in cycle time was calculated, and then averaged. This average was compared to the mean difference in cycle time between randomly sampled cell pairs. 10,000 random permutations of the data set were generated using Monte Carlo simulation by randomly sampling cell pairs (sibling pair label shuffling, without replacement) from the pooled cell population. For each random permutation the mean difference in cycle time across all cell pairs was calculated. A two-tailed, unpaired Student's t-test (α=0.05) was used to determine if the observed mean difference in sibling cycle times was significantly different to that of random permutations.

Supplementary Figures
Supplementary Figure S1 | Single cell pedigrees showing kinship clustering. Kinship clusters are pairs of related cells within a pedigree including mother-daughter clusters, sibling clusters, 1 st cousin clusters, and 2 nd cousin clusters. Kinship clusters at different depths within a single cell pedigree have different degrees of familial and temporal separation.