A dynamical systems treatment of transcriptomic trajectories in hematopoiesis

ABSTRACT Inspired by Waddington's illustration of an epigenetic landscape, cell-fate transitions have been envisioned as bifurcating dynamical systems, wherein exogenous signaling dynamics couple to the enormously complex signaling and transcriptional machinery of a cell to elicit qualitative transitions in its collective state. Single-cell RNA sequencing (scRNA-seq), which measures the distributions of possible transcriptional states in large populations of differentiating cells, provides an alternate view, in which development is marked by the variations of a myriad of genes. Here, we present a mathematical formalism for rigorously evaluating, from a dynamical systems perspective, whether scRNA-seq trajectories display statistical signatures consistent with bifurcations and, as a case study, pinpoint regions of multistability along the neutrophil branch of hematopoeitic differentiation. Additionally, we leverage the geometric features of linear instability to identify the low-dimensional phase plane in gene expression space within which the multistability unfolds, highlighting novel genetic players that are crucial for neutrophil differentiation. Broadly, we show that a dynamical systems treatment of scRNA-seq data provides mechanistic insights into the high-dimensional processes of cellular differentiation, taking a step toward systematic construction of mathematical models for transcriptomic dynamics.

... In this model, δg is the deviation of the gene expression vector, ⃗ g from the fixed point, ⃗ g * . (Right) Snapshots of a collection of particles at steady state following the dynamical process defined by x˙ =−2x 3 + x/2 + a and uniformly sampled noise for a = −1 (left), a = 0 (center) and a = 1 (right).         Appendix S1 1 The relationship between the Jacobian and the Covariance Here, we outline the methodological framework that enables characterizing cell-fate transitions directly from scRNA-seq snapshots of a cell's transcriptomic state. A scRNA-seq measurement yields a transcriptomic matrix, where each row is a different cell and each column is a different gene (Fig. S1, left). This data is often visualized via dimensionality reduction algorithms, that reduce the 25, 000 dimensional gene space to two or three axes of variation, and sorted via parametric curve fitting tools, that show how the cells vary as a function of a control parameter, such as developmental time (pseudotime). Thus, one can compute statistics, such as the covariance C of the genes at a given pseudotime window.
Assuming that the underlying biochemical processes (1) are stochastic and Markovian and (2) occur at significantly faster timescales (seconds to minutes) than the timescales over which transitions in cellular fates are observed (hours to days), then the local time evolution of a cell's transcriptomic profile is controlled by a single matrix, the Jacobian (J), where J ij = ∂ġ i /∂g j is the effect of the amount of gene j on the dynamics of gene i (Fig. S1, center). While J, in general, changes with pseudotime, it relates to the covariance of gene expression at that pseudotime C through the continuous-time Lyapunov equation [4], where D is the expected noise amplitude for individual genes and their interactions (derivation in Methods: Continuous time Lyapunov equation for transcriptomic matrices) [5]. An important result from this relationship is that in the vincinity of bifurcations, the most salient properties of J, corresponding to its eigen-decomposition, are inferrable from the eigendecomposition of C.
We demonstrate the intuition behind the Eqn. S1 using a one-dimensional toy-model (Fig. S1, right). The slope of the potential function, drawn in red, provides the deterministic features of the system's dynamics. Parameter regimes (a ≪ 0 and a ≫ 0) where the potential has highly convex curvature exhibit stable fixed points, while parameter regimes near the bifurcation (a ∼ 0), that have much flatter curvature, exhibit instability. Stochastic simulations of the system (drawn as open circles in Fig. S1-color corresponds to the value of the control parameter) demonstrate that owing to the reduction in curvature of the underlying potential, the data is spread maximally near the bifurcation, and narrows on either side of it.
This simple one-dimensional toy model captures the essence of the ideas used in this paper. If a complex high-dimensional dynamical system undergoes a bifurcation, then in its vicinity there must be, by definition, some direction in the highdimensional space with greatly enhanced fluctuations. Thus bifurcations, and regions of multistability, can be located by finding the points along a developmental trajectory in transcriptomic space where the covariance eigenvalue spectrum is dominated by a single principle mode. Moreover, the direction of those fluctuations (the corresponding covariance eigenvector) is equivalient to the soft direction along which the system bifurcates (the corresponding eigenvector of the Jacobian), even in the 25, 000 dimension transcriptomic space.

Methodological relationship to Dynamical Network Biomarkers
Chen et al. [6], developed the concept of a dynamical network biomarker (DNB), a group of genes that drive a critical transition and are detectable from high dimensional gene expression datasets. In particular, they define an indicator function where SD d is the average standard deviation of genes in the DNB, P CC d is the average correlation coefficient between genes in the DNB, and P CC o is the average correlation coefficient between genes in the DNB and genes outside the DNB [6]. At a critical state transition, or bifurcation, I is predicted to diverge, because SD d and |P CC d | become large, while |P CC o | becomes small. Mathematically, the genes in the DNB correspond to those that have non-zero weight in the direction of the transition, i.e., ⃗ p i d ̸ = 0, where ⃗ p d is the principal eigenvector of the Jacobian, while genes outside of the DNB have ⃗ p i d = 0. This prediction is qualitatively similar, but not the same as Eqn. 2. In particular, while both SD d and ω 1 increase at a bifurcation, they are not equivalent, as SD d measures the variance of each individual gene, while ω 1 measures the variance across all genes, and therefore accounts for corrections to the total variance due to covariances between genes in the network. Therefore, for bifurcation detection, we focus solely on ω 1 , instead of incorporating correlations into the indicator as in Eqn. S2.
As for determining which gene relationships are critical for the bifurcation, we take a similar approach to Refs. [6,7], in focusing on the correlations that approach ±1 at the bifurcation. This is justified via Eqn. 4, which yields that R ij → ±1 if ⃗ p i d ̸ = 0 and ⃗ p j d ̸ = 0. Interestingly, while we derived Eqn. 4 via the eigendecomposition of the covariance matrix, Refs. [6,7] derived the same result form the covariance matrix itself, providing additional support to this method.

Bifurcations possibilities from two mutually inhibiting genes
At steady state, Eqn. 5 satisfies the quintic polynomial which, depending on the parameter values, can have one real solution that is an attractor (e.g., if m 1,2 = 1 and k D = 1) or three real solutions, two attractors (nodes) and one repellor (saddle) (e.g., m 1,2 = 1, k D = 1/3). By examining the null clines, it can be deduced that varying m 1 , while fixing τ and m 2 can yield a saddle-node bifurcation, as Eqn. S4 moves vertically while Eqn. S5 does not, allowing for either node to merge with the saddle (Fig. S2A). Conversely, varying k D , while fixing m 1,2 and m 2 , can yield a pitchfork bifurcation, as both null clines move, such that above the bifurcation value, all three real solutions remain (Fig. S2B). Solving Eqn. S3 computationally via the Python function numpy.roots and plotting the real solutions ( Fig. S2C-D) yields the bifurcations used in Fig. 2 and Fig. S5 [8].

Resampling principal eigenvalue
Given the transcriptomic matrix . . , G nc,i } and G i,j is the expression of the j th gene in the i th cell, we generate a null sample G null by drawing each of its entries G null i,j randomly, with replacement, from ⃗ g i . In Fig. S10Fig. 2,Fig. 3, we compute the principal covariance eigenvalue ω null 1 for each of n s = 20 samples, and compare this null distribution against the principal covariance eigenvalue of the data ω 1 . This resampling technique has little impact on ω 1 for unimodal distributions as the scale of ω 1 is still determined by the system's noise (Fig. S3 left and  right), but significantly decreases ω 1 for multimodal distributions (Fig. S3 center) since the structure of the multimodality is scrambled; thus we found it was an effective method for determining if a spike in ω 1 is due to multimodality or increased noise.

Noise induced transitions
To determine if a non-bifurcating noise-induced transition model [9,10] could yield a similar covariance eigenvalue signature to a bifurcation, we ran the 102 gene network model ( Fig. 2A) in a regime of the dynamical system that had two fixed-points (m 1,2 = 1, k D = 1/3) at varying noise scales s (see Fig. S2 and Eqn. 22 for details). To ensure a transition, we initialized all cells to populate the fixed point with higher g 1 . We found that for low noise values (1/s ≤ 0.01) the cells stayed near their initial fixed point, yielding a unimodal distribution for g 1 (Fig. S6A) and low principal covariance eigenvalue (Fig. S6B) while for high noise values (1/s ≥ 0.02) the cells visited both fixed points, yielding a bimodal distribution for g 1 , and a high principal covariance eigenvalue.

Effect of small errors
To better understand why the difference between ω 1 and its corresponding null was significantly more apparent at τ m than τ d (Fig. 3C), we exmained how small errors in the model parameters effect bifurcations. Specifically, we simulated the GRN model (Eqn. 5) with different amounts of error in other parameters. For the saddle-node bifurcation, in which m 1 is varied while τ D and m 2 remain fixed, we perturbed m 2 by small amounts from its bifurcation value m 2c = 3. We found (Fig. S8A) that the bifurcation was still largely detectable, and its eigenvalue still well distinguished from its null (Fig. S8B), at these small errors. For the pitchfork bifurcation, in which k D is varied while m 1 and m 2 remain fixed, we perturbed m 1 by small amounts from its bifurcation value of m 1c = 1. In this case, we found that the small perturbations biased the bifurcation toward one of the branches (Fig. S8C). This bias significantly reduces the difference ω 1 and its corresponding null (Fig. S8D). Our analysis suggests that small errors in the one-to-many bifurcating dynamical systems that appears present at τ D may prevent it from being easily detectable, even when similar sized errors do not obscure the one-to-one bifurcation at ω 1 . links_state_to_fate_during_differentiation. The algorithms to generate these values are described in detail in Ref. [3] (Supplementary Materials) and recapitulated here for completeness. Given the full in-vitro hematopoiesis transcriptomic matrix (all cells and all genes), the SPRING positions in Fig. 3A plot were generated using the following procedure.
2. The top 50 principal components (PC) of the filtered transcriptomic matrix were computed.
3. 40,000 of the cells were selected randomly, and a k-nearest-neighbors (KNN) graph between those cells was constructed using the top 50 PC of the filtered transcriptomic matrix and k=4.
4. X-Y positions of these 40,000 cells were generated using the ForceAtlas2 algorithm with 500 steps [11]. Cells were annotated with their cell types (cluster annotation in Fig. 3A) based on their position in the SPRING plot and their expression (terminal cell fates) or lack of expression (pluripotent) of pre-selected marker genes. Specifically the marker genes used to determine if cells were neutrophils were S100a9, Itgb2l, Elane, Fcnb, Mpo, Prtn3, S100a6, S100a8, Lcn2, and Lrg1.
Neutrophil pseudotime rank was then determined by smoothly interpolating between cells in the pluripotent and neutrophil clusters. The interpolation method used throughout this procedure is an iterative, diffusive process defined as where ⃗ x i is a vector quantity defined for cell i, X = {⃗ x 1 , ⃗ x 2 , . . . , ⃗ x nc } is the matrix of this quantity for all cells, K k (i) are the cell indices of the k nearest neighbors of cell i, n > 0 is the number of iterations, and b is the neighbor weight (low b and high n both yield high diffusion) [12]. The pseudotime ranking procedure is: 1. Cells are identified to be part of the neutrophil trajectory  (g) Cells were considered part of the neutrophil trajectory ifĉ i > Q 0.6 (ĉ) where Q 0.6 (ĉ) is the 60 th percentile ofĉ. To test if the neutrophil bifurcation characterization was dependent on the choice of pseudotime algorithm, we used the Slingshot algorithm [2] to compute the pseudotime of each cell for its trajectory from the undifferentiated cluster to each of the terminal fate clusters. The input to Slingshot were the cells' cluster labels and their SPRING coordinates, and the output was a probability, or weight, that a cell belonged to each undifferentiated-to-terminal-fate trajectory, as well as its pseudotime along that trajectory. In Fig. S11A, we show the pseudotime of all cells that had weight > 0 for belonging to the trajectory that led from undifferentiated cells toward neutrophils. Unlike the pseudotime method described in Section 7.1, the origin of the trajectory does not coincide with the earliest sequenced cells, as time of sequencing and clonal barcode data could not be input to Slingshot. Nevertheless, we obtain a clear bifurcation signature in the principal covariance eigenvalue (Fig. S11B) at the point where promyelocyte gene expression decreases to 0 and myelocyte marker gene expression become maximal (Fig. S11C). This result supports our belief that the bifurcation characterization does not depend on the specific pseudotime calculation.

Determining the eigenvectors for analysis
In order to analyze the neutrophil trajectory in a native-space, we chose eigenvectors that were characteristic of the dynamics. Since τ m coincides with a well defined eigenvalue peak in the neutrophil trajectory, it was natural to use ⃗ s(τ m ) to aid in visualizing the trajectory and further probe mechanisms. However, τ = 0 and τ d coincide with transition points between states (Fig. 5B), and mark the beginning of specific dynamics (i.e., the eigenvalue remaining constant, or increasing), and it the lower correlation on the edges of the blocks in Fig. 5B suggests that the eigenvectors at those points had not equiilibrated to their new positions. Therefore, we defineτ 0 andτ d as the pseudotime bins with the eigenvector closest to the eigenvector at all other pseudotimes in that range, i.e.,τ 0 = arg min and use these pseudotimes for downstream analysis.

Identifying clusters via Gaussian Mixture Models
As the distribution of gene expression projected onto ⃗ s(τ m ) exhibited bimodality (Fig. 5E, Fig. S12B), we used a Gaussian Mixture Model to separate the two modes. Specifically, we fit G(τ m ), the normalized gene expression matrix at τ m to a two component Gaussian Mixture Model using the mixture.GaussianMixture function from the Python package scikitlearn with n components = 2 and all other parameters set to their default [13]. We then used the predict function of our trained model to generate cluster labels for cells at all pseudotimes. We found that cells were predicted to belong to the same cluster (GMM-a) for τ ≲ τ d (purple in Fig. 5F and Fig. S14). For τ ≳ τ d , cells were split between the two clusters (red and blue in Fig. 5F and Fig. S14).