Tipping points in epithelial-mesenchymal lineages from single-cell transcriptomics data

Understanding cell fate decision-making during complex biological processes is an open challenge that is now aided by high-resolution single-cell sequencing technologies. Specifically, it remains challenging to identify and characterize transition states corresponding to “tipping points” whereby cells commit to new cell states. Here, we present a computational method that takes advantage of single-cell transcriptomics data to infer the stability and gene regulatory networks (GRNs) along cell lineages. Our method uses the unspliced and spliced counts from single-cell RNA sequencing data and cell ordering along lineage trajectories to train an RNA splicing multivariate model, from which cell-state stability along the lineage is inferred based on spectral analysis of the model’s Jacobian matrix. Moreover, the model infers the RNA cross-species interactions resulting in GRNs and their variation along the cell lineage. When applied to epithelial-mesenchymal transition in ovarian and lung cancer-derived cell lines, our model predicts a saddle-node transition between the epithelial and mesenchymal states passing through an unstable, intermediate cell state. Furthermore, we show that the underlying GRN controlling epithelial-mesenchymal transition rearranges during the transition, resulting in denser and less modular networks in the intermediate state. Overall, our method represents a flexible tool to study cell lineages with a combination of theory-driven modeling and single-cell transcriptomics data.


Jacobian of the interacting RNA splicing model
Starting from eqs. (1), we rewrite the interacting RNA splicing model in terms of the functions  !,  !defined as: Where we have assumed  !=  = 1 as explained in the main text,  is number of interacting RNA species, and the coefficients  !" quantify the regulation of species  in species .The Jacobian matrix is defined as: Therefore, for a system of  interacting species, the Jacobian is a 2 × 2 matrix whereby the top right quadrant, which corresponds to the derivative of the unspliced force field with respect to spliced species (highlighted in red box), assumes the interpretation of a gene regulatory network.

Toggle switch
We generalize a minimal toggle switch design with two species (x, y) that mutually inhibit each other to include splicing dynamics while conserving the model's bistability.The temporal dynamics of unspliced and spliced versions of x and y ( & ,  ' ,  & ,  ' ) is modeled via stochastic ordinary differential equations: Where  is a basal production rate,  is a splicing rate coefficient, and  % ,  ) are Hill coefficients that determine the strength of feedback regulation.To simplify the model,  and  are assumed to be equal for the two species.Furthermore, the degradation/dilution coefficient for spliced species  & ,  ' is assumed to be 1, which can be achieved by rescaling time.Finally, the noise  () is the product of an amplitude term  and a random variable sampled from a normal distribution with standard deviation √∆, where ∆ is the timestep used for numerical solution.To induce a saddle node bifurcation, the parameter  % is gradually increased with time starting from  % = 2 to  % = 15.To generate the in silico dataset, we simulated  = 1000 trajectories of eqs.(1) using Euler-Maruyama starting from  = 0 (i.e.,  % = 2) until  = 200 (i.e.,  % = 15).From each independent simulation, a "cell" is sampled by taking the values of ( & ,  ' ,  & ,  ' ) at a randomly selected point along the trajectory.All other parameter values are fixed to  = 1,  ) = 6,  = 1,  = 1.All code was generated by the authors.

EMT circuit
We simulate epithelial-mesenchymal transition (EMT) with the EMT tristable circuit proposed by Tian and collaborators, which we previously generalized to include RNA splicing dynamics.This model uses ODEs to describe the dynamics of unspliced and spliced levels for: miR-34, miR-200, ZEB, SNAIL, external TGF-beta signal, cellular TGF-beta, E-cadherin, and Ncadherin.For each species, the unspliced equation includes basal transcription rate, feedback regulation from other species, and loss due to splicing; the spliced equation includes production due to splicing and degradation/dilution: Snail mRNA ( -,  -) and protein ( .,  .): miR-34 ( +/ ,  +/ ): Zeb mRNA ( 1 ,  1 ) and protein ( 0 ,  0 ): Parameter values were taken from the original model.The splicing rate coefficient () is assumed to be rescaled to unit value.The system of ODEs was simulated numerically using Euler-Maruyama algorithm.All code was generated by the authors.

Trifurcating circuit
We simulate  = 2000 cells in the trifurcating circuit consisting of 8 mutually regulating genes using the BoolODE package.All circuit and simulation parameter were fixed to their original value.The output of BoolODE consists of two gene expression matrices of size 8x2000 corresponding to two layers of expression, interpreted here as the unspliced and spliced RNA counts, and the simulation time associated to each cell.

Calculation and visualization of instability score
The instability score of a single cell along a lineage is defined as the average number of positive eigenvalues associated with the cell.Since a cell can appear in multiple bins along pseudotime axis (see previous section), the score averages the number of positive eigenvalues in each pseudotime bin where the cell is found.When multiple lineages are present in the dataset (see Fig. 3), the score is first computed within each individual lineage and then averaged if the cell belongs to multiple lineages.Finally, the instability score is plotted and overlayed on RNA velocity using scVelo's velocity_embedding_stream() function.

Data preparation and clustering
The unspliced and spliced scRNA-seq RNA counts for both OVCA420 and A549 datasets were taken from Cook and collaborators, and later filtered and normalized following standard procedures using scanpy.The n=50 top genes were used for Jacobian inference and downstream GRN analysis.Diffusion pseudotime for the OVCA420 dataset was computed using scanpy's built-in diffusion pseudotime function with default parameters, whereby the root cell for pseudotime calculation (i.e., the cell at pseudotime t=0) was set as a randomly chosen cell from the cell group sampled at t=0 days.Velocity pseudotime for the A549 dataset was computed using scVelo's built in function with default parameters.

Gene selection criteria
Scanpy's highly variable feature function was used to select the top 50 differentially expressed genes for both the OVCA420 and A549 cell line datasets.The set of top 50 genes used in Figure 4 are presented in supplementary table 2. Furthermore, the number of differentially expressed genes selected via scanpy's highly variable feature function was varied (N=25, 50, 75, 100) to test the method robustness (supplementary figure S5).Finally, we test the method using a usercurated list of epithelial and mesenchymal genes recovered from literature.First, we used scanpy's highly variable feature function to select the top 2000 genes in the dataset to ensure that low-count genes were filtered out.Then, we subset the dataset and kept only the epithelial and mesenchymal genes, resulting in 24 epithelial genes and 60 mesenchymal genes (supplementary table 3).

Inference of Jacobian matrix and gene regulatory network
The input for Jacobian inference includes the unspliced and spliced RNA counts as well as the pseudotime value for each cell defined in the previous section.First, cells were binned within intervals along the pseudotime coordinate.A width parameter (w) sets how wide the pseudotime bins are, whereas an increment parameter (inc) sets the forward step taken along the pseudotime axis after each inference iteration.Considering that diffusion pseudotime is typically bound in [0,1], we selected w=0.1, inc=0.05,leading to 20 equally spaced points along the lineage.We stress that a "correct" value for these parameters cannot be established a priori, and it is the result of trial and error.Second, all cells within each bin are used to infer the Jacobian matrix and gene regulatory network following the strategy highlighted in the main method section "Identification of tipping points along lineage".

Comparison with Dynamo
While our method computes an average Jacobian matrix within a specified reaction coordinate window (diffusion pseudotime in the case of Fig. 4), Dynamo reconstructs a Jacobian matrix associated to individual cells.To set up a fair comparison, we average the Jacobian matrix calculated by Dynamo over the same pseudotime inference window.Initially, we run Dynamo's Vectorfield function to compute a cell-specific Jacobian matrix.Then, we group cells based on their position along the diffusion pseudotime axis, and average Dynamo's cell-specific Jacobian matrices.We compare the methods based on different criteria for gene set selection.First, we let Dynamo's preprocessing pipeline select the genes for Jacobian inference; second, we filter the dataset object with the genes used in our calculation before running Dynamo's Vectorfield function.Third, we provide the gene set used in our calculation as input for Dynamo's Vectorfield function.In all three cases, the stability along the EMT lineage is evaluated based on both (1) The number of positive eigenvalues and (2) the value of the largest eigenvalue.

Community detection in gene regulatory network
GRN graph objects were generated using the NetworkX package.GRN communities were calculated using NetworkX's built-in functions girvan_newman() and greedy_modularity_communities() using default parameters.

Code availability
All scripts were generated in by the authors and are available at https://github.com/barcenasmanuel/Detecting_bifurcations_in_scRNA_seq_data.