DTFLOW: Inference and Visualization of Single-cell Pseudo-temporal Trajectories Using Diffusion Propagation

One of the major challenges in single-cell data analysis is the determination of cellular developmental trajectories using single-cell data. Although substantial studies have been conducted in recent years, more effective methods are still strongly needed to infer the developmental processes accurately. In this work we devise a new method, named DTFLOW, for determining the pseudo-temporal trajectories with multiple branches. This method consists of two major steps: namely a new dimension reduction method (i.e. Bhattacharyya kernel feature decomposition (BKFD)) and a novel approach, named Reverse Searching on kNN Graph (RSKG), to identify the underlying multi-branching processes of cellular differentiations. In BKFD we first establish a stationary distribution for each cell to represent the transition of cellular developmental states based on the random walk with restart algorithm and then propose a new distance metric for calculating pseudo-times of single-cells by introducing the Bhattacharyya kernel matrix. The effectiveness of DTFLOW is rigorously examined by using four single-cell datasets. We compare the efficiency of the new method with two state-of-the-art methods. Simulation results suggest that our proposed method has superior accuracy and strong robustness properties for constructing pseudo-time trajectories. Availability: DTFLOW is implemented in Python and available at https://github.com/statway/DTFLOW.

is a measure of similarity between two probability distributions. Although it has been 219 widely used in engineering and statistical sciences, this metric is a measure of 220 divergence, and does not satisfy the triangle inequality in the inner product space. Thus, 221 it is not appropriate to use this metric to calculate the distances between cells. In this 222 work, we introduce a new distance metric to measure the distance between two cells. 223 From equation (11), we obtain the distance of two cells ݅ and ݆ as 224  neighbours. In addition, we use 248 to store the reverse index ordering based on . We also use a nested list 249 prop-groups to store the candidate sub-branches/groups and a nested list sub-branches 250 to store the determined sub-branches. Initially these two nested lists are empty. The 251 major steps of this algorithm are described in Supplementary Algorithm 2 together with 252 prop-groups reaches , and the intersected list will be moved from prop-groups 267 to sub-branches to become a determined branch; otherwise, the elements of and 268 the intersected list in prop-groups will be assigned to the branches in sub-branches that 269 are closer to them. 270 Our method ensures that all the end points of sub-branches can be connected within 271 the kNN graph, and all pseudo-times of single cells in the previous sub-branch are less 272 than those in the subsequent sub-branch(es) along the development process. Thus this 273 new algorithm can provide more accurate inference results than the existing methods. 274

275
In this section, we evaluate the robustness and accuracy of our proposed algorithm 276 DTFLOW for the inference of psuedo-time ordering using single-cell data. We apply 277 DTFLOW to analyse four real scRNA-seq datasets, namely the mouse embryo dataset 278 [40], myeloid progenitor MARS-seq dataset [41], female gonad scRNA-seq dataset 279 [42], and microwell-seq data [43]. This work does not include any work for the 280 pre-processing of experimental data. We use the datasets with the same input (namely 281 the same genes and same single-cells) from the published papers directly. 282

Mouse embryo single-cell dataset 283
The high-throughput reverse transcription polymerase chain reaction (RT-PCR) dataset 284 [40] describes the early-stages of the developmental process for mouse embryo. This 285 dataset includes the expression levels of 48 selected genes in 438 single cells at seven 286 different developmental stages, namely from the 1-cell zygote stage to the 64-cell 287 blastocyst stage. We first apply DTFLOW to project the 48-dimensional gene 288 expression data into the two-dimensional feature space by using the BKFD algorithm. 289 We also test the influence of the minimal cell number ݊ required for forming a 299 sub-branch. When we set a small value (i.e.

݊ 1 1
), the individual cells in the lineage 300 process is divided into 5 sub-branches in Figure 2C. There are two bifurcation points 301 that separate cells into two distinct sub-branches along the differentiation process. 302 Figure 2C shows that the main lineage trajectory contains two major branches and one 303 of them further differentiates into two smaller branches. It also suggests that cell 304 differentiation does not occur in the early stages, but cells in the 32-cell stage 305 differentiate distinctly into trophectoderm (TE) and inner cell mass (ICM). 306 Subsequently, cells in the ICM stage further differentiate into epiblast (EPI) and 307 primitive endoderm (PE) in the 64-cell stage. After the second bifurcating event, the 308 embryo cells are divided into three distinct types: namely TE, PE, and EPI. However, if 309 we use a relatively large value of , the single cells will form only three 310 sub-branches with the first bifurcation event occurred in Figure 2D.  for forming sub-branches, 397 respectively. The first sub-branch in Figure 4C contains only a small number of cells. 398 DTFLOW ensures that the pseudo-time of each cell in the initial branch is less than that 399 of any other cells in the following sub-branches. Then cells differentiate into three 400 different terminal branches. Sub-branch two corresponds to the erythroid evolutionary 401 branch, sub-branch three is formed by cells within clusters 11 and 19, and sub-branch 402 four corresponds to the GMP branch. However, when a larger value of ݊ is used, 403 sub-branches two and three merge together and form one large sub-branch in Figure  404

4D. 405
We next compare the branching detection results of DTFLOW, Scanpy and 406 Monocle2. Supplementary Figures S6C and S7C show the branching inference results 407 of Scanpy and Monocle2, respectively. Figure S6C suggests that Scanpy is also able to 408 identify 4 sub-branches. However, the root cell identified by Scanpy is not in the initial 409 group, which is unreasonable for the developmental process. Although Monocle2 410 successfully estimates 12 states in Figure S7C in the sampled set with that of the corresponding cells in the whole dataset by using the 417 Spearman rank correlation coefficient. We conduct 50 repeated tests to measure the 418 robustness property of each method. Figure 5A shows that the robustness property of 419 DTFOLW and Scanpy is better than Monocle2. In addition, the variance of correlation 420 coefficients obtained by DTFLOW is smaller than that of Scanpy, which suggest that 421 the performance of DTFLOW is more stable than the two published methods.  Figure 5B gives the robustness property of these three 451 methods. Numerical results suggest that that the robustness properties of DTFOLW are 452 better than those of Scanpy and Monocle2. 453 reasonably. Figure 7B suggests that PCA cannot distinguish differentiation stages very 459 well. Although tSNE ( Figure 7C) and UMAP (Figure Figure 7D) can separate 460 different cell types clearly for this dataset, the distance intervals of different cells types 461 are relatively large, which cannot be used to indicate cellular developmental processes 462

Comparison of dimensional reduction algorithms
properly. 463 After the successful applications of our proposed algorithm to three relatively small 464 datasets, we next show the effectiveness of our method to a large dataset. We apply our 465 proposed method BKFD to a microwell-seq data that contains 51252 cells and 25912 466 genes [43]. After the data pre-processing, the dataset is reduced to 40210 cells with 100 467 approximate principal components [46]. Based on this dataset, we use four methods for 468 dimensional reduction. Figure 8 shows the visualization results of this dataset with 469 eight major cell clusters. It suggests that BKFD can capture the developmental 470 trajectories in a better way in Figure 8A. In addition, tSNE and UMAP can also 471 distinguish different cell types clearly in Figures 8C and 8D, respectively. However, 472 PCA cannot show good visualization results with distinguishable cell clusters for this 473 dataset in Figure 8B. The RWR algorithm is a popular method to estimate the global similarity between 494 a particular node point with other node points in the graph structure. We use this 495 method to transform the data of each node point to a stationary discrete distribution. algorithm by using four datasets, the values of these parameters may vary from dataset 501 to dataset. In addition, BKFD uses the same restart probability for all the nodes, and this 502 may limit the effectiveness of random walk [47]. It is still a challenge to express each 503 cell by a distribution vector in a better way, which needs to be studied in the future. 504 The continuously topological structure of cellular developmental processes can be 505 analyzed by using the nearest-neighbour graph, which lays the basis of the DTFLOW 506 algorithm. The kNN graph describes the similarities between a cell and its neighbour 507 cells, and has been used twice in the proposed method, namely the definition of 508 transition probability matrix, which leads to the low-dimensional visualization via the 509  Step 2: Compute the nearest neighbours for each cell, get a nearest neighbour graph 677 structure, and then transform the dataset into a Markov transition matrix ; C.
Step 3: 678 Using the random walk with restart method to get a diffusion matrix in which each 679 cell is represented by a discrete distribution vector; . Step