CellTracer: a comprehensive database to dissect the causative multilevel interplay contributing to cell development trajectories

Abstract During the complex process of tumour development, the unique destiny of cells is driven by the fine-tuning of multilevel features such as gene expression, network regulation and pathway activation. The dynamic formation of the tumour microenvironment influences the therapeutic response and clinical outcome. Thus, characterizing the developmental landscape and identifying driver features at multiple levels will help us understand the pathological development of disease in individual cell populations and further contribute to precision medicine. Here, we describe a database, CellTracer (http://bio-bigdata.hrbmu.edu.cn/CellTracer), which aims to dissect the causative multilevel interplay contributing to cell development trajectories. CellTracer consists of the gene expression profiles of 1 941 552 cells from 222 single-cell datasets and provides the development trajectories of different cell populations exhibiting diverse behaviours. By using CellTracer, users can explore the significant alterations in molecular events and causative multilevel crosstalk among genes, biological contexts, cell characteristics and clinical treatments along distinct cell development trajectories. CellTracer also provides 12 flexible tools to retrieve and analyse gene expression, cell cluster distribution, cell development trajectories, cell-state variations and their relationship under different conditions. Collectively, CellTracer will provide comprehensive insights for investigating the causative multilevel interplay contributing to cell development trajectories and serve as a foundational resource for biomarker discovery and therapeutic exploration within the tumour microenvironment.

functions were used to conduct cell clustering and visualization, respectively. The resolution parameter of FindClusters function is adjusted from 0.1 to 0.9 (with an interval of 0.1) to provide clustering results at various resolutions, with higher resolution values corresponding to more cell clusters.

Cell type annotation
CellTracer performed cell type annotation by the following two strategies: (i) Using original cell-type annotation if provided by the original data source; (ii) Performing CELLiD method described by DISCO to annotate different cell types (6). In this step, the cell type of each cluster was determined using reference cell type marker genes and R codes of CELLiD method (https://github.com/JinmiaoChenLab/DISCO_manuscript/blob/master/CELLiD.R). The cell marker annotations were collected and combined from DISCO (6) and CellMarker (7).
DISCO is a database of deeply integrated scRNA-seq data covering 107 tissues/cellines/organoids and 158 diseases. CellMarker is one of our previous works providing manually curated markers of diverse cell types. These databases collect cell marker annotations for both disease and healthy scRNA-seq data. To provide comprehensive annotations of diverse cell types, we integrated cell markers from both DISCO and CellMarker databases as cell type annotation reference. For each cell cluster, we used the comprehensive cell type annotation reference (such as normal cells, malignant cells, and disease cells) as input for CELLiD.

Cell development trajectories construction
The Monocle 2 package (v2.18.0) (8) was used to calculate pseudotime, states and further construct cell development trajectories. Monocle 2 works well with both relative expression data and count-based measures. In general, it works best with transcript count data (http://cole-trapnell-lab.github.io/monocle-release/docs/). In our work, we use the gene counts as input matrix to Monocle 2. To work with count data, we set the expressionFamily parameter as negbinomial.size() to specify the negative binomial distribution. The project for Monocle 2 undertook Seurat's processing results containing quality filtered cells and the metadata of cells.
Genes with an average expression greater than 0.1 were used for principal component analysis (PCA) and top 20 principal components were used for cell clustering. Further, featured genes were screened (q-value < 0.01) for cell sorting based on differentially expressed genes of clusters or cell types. To visualize trajectories in 2D space, the DDRTree algorithm was used to reduce the dimension. Further, the trajectory analyses results of Monocle 3 (v1.2.9) (9) were also included in CellTracer. Compared with Monocle 2, Monocle 3 changed the dimensional reduction method DDRTree to UMAP, which can better reflect data of high-dimensional space. The most important difference between Monocle 2 and 3 is that DDRTree based method assumes trajectories are connected into a single tree-like structure.
While in Monocle3, multiple, disjoint graphs could be learned. Likewise, the top 20 principal components calculated by PCA were used for cell clustering and featured genes (q-value < 0.01) were selected for cell ordering. Finally, the pseudotime trajectory was visualized in the 2D space of UMAP. For each dataset, CellTracer provided trajectory analyses using Monocle 2, and Monocle 3 for different cell types, such as malignant cells, immune cells, stromal cells, etc.

Functional annotation data collection
In order to dissect the functional activation status and state transition of individual cellular populations, CellTracer collected functional gene sets including Gene Ontology (GO) (10), biological pathways (11), hallmarks (12) and cellular states (13). The gene set variation analysis (GSVA) method (14)

Database construction
CellTracer can be freely visited at http://bio-bigdata.hrbmu.edu.cn/CellTracer/. The online web server of CellTracer was constructed by Java Server Pages language and deployed on the All data processes and statistical analyses were performed using the R software (V4.2.1, https://cloud.r-project.org/).