A novel approach for the analysis of single-cell RNA sequencing identifies TMEM14B as a novel poor prognostic marker in hepatocellular carcinoma

A fundamental goal in cancer-associated genome sequencing is to identify the key genes. Protein–protein interactions (PPIs) play a crucially important role in this goal. Here, human reference interactome (HuRI) map was generated and 64,006 PPIs involving 9094 proteins were identified. Here, we developed a physical link and co-expression combinatory network construction (PLACE) method for genes of interest, which provides a rapid way to analyze genome sequencing datasets. Next, Kaplan‒Meier survival analysis, CCK8 assays, scratch wound assays and Transwell assays were applied to confirm the results. In this study, we selected single-cell sequencing data from patients with hepatocellular carcinoma (HCC) in GSE149614. The PLACE method constructs a protein connection network for genes of interest, and a large fraction (80%) of the genes (screened by the PLACE method) were associated with survival. Then, PLACE discovered that transmembrane protein 14B (TMEM14B) was the most significant prognostic key gene, and target genes of TMEM14B were predicted. The TMEM14B-target gene regulatory network was constructed by PLACE. We also detected that TMEM14B-knockdown inhibited proliferation and migration. The results demonstrate that we proposed a new effective method for identifying key genes. The PLACE method can be used widely and make outstanding contributions to the tumor research field.

Since the 1970s, increasingly efficient cancer prognosis detection methods and therapeutic approaches have been developed [1][2][3] , and the list of cancer genes has been growing steadily 4 . There are a large number of differentially expressed genes (DEGs) between cancer tissues and paired adjacent noncancerous tissues, and the key cancer genes often arise from the DEGs, but it is unrealistic to conduct a study on each DEG [5][6][7][8] . Fortunately, technological and computational advances in genomics and interactomics have made it possible to screen key genes within human cancer cells 9 .
There are many genome sequencing analysis methods to screen key cancer genes. These methods have the same problem: (1) the accuracy of key gene screening methods needs to be improved 10,11 . (2) Objective regulatory networks for key genes are lacking. There is an urgent need for new key gene screening approaches. The protein-protein interactions (PPIs) are defined as physical links between proteins [12][13][14] . It is well known that PPIs provide an objective basis, and PPIs could be utilized to screen genes that are consistently associated with survival [15][16][17][18] . PPIs have been studied for many years and have been utilized in diverse fields of medicine, such as diagnostics, with a wide range of applications. The revolution brought about by the advent of PPIs has changed the face of human molecular and disease research [19][20][21] , and it has brought great convenience to human cancer research 22 . PPI networks have vital relationships with gene regulation and function and provide a new way to OPEN 1 Fig. 1. HCC cells (T) and hepatocytes (N) were selected from GSE149614 (samples of liver cancer patients) of the GEO database. The data were processed with the Seurat package 39 . We calculated the number of gene types (nFeature 39 ) presented in the sample, total gene expression (nCount 39 ) and the percentage of reads in the mitochondrial genome (percent.mt 39 ), and distinct differences in Figure 1. The scheme of PLACE analysis. First, obtain sequencing data from public databases, then select samples of interest from the obtained data, then identified differential genes in obtained data, construct physical linkage networks and co-expression networks by PLCAE, and combine the two networks, Finally, the obtained results are validated by other databases and in vitro assays. www.nature.com/scientificreports/ gene expression levels between T and N were found ( Fig. 2A). We next calculated a subset of features that exhibited high cell-to-cell intercellular variation in the dataset (the top 2000 variable genes) (Fig. 2B). Among the 2000 variable genes, we identified 15 principal components, which allowed easy exploration of the primary sources of heterogeneity in a dataset (Fig. 2C). Then, we created an expression matrix of cell-by-gene and conducted dimensionality reduction by T-distributed stochastic neighbor embedding (tSNE) to visualize and explore these datasets (Fig. 2D). We used SingleR to predict and annotate cell type 40 , and then the cell type was confirmed using canonical markers (hepatocyte and HCC cell-ALB, endothelial cell-CD34, stem cell-EPCAM, stromal  cell-NGFR, B cell-MS4A1, T cell-GNLY, CD3E and CD8A, NK cell-KLRD1, monocyte-CD14 and FCGR3A, macrophage-CD68) (Table S1) (Fig. 2E). Finally, all cell types were identified and annotated: NK cells, hepatocytes and HCC cells, monocytes, macrophages, stem cells, endothelial cells, stromal cells, T cells, and B cells (Fig. 2F). We retained hepatocytes and HCC cells (Fig. 2G) and calculated nFeature, nCount and percent.mt presented separately in hepatocytes and HCC cells and each sample of hepatocytes and HCC cells for further analysis. (Fig. 2H,I).

DEGs between HCC cells and hepatocytes.
Hepatocytes and HCC cells were divided into two groups: HCC cells (T) and hepatocytes (N) (Fig. 3A). Besides we calculated nFeature, nCount and percent.mt separately in N and T (Fig. 3B). We then identified 1618 DEGs by comparing cells from HCC with those from hepatocyte (Table S2). We next analyzed DEGs by enrichment in Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and hallmarks. Hallmark analysis showed that DNA repair, peroxisomes, MYC targets V1, and oxidative phosphorylation were activated (Fig. 3C), and the KEGG results indicated that the DEGs activated in group T were mainly enriched in the oxidative phosphorylation pathway (Fig. 3D). Therefore, follow-up work was performed to help us identify the key genes among the candidate genes.
Screening candidate genes from DEGs using the PLACE method. The ultimate gene regulatory network requires both physical links and co-expression, as we mentioned earlier. Thus, we analyzed and counted the proportion of DEGs that were significantly correlated (physical links and co-expression) with the target gene in all DEGs. We recalculated the level 1, level 2 and level 3 counts of each DEG of interest using the PLACE method, which has been described in the Methods section. The genes were arranged in descending order by the number of level 1, level 2 and level 3 genes (Table S3). We screened the top10 candidate genes that had the greatest number of PPIs. The expression levels of 10 genes between N and T had significant difference. TMEM14B, ERGIC3, JAGN1, EBP, UBE2I, GJB1, IER3IP1, TIMMDC1, YIF1A and AIG1 were highly expressed in HCC cells (Table 1, Fig. 3E). Finally as an example, PLACE constructed a new network of TMEM14B containing PPIs and co-expression (Fig. 3F), and we verified the relationship of TMEM14B-TMEM14C and TMEM14B-NUFDAB1 by the STRING protein interaction database with the TCGA database (Fig. 3G,H).

Validation of candidate genes on survival benefit.
To further verify whether the previously candidate genes can regulate tumor development and thus affect survival, we calculated p values for different survival data of each gene by The Cancer Genome Atlas-Liver hepatocellular carcinoma (TCGA-LIHC). Among them, TMEM14B, ERGIC3, JAGN1, BE2I, IER3IP1, TIMMDC1, YIF1A and AIG1 were negatively associated with the overall survival ( Fig. 4A), TMEM14B, ERGIC3, UBE2I, IER3IP1 and TIMMDC1 were associated with the disease-specific survival. TMEM14B, ERGIC3, JAGN1 and TIMMDC1 were associated with the disease-free interval. TMEM14B, ERGIC3, UBE2I, IER3IP1 and TIMMDC1 were associated with the progression-free interval ( Fig. 4B). A large fraction (80%) of the genes (screened by PLACE method) were associated with survival.
Construction and validation of the TMEM14B regulatory network. In the above results, we identified and verified the DEG TMEM14B, which was closely related to the survival of tumor patients. Here, TMEM14B-related genes (Table S4) were screened from the 1618 DEGs by the PLACE method. We next analyzed these TMEM14B-related genes by Hallmark. We then found that DNA repair genes, MYC target V1 genes and oxidative phosphorylation genes were enriched in the TMEM14B-related genes (Fig. 5A). Meanwhile, PLACE was used to construct the TMEM14B regulatory network ( Fig. 5B-D). To further verify whether TMEM14B could regulate DNA repair genes, MYC targets V1 genes and oxidative phosphorylation genes, the interrelationships among the genes (TMEM14B-DNA repair genes, TMEM14B-MYC targets V1 and TMEM14B-oxidative phosphorylation) were validated by TCGA-LIHC. We discovered that 8 of the 11 TMEM14B-DNA repair gene interactions were detected by our method and confirmed by the Pearson test in the TCGA-LIHC cohort, 34 of the 46 TMEM14B-MYC target V1 gene interactions were detected by our method and confirmed by the Pearson test in the TCGA-LIHC cohort, and 11 of the 15 TMEM14B-oxidative phosphorylation gene interactions were detected by our method and confirmed by the Pearson test in the TCGA-LIHC cohort ( Fig. 5E-G, Tables S5-S7).
To confirm the carcinogenic role of TMEM14B, we knocked down TMEM14B in HepG2 and MHCC-LM3 cells using siRNA. Cell proliferation was evaluated using a CCK-8 assay at 24 h, 48 h and 72 h. The results showed that TMEM14B knockdown inhibited the proliferation of HepG2 and LM3 cells (Fig. 6A-D). Cell migration was evaluated using Transwell and scratch assays. TMEM14B knockdown inhibited the migration of HepG2 and LM3 cells (Fig. 6E-L). This result emphasized another advantage of the PLACE method: we can construct a PPI and co-expression network for each protein that is useful for studying genes of interest.

Discussion
In the present study, we proposed a new method, PLACE. In the PLACE method, the input of PPI interactions, expression matrix and potential gene list were needed, and then the co-expression and physical link (level 1, level 2, level 3) network of each potential gene was accordingly constructed. After sorting by PLACE, potential genes that ranked in the top 5, 10, 20, 30, 40 or 50 of the list were selected as key genes. PPI interactions stem from computational prediction, from knowledge about intricate connections and information transfer between molecules within organisms, and from interactions aggregated from other primary databases 41,42 . There are several published databases of PPIs, such as The Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database and BioGRID database 16,43 . For these databases, while comprehensive, no uniform standard definitions were used for the PPIs. Therefore, these databases were not used in our study, but the Human Reference Interactome database (HuRI) was used. Benefiting from the Center for Cancer Systems Biology at Dana-Farber Cancer Institute, a human "all-by-all" reference interactome map of human binary protein interactions was successfully constructed. Currently, 64,006 PPIs involving 9094 proteins have been identified using the Y2H assay 22 . The Y2H assay is the least laborious, low-cost, high-precision direct PPI screening method available to date 44 . PLACE can further dissect key genes based on HuRI PPI interaction data.
The expression matrix and potential gene list from a total of 13,736 cells (10,672 cancer tissue-derived cells and 3064 paired adjacent noncancerous tissue-derived cells) were picked from the scRNA-seq, and we annotated hepatocytes and HCC cells using canonical markers, such as ALB 45 . The exclusion of other cell types by design implied that our results have no bearing for immune cells, stromal cells, etc., so we only focused on the hepatocytes and HCC cells themselves. Then, we identified a number of genes that were differentially expressed between cancer tissues and paired adjacent noncancerous tissues.
In this article, the PPI interactions, expression matrix and potential gene list were processed using PLACE. Next, TMEM14B was identified as the most significant prognostic key gene. Survival is the key to prognosis for tumors; thus, we thought that differentially expressed key genes strongly correlated with survival determine the different prognoses of cancer patients 46,47 , so we analyzed the correlation between gene expression level and survival to evaluate the importance of a gene.
In the present study, TMEM14B regulatory hallmarks, such as DNA repair, MYC targets V1 and oxidative phosphorylation, were found by analyzing the GSE149614 dataset in PLACE, and the results were proven by TCGA. For TMEM14B, biological experiments were conducted, indicating its critical role in the pathogenesis of multiple carcinomas. In conclusion, we have found a new method for discovering critical genes. The role of TMEM14B in tumors is not clear, and this study revealed its prognostic role and regulatory network in HCC for the first time. The results proved that PLACE makes it possible to accurately connect key genes to the regulatory pathway.

Materials and methods
PLACE method. The PPI network was constructed using the HuRI-Union dataset. The PPIs in HuRI were identified by yeast two-hybrid (Y2H) assay or curated literature. For ease of use, we redefined 3 relationships (between any two proteins A and B). Level 1: Proteins A and B in direct contact and interaction-protein A-protein B; level 2: Proteins A and B in indirect contact with an interval of protein X-protein A-protein X-protein B; level 3: Proteins A and B in indirect contact with an interval of two proteins X1 and X2-protein A-protein X1-protein X2-protein B. We calculated the level 1 counts, level 2 counts and level 3 counts for each DEG. Apart from this, we then examined the relationship between each DEG, and Pearson's coefficient was calculated for all genes. We retained the level 1 counts, level 2 counts and level 3 counts based on correlation values r > 0.5 and p < 0.05, and the network was visualized with Cytoscape software. The genes were arranged in descending order by the number of level 1, level 2 and level 3 genes. Data processing. We downloaded GSE149614 scRNA-seq submitted by Yiming Lu et al. from the Gene Expression Omnibus database 48 . A total of 13,736 cells (10,672 cancer tissue-derived cells and 3064 paired adjacent noncancerous tissue-derived cells) were selected from the scRNA-seq. Single-cell RNA sequencing data analysis: dimensionality reduction and clustering. After preliminary screening of 13,736 cells (10,672 cancer tissue-derived cells and 3064 paired adjacent noncancerous tissue-derived cells), the cutoff criteria iare that the percentage of mitochondria is less than 20%, and the expression matrix of cells was processed using R software (Seurat package). Following data normalization (Normal- Single-cell RNA sequencing data analysis: cell type annotation. The cell types were annotated according to a sample reference dataset (HumanPrimaryCellAtlasData) with known labels given via the SingleR package, which assigns these labels to cells from GSE149614 based on the similarity of their expression profiles and confirmed according to the list of marker genes (Table S1). We visualized the marker genes in clustering plots by the FeaturePlot function.
Single-cell RNA sequencing data analysis: biomarker genes that showed differential expression between cancer cell-derived hepatocytes and HCC cells. Hepatocytes and HCC cells were selected from the pool of single cells (subset Function). We performed differential gene expression analyses on cancer tissue-derived cells and paired adjacent noncancerous tissue-derived cells. Differentially expressed genes (DEGs) were then identified by differential gene expression analysis. The Wilcoxon test (adjusted P value < 0.05) and a log e (FC) greater than 0.25 were used to test for significance 39 .  Wound-healing assay. A culture insert (Ibidi, Munich, Germany) was used to generate a wound of 500 μm.
The insert was placed on 24-well plates; then, 3 × 10 5 cells were seeded in each culture insert and incubated for 24 h. After removing the culture insert, the cells were allowed to grow in medium without FBS for 24 h. The original area and migration area were measured using ImageJ software, and the wound closure rates are shown according to the ratio of the migration area to the original area. Each treatment was performed in triplicate wells, and three independent experiments were repeated.
Transwell assay. Transwell migration assays were performed using a 6.5-mm transwell insert with an 8.0μm pore polycarbonate membrane (Merck Millipore, Burlington, MA, United States). A total of 300 μl of cell suspension containing 3 × 10 5 cells without FBS was added to the upper chamber, and 800 μl of medium containing 10% FBS was added to the lower chamber. After incubation for 24 h, cells in the lower chamber were fixed with 4% paraformaldehyde for 15 min and stained with crystal violet for 15 min. Images of each chamber were captured randomly for cell counting. Three independent experiments were repeated.