Network analyses elucidate the role of SMYD3 in esophageal squamous cell carcinoma

SMYD3 is a member of the SET and myeloid‐Nervy‐DEAF‐1 (MYND) domain‐containing protein family of methyltransferases, which are known to play critical roles in carcinogenesis. Expression of SMYD3 is elevated in various cancers, including esophageal squamous cell carcinoma (ESCC), and is correlated with the survival time of patients with ESCC. Here, we dissect gene expression data, from a previously described KYSE150 ESCC cell line in which SMYD3 had been knocked down, by integration with the protein–protein interaction (PPI) network, to find the new potential biological roles of SMYD3 and subsequent target genes. By construction of a specific PPI network, differentially expressed genes (DEGs), following SMYD3 knockdown, were identified as interacting with thousands of neighboring proteins. Enrichment analyses from the DAVID Functional Annotation Chart found significant Gene Ontology (GO) terms associated with transcription activities, which were closely related to SMYD3 function. For example, YAP1 and GATA3 might be a target gene for SMYD3 to regulate transcription. Enrichment annotation of the total DEG PPI network by GO ‘Biological Process’ generated a connected functional map and found 532 significant terms, including known and potential biological roles of SMYD3 protein, such as expression regulation, signal transduction, cell cycle, cell metastasis, and invasion. Subcellular localization analyses found that DEGs and their interacting proteins were distributed in multiple layers, which might reflect the intricate biological processes at the spatial level. Our analysis of the PPI network has provided important clues for future detection of the biological roles and mechanisms, as well as the target genes of SMYD3.

SMYD3 is a member of the SET and myeloid-Nervy-DEAF-1 (MYND) domain-containing protein family of methyltransferases, which are known to play critical roles in carcinogenesis. Expression of SMYD3 is elevated in various cancers, including esophageal squamous cell carcinoma (ESCC), and is correlated with the survival time of patients with ESCC. Here, we dissect gene expression data, from a previously described KYSE150 ESCC cell line in which SMYD3 had been knocked down, by integration with the proteinprotein interaction (PPI) network, to find the new potential biological roles of SMYD3 and subsequent target genes. By construction of a specific PPI network, differentially expressed genes (DEGs), following SMYD3 knockdown, were identified as interacting with thousands of neighboring proteins. Enrichment analyses from the DAVID Functional Annotation Chart found significant Gene Ontology (GO) terms associated with transcription activities, which were closely related to SMYD3 function. For example, YAP1 and GATA3 might be a target gene for SMYD3 to regulate transcription. Enrichment annotation of the total DEG PPI network by GO 'Biological Process' generated a connected functional map and found 532 significant terms, including known and potential biological roles of SMYD3 protein, such as expression regulation, signal transduction, cell cycle, cell metastasis, and invasion. Subcellular localization analyses found that DEGs and their interacting proteins were distributed in multiple layers, which might reflect the intricate biological processes at the spatial level. Our analysis of the PPI network has provided important clues for future detection of the biological roles and mechanisms, as well as the target genes of SMYD3.
Methylation of histone proteins plays a pivotal role in the regulation of a wide range of biological processes. SET and myeloid-Nervy-DEAF-1 (MYND) domaincontaining protein (SMYD) is a methyltransferase family that includes SMYD1, SMYD2, SMYD3, SMYD4, and SMYD5 and has been found to play critical roles in human carcinogenesis. Altered expression of SMYD3 is associated with the progression of several solid tumors, including bladder cancer [1], glioma [2], gastric cancer [3], and prostate cancer [4].
Sarris et al. [5] found a significant correlation between elevated expression of SMYD3 and the incidence of both hepatocellular carcinoma and colorectal cancer. Several trials have explored the effects of SMYD3 overexpression on proliferation, viability, cancer cell migration, and invasion [6][7][8][9][10]. The series of elegant experiments suggested that SMYD3 could serve as a potential biomarker for clinically aggressive disease and an attractive therapeutic target. The biological roles for SMYD3 are numerous and far beyond its transactivation activity. Thus, more attention directed toward other roles remains necessary.
Network-based analysis of protein-protein interactions (PPI) refers to the association among protein molecules and the study of these associations from the aspects of biochemistry, signal transduction, and biomolecular networks. Proteins do not work alone, interacting with other proteins in the biological context of specific functions [11]. In recent years, integrated analysis of largescale gene expression data, or other high-throughput data, with the PPI network has received great attention to elucidate biological processes [12]. Integration analyses with the PPI network provide a number of applications, such as protein interaction prediction, disease candidate genes identification, protein function prediction, functional protein module identification, protein complex, and drug target identification [13].
In our previous study, we found that SMYD3 expression is frequently upregulated in human esophageal squamous cell carcinoma (ESCC) tissues, correlating with overall survival of patients with ESCC [14]. RNAimediated knockdown of SMYD3 suppressed ESCC cell proliferation, migration, and invasion in vitro and inhibited local tumor invasion in vivo [14]. Future analyses should be carried out beyond the mere listing of differentially expressed genes (DEGs) after microarray experiments or other high-throughput experiments. It would be helpful to explain the biological roles or biological phenotype of target genes, especially when more and more spatial or temporal interactions between proteins are obtained from the public databases. In this study, we analyzed DEGs, following SMYD3 knockdown in ESCC cells, by applying PPI network analysis.

Materials and methods
Expression of SMYD3 in ESCC from TCGA data Expression data of esophageal carcinoma (TCGA_ESCA_ exp_HiSeq-2015-02-24) were downloaded from TCGA (https://cancergenome.nih.gov/), which contained level 3 expression of 89 cases of ESCC by RNA-seq. The X-tile program 3.6.1 [15], to define the optimal cutoff point for the expression level for SMYD3, was used to classify patients with ESCC into two groups, high and low SMYD3 expression, following the Kaplan-Meier survival analysis and log-rank test by SPSS 13.0 software (version 13.0; SPSS, Inc., Chicago, IL, USA).
The expression profile and differentially expressed genes SMYD3 was knocked down as described in our previous study [14]. Briefly, shRNA sequences targeting SMYD3 were ligated into the pGLV3/H1/GFP/+ Puro vector and transfected into the KYSE150 ESCC cell line. KYSE150 cells were transfected with an empty plasmid as a control. SMYD3 knockdown was confirmed by both QRT-PCR and western blot, and the mRNA expression profile was analyzed with the GeneChipÒ PrimeView TM Human Gene Expression Array (Affymetrix, Santa Clara, CA, USA). Raw data were treated by normalization and log transformation. Both raw and treated expression data have been submitted to the NCBI GEO database (http://www.ncbi. nlm.nih.gov/geo/) and were assigned as GSE85419. A threshold of twofold change was set for DEGs in this study.

Construction of the PPI network
Currently known human PPI data are available from the newest releases of the following databases: HPRD (http://www. hprd.org/), BioGRID (http://thebiogrid.org/), DIP (http:// dip.doe-mbi.ucla.edu/dip/Main.cgi), and IntAct (http:// www.ebi.ac.uk/intact/). These physical protein interactions were collected from public reports of both low-throughput and high-throughput experimental results, providing high confidence for the following analyses, such as disease researches integrated with the human PPI network [16,17].
At first, we manually integrated the data to obtain a unique dataset of interactions for the Homo sapiens species. These unique PPI data were considered as a curated parental PPI network, containing 18 644 unique proteins and 199 411 interactions, which was applied for new or child PPI network construction. Cytoscape software has been widely applied for visualization, data integration, and analyses of PPI networks, as it provides update plugins to meet the needs of large-scale data analyses [18]. In Cytoscape, for visualization as graphs, PPI networks are presented with nodes as proteins and the edges as their interactions. First, the total DEGs, downregulated DEGs, and upregulated DEGs were mapped to the parental PPI network, and extracted their first class interacting neighbors to construct three PPI subnetworks. To increase the reliability and reduce the unnecessary connections, the network reconstruction was limited to the first interacting protein neighbors of these DEGs. Second, to detect the axis of SMYD3-neighboring proteins, SMYD3 was used as the query node to construct a SMYD3-central PPI network.
Third, a subnetwork was created to detect the internal interactions between DEGs by mapping all DEGs to the parental PPI network.

Analyses of PPI network topological parameters
The analyses of multiple topological parameters of networks were carried out by NetworkAnalyzer [19]. The network topological parameters are important characters to understand the organization of complex networks, such as the PPI network [20]. The degree is defined as the number of one node's directly interacting protein neighbors in the PPI network. One of most important network topological characteristics is that the degree distribution follows a scale-free power law distribution for many large networks, such as the PPI network or social network. In this study, the power law of distribution of node degrees was analyzed by the method in our previous report [21].
The three new PPI subnetworks were treated as undirected in this study. The degree distribution P(k) of a large-scale network is defined as the fraction of nodes in the network with degree k. The pattern of their dependencies can be visualized by fitting a line on the node degree distribution data. NetworkAnalyzer calculates the positive coordinate value for fitting the line where the power law curve of the form y = bx a . R 2 value is a statistical measure of the linearity of the curve fit and used to quantify the fit to the power line. When the fit is good, the R 2 value is very close to 1. Moreover, many topological parameters were also analyzed and shown, as they indicate the network properties.

Gene functional enrichment analyses
The Functional Annotation Chart in DAVID (http://da vid.abcc.ncifcrf.gov/) is able to examine the significance of gene-term enrichment by the application of a modified Fisher's exact test [22]. The chart covers 40 annotation categories, including GO terms, protein functional domains, pathways, sequence features, disease associations, homology, gene functional summaries, and literature. Terms from the Functional Annotation Chart with P < 0.05 were visualized by Enrichment Map [23], which organizes the enriched overlapped gene sets into a network.

Generation of the functional annotation map
The total DEG PPI network was annotated by Gene Ontology (GO) for the mining of enriched GO 'Biological Process' terms using the ClueGO plugin, which creates a functionally and hierarchical organized GO/pathway term network [24]. The enriched annotated terms with a P-value < 0.001 were defined as significant. To visualize the relationship between GO terms, a kappa score, reflecting the overlapping number between terms, was set to 0.3 as the threshold.

Subcellular classes of the total DEG PPI network
The subcellular localization for proteins in the total DEG PPI network was obtained from the 'GENE-ONTOLOGY annotation file' in the HPRD database, which was curated and imported into the network as a node attribute. For the localizations of several proteins that were annotated by multiple locations, including proteins that might translocate into the nucleus (e.g., both in cytoplasm and in nucleus), these localizations were merged (e.g., cytoplasm/nucleus). Cerebral program is able to separate the nodes in the total DEG PPI network into multiple layers according to their subcellular localization remaining their interactions, generating a pathway-like graph [25].

Correlation of SMYD3 with survival of patients with ESCC
Although we previously found that SMYD3 is overexpressed in patients with ESCC from China, we applied another dataset as a validation cohort. By X-tile, patients with ESCC were classified into two groups according to the expression level of SMYD3 (P < 0.001; Fig. 1A). Compared with patients with high SYMD3 expression, ESCC patients with lower SYMD3 had a longer survival time (P = 0.014), consistent with our previous report (Fig. 1B).

The construction of three DEG PPI networks
Using twofold as the threshold, we identified 238 DEGs (85 upregulated genes and 153 downregulated genes) from the mRNA expression microarray of SMYD3-knockdown ESCC cells. In order to provide a landscape of what and how the DEGs participate in cellular biological activities, we constructed a full view of their interaction proteins to shed light on their functions, as well as the biological role of SMYD3 in ESCC. The PPI datasets from several acknowledged interaction databases collected from the literature provide original and credible data for performing research. By mapping the total, downregulated, and upregulated DEGs to the parental PPI network and extracting their first interacting neighbors, three sub-PPI networks were generated.
As shown in Fig downregulated DEGs have interacting proteins to form a sub-PPI network composed of 2963 nodes and 58 833 edges (Fig. 2B). Third, 85 upregulated proteins formed a PPI subnetwork containing 2176 nodes and 37 238 edges (Fig. 2C). These three PPI subnetworks suggested that the knockdown of SMYD3 tremendously disturbs the protein activities in ESCC, as more than 200 DEGs, resulting from SMYD3 knockdown, were capable of broadly influencing biological processes through the interactions with thousands of other proteins.
Currently, 23 SMYD3-interacting proteins have been identified. Nevertheless, the expression levels of these 23 proteins were not significantly changed, except for the gene mesoderm-specific transcript (MEST), which was downregulated in the mRNA profile of SMYD3 knockdown in ESCC (Fig. 2D). The DEG-DEG interactions were acquired from the parental PPI network to detect their internal interactions. The DEG-DEG network contained 77 nodes (48 downregulated and eight upregulated DEGs) and 120 edges, which also included a large module containing 59 DEGs (39 downregulations and 20 upregulations; Fig. 2E).

Network topological characteristics analyses
Based on the specific distinguishing principles (e.g., power law distribution of node degree), a real biological network, such as the PPI network, is significantly discriminated from random networks [26,27]. For the total, downregulated, and upregulated DEG networks, the node degree distributions approximately showed a power law distribution, with an R 2 = 0.854, 0.831, and 0.845, respectively ( Fig. 3A-C). These results indicated the three DEG PPI subnetworks constructed in this study were real complex biological networks characterized scale-free [28]. It also suggested that a small number of important proteins act as hub nodes with a large amount of interactions. Three topological metrics proposed to understand the structure of a complex network, specifically network density, network centralization, and clustering coefficient, are shown in Table 1. Several other important network characteristics, for example, average clustering coefficient distribution, closeness centrality, neighborhood connectivity distribution, and topological coefficients, are indicated in Fig. 3D-G.

Functional annotation map of DEGs
To gain a full view of the functions and categories of the DEGs, the DEGs were annotated by Functional Annotation Chart and visualized by Enrichment Map. As shown in Fig. 4, one node represents one functional annotation term. The more significant the enriched term, the larger it is. Nodes from the same kind of functional category are shown in the same color. The edge width, defined by the overlap coefficient between the enriched terms (overlap coefficient cutoff was set as 0.6), is wider when more of the same genes overlap in two nodes.

Generation of functional annotation map
Cellular activities might be affected by the DEGs through cascades of interactions in the network to perform their multiple biological functions. To find potential cellular activities perturbed by SMYD3 protein through its DEGs and their interacting proteins, GO 'Biological Process' enrichment analyses of the total DEG PPI network were performed. We generated 532 enriched GO terms to construct a functional annotation map, in which the nodes were no longer proteins, but rather their enriched GO terms, with the edges suggesting significant overlapping of enriched proteins between two GO terms (Fig. 5).
Interestingly, many groups of GO terms possibly associated with SMYD3 functions were found, as indicated by capitalized Roman numerals. As SMYD3 plays important roles in histone modification, a group of gene expression regulation-associated terms were identified, such as 'gene expression', 'regulation of gene expression', 'positive regulation of transcription, DNA-dependent', and 'transcription from RNA polymerase II promoter'. The second interesting result was that many of the proteins from the total DEG PPI subnetwork participated in signal regulation or signal transduction, for example, 'signal transduction', 'intracellular signal transduction', 'regulation of cell  biological roles of SMYD3 in promoting cancer cell metastasis and invasion were also found: 'cell-substrate adhesion', 'cell-matrix adhesion', and 'positive regulation of cell adhesion'. The interesting significant important GO terms are listed in Table 2.

Subcellular layers of proteins in the PPI subnetwork
The proper subcellular localization of the proteins is extremely crucial because the appropriate location provides the physiological context for their functions, such as signal transduction, transcription regulation, protein modification, and complex formation. Cerebral program could array nodes in the PPI network into different subcellular layers maintaining their interactions. The total DEG PPI network was separated into 10 layers in the following percentages: secreted (6.5%), membrane (13.2%), cytoskeleton (0.5%), cytoskeleton/ cytoplasm (0.3%), cytoplasm (26.4%), secreted/nucleus (0.8%), membrane/nucleus (0.3%), cytoskeleton/nucleus (0.45%), cytoplasm/nucleus (28.6%), and nucleus (23%; Fig. 6A). These results suggest the proteins in the total DEG PPI network distributed from extracellular to intracellular till nucleus. There are currently 23 SMYD3-interacting proteins reported and annotated in the PPI database. The most recognized SMYD3 function is histone methyltransferase, suggesting that this protein is mainly localized in the nucleus. To detect whether there were any possibilities for SMYD3 and its interacting proteins play roles in the nucleus, the proteins of the SMYD3-central PPI network were also arrayed according to subcellular locations. Many SMYD3 interacting proteins, such as E2H2, HDAC1, NFYB, and POLR2A, were located in the nucleus, suggesting that SMYD3 might form functional protein complexes with these proteins to regulate gene expression (Fig. 6B). These results might reflect the intricate biological processes at the spatial level.

Discussion
Globally, esophageal cancer is the eighth most common malignancy, as well as the sixth most common fatal cancer worldwide. Esophageal cancer has two histological types, adenocarcinoma and squamous cell carcinoma; the latter is among the four most common causes of death in China [29]. More and more research has illustrated that systems biology approaches, such as network-based methods, can be successfully applied to elucidate the molecular mechanisms of diseases [30,31].
SMYD3 protein interacts with H3K4Me3-modified histone tails, which facilitates its recruitment to the core promoter of transcriptionally active genes [32]. To explore the potential roles or functions of SMYD3, a systems approach was applied by integrating public protein interaction data with DEGs resulting from SMYD3 knockdown to provide a full view. As shown in the three (downregulated, upregulated, and total DEGs) PPI subnetworks generated, thousands of proteins interact with the DEGs. This suggests that SMYD3 affects the expression of other proteins, and its knockdown impacts on cellular activities through the perturbation of the cellular protein network. Of the 23 SMYD3 directly interacting proteins, only the expression of MEST is downregulated. MEST is an imprinted gene with a hypermethylation promoter and is associated with cell invasion, as well as being a risk factor for cervical cancer and hepatocellular carcinoma [33][34][35]. It is presumed that the knockdown of SMYD3 might impact on the expression of its directly interacting protein MEST.
The wide coverage of the Functional Annotation Chart provides a powerful tool to facilitate large-scale gene function analysis from a network viewpoint. These functional terms are presumed to be significantly mediated by SMYD3 through its DEGs and could also be applied to explore the molecular roles of SMYD3 in ESCC tumor initiation and growth. The two terms of GO:0003682 of 'chromatin binding' and GO:0000979 of 'RNA polymerase II core promoter sequence-specific DNA binding' have three repeated genes of GATA3, TP53, and YAP1 with fold changes of À2.71, À2.02, and À2.08, respectively, following knockdown of SMYD3 in ESCC. Yes-associated protein 1, a key gene in the Hippo signaling pathway, is a crucial regulator pervasively activated in human malignancies [36]. High levels of nuclear YAP1 are correlated with increased chromosome instability and aneuploidy in hepatocellular carcinoma [37]. In breast   between cis-regulatory elements [38]. In this light, our data suggest that YAP1 and GATA3 are important target genes for future analysis on the impact of SMYD3mediated regulation of tumor-associated genes.
To better understand the biological roles of the DEGs through the interactions with their protein partners, the total DEG PPI network was subjected to functional enrichment annotation by GO, which was also illustrated by a network. We show that the total DEG PPI network, perturbed by the knockdown of SMDY3, involves various biological activities, including the acknowledged and potential functions of SMDY3. Interestingly, the functional enrichment annotation map contains several cell motility-related GO terms, for example, 'cell-substrate adhesion' and 'cell-matrix adhesion', indicating a function for SMYD3 in cancer cell metastasis. Direct evidence for SMDY3 participation in cancer cell metastasis has been found in our previous report from ESCC [14], as well as in bladder and colon cancer in vitro and in vivo [39], suggesting that SMYD3 is one of the key players stimulating migration and invasiveness of these cancer cells. On the other hand, SMYD3 is able to regulate cell signal transduction, cell cycle, and various biological effects, except the well-known gene expression regulation, through a cascade of PPIs. The direct role for SMYD3 in the regulation of signal pathways has been reported, as SMYD3 mediates the methylation of MAP3K2 at lysine 260, which potentiates activation of the Ras/Raf/MEK/ERK signaling module [40]. These results provide critical clues to explore the multiple functions of SMYD3 in the future.
Studies indicate that classical signaling pathways are composed of a series of genes or proteins, each linked by the order involved in signal transduction and response [41]. We presume that the many canonical and noncanonical signals are also transduced by sequential protein interactions, arrayed in the proper layers. On the other hand, the roles or functions of the protein might vary according to its subcellular localization. For example, proteins located in the plasma membrane are primarily involved in cell adhesion, cytoskeleton, and cell signaling, whereas in the nucleus, proteins are mainly involved in transcription, ribosomal assembly, or chromatin remodeling [42]. SMYD3 protein distributes both in cytoplasm and in nucleus or translocates from the cytoplasm into the nucleus, enabling its multiple functions in different subcellular localizations [43]. Based on subcellular localization information, a pathway-like view of total DEG PPI network was created, displaying the cellular locations of proteins and making it easier to understand the direction of information flow.
In summary, knockdown of SMYD3 causes the altered expression of its target genes, indicated by the PPI network, to directly or indirectly affect the signaling of extracellular membrane-cytoskeleton/cytoplasm-nucleus cascades, causing the altered expression of other DEGs, and consequently cause alterations in cell cycling, signal transduction, invasion, and metastasis.

Data Accessibility
Both raw and treated expression data described in this study have been submitted to the NCBI GEO database (http://www.ncbi.nlm.nih.gov/geo/) and assigned the accession number GSE85419.