Modeling and analyzing gene co-expression in hepatocellular carcinoma using actor-semiotic networks and centrality signatures.

Primary hepatocellular carcinoma (HCC) is currently the fifth most common malignancy and the third most common cause of cancer mortality worldwide. Because of its high prevalence in developing nations, there have been numerous efforts made in the molecular characterization of primary HCC. However, a better understanding into the pathology of HCC required software-assisted network modeling and analysis. In this paper, the author presented his first attempt in exploring the biological implication of gene co-expression in HCC using actor-semiotic network modeling and analysis. The network was first constructed by integrating inter-actor relationships, e.g. gene co-expression, microRNA-to-gene, and protein interactions, with semiotic relationships, e.g. gene-to-Gene Ontology Process. Topological features that are highly discriminative of the HCC phenotype were identified by visual inspection. Finally, the author devised a graph signature-based analysis method to supplement the network exploration.


Introduction
Primary hepatocellular carcinoma (HCC) is the fi fth most common malignancy and the third most common cause of cancer mortality worldwide with one million new cases diagnosed annually. Its prevalence is much higher in developing nations than in industrialized nations. At present, 80% of the HCC cases came from the East Asia and the sub-Saharan Africa with China accounting for nearly 55% of them [1]. For this reason, there have been numerous efforts made in the molecular characterization of primary HCC. As a result, there is a rich repository of genomic and proteomic data available for public access [2]. To uncover the biology hidden within such a large volume of data will require softwareassisted network modeling and analysis (reviewed in [3]). In recent years, attempts to characterize disease phenotypes by integrative network modeling and analysis have been made. For example, Tuck et al. [4] retrieved the human gene regulatory network from the TRANSFAC ® database and integrated it with the transcription factor-to-target genes co-expression network derived from multiple microarrays. They then demonstrated that node degree measures are a feasible discriminator of oncology types. Chuang et al. [5] characterized proteomic sub-networks as the biomarkers for discriminating between metastatic and non-metastatic breast cancer. They demonstrated that the protein sub-networks identifi ed are highly discriminative of metastasis and some of the genes underscored by statistical inference methods were found to be member nodes of those sub-networks. These studies demonstrated the effectiveness of network modeling and analysis. This paper presents the author's fi rst attempt in exploring the biological implication of gene coexpression in HCC using actor-semiotic network modeling. The rationale was that a complex network requires context or metadata to be comprehensible. Without which, no human user would be able to unpack the information content within, let alone making biological deductions. The proposed actorsemiotic network is similar to the actor-network [6] frequently used for modeling healthcare systems. Actor-network theory models the human community as a network of heterogeneous actor-semiotic interactions. The actors are human participants, human organizations, and material objects. The semiotics is the human ideas, concept, and policies. In molecular biology, the actors are the bio-molecules and the sub-cellular components. The semiotics is the human understanding of biology. Its abstraction is the ontologies on biological processes, molecular function, and cellular phenotypes.
Because the topology of an actor-semiotic network is determined by the combination of inter-actor and semiotic relationships, there should be visually identifi able topological features that are highly discriminative of the HCC phenotype. To achieve this, the author employed visual inspection and, in addition, a graph signaturebased analysis method to supplement network exploration. This method fi rst summarized the local topology of every node in the network as a signature vector and then projected the vectors onto a two-dimensional scatterplot for further exploration.

Visual analysis
Using NetMap Decision Director™, an actorsemiotic network G (|V| = 9313; |E| = 49,393) was being constructed. G was a union of all the actor and semiotic nodes and edges described in section 6.2. The bio-molecules and the sub-cellular components within G were represented by the actor nodes whereas the biological context of G was represented by the semiotic nodes (see Appendix A.1). The pairwise interactions between bio-molecules or between bio-molecules and sub-cellular components were represented by the inter-actor edges. The ontological relationships between actor and semiotic nodes were represented by the semiotic edges.
A smaller network G' (|V| = 1668; |E| = 2473) was derived from G as a result of node mapping (see Appendix A.2 and Fig. 1). To test whether G' contained a set of nested networks, it was decomposed to G d using NetMap™. The nested networks observed in G d are discrete clusters. Let C k be one of the clusters, then G d = {C k } where 0 Ͻ k Յ 32 (Fig. 2). Each cluster is a network that is not connected to any other clusters nor does it share any of its member nodes with other clusters, such that Although their size |V| ranged from 2 to 1536, only one cluster had a |V| of 1536. The rest had a |V| that ranged from 2 to 7. Among the inter-actor edges in the small clusters (1 Ͻ |V| Ͻ 8), only 10 were of the Coexpression_ HCC subtype and 24 were of the Coexpression_liver subtype. It showed that most co-expressed genes, whether in the normal hepatocyte or in HCC, are highly inter-connected. Eight of the small clusters contained only semiotic edges. For the small clusters that contained at least one inter-actor edge, the semiotic nodes indicated that the protein-coding genes within each cluster shared the same biological process or molecular function (Fig. 2).
From the largest cluster G e (|V| = 1536; |E| = 2367) in G d , the largest connected component G e ' (|V| = 1371; |E| = 1120) was extracted (Fig. 1). G e ' was comprised of inter-connected emergent groups and liaison nodes (Fig. 3). An emergent group is a sub-network in which its member nodes are more inter-connected within than without. A liaison node is a node shared by multiple emergent groups. The emergent groups were localized in the top half and the liaison nodes in the lower half of G e '. Of the 41 emergent groups, six of them had a |V| larger than 25. The semiotic edges within each emergent group indicated that it belongs to a specifi c biological process showing that the topology of the integrated co-expression and protein interaction network in HCC is partially modular. In a sense, each emergent group is similar to the Complex Biological Module proposed by Zotenko et al. [30]. Some emergent groups, e.g. groups 4 and 5, are directly linked to one another suggesting that the coupling between their corresponding biological processes could be hard-wired. Some, e.g. groups 2 and 3, are connected via liaison nodes, MAPK1 and MIRN217, suggesting that the coupling between their corresponding biological processes could be switch-dependent.

Network exploration using graph signatures
The eccentricity and radiality centralities were found to give identical rankings. The same was also observed with the HITS-Authority and HITS-Hub centralities. Therefore the radiality and the HITS-Hub centralities were excluded from the signature vector of each node. After the signature vectors were computed and scaled, the scatterplot shows that there are two clusters of nodes, each representing a different range of signature vectors (Fig. 4). Nodes within the emergent groups were found in the upper cluster and liaison nodes were found in the lower cluster.
The six nodes at the left-extremity (x-range = [−1661.93, −1617.66]; y-range = [−74.14, 61.57]; Fig. 4) of the lower cluster have signatures that contained the top 5% ranking in closeness, current-fl ow betweenness, current-fl ow closeness, and shortestpath betweenness centralities. Three of these nodes GGA3, IPO7 and RAN are members of emergent group 2 (Fig. 3). They are involved in intracellular traffi cking. Another two, CTGF and CYR61 are liaison nodes involved in angiogenesis. The last one, MAPK1 is also a liaison node which is an amplifi er shared by multiple signal transduction pathways. The four nodes   Figure 2. The rank scores for the seven centrality types in each bar chart are arranged (from left to right) in this order: Degree, Closeness, Current Flow-Betweenness, Current Flow-Closeness, Eccentricity, HITS-Authority, and Shortest Path-Betweenness. A lower rank score means a higher node ranking for a particular centrality type.   Fig. 4) have signatures that contain the bottom 10% ranking in closeness, current-fl ow closeness, eccentricity, and HITS-authority centralities, and the top 10% ranking in degree, current-flow betweenness, and shortest-path betweenness centralities. They are COX17, NDUFS1, and DLD. Each of these nodes was a junction to two small subsets of nodes with each subset containing a maximum of three nodes. Yet at least one node within each subset was connected to another two nodes without. COX17 and NDUFS1 are subunits of the mitochondrial electron transport chain with the latter being another subunit of the ubiquinone complex. DLD is a subunit of the pyruvate dehydrogenase complex. The semiotic node at the top corner of the upper cluster (x = [225.65]; y = [1223.89]; Fig. 4) is GO:0050672 (negative regulation of lymphocyte proliferation). It had a signature that contain the bottom 10% ranking in degree, current-fl ow betweenness, and shortestpath betweenness centralities, the top 10% ranking in closeness and eccentricity centrality, and the top 5% ranking in HITS-authority centrality.
In summary, the ranking of all centralities decreases as one moves to the right end of the x-axis in the scatterplot. On the other hand, the node ranking on degree, current-fl ow betweenness, and shortest-path betweenness centralities increase as one moves to the lower end of the y-axis but at the same time, the rankings on closeness, currentfl ow closeness, eccentricity, and HITS-authority centralities decrease. The rank score of those nodes mentioned in this paper are tabulated in Table 1.

Inference of HCC Biology
Based on the visual exploration of network G e ' and the inspection of the scatterplot, the author deduced several hypotheses on the molecular pathology of HCC as described in the following sections. Since cell cycle events have been well studied in recent years, emergent group 3 was used to demonstrate that the actor-semiotic network is a model consistent with the current knowledge on cell proliferation. MicroRNAs have recently been discovered as new players in regulating oncogenic signal transduction. In section 3.2, the author hypothesized the infl uence of MIRN18A on angiogenesis in HCC and how this could contribute to tumor invasiveness. Also gaining attention lately is the role of intracellular trafficking in establishing the malignant phenotype. In section 3.3, the author hypothesized the possible effect of nuclear export disruption on growth factor-induced gene regulation.

De-synchronized cell cycle phases
The semiotic nodes in emergent group 3 indicated that it contains exclusively cell cycle genes (Fig. 3). Their co-expression was found only in HCC and could be a result of replication stress. Within this emergent group, UBE2C has the highest node degree centrality. Of interest, UBE2C up-regulation has frequently been observed in a variety of malignancies including HCC [2,7]. UBE2C was found to link with three semiotic nodes, GO:0007051 (spindle organization), GO:0008054 (cyclin catabolism), and GO:0031536 (positive regulation of mitotic exit), as compared to only one or two seen among its co-expressed neighbours. Hence UBE2C is functionally more diverse but still operates exclusively within the cell cycle. This agrees with the consensus that UBE2C, an E2-ubiquitin conjugating enzyme, is a subunit of the anaphase promoting complex (APC/C) which mediates substrate ordering [8]. Substrate ordering refers to the proper sequence of protein ubiquitination that ensures the orderly degradation of different proteins during cell cycle progression. It has been known that APC/C inactivation is mediated by UBE2C auto-ubiquitination, a result of UBE2C up-regulation [9]. If this up-regulation is persistent in HCC, APC/C inactivation could be prolonged beyond the S phase. One probable effect would be the reduction in cyclin catabolism which could lead to a shortened G1 phase and a prolonged S phase due to the disruption in DNA replication [10]. With the loss of substrate ordering, the author hypothesized that the cell cycle phasing would be de-synchronized. Molecular events that are S phase specifi c could co-exist with those in the G2 and M phases. Eventually, mitotic exit could be delayed or even failed.

Abnormal angiogenesis
CYR61 (CCN1) and CTGF (CCN2) were found to co-express with TGFB1 in HCC only. Both belong to the CCN family of immediate early genes activated by TGFβ1 [11] and by hypoxia [12]. Previous work suggested that CYR61 induces endothelial cell proliferation, cell adhesion, and angiogenesis through the activation of integrin (ITGAV-ITGB3 complex) expression [13]. CTGF induces the secretion of collagen and fi bronectin which form the scaffolding of the extracellular matrix, a step crucial to the formation of a The actor nodes are listed in the alphabetical order of their gene symbols. The semiotic nodes are listed in the alphanumerical order of the Gene Ontology ID. The rank score for each centrality type ranges from 1 to 1372. A lower rank score means a higher node ranking for a particular centrality type.
neo-vasculature [14]. That explained why it is directly linked to COL6A1 and COL6A3 in emergent group 7. As shown in the lower right inset of Figure 3, CTGF is a predicted target of MIRN18A. This microRNA gene, which has been found to express in some Japanese HCC patients, is both liver-and tumor-specifi c [15]. The author hypothesized that the expression of MIRN18A in HCC could lead to matrix instability due to the reduced translation of CTGF transcripts. The dynamics of angiogenesis could therefore be altered if the molecular abundance of CYR61 is higher than that of CTGF. One consequence could be excessive endothelial cell migration and proliferation but inadequate cell anchorage due to an unstable extracellular matrix and hence poor tubular formation. Tumor vasculature is known to be structurally chaotic with excessive leakage [16] and MIRN18A expression could be a contributing factor. This may enhance HCC metastasis in two ways. The fi rst could be enhanced tissue invasion by MMPs induced by the hepatitis B viral oncoprotein HBX in malignant cells [17]. The second could be the intravasation of malignant cells into the neovasculature but also rapid extravasation to the surrounding tissue because of vascular leakage.

Disrupted nuclear transport
IPO7 and RAN were found to co-express not only with each other but also with nine other proteincoding genes (emergent group 2; Fig. 3). The semiotic nodes indicated that they are all involved in intracellular traffi cking. Their co-expression occurred only in normal hepatocytes suggesting that intracellular traffi cking could be aberrant in HCC. One possible cause could be the disruption of nucleocytoplasmic traffi cking by HBX. Specifi cally, HBX disrupts nuclear export by sequestrating the export receptor XPO. Furthermore, the nuclear import and export processes require the GTPase protein RAN. It controls the interaction of XPO and of the importin receptor IPO7 with their target proteins [18]. If the majority of the XPO in HCC is being inactivated by HBX, it is possible that there will be a surplus of RAN available for mediating nuclear import by IPO7.
Recent fi ndings revealed that many growth factors, e.g. CTGF, CYR61, EGF, FGF, IFNG, and their cell surface receptors can be endocytosed, then imported into the nucleus by importin receptors, and eventually exported by exportin receptors (reviewed in [19]). Within the nucleus, they interact with various transcription factors, e.g. E2F1 and STAT3, or co-regulators [20]. Apart from regulating the transcription of specifi c target genes, they could also be involved in DNA replication [21] and repair [22], and RNA metabolism [23]. Therefore the author hypothesized that the HBX-induced imbalance between nuclear import and export volumes could prolong growth factor activities inside the nucleus. Already, there have been studies suggesting that, at least for FGFs and EGFs, prolonged nuclear localization is correlated with cancer progression, resistance to radiotherapy and consequently poor prognosis [24].

Strength and limitations of network analysis
Network analytics is very suited to biomedical research where high informational granularity and connectivity between objects are required for knowledge inference. However, the scale of the network often presents a cognitive challenge to the analyst. This limitation is partly moderated with the use of NetMap™ which allows the analyst to downsize a large network (|V| Ͼ 5000; |E| Ͼ 5000) by excluding nodes and edges selectively and then extract any sub-networks for further analysis. The 2D-projection of graph signatures further moderates the challenge of scale by providing a visual summary on the surrounding topology of every node in the form of a scatterplot. Using the latter as a guide, the analyst can then prioritize the nodes that need to be inspected fi rst. At present, the author is testing this approach with networks that contained human disease terms [25] and cellular quiescence phenotypes [26] as semiotic nodes to see if one can discover more insights into the molecular pathology of HCC.

Biological implication of node centrality
There have been several views on how node centralities signify the biological essentiality of a protein. The fi rst view took degree centrality as the primary indicator of biological essentiality because high degree protein nodes, also known as hubs, are essential for maintaining network connectivity [27]. The second view argued that shortest-path betweenness centrality is a better indicator of essentiality [28]. This view suggested that bottleneck proteins linked to multiple protein hubs are also biologically essential. The positive correlation between node degree and biological essentiality has been confirmed recently [29,30] but the original rationale has been challenged [30]. Zotenko et al.'s [30] proposition was that the hubs are essential because they form modules in which the member proteins are highly inter-connected and share a common biological function. They named the module as Essential Complex Biological Module (ECOBIM) because it is enriched in essential proteins. Furthermore, the authors demonstrated that current flow betweenness and shortest-path betweenness centralities are better indicators of connectivity, thus supporting the second view. So far, the above hypotheses were deduced from the yeast protein interaction network [27,28,30] and the human disease gene network [29] but how do they contribute to the current understanding of cancer biology?
The fi rst view seemed to agree with the recent suggestion that it could take three mutated genes or fewer to induce early stage malignancy [31] since some well studied cancer genes, e.g. APC, TP53, PTEN, and CDKN2A, have a degree centrality greater than 20 (see Fig. 2 in [32]). Further supporting evidence is that these genes are known to associate with familial cancers [33]. However, Goh et al. [29] demonstrated that the vast majority of disease genes do not encode proteins high in degree centrality and therefore are not essential except for diseases that are fatal in utero. If applied to oncology, that will suggest that carcinogenesis does not necessarily involve genes (or proteins) of high degree centralities.
In the network G e ', the author observed that protein-coding genes that rank within the top 2% in degree centrality are not necessarily highly ranked in betweenness centralities. The best example comes from emergent group 1 in which RPL6, RPL9, RPL14, RPL15 and RPL31 rank within the top 2% in degree centrality but rank below 149th in current-fl ow centrality and rank below 100th in shortest-path betweenness centrality (Table 1). These genes are essential because RNA biosynthesis is fundamental to viability. The deletion of any one gene will affect the connectivity within the emergent group 1 more than without. This observation is in agreement with Zotenko et al.'s view. On the other hand, genes that rank within the top 10% in degree centrality and also within the top 5% in closeness, current-fl ow closeness, current-fl ow betweenness, and shortest-path betweenness centralities, are involved in signal transduction or intracellular traffi cking suggesting that they could be the key drivers of disease progression if not carcinogenesis. Some of these proteins, e.g. CXCR4, RAN and IPO, are not only nodes within individual emergent groups but are also connected to liaison nodes and nodes of other emergent groups. Furthermore, a few signal transduction proteins, e.g. CXCR4 and MAPK1, have degree centralities that rank within the top 2% and their current-fl ow betweenness and shortest-path betweenness centralities ranking within the top 1%. They are likely to be signaling hubs [32]. Therefore, genes involved in HCC can have a high degree centrality but they can also serve as bottleneck proteins to multiple emergent groups. This deduction further refi nes Goh et al.'s proposition.
Thus far, none of the microRNA nodes found in G e ' are emergent group nodes but are liaison nodes. Their degree centralities rank between 252nd to 1314th with a median ranking of 569th. If projecting from Goh et al.'s and Zotenko et al.'s proposition, microRNAs are non-essential implying that their deletion may not be lethal but can contribute to abnormalities. Of the 15 microRNAs in G e ', four of them rank within the top 3% in closeness centrality. They are MIRN148A, MIRN148B, MIRN217, and MIRN375. The fi rst two also rank within the top 2% in eccentricity. In addition, MIRN217 rank within the top 2% in shortest-path betweenness centrality and MIRN375 rank within the top 3% in current-fl ow betweenness and shortest-path betweenness centralities (Table 1). They share the common topological feature of being connected to liaison nodes on one side and emergent group nodes on the other side. Their ranking in the betweenness centralities seems to depend on the number of interaction partners and the node degree of each interaction partner. Based on the visualized topology and centrality rankings, it is reasonable to hypothesize that microRNAs which target signal transduction proteins or transcription factors of high degree, closeness, and betweenness centralities will exert the highest impact on the regulation of gene expression. This deduction seemed to agree with Cui et al.'s [34] proposition that the expression of the output layer genes in the signaling network is heavily regulated by microRNAs. Because the signal transduction network is inter-connected with the gene regulatory network [35], some proteins at the output layer could be bottlenecks that bridge the two networks and therefore are most likely to have high degree centralities as well as betweenness centralities.

Conclusion
The use of actor-semiotic network modeling and analysis does provide insight into the pathology of HCC. Although the inclusion of semiotic nodes increases the size of a network, they are useful for identifying discrete clusters or emergent groups that serve a particular biological process or a set of inter-related molecular functions. The provisions of network decomposition and sub-network extraction functionalities by NetMap™ facilitated the 'top down' exploration of a large graph. The use of graph signatures further facilitated network exploration by providing a summary of node topologies in a form of a scatterplot.

Data sources
Gene expression data. The gene co-expression profi les of HCC and normal hepatocytes were obtained from Gamberoni et al. [36] which was derived from the original dataset published by Chen et al. [37]. A set of co-expressed genes from each sample set (normal hepatocyte or HCC) was extracted based on their Pearson's correlation coeffi cients (PCC Ն 0.86). This level of correlation, according to the random matrix theory, should be adequate for differentiating between the true co-expression modules and random noise [38].
MicroRNA expression data. The microRNA expression data of HCC and adjacent normal hepatocytes was published by Murakami et al. [15]. The predicted microRNA target genes were curated from three publications [39][40][41].
Gene Ontology. The three categories of GO-Component, Process, and Function, were obtained from the Gene Ontology Consortium [42].
Human proteome data. The canonical human proteomic interaction data was obtained from the BioGrid version 2.0.36 [43]. This was integrated with the Hepatitis B-to-human proteomic interaction data obtained from the NCBI Gene RIF.

Data-to-network mapping
A relational database was constructed for storing the above datasets. Data for the edges were stored in four tables with each storing data of a specifi c edge type. The mapping of data to nodes and edges was done with the use of NetMap Decision Director™. The actor nodes are GO Component, Gene, MIRNA, and Protein. The semiotic nodes are GO Process and GO Function. The semiotic edges are of the type Gene_To_GO (Process or Function). Inter-actor edge types are Gene_To_GO (Component), Gene_To_Gene, miRNA_To_Gene, and Gene_To_Protein. Gene_To_Gene has two subtypes: Coexpression_HCC and Coexpres-sion_Liver. Gene_To_Protein also has two subtypes: Human_Protein_ Interaction and HBV_ Human_Interaction.

Network visualization and interactivity
The visualization for the networks described in this paper was generated with the use of NetMap™. The software also allows the analyst to (1) decompose a large graph into a set of discrete clusters; (2) extract the largest cluster and identify its largest connected component; (3) decompose the largest connected component to inter-connecting emergent groups; (4) navigate from point-to-point within each network; and (5) search nodes by Gene Symbols or GO identifi ers.

Emergent groups
The identifi cation of emergent groups was completed by a proprietary pattern recognition algorithm embedded in NetMap™. These groups are so named because they emerge out of a given set of pairwise relationships. Hence, in a biological or social network, emergent groups are network structures that emerge out of local interactions [44]. The NetMap™ algorithm was employed to examine the topology and the edge types of the relevant network and emergent group nodes were identifi ed based on three criteria: Given an emergent group C e (V e , E e ), • |V e | Ͼ 2 • E e = V e × V e such that |E e | Ͼ 2.
• Each node v ∈ V e has at least 50% of its edges connected to other nodes within C e .
Under these criteria, C e often appears as a subnetwork of high curvature which is the local density of triangular relations. Given that the curvature of a node, curv(v), is defi ned as: where curv(v) = [0, 1], t is the number of triangles, and n is the number of neighbours to node v [45], curv(v)→1 in C e .

Centrality measures
Node centralities are metrics for measuring the connectivity pattern of a node in relation to its surrounding neighbours. In this study, nine types of node centralities were calculated using CentiBiN [46]. They are closeness, current-fl ow betweenness, current fl ow closeness, degree, eccentricity, HITS-authority, HITS-hub, radiality, and shortestpath betweenness centralities. The rationale behind each measure can be found in [47].

Signature vectors
After computing each node centrality type, the nodes were ranked in the descending order of their centrality values. The node with the highest value for, say degree centrality, would be assigned a rank score of 1. Hence the lower is the rank score, the higher is the node ranking for a certain centrality type. This step generated a column vector R = [c i ] for each centrality type in which each entry c i is the rank score for node i. The iteration of the previous step generated a set of column vectors S = (R 0 , R 1 , …, R j ) which formed the matrix M = [c ij ] in which each entry c ij is the rank score for node i of the centrality type j. The node i can be an actor or a semiotic node. The signature vector V i for node i is defi ned as V i = (c i0 , c i1 , …, c ij ) which is the row i of M.