Introduction

Head and Neck Squamous Cell Carcinoma (HNSCC) is the sixth most common cancer worldwide, with more than 600000 new cases per year1. Among these, the oral tongue squamous cell carcinoma (OTSCC) is the most prevalent cancer, with high incidence of metastasis to the lymph nodes of the neck2,3 being responsible for a decrease in the overall survival rates by nearly 50%4,5,6.

Even though considerable research efforts, so far there is as yet no clear consensus about the genetic alterations that underlie nodal metastasis and the metastatic process itself.

As a strategy to identify major patterns of expression, next generation sequencing (NGS) has been used to explore not only the genetic heterogeneity and gene expression of diverse types of cancer, but also those aspects related to tumor progression. In the present work we have examined the applicability of high throughput gene expression analyses as a resource to investigate major alterations in expression patterns of tumor cells as they increasingly acquire a metastatic phenotype. Besides, such an approach can reveal novel biomarkers of OTSCC.

To that end, we used as study model SCC-9 primary tumor (ATCC CRL-1629) as well as 4 cell lines derived previously established by Agostini and collaborators7: ZsG, SCC9-transduced with a green fluorescent protein; and 3 metastatic cell generations carrying the fluorescent protein. The cells displaying increasing invasive properties were referred to as LN1, LN2 and LN37,8.

Frequently, the differentially expressed genes (DEGs) are selected by either t-test or post-test corrections, and subsequent analyses are made based on one of these selections. However, the quality of the results depends on the number of genes analyzed, based on cut-offs of fold change and statistical significance9. In the present study, we compared the impact of the standard statistics cut-offs on the analyses. In this manner we initially identified the DEGs using the student’s t-test (p-values) and its false discovery rate (FDR) correction (q-values). Then, all DEGs were divided according to the standard p- and q-values, setting the cut-off α = 0.05. DEGs were sorted as protein coding genes (PCG) and non-coding genes (NCG). Using the PCG, we obtained the related KEGG pathways and classified them according to the 8 consensual hallmarks of cancer: (i) auto-sustained proliferative signaling, (ii) ability to evade growth suppressors, (iii) mechanisms to resist cell death, (iv) enabling of replicative immortality, (v) angiogenesis induction, (vi) invasion and metastasis capacity, (vii) shift of energy metabolism, and (viii) evasion of immune destruction10. We found 71 KEGG pathways related to other cancer types and chronic diseases. Based on the assumption that individual genes may take part in more than one pathway, the approach involving the hallmarks allowed us acknowledged alternative functions for each gene considered.

Furthermore, in undertaking to extend the observations pertaining to the individual contributions (absolute and relative) to invasion and metastasis, of genes sorted according to the hallmarks of cancer and including the two extra classes proposed by us, we matched the KEGG pathways associated to each hallmark against the 77 KEGG pathways of the invasion and metastasis category.

The results indicated that the PCGs of DEGs involved in angiogenesis and immune destruction evasion displayed the highest contribution to metastasis in OTSCC for both p- and q-values data. In contrast, PCGs related to energy metabolism and other cancer types represented less the relative contributions to invasiveness. Energy metabolism was the second hallmark with highest number of shared genes with invasion and metastasis, while evading immune destruction was the first. The hallmark “other cancer types” had the lowest contributions to progression towards metastasis for both p- and q-values data. The results highlighted a strong correlation between the data analyses of uncorrected (Student’s t–test [p-values]) or corrected (FDR [q-values]) and their patterns of contributions to invasiveness.

The comparative analyses were applied to all the 11 types of gene regulation that DEGs possibly display, referred as clusters of gene expression (CoGE): (i) exclusive to parental cell line, (ii) exclusive to derived cell line, (iii) continuum (similar expression values), (iv) exclusive down regulated in parental cell line, (v) exclusive up regulated in parental cell line, (vi) exclusive down regulated in derived cell line, (vii) exclusive up regulated in derived cell line, (viii) common down regulated, (ix) common up regulated, (x) common down regulated in parental and up regulated in derived cell lines and (xi) common up regulated in parental and down regulated in derived cell lines. By comparing those clusters, we found 26 DEGs sequentially altered in the OTSCC model. Of these, 15 were down regulated; 1 was up regulated and 10 displayed only slight modifications of expression (continuum). Also, CoGE analyses showed differences between proliferation, metabolism and the mechanisms related to promotion of the metastatic process. This set of 26 genes may constitute biomarkers of OTSCC metastasis.

Results

The total amount of partial RNA sequences exceeded 100 million reads with 50–250 bp for each lineage. The experimental design, number of reads and number of human genes mapped are shown in Supplementary Table 1.

Tongue cancer differentially expressed genes represent almost all of the reported human KEGG pathways

In order to understand main changes on gene expression along with the metastatic OTSCC progression we used the software Cufflinks to map the reads of the known human genes. The relative abundance metric parameter FPKM (Fragments Per Kilobase of exon per Million reads sequenced) was used to represent the value of gene expression on each dataset analyzed. To detect DEGs, we applied the Student’s t-test (p-values), and then, the FDR correction (q-values). The quality of the results depends on the amount of genes analyzed, which in turn is based on cut-offs and statistical significance9. This justified the use of both data, p- and q-values. We compared the expression of genes between the parental versus its derived cell line. Regarding the p-value, 9169 DEGs were found between SCC9 vs. ZsG; 11597 between ZsG vs. LN1, 5011 between LN1 vs. LN2, and 8572 between LN2 vs. LN3. Regarding the q-values, 6728 DEGs were found between SCC9 vs. ZsG; 9874 between ZsG vs. LN1, 284 between LN1 vs. LN2 and 5579 between LN2 vs. LN3 (Supplementary Table 2). Hereafter, only the identified DEGs will be discussed.

Expression ratios were obtained between parental and derived cell lines. Based on that, a ranking of all DEGs was produced (Supplementary Table 3). SCC9 cell line was used as the reference for ZsG transformed cells, however the following comparisons of transformed cell lines (TCLs) (ZsG vs. LN1, LN1 vs. LN2 and LN2 vs. LN3) will be discussed. Down and up regulated genes were identified for each comparison (Fig. 1A), and by matching those differentially expressed genes, we could classify them into 11-different clusters of gene expression (CoGE). These strategy allowed the identification of regulatory patterns among parental-derived changes on gene expression: (i) exclusive parental genes (FPKM > 0 in parental cell line and FPKM = 0 in derived cell line); (ii) exclusive derived genes (FPKM > 0 in derived cell line and FPKM = 0 in parental cell line); (iii) continuum (referring to those genes whose expression did not change significantly between parental and derived cell line displaying FPKM ratio 0.8 ≥ x ≥ 1.2. Differences of 0.1 are common even in technical replicates, but differences of 0.3 are considered as significant variations. With that in mind, the value of 1 ± 0.2 was set as the limit for the continuum cluster. A similar approach was used for human neoplasms11); (iv) exclusive parental down regulated (FPKM ratio < 0.8); (v) exclusive parental up regulated (FPKM ratio > 1,2); (vi) exclusive derived down regulated (FPKM ratio < 0.8); (vii) exclusive derived up regulated (FPKM ratio > 1.2); (viii) common down regulated; (ix) common up regulated; (x) common parental down regulated and derived up regulated; and (xi) common parental up regulated and derived down regulated (Table 1 and Fig. 1B).

Figure 1
figure 1

Regulation of the differentially expressed genes. (A) Exclusive, down regulated, continuum and up regulated for p- and q- values. (B) Clusters of gene expression after comparison of the regulated genes from parental and its derived cell lines. In white, DEGs without regulation of expression (exclusive to parental or to derived cell lines and continuum), in blue, down regulated DEGs and in red, up regulated DEGs; light purple, intersections between down regulated and up regulates DEGs.

Table 1 Number of protein coding genes (PCG), KEGG pathways (Path) and non-coding genes (NCG) differentially expressed for each cluster of gene expression between parent-derived cell lines of tongue metastatic progression, for p- and q-values.

Using STRING12, we further categorized the DEGs for both p-values and q-values into protein coding genes (PCG) (Supplementary Tables 4A and 4B, p- and q-values, respectively) and non-coding genes (NCG), (Supplementary Tables 5A and 5B, p- and q-values, respectively). Also, STRING allowed the enrichment of the interactome of all differentially expressed PCG with KEGG pathways by assigning the molecular or biochemical pathways related to them (Table 1 and Supplementary Table 6). Supplementary Tables 6A (p- values) and 6B (q- values) show the KEGG pathways and genes related to each hallmark of cancer, related to each CoGE.

Table 1 reveals that there was no correlation between the number of PCG and KEGG pathways, for both p- and q-values. Concerning the KEGG pathways, the highest was 284 (corresponding to LN1 vs. LN2, exclusive parental down regulated, for both p- and q-values). This observation raised the question of how many KEGG pathways are related to the cDNA of the human genome. Using all the 35238 annotated genes, we followed the same procedures, obtaining 292 KEGG pathways related to 7858 PCG (Supplementary Table 6C). With this approach almost all the pathways related to the human genome were included in our analysis. However, two of these were not detected: D-arginine and D-ornithine metabolism, and fatty acid elongation in mitochondria. The same strategy was applied to all 3717 DEGs found in our analyses of q-values (Supplementary Table 6D).

KEGG pathways and protein-coding genes related to ‘Energetic metabolism’ and ‘Invasion and metastasis’ hallmarks were the most represented

All 292 KEGG pathways of the human genome were distributed according to the 8 hallmarks of cancer: (i) auto-sustained proliferative signaling, (ii) ability to evade growth suppressors, (iii) mechanisms to resist cell death, (iv) enable replicative immortality, (v) angiogenesis induction, (vi) invasion and metastasis capacity, (vii) shift of energy metabolism and (viii) evasion of immune destruction10, using the PubMed database by manual curation. Because (ix) other cancer types and (x) chronic diseases we related to 71 KEGG pathways, these were added to the 8 hallmarks of cancer. Energy metabolism was the hallmark with the highest number of KEGG pathways (123), followed by invasion and metastasis (77), chronic diseases (52), proliferative signaling (39), resisting cell death and evading immune destruction (37), angiogenesis (24), evading growth suppressors (22), other cancer types (21) and replicative immortality (18).

The characteristic chosen to compare all other hallmarks was invasion and metastasis. Of the 77 KEGG pathways related to invasion and metastasis, 25 were shared with energy metabolism, 22 with immune destruction evasion, 18 with proliferative signaling, 13 with cell death resistance and angiogenesis, 11 with growth suppression evasion and 3 with replicative immortality. Also, other cancer types and chronic diseases pathways were compared to those belonging to the invasion and metastasis hallmark. We found 1 and 0 shared pathways respectively, as depicted in Fig. 2A, upper left panel.

Figure 2
figure 2

Contributions of each hallmark of cancer to invasion and metastasis, based on the KEGG pathways from protein coding genes. The intersections show the number of KEGG pathways (upper left panels) or genes (upper right panels). (A) All 292 KEGG pathways of the human genome (upper left panel) or genes related to 35238 human cDNAs annotated in Ensemble (upper right panel) and the distribution of the differentially expressed genes in all-3 comparisons of transformed cell lines (bottom panels, ZsG vs. LN1 left, LN1 vs. LN2 middle and LN2 vs. LN3 right), for p-values; (B) all 292 KEGG pathways of the human genome (upper left panel) or related to 3717 differentially expressed genes after FDR correction found in this study (upper right panel) and the distribution of the differentially expressed genes in all 3 comparisons of transformed cell lines (bottom panels, ZsG vs. LN1 left, LN1 vs. LN2 middle and LN2 vs. LN3 right) for q-values. In upper left panels were highlighted (in red) the changes of the number of KEGG pathways found in the transformed cell lines comparisons.

When the same comparison was carried out for p-values, a similar distribution was found except for an exclusive pathway of energy metabolism: 97 out of 98, when LN2 and LN3 were compared (Fig. 2A, upper panel, in red). For q-values, 23 of 24 exclusive pathways were assigned to the category of resistance to cell death when LN2 and LN3 were compared. There was a decrease in energy metabolism exclusive pathways, for the comparisons LN1 vs. LN2 (97 out of 98) and LN2 vs. LN3 (95 out of 98); and 51 out of 52 chronic diseases pathways when LN2 vs. LN3 were compared (Fig. 2B, upper panel, in red).

In order to evaluate the absolute contributions of the genes associated to invasion and metastasis, the same approach was followed, using the related DEGs of each KEGG pathway identified by STRING. First, we compared all 35238 human genes against invasion and metastasis. The highest number of PCG was related to invasion and metastasis (3712) in which 1542 were shared with evasion of immune destruction followed by 1536 with energy metabolism, 1310 with proliferative signaling, 1126 with resistance to cell death, 1081 with angiogenesis, 1028 with chronic diseases, 854 with evading growth suppressors, 590 with replicative immortality and 460 with other cancer types, with which it shared only one KEGG pathway (Fig. 2A, upper right panel).

We used the same approach for p-values, finding for ZsG vs. LN1, 2206 DEGs related to invasion and metastasis. For LN1 vs. LN2, there were 2026 DEGs and for LN2 vs. LN3, 1745 DEGs (Fig. 2A, bottom panels right, middle and left, respectively). All 3 TCLs comparisons displayed the same pattern of contributions to invasion and metastasis.

The same analyses were carried out for q-values. For that we used all 3717 DEGs found in our outcomes. The results are shown in Fig. 2B. The left upper panel displays the DEGs related to KEGG pathways and the upper right panel displays the DEGs associated to each hallmark of cancer. Of those, 2187 DEGs were related to invasion and metastasis. The related DEGs to invasion and metastasis are shown in Fig. 2B, where ZsG vs. LN1 displayed 1977, LN1 vs. LN2 1.511, and LN2 vs. LN3, 1030 (bottom panels, right, middle and left, respectively).

‘Angiogenesis’ and ‘Evading immune destruction’ are the main hallmarks related to metastasis

Next we enquired the relative DEGs contributions of the hallmarks to invasion and metastasis. Accordingly, we compared all DEGs of each hallmark against those of invasion and metastasis, and looked for gene redundancy. This was carried out by matching each of the 11-CoGE to each of the 8 + 2 hallmarks of cancer. Panels 1 and 3 depicted in Table 2 (p- and q-values, respectively) contain the gene number of each CoGE related to each 8 + 2 hallmarks. See also Supplementary Tables 7A (p-values) and 7B (q-values), which show the KEGG pathways and their related gene IDs for each CoGE. In addition, panels 2 and 4 of Table 2 show the percentage contributions of each hallmark to invasion and metastasis, distributed into each CoGE. This information allowed us to recognize which hallmark was the closest to the invasive process. The number of genes for each hallmark is described in section 2.

Table 2 Number of differentially expressed genes (DEGs) for each comparison between the transformed cell lines in 11-clusters of gene expression into the 8 + 2 hallmarks of cancer (panels 1 and 3) and percentages of the genetic contribution of each hallmark of cancer to the invasion and metastasis process (panels 2 and 4).

Summary of cell line comparisons:

ZsG vs. LN1

The exclusive LN1 up regulated CoGE within the invasion and metastasis hallmark displayed the highest number of DEGs (500 for p-values and 494 for q-values). The lowest number of DEGs was related to other cancer types and replicative immortality. The hallmarks with highest contributions to invasion and metastasis were angiogenesis, evasion of immune destruction and evasion of growth suppressors, whereas the lowest contributions were energy metabolism and other cancer types. Results suggested that ZsG cells displayed the highest proliferative capacity and resembled “neurodegenerative diseases” more closely than LN1 cells. In contrast, LN1 cells were matched to other cancer types besides having a different cytoskeleton regulation and interactions with the extracellular matrix (ECM) than ZsG cells (Supplementary Tables 7A and 7B and Table 2).

LN1 vs. LN2

The CoGE displaying the highest number of DEGs was exclusive in LN1 up regulated concerning the invasion and metastasis hallmark for both p- and q-values, represented by 658 and 850 genes, respectively. The lowest numbers of DEGs were related to other cancer types and replicative immortality, displaying some CoGE without DEGs. The highest hallmarks contributions were evading immune destruction, angiogenesis, proliferative signaling and resisting cell death, while the lowest corresponded to energy metabolism and other cancer types. We concluded based on the KEGG pathways of each CoGE, that both LN1 and LN2 cell lines had features associated with the invasive process although resorting to different mechanisms, in which LN1 cells used inflammatory and proliferative processes to become invasive, whereas LN2 cells relied on strategies to become refractory to the immune system (Supplementary Tables 7A and 7B and Table 2).

LN2 vs. LN3

The CoGE exclusive LN3 up regulated displayed the highest number of DEGs for p-values (645) and q-values (638). The lowest numbers of DEGs were related to other cancer types and replicative immortality, with some CoGE without DEGs. Following invasion and metastasis, the hallmarks most represented were evading immune destruction, angiogenesis, evading growth suppressors and proliferative signaling in that order. Those with the least contributions were energy metabolism and other cancer types. Based on the KEGG pathways for each CoGE, we suggest that LN2 cells, rather than LN3 cells, showed a remarkable similarity to other cancer types, while the LN3 cell line was closer to high invasive capacity through angiogenesis stimulation and ability to avoid the immune system (Supplementary Tables 7A and 7B and Table 2).

In order to identify the global contributions of each hallmark of cancer to invasion and metastasis, we obtained the mean of percentages of 11-CoGE and ranked them accordingly (Fig. 3). We found that the processes most closely related to invasion and metastasis were angiogenesis, evading immune destruction, evading growth suppressors and proliferative signaling, with genetic contributions from more than 68% for p-values and almost 48% for q-values. Energy metabolism and other cancer types were less relevant, with contributions of almost 50% and 37% of their DEGs (p- and q-values comparisons, respectively) (Table 2).

Figure 3
figure 3

Individual contributions and percentages, (in parenthesis), of each hallmark to the invasive process. The total percentages are the mean of the 11-percentages of the clusters of gene expression (CoGE) for each hallmark, for (A) p-values, and (B) q-values.

MYH14, ANGPTL4, ENPP1 and PPARD are possible OTSCC biomarkers and potential targets for interference studies

Next we looked for specific genes in our model trying to pinpoint novel OTSCC or metastasis biomarkers. Accordingly, we checked 11-CoGE of each cell line comparison, finding 26 common DEGs displaying the same type of regulation (Table 3) along with the OTSCC model of increasing metastatic potential. They were sorted as 15-downregulated, 10-continuum and 1 up regulated. These genes were classified using Panther® software to analyze whether they were eligible as biomarkers for OTSCC and/or for metastasis.

Table 3 Common DEGs displaying altered expression and their regulation into our invasive progression model of tongue cancer and their FPKM expression mean for each cell line.

First, we concentrated on the 15-downregulated genes, and found that only angiopoitein related protein 4 (ANGPTL4) was associated to biological adhesion acting as a signaling molecule. Amiloride-sensitive sodium channel subunit alpha (SCNN1A) and Ras-related protein Rab-17 (RAB17) participate in biological regulation. Myosin 14 (MYH14), which acts as G-protein modulator by way of the actin binding motor protein and as a cell junction protein, as well as RAB17, are related to cellular component organization and biogenesis. Regarding cellular processes, six genes, solute carrier family 28 member 3 (SLC28A3), gap-junction alpha-5 protein (GJA5), netrin receptor UNC5B (UNC5B), ANGPTL4, MYH14, and RAB17 were detected. Five genes were associated to developmental processes, namely ANGPTL4, MYH14, UNC5B, transcription cofactor vestigial-like protein 1 (VGLL1) and FYVE, RhoGEF and PH domain-containing protein 3 (FGD3), which acts as guanyl-nucleotide exchange factor. Four genes were related to cellular localization, SCNN1A, SLC28A3, MYH14 and RAB17. Concerning metabolic processes, we detected lipase member H (LIPH), which acts as esterase, phospholipase and storage protein. Two genes were related to multicellular organismal process, SCNN1A and MYH14. LIPH was associated to cell proliferation and motility. Within the group of 15-downregulated genes, ANGPTL4 is a classical biomarker for metastasis. Five genes were not found in the Panther database. Therefore, we used Gene Ontology to find out their biological functions. Two of them had receptor functions, plexin domain containing 2 (PLXDC2) and neuromedin U (NMU), that acts as neuromedin U receptor binding. Other two genes with binding properties, radical S-adenosyl methionine domain containing 2 (RSAD2) acting as a self-association protein and as iron-sulfur cluster binding, and coxsackie virus and adenovirus receptor (CXADR) identical protein binding and integrin binding. Finally, sciellin (SCEL) takes part in the assembly or regulation of proteins in the cornified envelope.

The single up regulated gene, ENPP1, was classified as part of the metabolic process, being a nucleotide phosphatase and pyrophosphatase enzyme.

Inspection of the ten continuum genes revealed one gene related to cellular component organization or biogenesis, syntaxin-6 (STX6), which acts as a SNARE protein. Four genes were related to cellular processes, signal recognition particle receptor subunit beta (SRPRB), peroxisome proliferator-activated receptor delta (PPARD), GPI ethanolamine phosphate transferase 2 (PIGG) and STX6. Four genes were classified within the localization group: charged multivesicular body protein 6 (CHMP6) that acts as transfer/carrier protein; transmembrane emp24 domain containing protein 2 (TMED2), which acts as transfer/carrier protein and as vesicle coat protein; SRPRB and STX6. Also, 2 genes related to metabolic process, PPARD and PIGG were found. Only one gene was related to multicellular organismal process, PPARD, a member of the proliferator-activated receptor family PPAR involved in the development of several chronic diseases. Four genes were not found in the Panther database, so we used Gene Ontology to search for their biological functions. Two were related to cell death, death associated protein kinase 3 (DAPK3), which displays protein homodimerization activity and transferase activity of phosphorus-containing groups; and second mitochondria-derived activator of caspase (SMAC/DIABLO), which activates caspases by binding to inhibitor of apoptosis proteins. Two genes were related to transport functions, sodium/potassium/calcium exchanger 6 mitochondrial (SLC8B1) acting as a cation transporter, and transient receptor potential cation channel subfamily C member 4 associated protein (TRPC4AP), related to phosphatase binding.

To better understand the relationship between the above 26 highlighted genes, we investigated whether they interacted with each other using STRING. Four genes RAB17, FGD3 (down regulated), STX6 and CHMP6 (continuum) (Fig. 4) were found to be linked.

Figure 4
figure 4

Interactome showing the contributions of the 26-common DEGs displaying altered expression of the metastatic model of OTSSC to the consensual biomarkers associated to epithelial-mesenchymal transition. Highlighted circles represent the common DEGs displaying altered-expression detected in our transcriptome analysis. Highlighted circles in blue are down regulated genes, purple, continuum genes and red, up regulated genes.

Based on the most common biomarkers of squamous cell carcinoma reviewed by Scanlon et al.13, we selected all the expressed genes related to epithelial-mesenchymal transition (EMT) in our OTSCC model (Supplementary Table S7). In order to understand the relationship between those genes, we used STRING, which displayed 5 sub-clusters, grouping as (i) cadherins, (ii) laminins, collagen and integrins, (iii) integrins and laminins, and two clusters of collagen (iv, v) (Fig. 4). Interestingly, when we analyzed those genes grouping with our list of 26 highlighted genes, we found that the up regulated gene ENPP1 clustered with a collagen sub-cluster; ANGPTL4, a down regulated gene grouped with the integrins of a sub-cluster of laminins and integrins. MYH14, another down regulated gene, grouped with the sub-cluster of cadherins. Finally, PPARD, a gene of the continuum group, clustered with fibronectin and the sub-cluster of laminins, also exhibited higher affinity for the transcription factors SNAIL1 and SNAIL2, TWIST and LEF1, the most important transcription factors of the EMT of HNSCC13. Therefore, those 4 genes stand out as potential targets for oral cancer therapy.

To validate some of the common DEGs displaying altered expression, we selected 4-down regulated, 1-continuum and 1-up regulated genes (Table 3, highlighted genes). We carried out RT-PCR and plotted the results relative to Ct (threshold cycle), as well as to the transcriptomic data (FPKM) (Fig. 5). The results showed that down regulated genes enhance their Ct values along with the metastatic progression. This means that less aggressive cells display higher expression, whereas the most aggressive stages exhibited a lower degree of expression; in contrast, up regulated gene ENPP1 displayed a continuum pattern, as well as PPARD, a continuum gene.

Figure 5
figure 5

Expression profiles of the selected common DEGs displaying altered-expression for each cell line, of 3 independent experiments. (A) RT-PCR data and (B) FPKM data. Turkey’s multiple comparisons test, ****p < 0.0001; ***p < 0.001; **p < 0.01; *p < 0.05.

Clinical data is consistent with our 4 potential therapeutic targets expression

In order to highlight the expression of these 4 potential therapeutic genes in patients of head and neck cancer, data from The Cancer Genome Atlas (TCGA, https://cancergenome.nih.gov/) of 248 clinical tumors were used to support our observations. In this regard, we analyzed the 4 genes proposed as possible molecular targets (ANGPTL4, MYH14, PPARD and ENPP1, Fig. 6, red lines) in terms of levels of gene expression: low, continuum and high. Considering that metastasis is the major factor in cancer lethality, we found a remarkable correlation between up and down regulation of selected genes and the survival rate of cancer patients. Low expression of genes ANGPTL4 and MYH14 correlates with high lethality (Fig. 6A and B, in red). Nevertheless, the up regulation of gene ENPP1 correlates with high lethality as shown in Fig. 6C, in red. In addition, gene PPARD whose expression did not change (continuum) appears to have no correlation with the survival rate plots (Fig. 6D, in red).

Figure 6
figure 6

Clinical data of head and neck cancer from TCGA (n = 248), relative to proposed molecular targets. Low expression of (A) MYH14 and (B) ANGPTL4 (red line) and high expression (black line); (C) low and high expression (red lines) of PPARD; and high (red line) and low expression (black line) of ENPP1, in alive patients. Dotted gray lines represent gene expression of dead patients.

Discussion

Here, we sequenced and analyzed the transcriptomic data of 5 cell lines of OTSCC, characterized by their progressively increasing invasive capacity. The sequencing was made for 3 independent experiments for each cell line. Similar studies14,15 have reported 21000 expressed genes. In our screening we found 28000 genes. The significant difference may be ascribed to the approach employed in the present work, namely pooling together seven data sets obtained from the biological and technical replicates (Supplementary Table 1). One way to validate the consistency of our datasets was to rank the genes according to the level of expression (FPKM) and to compare whether the top 10 most expressed genes were comparable taking into account both for the independent experiments and the experimental replicates (7 datasets). We found a high correlation that showed 6 highly expressed genes in LN1 and LN2 cell lines, 7 in LN3 cells and 8 in SCC9 and ZSG cell lines, respectively (data not shown). For all cell lines and their replicates microRNA6723 was the most expressed gene in each one of the 35 datasets. Also, we found that among the top 10 expressed genes the calcium binding proteins S100A6 and S100A9 proteins were included. S100A9 protein were associated to chronic inflammation in hypoxia response16, a possible mechanism that OTSCC induce to develop more aggressive stages.

In order to detect the differentially expressed genes (DEGs), we compared the parental cells with its derived cell line applying the student’s t-test (p-values to the expression values). We found more than 5.000 DEGs. Then, we corrected those data by applying the FDR correction (q-values) to minimize the type I error. We found values between 284 and 9.874 (Supplementary Table 2). Usually transcriptomic analyses are based on p-values. However, many reports are based on q-values to reduce the number of genes to be analyzed functionally. The problems involved in this kind of analyses are the cut-offs based on fold change and statistical significance9. Thus, depending on the aim of the work, one has to establish a compromise between the significance of the functional attributes of the gene expression and the number of genes analyzed. On the other hand, many works in the literature have “cleaned” their data by eliminating many non-coding genes (NCG) that have not part of specific biochemical pathways17. In the same way, some parameters that take into account the most differentially expressed genes, assume that the complexity of the cells could be downscaled to a small number of genes. In this work, we used all DEGs for p- and q-values and compare them in order to show the relative importance of this correction, holding a comprehensive and more robust analysis of the complex cell biological systems.

By comparing DEGs between the parental and its derived cell line, we found 11-CoGE (Supplementary Tables 36). They display all the possible types of expression regulation for each gene and for each comparison of the transformed cell lines. Establishing CoGE is an interesting tool to find the common DEGs and their expression properties between the comparisons. With this approach, new OTSCC biomarkers were identified (Table 3). In addition, this strategy evidenced expression features that may have been acquired or lost between the compared cells. In other words, CoGE can reveal in greater detail those phenotypic traits of cell lines that display individual or common characteristics that fall within the general bracket of malignancy. Usually, the analysis tools displays patterns of gene expression, which include some of the 11-CoGE described here but not all of them11,18,19,20. In this report we have provided a comprehensive view of all gene regulatory possibilities when considering transcriptomics, as related to tumor progression.

DEGs for p- and q-values were enriched for the PCG with their related KEGG pathways (Table 1). The analysis allowed the observation that the number of genes considered did not necessarily produce a proportional number of pathways. This could be interpreted as meaning that genes may be endowed with multiple functions promoting in tumor cells highly plastic networks. An excess of PCGs resulting in fewer pathways could indicate a high functional redundancy. Accordingly, we found 284 KEGG pathways related to LN1 vs. LN2, exclusive DEGs up regulated in LN1 cells, for both p- and q-values. The groups comprising the commonly expressed genes (down or up regulated) had lower number of PCG and KEGG pathways. However, when analyzing the number of KEGG pathways related to the human genome, we used all-35238 annotated cDNAs following the same procedures, which yielded 292 KEGG pathways (Fig. 2, Supplementary Table 6C).

In order to find out the KEGG pathways related to the 8 hallmarks of cancer described by Hanahan and Weinberg in 201110, we classified each KEGG pathway into each characteristic proposed for cancer by performing manual curation of the data in the literature. Additionally, we found also 71 KEGG pathways related to other cancer types and chronic diseases. For this reason we added these two categories to our analyses. We observed that energy metabolism was the hallmark that included the highest number of related KEGG pathways. Biological processes related to metabolism are frequently found as most altered pathways in large scale analyses21,22,23,24. After metabolism, the other hallmarks categories followed: invasion and metastasis, chronic diseases, proliferative signaling, resisting cell death and evading immune destruction, angiogenesis, evading growth suppressors, other cancer types and replicative immortality (Supplementary Table 6).

As our OTSCC model was developed by selecting a gradient of increasing metastatic potential, we used invasion and metastasis as a gold reference to weight the contributions of other hallmarks. Consequently, we compared the KEGG pathways falling into this category to KEGG pathways in the other hallmarks (Fig. 2 and Supplementary Table 7). Among these, 25 were shared with energy metabolism, 22 with immune destruction evasion, 18 with proliferative signaling; 13 with resistance to cell death and angiogenesis, 11 with growth suppression evasion; 3 with replicative immortality, 1 with other cancer types and 0 with chronic diseases.

Consistently, we used the related DEGs to each KEGG pathway, searching for the genetic contributions of each hallmark to invasion and metastasis. We found that the principal inputs of the hallmarks corresponded to those with highest number of pathways and DEGs, namely evading immune destruction (22 pathways, 1542 genes) and energy metabolism (25 pathways, 1536 genes) (Fig. 2). In the same way, the less represented KEGG pathways and DEGs were found to display the lowest contributions, relative to replicative immortality and other cancer types. Interestingly, chronic diseases, a hallmark with no shared pathways with invasion and metastasis, had the fifth higher contribution of DEGs to the invasive process (0 pathways and 1028 shared genes). These data show that considering exclusively the number of pathways could be misleading.

All comparisons of DEGs related to KEGG pathways were carried out for each CoGE. Analyzing the same parameters (pathways and genes) we found clues about the specific biological features of each cell line. It must be mentioned, however, that we have no information as to whether those PCG are actually translated. Notwithstanding, preliminary proteomic data, have confirmed that the transcriptomic data parallel the implied mechanisms as show by the pathways analyses (Cesari IM et al., in preparation). It can be consider the possibility that the genes could also be post-transcriptionally and post-translationally regulated. Taken together the most important findings were that ZsG elements were closest to neurodegenerative diseases and also displayed more features of proliferative cells than LN1 cells. Conversely, LN1 elements were more similar to other cancer types, besides having different cytoskeleton regulation and interactions with the extracellular matrix (ECM) (Supplementary Tables 7A and 7B and Table 2). This follows the observation that up regulated transcripts in cancer are down regulated in central nervous system (CNS) diseases and vice versa20. Indeed, another report revealed that up regulated genes in CNS disorders coded for low abundant proteins, and that the opposite occurred in cancer25. These observations may support the idea that the highest similarity with neurodegenerative illnesses occurs in the less aggressive cell lines. Furthermore, it is known that common features of metastasis involve MMPs genes, transcription factors, cyclooxygenases, chemokines, etc.6,26, especially those related to ECM interactions.

Based on those comparisons we can infer that LN1 and LN2 use different mechanisms to become metastatic; LN1 resorts to inflammation and proliferation and LN2 to immune system evasion. Indeed, inflammation was associated with amplification of the signaling loops that favor the metastatic cascade27,28,29. This is in agreement with previous reports showing that gene silencing was associated with tumor progression and metastasis. A point in case is MTA2 (metastasis tumor-associated protein 2) in glioma, in which it has been shown that proliferation and metastasis were inhibited30, while this gene was found to be upregulated in nasopharyngeal cancer31. In addition, the immune system can promote either activation or suppression of tumor growth, in a process known as “immunoediting”32. Some cancer cells present tumor antigens that lead to their elimination by the immune system. Alternatively during the process of immunoediting, they can lose those antigens due to either random genetic instability or in response to immune-induced inflammation32,33. Finally, LN2 cells were more similar the tumor cells classified as “other cancer types” than LN3 cells, whereas LN3 could induce angiogenesis and were capable to evade the immune system (Supplementary Table 7). Both mechanisms were already discussed for ZsG and LN2 cells.

The next step consisted in looking for the relative contribution of DEGs for each hallmark of cancer to invasion and metastasis. Table 2 showed that angiogenesis and evading immune destruction DEGs were the most representative. Regarding the number of genes, angiogenesis and evasion of immune destruction were the first and the fifth hallmarks with highest contributions, respectively. Conversely energy metabolism and other cancer types were those with lowest contributions of DEGs, being the second and the last hallmarks with higher number of genes, respectively. This means that evading immune destruction is a hallmark highly associated to invasion and metastasis in OTSCC due to the number of genes and its contribution to that process. In contrast, energy metabolism, the hallmark that displayed the highest number of KEGG pathways and the second highest number of gene contributions to invasion and metastasis, was less related to the invasive process (Fig. 3). These observations suggest that cancer therapies should target those genes involved in immune system evasion, angiogenesis and/or growth suppressors avoidance. It follows that although metabolism contains a high number of gene contributions, they may not be the most susceptible targets for an efficient therapy.

After evaluating the global behavior of the gene expression and its contributions to invasion, we identified common DEGs displaying altered-expression that could become biomarkers of OTSCC or metastasis. To accomplish that, we compared the 11 CoGE, and found 26 genes: 15 were down regulated, 10 were continuum and one up regulated (Table 3).

Of the subgroup of down regulated genes, we found 3 genes described as biomarkers of HNSCC (RAB1734, NMU35 and ANGPTL436), and one of EMT (CXADR37). Of these, CXADR was reported to be down regulated38, agreeing with our results; no expression data for RAB17 was found although in our results we found to be down regulated; ANGPTL4 was reported to be overexpressed39 in contrast to our result showing it was down regulated; NMU protein was proposed as biomarker, although in our data measuring RNA/cDNA levels, this gene was down regulated. Other 4 genes have expression data in HNSCC, namely VGGL140, SCEL41, SCNN1A42 and UNC5B43. In previous works all of them were found to be down regulated, in agreement with our analysis. Finally, 7 common DEGs which had not been reported before in association with HNSCC expression were found to be down regulated. These were MYH14, RSAD2, SLC28A3, LIPH, GJA5, PLXDC2, and FGD3. Interestingly, mutations for MYH14 have been described in HNSCC44 which were correlated to a negative regulatory activity of metastasis45. Incidentally, all 7 genes have been associated to cancer or metastasis and had been noted for their high level of expression46,47,48,49,50,51,52,53, even though PLXDC2 was down regulated in vulvar squamous cell carcinoma (VSCC). This was associated with unfavorable prognosis54.

In agreement with data obtained from renal carcinoma we found that DIABLO belonged to the continuum group55. Curiously this gene was found to be up regulated in cervical cancer56, in tumors of colorectal carcinoma patients57, and in gastrointestinal cancer58. In addition, TRPC4AP, another member of the continuum CoGE group did not display any type of regulation in mouse fibroblasts NIH-3T3 when induced by adenovirus early region 1 A protein (E1A) oncogene11, whereas it was found down regulated in a murine model of aggressive OSCC59. In the same way, PPARD, CHMP6 and TMED2 were found to be either down regulated59,60,61, or up regulated61,62,63 depending on the treatment and the cell line types. Only DAPK3 and STX6 had been reported to be expressed in HNSCC, being down64 and up regulated65. SRPRB was described as down regulated in peripheral blood cells of a melanoma patient when compared with healthy primary melanocyte cells66, as well as in breast cancer patients after 4 cycles of chemotherapy67. There are no data in cancer studies concerning SLC8B1 (encoding NCLX protein) and PIGG. However, SLC8B1 is known to play a key role in cellular and mitochondrial Ca2+ homeostasis and thereby, it is implicated in cell Ca2+ regulation, oxidative phosphorylation, hormonal secretion, synaptic transmission and apoptosis68,69,70. Moreover, PIGG is involved in ethanolamine phosphate transference and its mutations and deletions were reported to be associated with intellectual disorders, hypotonia and early-onset seizures71. Other members of PIGG’s family, such as classes U (PIGU), T (PIGT) and X (PIGX) are oncogenic, being overexpressed in bladder cancer72 and breast cancer cell lines73,74, suggesting a possible role in cancer development related to PIGG.

With regards to the up regulated genes the only one found was ENPP1 whose expression is stimulated by estrogen in stromal cells from normal human endometrium75. ENPP1 loss has been found in ovary cell lines occurring even without genomic deletion. The silencing of this gene can be attributed to hyper methylation of the connective tissue growth factor (CTGF/CCN2) promoter, that inhibits the expression of several genes76. Also, it was shown that ENPP1 is a potential facilitator of breast cancer bone metastasis, with high levels of both mRNA and protein synthesis77, occurring in a chromosomal region reported to be amplified in breast cancer78. The opposite was found in ovarian cell lines. Likewise it was shown that loss of microRNA-27b contributed to breast cancer stem cell generation by activating ENPP1. Clinical data suggest ENPP1 expression in primary breast cancer tissues is associated with malignant potential and response to chemotherapy79. In HNSCC, it was reported that ENPP1 gene was activated by anti-inflammatory stimuli80.

Our observations on common DEGs displaying altered expression corroborate the findings regarding classical biomarkers of invasion, such as RAB17, NMU, ANGPTL4, CXADR and ENPP1, and also allowed the proposal for novel biomarkers of OTSCC metastasis, such as PIGG and SCL8B1 (Table 3). Furthermore, we found 24 of 26-common DEGs displaying altered expression related to different processes in many types of cancer, strengthening our analysis strategy. For instance, NMU gene that encodes a HNSCC biomarker was found to be down regulated at the RNA/cDNA level in our work, leading us to consider the occurrence of post-transcriptional and/or post-translational modifications. Quite probably, a certain number of genes that we have found to be differentially expressed may not synthesize proteins. Similarly the proteome profile may not be deduced from the transcriptomic data, due to many factors pertaining to the transcription process81. Our approach suggests a set of pathways, out of many possible ones, that OTSCC could possibly undertake to become more aggressive.

When we compared those 26 sequentially altered genes with traditional biomarkers for OSCC, we found that ANGPTL4, MYH14, ENPP1 and PPARD interact with important subsets of genes involved in EMT (Fig. 4). Collagen and integrins are important components of the ECM, and actively participate in the invasive process82. MYH14, ANGPTL4 and ENPP1 clustered with those genes, as observed in the interactome. The transcription factors SNAIL1, SNAIL2, TWIST and LEF-1 promote EMT in HNSSC13 and PPARD interacted with them. PPARD is activated by LEF-183. This represents an interesting approach to define ANGPTL4, MYH14, ENPP1 and PPARD as novel HNSCC biomarkers. Moreover, we analyzed 248 clinical data of HNSCC from the TCGA and we found that expression levels of these genes in live patients correlate with our findings (Fig. 6).

Validation of the transcriptome was attempted by carrying out RT-PCR for 6 of the common DEGs displaying altered expression. Among them, 4 were down regulated in our model, and displayed a pattern that corroborated the transcriptomic data. Therefore, RSAD2, VGLL1, SCEL and ANGPTL4 could represent biomarkers of oral metastasis. On the other hand, ENPP1, an up regulated gene, did not display the same pattern as that of the RNA-Seq data. PPARD, a gene belonging the continuum CoGE, consistently did not change its expression amongst the metastatic progression (Table 3 and Fig. 5).

Finally, the reasoning used here for large-scale RNA-Seq analyses using all the DEGs with or without corrections (p- or q-values), in order to have a comprehensive and robust view of the complex cell system biology. No single analysis pipeline can be used in all cases84. The pipeline took into account PCG and KEGG pathways related to them. This allowed us to classify the DEGs related pathways into the hallmarks of cancer and to establish their contributions to any specific process or characteristic. In our case, we found that evading immune destruction and angiogenesis were the most related to invasion and metastasis. We propose that cancer treatments should be directed against those genes rather than metabolic or generic ones. Our approach can be used for other cellular and cancer related processes and diseases, being an interesting tool to highlight nodal genes or set of proteins on which to base new therapies. Lastly, genes such as ANGPTL4, MYH14, PPARD and ENPP1 might constitute interesting molecular targets for OTSCC treatment trials.

Material and Methods

Cell lines

Cell lines SCC9, and transformed ZsG, LN1, LN2 and LN3 were a kind gift from Agostini and collaborators. For details of model development, see reference7.

Cell culture

Cells were cultured in Ham’s F12 medium (DMEM/F12; Invitrogen, USA) supplemented with 10% fetal bovine serum (FBS) and hydrocortisone 400 ng/ml (Sigma-Aldrich, USA). 1.1 × 106 cells of SCC9, ZsG and LN1; 1.5 × 106 cells of LN2 and 2 × 106 cells of LN3 were transferred to 60.1 cm2 Petri dishes, for 48 hours, using an incubator series 8000 water-jacketed CO2 (Thermo Scientific), in humidity atmosphere of 5% of CO2. Three independent biological replicates of each cell line were used for the transcriptomic analysis. The cell lines were genotyped and tested free for Mycoplasma sp. infection.

RNA extraction

Total RNA of ~6 × 106 cells in 3 independent biological experiments for all 5 cell lines was extracted using RNeasy kit (Quiagen®), according to the manufacturer instructions. Quality and purity of the samples were quantified using Nanodrop ND1000 (Thermofisher Scientific).

Library preparation and sequencing

Libraries were prepared using 4 μg of total RNA from each sample, strictly following the instructions of the TruSeq RNA Sample kit v2 (Illumina®). Seven technical replicates were obtained for all cell lines, from three biological independent experiments. Each library was uniquely identified using specific barcodes. The quality of library preparations was assessed using DNA 1000 kit for Bioanalyzer (Agilent®). Libraries were subsequently quantified by qPCR using Library quantification kit for Illumina (Kapa Biosystems®). 10 pM of sample libraries were distributed in 5 lanes of a flow cell, using TruSeq PE Cluster kit v3 - cBot – HS (Illumina®). A 100 × 100 Paired End run was carried out in an Illumina HiSeq2500® platform, using TruSeq™ SBS Kit v3 - HS - 200 cycles. Samples were multiplexed using Nextera kit, on which sequence adaptors were added into samples to differentiate and demultiplexed after the sequencing process.

Data analysis

CASAVA® tool was used to make the base calling, obtaining the FASTQ sequences for experimental and biological replicates. Using the FASTQ files, gMAP® was employed to align our reads against the human genome v.38 (Ensembl), producing a GFF file and the Cufflinks tools align the coordinates of each read in GFF format to produce the frequency of each gene, expressed in Fragments Per Kilobase of exon per Million reads sequenced (FPKM). A FPKM ranking was obtained and these data were compared using Excel (Microsoft Corporation®). The first step consisted of the comparison between parental vs. its derived cell line, gene by gene, using Student’s t-test, to obtain the differentially expressed genes (DEGs). False Discovery Rate (FDR) correction was made, generating a q-value, reducing type I error. A ratio between the derived and parental cell lines was obtained, to determine the type of regulation for each gene, and to establish the clusters of gene expression (CoGE). STRING® free software (http://string-db.org)12 was used to classify between protein coding genes (PCG) and non-coding genes, and to obtain the KEGG pathways related to each PCG. Panther® (http://pantherdb.org) was used to determine the biological process of specific genes, based in gene ontology.

RNA extraction and cDNA synthesis for validation experiments

Total RNA was isolated from oral cancer cells using TRIzol reagent (Invitrogen) according to the manufacturer’s instructions. Total RNA was quantified spectrophotometrically using Nanodrop ND1000 (Thermofisher Scientific) and 1 μg was treated with 1 unit of RNase-free DNase for 30 min at 37 °C. Reactions were stopped by adding 1 μl of 20 mM EDTA and heating for 10 min at 65 °C. Synthesis of cDNA was performed using the DNase-treated RNA according to a High Capacity cDNA Reverse Transcription Kit (Applied Biosystems).

Real Time-PCR

Gene expression analysis was performed using a 7500 Real-Time PCR (Applied Biosystems) and power SYBR-Green PCR master mix (Applied Biosystems). The sequences of the primers used were: RSAD2 forward GCGTTGCGGGGAAACGAA reverse AGCGCCGGCCGTTTATC; VGLL1 forward GGACATCAGCAGCGTAGTGG reverse CTCTGACTCGAGGGGGTCAA; SCEL forward TTGCAACCTGGCGGTTCATT reverse ACACCTGGTTCCCTCTTCTTCT; ANGPTL4 forward CTCTCTGGAGGCTGGTGGTT reverse TGTGGGATGGAGCGGAAGTA; ENPP1 forward CTATGGACGTGGGGGAGGAG reverse TAGGTGTTGGGGTCCTTGGC; and PPARD forward GTGGCTTCTGCTCACCAACA reverse CATCGTCTGGGTCTGAACGC. The comparative Ct method was used to contrast changes in gene expression levels. β–actin was used as an endogenous control.

Statistical analyses

For transcriptomic data, statistical analyses were performed using Excel (Microsoft Corporation®), statistical significance was determined by student’s t- test and false discovery rate (FDR) correction, both of them with α = 0.05. The results were expressed as means ± S.E.M for n independent experiments. Statistical significance was determined by two-way ANOVA, with α = 0,05. For transcriptomic data validation, Prism 7 (GraphPad Software) for Mac was used. The results are expressed as means ± S.D. of 3 independent experiments. Two way ANOVA and Tukey’s multiple comparisons test were done7,26.