Dys-regulated Gene Expression Networks by Meta-Analysis of Microarray Data on Oral Squamous Cell Carcinoma

Background: Oral squamous cell carcinoma (OSCC) is the sixth most common type of carcinoma worldwide. Development of OSCC is a multi-step process involving genes related to cell cycle, growth control, apoptosis, DNA damage response and other cellular regulators. The pathogenic pathways involved in this tumor are mostly unknown and therefore a better characterization of OSCC gene expression profile would represent a considerable advance. The availability of publicly available gene expression datasets has opened up new challenges especially for the integration of data generated by different research groups and different array platforms with the purpose of obtaining new insights on the biological process investigated. Results: In this work we performed a meta-analysis on four microarray and four datasets of gene expression data on OSCC in order to evaluate the degree of agreement of the biological results obtained by these different studies and to identify common regulatory pathways that could be responsible of tumor growth. Sixteen dys-regulated pathways implicated in OSCC were mined out from the four published datasets, and most importantly three pathways were first reported. Those regulatory pathways and biological processes which are significantly enriched have been investigated by means of literatures and meanwhile, four genes of the maximally altered pathways, ECM-receptor interaction, were validated and identified by qRT-PCR as a possible candidate of aggressiveness of OSCC. Conclusion: we have developed a robust method for analyzing pathways altered in OSCC using three expression array data sets. Although this method has been used on four microarray studies, it can be extended to multiple expression array data sets. This study sets a stage for the further discovery of the


Introduction
OSCC is the sixth most common type of carcinoma worldwide.It has been estimated that 30,000 new cases and 7800 OSCC-related deaths occur in the United States annually [1].The incidence rates are higher in males and blacks.Tobacco products and alcohol are major etiological factors, and the rate of consumption of these products is believed to partially account for the difference in incidence rate according to race and sex [2].Development of OSCC is a multi-step process involving genes related to cell cycle, growth control, apoptosis, DNA damage response and other cellular regulators [3].Understanding the genetic processes and biological pathways involved in the development of OSCC might lead to valuable information that might improve disease classification, early detection and diagnosis, as well as therapeutic planning and drug development [4][5].Microarrays represent a promising tool that makes it possible to explore the expression profile of thousands of genes simultaneously, at the RNA level [6][7].In the literatures, there are several microarray studies on OSCCs with promising findings [8][9][10].
Today, gene chip databases such as Gene Expression Omnibus (GEO) have been exponentially growing, but scientists are facing a series of problems on how to high-efficiently and comprehensively deal with those booming datasets of microarray data.In spite of high-throughput microarray technology, a large-scale and high-repetition experiment for individual research object is both unrealistic and undesirable due to the high-price.Furthermore, types of gene chip (cDNA / oligonucleotide chip, long probe / short probe chips), microarray platforms (Affmetrix, Aglient chips etc), models of gene chip(e.g.Affymetrix HU-U133A , HU-U95A), laboratories, the number of samples, experimental samplings, experimental operators, experimental processes etc exist inevitably differences, which results in various results from the same experimental object in different studies and even contradictory conclusions [11].We investigated almost all papers published in recent years on OSCC depending on microarray technology (listed in supplementary Table 1), and the following a few points were concluded：1.Numerous papers focused on exploring the differentially expressional genes (DEGs)and biomarkers of OSCC, nevertheless, there is a discordance among the studies, e.g.Gino MA et al (2004) [12] analyzed gene expression in 41 OSCC specimens with non-metastasis versus 13 normal tissue specimens, 1658 over-expressional and 1232 low-expressional genes were detected respectively, while Shimada and colleagues [13] identified only16 up-regulated and 0 down-regulated genes in 4 oral tongue SCC specimens; 2. With respect to biomarkers of OSCC, it is hardly easy to find a same marker-genes for OSCC in published papers; 3. It is little known about reports of dys-regulated pathways on OSCC.
To address the aforementioned problems, we investigated the three databases, NCBI GEO (http://www.ncbi.nlm.nih.gov/geo/),EBI( http://www.ebi.ac.uk/microarray-as/ae/) and Oncomine (http://www.oncoming.org).We only collect the microarray data sets on OSCC with the tumor samples >10, as a result, 4 data of studies which meet our requirements were shown in Table 1.A novel meta-analysis method was developed in R and MATLAB language for this project.This method allows for a meta-analysis of multiple sets of microarray data and can query multiple pathways simultaneously in determining the significance of each pathway for OSCC.We used our approch to study the dys-regulated pathways in OSCC relative to normal tissue using the data from the profiling studies available on the NCBI GEO database.We also validated some genes in maximally altered pathway using qRT-PCR.Therefore, we believe that those results are more correlative with clinical outcomes with patterns of gene expression on a genomic scale, and may yield a more accurate predictor of tumor behavior.It is also significant to understand the mechanisms of tumorigenesis and application of tumor clinical outcomes.

Results:
We used four data sets [1-4 Materials &Methods] for this study.These studies used the Affymetrix microarray platforms for the gene expression analysis.We used the Unigene IDs to map the different genes across the studies.The data were downloaded from the GEO and EMBI-EBI database.The four data sets used contained primary-OSCC versus normal tissue samples.Those data were normalized by MAS5.0 and filtered .Table 2 summarizes the results of probes of each type of microarray processed in the four studies.In case of the loss of information in meta-analysis, three methods including average, median, and randomization were used to integrate p-value of multiple probes into one.Furthermore, unlike other reported meta-analysis which is only targeting to common genes across different platforms, in this study, those gene lists that any two or three studies contained, or all common gene lists across the four studies were further adopted respectively.The gene lists obtained from each study were then merged as described in Materials and Methods (Figure 1) to arrive at a meta-gene list.This list was then used for pathway analysis.Table 3 shows the three types of common genes.The supplementary Table 2 lists the top 16 dysregulated pathways, which include extracellular matrix (ECM)-receptor interaction (KEGG), Cell cycle(KEGG), Hs_Matrix_Metalloproteinases (GeneMapp ), Platelet Amyloid Precursor Protein Pathway(BioCarta), Hs_Cell_cycle_KEGG(GeneMapp), Proteasome Complex (BioCarta) and so on.
The pathway that was found to be maximally altered was the ECM-receptor interaction pathway (shown in Figure 2), which consists of a complex mixture of structural and functional macromolecules and serves an important role in tissue and organ morphogenesis and in the maintenance of cell and tissue structure and function.Specific interactions between cells and the ECM are mediated by transmembrane molecules, mainly integrins and perhaps also proteoglycans, CD36, or other cell-surface-associated components.These interactions lead to a direct or indirect control of cellular activities such as adhesion, migration, differentiation, proliferation, and apoptosis.In addition, integrins function as mechanoreceptors and provide a force-transmitting physical link between the ECM and the cytoskeleton.Inte-grins are a family of glycosylated, heterodimeric transmembrane adhesion receptors that consist of noncovalently bound alpha-and beta-subunits.
Validation of the ECM-receptor interaction pathway on the transcript level.To validate the results of analysis, the expression of a few key genes in the ECM-receptor interaction pathway that were significantly dys-regulated in all four data sets was determined by qRT-PCR analysis using unmatched normal (n = 3) and primary-OSCC samples (n = 3)(three times biological repeats).The details of the regulation of the individual genes are shown in Table 4. From the Table 4, the 4 genes, COL5A2, COL1A2, COL4A1 and SPP1 within ECM-receptor interaction pathway were upregulated in the primary-OSCC relative to normal tissue.

Discussion
The current study represents the first effort to simultaneously compare the transcriptional profiles of highly enriched primary -OSCC and normal tissue from a wide variety of patients, using modern microarray technology, which enhanced our ability to identify genes and pathways that are dys-regulated at the transcriptional level.This analysis provides critical insights into the differences between normal and primary-OSCC that may be used for the development of targeted therapies, and tools for assessing the impact of therapy on the OSCC.
There are at least two very important highlights in current study.The common genes were only targeted in other meta-analysises, which will be possible to result in a flaw that is likely to lose genes, thus the results may be not accurate, e.g. the early Affymetrix HG_U95Av2 oligonucleotide arrays represents approximately 10500 unique genes (12625 probes), while HG_U133 plus 2.0 Affymetrix Gene Chips contains 47000 genes (54675 probes).In our analysis, the three gene data sets generated from the four studies (described in Materials and Methods) were treated using the same ways., There is phenomenon that multiple probe-sets are assigned to the same gene in the Affymetrix array.
Usually, there are three methods which were used in the reviewed published papers to integrate the expression values of multiple probes into one.They are the mean, median, and random value of the expression values [18][19][20][21][22].But using any kind of single method would produce statistical bias due to possible data loss.To avoid this issue, we have selected the above three methods.In addition to this, the other pipeline of our analysis is same.Finally, on the basis of three approaches to integrate the multiple probes value into one and the three gene data sets mentioned above, the 16 top common dys-regulated pathways (all p-values <0.05) were detected (shown in supplementary Table 2).As a result the outcome is more reliable and robust.
Single-gene analysis may miss important effects on pathways.Cellular processes often affect sets of genes acting in concert.An increase of 20% in all genes encoding members of a metabolic pathway may dramatically alter the flux through the pathway and may be more important than a 20-fold increase in a single gene [23][24], developed a method called gene set enrichment analysis (GSEA) ,which ranks gene sets by assigning scores according to their enrichment in a particular microarray data set.According to "module"-based approaches (sets of genes that are biologically related [25],which require the definition of a threshold in terms of fold change, or statistical significance, in ex-pression and, therefore, do not incorporate all the genes in the identification of dys-regulated gene sets or pathways. However, in current study, we have developed a method similar to GSEA to carry out meta-analysis of altered pathways from multiple microarray data sets.We show how this approach can be used to identify critical pathways in primary-OSCC progression.To validate the results of analysis, the 4 genes (COL5A2, COL1A2, COL4A1 and SPP1) in the ECM-receptor interaction pathway to be maximally altered were chosen to validated by real-time q-PCR analysis, the results indicated that the 4 genes are dys-regulated in primary-OSCC relative to normal tissue.
By investigating data from our designed experiments, we found out the most significant sixteen pathways which were dys-regulated in primary-OSCC cancer, like ECM-receptor interaction, cell cycle pathways etc.In our discussion, the dys-regulated pathways were divided into four groups The dys-regulated pathways in the first groups are associated with OSCC in the previous studies.Platelets contain both amyloid precursor protein (APP) and amyloid -beta peptide (Ab) and probably contribute greater than 90% of circulating APP.According to S. Ko et al.,(2007), an increase of APP expression in OSCC and related OSCC cell lines have been demonstrated [22].The APP expression is involved in the proliferation and carcinogenesis of OSCC.The degree of APP expression could serve as an invaluable marker for oral carcinogenesis [27].By analyzing the pathways in the second group, those pathways, ECM-receptor interaction (KEGG) Focal adhesion (KEGG) cell cycle (KEGG) Ubiquitin mediated proteolysis(KEGG) (which is not belonging to the second group) proteasome (KEGG), are found to be linked one by one in KEGG database.These pathways suggest that one upstream pathway dys-regulated would lead to the downstream pathways dys-regulated.ECM-receptor interaction pathway, which has a very close relationship with Hs_Matrix_Metalloproteinases (GeneMAPP), is maximally altered in current study.
As shown in the Figure 2, ECM-receptor interaction pathway mainly comprises ECMs, integrins and proteoglycan.
The ECM plays an important role in controlling of cellular proliferation, differentiation, and apoptosis as well as in providing nutritional support for the tumor cells.For some normal cells the ECM remodeling mechanism for migration through tissue barriers is also considered to be important [28].ECM-degrading matrix metalloproteinases ( MMPs) secreted by malignant tumor cells including OSCC have been suggested to play an essential role in the processes of tumor invasion and metastasis [29][30].Over-expression of several of these MMPs, either in tumor cells or in tumor-associated fibroblasts, has been linked with increased invasive and metastatic behavior in many different types of cancers including OSCC [31][32][33].Especially MMP-2 and MMP-9 are present in human OSCC, and that the activated MMP-2 could be a main enzymatic activity of gelatinolysis in OSCC.Interaction of tumor cells and stromal cells seems to play an important role in the invasion and metastasis of human OSCC.Examination of activated forms of MMP-2 and MMP-9 in OSCC is very important to understanding the tumor malignant characteristics [34][35][36].In addition, Sadie A et al's.,(2001)data illustrated the potential importance of tumor cell-derived membrane type 1 metalloprotease (MT1-MMP) in progression of OSCC cells [37].In current study, MMP-2, MMP-9 and MT1-MMP were mapped into Hs_Matrix_Metalloproteinases (GeneMAPP) (the Figure not shown) and dys-regulated, which indicates that ours results of analysis are consistent with overview of literature.Similarly, integrin expression in oral squamous cell carcinoma shows considerable variation both between tumours and within different areas of the same tumour [41][42][43].High stromal versican (an extracellular matrix proteoglycan) expression may thus be considered an independent and adverse prognostic marker in OSCC [40].
Cell cycle, cell communication and focal adhesion implicated in OSCC have been reported in relevant literatures [41][42][43].Ikebe, T et al. (1998) confirmed that proteasomes were involved in the invasiveness of oral squamous cell carcinoma (SCC) cells [44].Bile acid biosynthesis, glycerolipid metabolism and arginine and proline metabolism were also reported to relate with OSCC [45][46].However, there is unknown about the relationships between OSCC with phenylalanine metabolism, angiotensin-converting enzyme 2 regulates heart function, and regulators of Bone Mineralization pathway, which would be considered as novel dys-regulated pathways implicated in primary-OSCC.
In conclusion, we have developed a robust method for analyzing pathways altered in OSCC using four expression array data sets published.Compared with similar meta-analysis reported, there are at least two very important highlights in current study, the common genes were only targeted in other meta-analysis, which will be possible to result in a flaw that is likely to lose genes, the results may be not accurate.However, the genes across different arrays (common genes in ≥ two, three and four arrays) were selected as the target genes (detailed description in the manuscript).In addition, one of the three methods, random, mean and median, is usually used to integrate several p-values of multiple probe IDs corresponding to a gene to one p-value, whereas all three methods were applied in our meta-analysis.Thus the outcome is more reliable and robust.
The gene signature of the top dysregulated pathway, ECM-receptor interaction, when validated on an independent data set, was found to correctly classify all the primary-OSCC samples.The some genes, such as Col5A2 and SPP1as biomarkers, in the pathway were also able to distinguish samples with a good prognosis.In addition, three pathways previously not reported, phenylalanine metabolism, angiotensin-converting enzyme 2 regulates heart function, and regulators of Bone Mineralization, are detected to be implicated in OSCC.Although this method has been used on four microarray studies, it can be extended to multiple expression array data sets.This study sets a stage for the further discovery of the basic mechanisms that underlie a diseased state and would help in identifying critical nodes in the path-way that can be targeted for diagnosis and therapeutic intervention.

Materials and Methods
Framework Figure 1 illustrates the framework of our method.The project started with collection of databases, followed by meta-analysis, then ended with PCR validation.The process of detail is as followed.

Data Collection and
, where N(P),N(A), N(M) stands for the number of P ,A, M respectively, if the value ≥ 20% , the corresponding probe ID would be excluded .

Integration of datasets.
To overcome the bias of microarray technology e.g.technical and pathophysiological reproducibility, we developed a meta-analysis method to integrate the 4 different microarray datasets.Firstly, equidirectional two-sided P value for all probe IDs in each study was calculated using Mann-Whitney U test [47] by comparing primary-OSCC and normal tissue samples.Secondly, in general, multiple probe sets, called "sibling probe-set", were mapped to a single UniGene ID, i.e. one same gene UniGene ID has multiple expression values, after Mann-Whitney U testing, and a UniGene ID was assigned to multiple p values.We designed a pipeline offering three options to integrate the multiple p values, the "randomized ", "averaged" "median" value of the multiple P values was, respectively, corresponded to one UniGene ID.But, in some cases, one probe ID matched to more than one UniGene IDs.In such cases, those UniGene IDs were assigned with the same P value.In addition, those probe IDs not mapping to any UniGene ID were, of course, dropped out as they were considered to have non-transcriptome biological meanings.Thirdly, the 2 www.ncbi.nlm.nih.gov/geo/ 3 http://www.ebi.ac.uk/microarray-as/ae/ 1 http://www.affymetrix.com/common UniGene IDs in the four studies would be chosen, in case of loss of gene information, the three sets of data being composed of the UniGene IDs would be chosen and analyzed.The first set was the data being composed of the common UniGene IDs in all four studies (= 4 studies, later called '4' common gene list.The second set was the data being composed of the common UniGene IDs at least crossing three studies (≥3 studies, later called '3_4' ) in all four studies.The third set was the data being composed of the common UniGene IDs at least crossing two studies (≥2 studies, later called '2_4' ) in all four studies.All these three sets of data would be analyzed respectively in this article.
Then the P value of UniGene IDs of three sets of data were integrated by the Liptak-Stouffer's Z combination method [48][49][50][51]] where k = 2, 3, or 4, the number of the studies; Φ-1(.) is the inverse function of the standard normal cumulative density function; Pi,g is the P value of the UniGene IDs g in ith study; and wi = 1, the weight of the ith study.An overall twosided p-value can then be calculated from the cumulative standard normal distribution: )) ( RNA Isolation and Quantitative Real-time PCR Analysis.Total RNA was isolated from OSCC specimens using RISO™ RNA Isolation Reagent (Biomics, China).First-strand cDNA was synthesized from 1 μg total RNA using Thermoscript reverse transcriptase PCR system according to the manufacturer's instructions (Invitrogen, USA).Target genes were amplified by quantitative real-time PCR in the MyiQ real-time PCR detection system (Bio-Rad, USA), using EzOmicsTM SYBR qPCR Kit (Biomics, China) according to the manufacturer's guidelines.The PCR cycling conditions used were as follows: initial denaturation of one cycle at 95 °C for 10 min, followed by amplification 40 cycles at 95 °C for 20 s, 55 °C for 30 s, and 72 °C for 30 s during which the fluorescence data were collected.To verify that the used primer pair produced only a single product, a dissociation protocol was added after thermocycling, determining dissociation of the PCR products from 55°C to 95°C.The ΔCt method was used for relative quantification of target genes.The gene-specific primer pairs (and product size) for the gene analyzed were shown in supplementary Table 3.

Figures:
Figure 1: General scheme of our approach in this study.Workflow is represented by boxes and arrows.

where
Pg is the combination P value of UniGene g across the datasets, Φ(.) is the standard normal cumulative density function.Dysregulation of pathway.To obtain the dysregulation of a pahway P, we used a procedure similar to that used in Setlur et al.[52].Biocarta4, KEGG5 and GeneMAPP6 databases were used as pathway references for the analysis.Therefore, a pathway score S Pg is the integrative P-value of UniGene g in pathway P. Then randomly chose genes to generated the same size of pathway P and computed a new score S﹡ten thousand times iteratively, following the above formula.The dysregulation of pathway P was valued by the frequency of scores S﹡ larger S.Dys-pathway validation.A Clinical Samples.A total of patients with histologically confirmed oral squamous cell carcinoma (OSCC) who underwent surgical resection at the Sichuan Cancer Hospital from 1998 to 2002 were enrolled in this study.Noncancerous oral tissues adjacent to the tumor (2-3 cm) as well as noncancerous oral mucosa specimens from trauma or extraction of teeth patients who neither smoked nor drank served as normal control tissues.The patients had not received either chemotherapy or radiation therapy before operation could be considered.Primary tumor samples and noncancerous oral tissues were obtained from patients (n = 10) during surgery and immediately snap-frozen in liquid nitrogen and stored at -80°C until RNA extraction.The frozen tissue samples were sectioned and mounted on glass slides.The slides were stained with hematoxylin and eosin for histopathological examinations.These frozen tissues were cut into 4-μm-thick sections and used for RNA isolation and cDNA synthesis.All the tumor samples had been classified as OSCC.The histology and cellular composition of tissue were confirmed by the scientists at SI-CHUAN Cancer Hospital before RNA extraction was performed.All protocols and consent forms were approved by the hospital's institutional review board.And informed consent was obtained from patients or legal guardians (as appropriate).

Figure 2 .
Figure 2. Visualization of molecular interaction and reaction networks in the KEGG database.Red color represents dys-regulation in primary-OSCC.
Pre-progression.In this study, the datasets, representing primary-OSCC versus normal tissue in Gene Expression Omnibus (GEO) 2 and European Molecular Biology Institute-European Bioinformatics Institute (EMBI-EBI)3, were collected, and as a result, four microarray expression datasets arrived for the requirements (shown in Table1).Array data files with a.CEL extension were normalized using the Affymetrix Microarray Suite (MAS) Version 5.0 algorithm and annotated with the Annotation Merge Files by Affymetrix® Expression Console™ (version 1.1)1.For filter , (a) data sets were excluded if the probe ID was marked as AFFX internal control; or (b) a matrix were composed of each probe ID and its corresponding sample CALL (CALL: P, present; A, absent, M, medium), followed with a conversion of this matrix, P =1; A=0; M=0.5, multiple values were assigned to each probeID , we

Table 1 :
Characteristics of the three datasets chosen

Table 1 .
The information of probe data in the four studies.The columns from second to third column shows, respectively, the number of transcripts of each type of gene chip and the number of probes filtered by the CALL value (referring to Materials and Methods);'Multiple UniGeneNum'in the fourth column means those filtered probes certainly corresponding to UniGene IDs, some of which exhibit multiple probes mapping to one UniGene ID and vice versa; the last column presents that the number of UniGene IDs having unique correspondence between UniGene ID and expression value of probe(referring to Materials and Methods)

Table 3 .
The number of common genes and responding pathways in meta-analysis.Notes: 4_2, 4_3 or 4 in the first column indicates the three gene lists being contained in ≥two(later called '2_4' common gene list), ≥three(later called '3_4' common gene list) or four (later called '4' common gene list)studies respectively; the second column shows the number of genes in '2_4', '3_4', '4' common gene list; the third column presents the number of those genes which can be mapped to pathways; the last three columns illustrate the number of pathways were mapped by three common gene lists in three databases , KEGG, Biocarta and GeneMap databases.

Table 4 :
The result of RT-qPCR of 4 genes of ECM-receptor interaction/Cell Communication pathway.Notes: ΔCT=target gene-reference gene.N/A means the expression lower than the scope of the inspection of RT-qPCR