Journal of Biometrics & Biostatistics

Invasive breast cancer is a heterogeneous disease. The analysis of one or a group of specific gene expression profiles may not be enough to understand molecular activities in cancer cells. Therefore, a method which gives us the opportunity to compare similar up and down regulated gene expression profiles, is needed. The main purpose of our work is to sort the extreme high and low expressed genes and extract, compare and cluster them. Expression profiles of 598 samples of invasive breast cancer and 48 samples of normal tissue have been analysed to create a new algorithm called Extreme Gene Expression Family (EGEF). The EGEF algorithm sorted, grouped and compared the highest and the lowest expressed genes (n = 100). According to the hierarchical clustering result, dense and light memberships of gene families are detected. The resulting analysis allows us to predict which genes would show similar expression signatures in invasive breast cancer and to us to recognize specific biological activities and processes. EGEF algorithm can be used to detect expression signatures in other cancers and biological processes.


Introduction
The influence of changes in gene expression in the development of cancer is still not well understood [1].Analysis of gene expression data in cancer studies is an effective way that leads to the discovery of global cancer profiling, tumor classification, tumor specific molecular marker identification and pathway exploration [2].Gene expression profiling with next generation sequencing techniques has arisen as a powerful approach to study the cancer transcriptome [3].This approach is valuable for the identification of novel biological mechanisms that are aberrant in cancer cells; moreover, this approach clarifies our understanding of known pathways in proteomics and the metabolome [4][5][6].Next-generation sequencing has a big impact on cancer genomics in resequencing, analyzing, and comparing the tumor matches and normal genomes of a single patient [7].The techniques have supplied large amounts of data about DNA sequencing, especially for cancer studies.Gene expression profiling with generation sequencing techniques has arisen as a powerful approach to study the cancer transcriptome.cDNA and oligonucleotide microarray technology have increased the rate of discovery of genetic interaction by simultaneously observing thousands of genes in a single experiment [4,8].Gene expression approach is valuable to identify new mechanism in the regulation, expression and production of proteins and clarify our understanding of known pathways in the proteomics and the metabolome [5,9,10].The cancer genomics area has been influenced profoundly by the application of next-generation sequencing technology, which has enormously speed up the pace of discovery while impressively decreasing the cost of data production [7].Korbel first indicated that paired-end read from next generation sequencing platforms can be arranged to the genome and analyzed to determine Putative Structural Variation [11].However, we need to digest this immense amount of gene expression data to turn into a sensible result about the genomics of cancer.There are several tools developed for this purpose.Oncomine is one of the most actively and statistically used cancer gene expression web tools.COPA [12,13] and GTI methods [14] are other methodss to be used statistically for cancer gene expression.
Here we introduce an algorithm, Extreme Gene Expression Family (EGEF), developed using the expression profiles of patients with invasive breast cancer in order to identify signatures that are characteristic for this cancer type.The main purpose of the algorithm is to find the highest and the lowest expressed genes and the correlation among them, specifically the genes which are coexpressed similarly in invasive breast cancer cells.The coexpression signatures of genes may elucidate novel mechanisms for the underlying biological processes in invasive breast cancer.The algorithm also allows us to detect the tumorigenensis involved genes and their sparse membership within the cancer.

Data processing
Expression data from17814 genes for 598 invasive breast cancer samples and 48 normal breast tissues was downloaded from TCGA and applied to EGEF algorithm.Figure 1 shows the main steps in the work flow [19][20][21][22].

EGEF algorithm
The discovery of the highest (HE) and the lowest (LE) gene expression signatures is done by EGEF algorithm (Supplementary 1 and 2).The EGEF script has been created in R statistical program for fast and reliable data mining.EGEF sorts 598 samples with 17814 gene expression profiles and distinguishes extreme genes based on the expression level.We searched for top and bottom in set of 3, 5, 10, 20, 50 and 100 Hierarchical clustering, heat mapping, gene expression profiles and biological functions of genes are done by clustering and correlation programs HCE 3.0 and MSigDB respectively [9,10].

EGEF Algorithm steps (process)
1-Prepare a table consisting of column indicating patients and row specifying expression value of genes.
2-Sort expression value of genes for each patient ascending and descending order.
3-Retrieve a window-N from the table.Window-N is a sub-table including the first N rows of the table.N can take value such as 1, 3, 5,10, 20, 50 and 100.4-Determine and prepare a list of genes which are covered by the window-N.

5-Calculate frequency of genes based on patients.
Senol's Frequency (Gene) = The number of patients that the genes located within window-N or first N 6-Calculate ratio of the genes.Higher ratio of the genes indicates the genes' activity in all patients.Amina's Ratio (Gene) = Senol's Frequency (Gene) /Total number of patients 7-Establish extremely expressed gene family which consist of the first n genes from window-N depending on their Amina's ratio.

Implementation of EGEF algorithm
1-Prepare a table consisting of column indicating patients and row specifying expression value of genes.The algorithm starts with preparation of a table which consists of columns indicates patients and row specifying expression value of genes.To make it clear, four randomly genes were selected and prepared a table with real data (Table 1).
2-Sort expression value of genes for each patient ascending and descending order.The second step of the algorithm is to sort the value of genes for each patient.The sorting has been done in two ways; ascending and descending order (Table 2).The sorting will produce the highest and the lowest expressed genes [26][27][28][29].
3-Retrieve a window-N from the table.Window-N is a sub-table including the first N rows of the table.N can take value such as; 1, 3, 5,10, 20, 50 and 100.After the sorting of the data, a window-N has been retrieved from the table.The window-N is a sub-table which includes the first N rows of the table (Table 3).N can take value such as; 1, 3, 5,10, 20, 50 and 100.The window-N, N changeable, has been used to codify the extreme gene families.4. Senol's Frequency (Gene)w-N = The number of patients that the genes located within window-N For example, the number of patients that FCGR3A is located within window-50.So, Senol's Frequency (FCGR3A)w-50 = 589 6-Calculate the ratio of the genes.The frequency is used to find The algorithm is used for preparing tables depending on ascending and descending order to find top and bottom expressed genes respectively.The table shows how the genes are listed according to their expression value.

After all steps have been completed the algorithm has shown the
The table shows Window-20.The expression value is sorted by descending order.P: Patient the ratio of the genes if Senol's frequency is divided by total number of patients.As a result of that, special ratio, Amina's Ratio, has been found and show below with an example.The higher ratio of the genes indicates the genes intensive activity in all patients [29][30][31].
Amina's Ratio (Gene)w-N = Senol's Frequency (Gene)w-N /Total number of patients Example: Senol's frequency of FCGR3A is 589 and total number of patients = 598 Amina's Ratio (FCGR3A)w-50 = 589/598 = 0.98 7-Establish extremely expressed gene family consisting of the first n genes from window-N depending on their Amina's ratio.

Similar algorithms
According to their target and performance, similar algorithms which search for the gene expression patterns and clustering methods have been cited (Table 5).

Algorithm Types Reference
Agglomerative HCT Investigate any correlation among discriminator genes in hereditary breast cancer [11] E-cast Uses a dynamic threshold [12] Non-HCT (CAST) Clustering gene expression patterns.[ Cheng and Church Biclustering of expression data [21] Plaid A tool for exploratory analysis of multivariate data [22] BiMax Sharing compatible expression patterns across subsets of samples [23] xMOTIFs A conserved gene expression motif [24] OPSM Capturing the general tendency of gene expressions across a subset of conditions [25] Spectral MEQPSO Global convergence towards an optimal solution [26] ISA Overlapping transcription modules [27] Table 5: Related clusterıng algorıthms.

EGEF analysis for top and bottom expressed genes
A new value termed Senol's frequency, was created to refer to the number of window-N.According to the Senol's frequency, the highest and the lowest expressed genes of invasive breast cancer have been found.The frequency is calculated for the highest and lowest expressed genes in window-50 and are given in Table 5 and Table 6, respectively.In addition, the 100 genes in window-100 are available in Supplementary 1.For example, Senol` frequency (FCGR3A)w-50 = 589 means the gene expression is located at top of window-50 (589 out of 598 patients have the expression of the specified gene within the top 50 highest expressed genes).Amina's ratio provides the information regarding the particular gene expression activity in the breast cancer.For example, if Senol's frequency for FCGR3A is 589 and the total number of patients is 598, then Amina's Ratio is (FCGR3A) w-50 is 0.98.After the application of the EGEF algorithm, the extreme HE and LE genes are grouped as top and bottom window-N, N = 3, 5, 10, 25, 50, 100 members (Tables 6 and 7).

The algorithm selected genes compared with control data
According to the frequency, the patients gene expression have been compared to the control data in order to observe the difference of expression activity.Only 25 of 100 extreme the highest and the lowest genes are presented in this paper (Figure 2).For example, ASPN expression average is 7.21 in the cancer cell, but it is too low in the control expression, -0.03 (Figure 2).The same result has been seen in all of the highly expressed gene levels.The algorithm has selected genes that show that there is a large difference in expression between patient and control samples in the high expressed group.However, the comparison of low expressed genes in patient and control samples shows almost no difference.For example, AHSG has very low gene expression frequency (Senol frequency = 593).The gene expression level for AHSG gene drops from -6.29 to -6.59 between cancer and control cell (Figure 2).

The selected gene's involvement of tumorigenesis and cellular activity
100 extreme high and the low expressed genes are categorized depending on their biological features (Supplementary 4) so that we can predict what kind of mechanisms are potentially activated in the breast cancer cell.Based on the function of the specific gene, we have searched for the potential of the genes' involvement in tumorigenesis (Table 8).We have observed the following correlation regarding the size of the window-N and the involvement of a specific gene in tumorigenesis-the smaller the window-N, the stronger the relation with tumorigenesis.Similarly, the larger the window-N, the weaker the relation in the highly expressed genes.If the value of N takes 3,5 and 10, tumorigenesis involvement is 100%.If N takes 20 or 50, the involvement is 84% and 77%, respectively (Table 8).The low expressed genes have shown different tumorigenesis relation percentage.The highest tumorigenesis involvement detected in window-N = 3 is 100%, but after that the relation to tumorigenesis decreases.N takes the value of 5, 10, 20, 50 and percentage is 60%, 60%, 65% and 62% respectively (Table 8).

Clustering of the EGEF selection genes
The 50 genes which are extremely high and low expressed are clustered respectively, and shown in a heat map (Figures 3 and 4).The Pearson correlation is used to calculate the gene expression correlation.The genes are selected according to the Window-N, 3, 5, 10, 25 and 50.The last column belongs to Window-50 and the lats two rows represents 49 th and 50 th top genes and their frequency.Bold genes stand for tumorigenesis involvement.222 The genes are selected according to the Window-N, 3, 5, 10, 25 and 50.The last column belongs to Window-50 and the last two rows of that represents 49 th and 50 th bottom genes and their frequency.Bold genes stand for tumorigenesis involvement.According to Senol's frequency (A) shows percentage of the top expressed genes which are related with tumorigenesis.(B) shows percentage of the bottom expressed genes which are related with tumorigenesis.The heat maps show us the clustered and correlation among 50 extreme high and low genes [31][32][33][34].

Discussion
The changes in global gene expression lead us to understand better of the biological activities which drive to carcinogenesis.The EGEF algorithm sorted, grouped and compared the highest and the lowest expressed genes (n = 100).The resulting analysis allows us to predict which genes would show similar expression signatures in invasive breast cancer, allowing us to recognize specific biological activities and processes.EGEF algorithm can be used to detect expression signatures in other cancers and biological processes.In the future, the results of the EGEF algorithm can be correlated with clinical parameters in order to find potential new targets for drug treatment targets.Most of the algorithms focus on finding the outlier expressed genes, oncogenes or tumor supressor genes, but the EGEF algorithm points out tumorigenesis related genes and their partner genes that help a cell to convert to cancer cell Clinical and genomic work regarding cancer need a new perspective to look at the heterogenity of the cancer development and clinical treatment.The new algorithm takes a different approach than previous approaches which only target abnormally expressed genes.However, the main goal of EGEF algorithm is to find tumoigenesis related genes and their family members and their relation strength of the family.If we change our view of the problem, then we might be able to find new solutions or ways to target therapy.

4 -
Determine and prepare a list of genes which are covered by the window-N.The fourth step is to determine and prepare a list of genes which are covered by the window-N.The step is explained by two examples.The first example N = 3 and Patient = 598 and the second is N = 5 and Patient = 598; both are given in Table

Figure 1 :
Figure 1: EGEF Algorithm Process.The data are downloaded and applied to the EGEF algorithm.The algorithm ends with each gene frequency and ratio.According to the frequency and ratio, Extreme Gene Expression Families have been produced.

Figure 2 :
Figure 2: Comparison of Cancer and Control Gene Expression.TableAis the comparison of 25 the highest cancer and control gene expression and the average of them is 5A2 and 0.48 respectively.Table B compares 25 the lowest cancer and control gene expression and the average is almost same -4.98.While the high expressed genes average is too high the low expressed genes average is almost same in the breast cancer.

Figure 3 :
Figure 3: EGEF high expressed genes.The heat map is generated by HCE 3.0 and clustered according to Pearson correlation.The figure shows the relation of the 50 extreme gene expressions among each other.Among the high expressed genes there is clear diversity.Although that the number genes which are expressed together is very high.

Figure 4 :
Figure 4: EGEF low expressed genes.The figure is generated by HCE 10 and clustered according to Pearson correlation.The heat map shows 50 low expressed genes clustering and relations.AHSG, TYR, FABPI, RPS4YI, NTS show strong correlation as in the low expressed genes in the cancer.The figure quite clear shows the strong and weak correlated genes.

Table 1 :
based on preparing tables.According to the genes n number patients expression value are written and prepared for the statistical analyzes.17814 invasive breast cancer genes expression value from 598 patients are tabled in that order.Preparıng a table and example of that.

Table 2 :
Ascendıng and descendıng order of the genes.

Table 4 :
applied to the algorithm by using Window-N.The patients top 3 and top 5 genes are listed respectively.Changing the N number in Window different or common top and bottom expresssed genes are detected.Genes ıncluded by wındow-3 and wındow-5.

Table 7 :
The low expressed genes and theır senol's frequency.