An Optimization-Driven Analysis Pipeline to Uncover Biomarkers and Signaling Paths: Cervix Cancer

Establishing how a series of potentially important genes might relate to each other is relevant to understand the origin and evolution of illnesses, such as cancer. High-throughput biological experiments have played a critical role in providing information in this regard. A special challenge, however, is that of trying to conciliate information from separate microarray experiments to build a potential genetic signaling path. This work proposes a two-step analysis pipeline, based on optimization, to approach meta-analysis aiming to build a proxy for a genetic signaling path.


Introduction
Technology advancement has accelerated the capability to generate large amounts of biological data. The capability to translate these data into usable knowledge has, however, grown at a much slower rate. The technologies used to generate these data are often rendered obsolete by newer ones before the data already available are fully analyzed and taken to their full potential for biological and medical advancement. Microarrays constitute a technology of this sort: one used to generate a large number of experiments, many of which will be greatly under-utilized. The analysis of microarrays, however, still holds a large potential for the discovery of genetic biomarkers for all types of cancer, as well as elicit their signaling pathways. Extracting this kind of knowledge from microarray experiments has historically been considered challenging, largely due to two main difficulties: (i) the use of incommensurable units across different experiments, and (ii) the lack of analysis techniques that converge to a consistent set of biomarkers. These two difficulties propagate uncertainty to the task of determining a reliable signaling pathway. To this end, this work proposes a two-step pipeline that involves (1) a meta-analysis strategy, based on multiple-criteria optimization, which circumvents both of the main difficulties described previously to detect highly differentially expressed genes; and (2) a method, based on integer programming to find the most correlated path among the genes from the previous step. The central hypothesis is that there is a strong signal of relative expression in microarrays that is effectively discoverable through mathematical optimization.
It is critical that the detection of genetic cancer biomarkers through meta-analysis can be carried out faster, more consistently, and more accurately in order to shorten the lead-time from data generation to data interpretation and knowledge application. The simultaneous meta-analysis of multiple experiments via optimization and the subsequent identification of the highest correlated genetic path described in this work offer these capabilities. Microarray data already in repositories can be readily analyzed and, prospectively, new high-throughput biological technologies could be fully utilized earlier in the fight against cancer. The gap between raw data and applicable biomedical/medical knowledge can be reduced significantly; especially when considering that historic biological data will now be able to be brought into perspective to design new experiments and focus on more precise aspects of exploration.

Method
The proposed analysis pipeline has two sequential stages: (1) Meta-analysis for detection of highly differentially expressed genes and (2) finding the most correlated path. These are explained next.

Stage 1: Meta-Analysis for Detection of Highly Differentially Expressed Genes
Meta-analysis involves the joint study of multiple databases to obtain conclusions that apply across all of them. Meta-analysis can help detect potential genetic cancer biomarkers through the study of microarray databases. However, to this end, a series of difficulties are apparent: (a) Microarray experiments that are publicly available use different technologies, platforms and, most of the times, different scales. Incommensurability renders many meta-analyses efforts unfeasible [1] due to the inability to make comparisons across all experiments of interest. Even when the same units are used, often time, data normalization is required for comparability. (b) There is not an efficient, systematic method to carry out meta-analysis. Most of the studies analyze one particular database and try to generalize the results to other databases or analyze several databases separately and try to make sense of all the independent results [2]. (c) The issue of having a large number of measurements and genes generally results in large number of significant genes that must be validated [3]. (d) Meta-analysis of microarrays-and of high throughput biological experiments in general-is a laborious process that is often outpaced by the development of technology to generate ever-larger data sets. That is, data generation capabilities are larger and grow faster than our abilities to make sense and translate these data into usable knowledge. (e) Large repositories of public data generated through costly microarray experiments could go underanalyzed and underutilized in the fight for cancer when the researchers' attention shifts to the next high-throughput technology. The problem of making sense of large quantities of data, however, will persist.

Multiple Criteria Optimization
Multiple Criteria Optimization (MCO) is a field from Engineering Mathematics that deals with making decisions in the presence of multiple performance measures in conflict, i.e., decisions where optimizing one criterion results in moving away from optimality in at least another criterion. Because of the presence of conflict, an MCO problem does not find a single best solution but rather a set of best compromising solutions in light of the performance measures under analysis. The best compromises define solutions called Pareto-Efficient (or simply Efficient, for short) that define the Efficient Frontier of the MCO problem at hand. A typical multiple criteria optimization with two conflicting performance measures (objectives), PMs, can be visualized as in Figure 1. In this figure, a set of seven candidate points, characterized by their values on both performance measures, are shown. The performance measure represented in the x-axis is to be maximized while the performance measure in the y-axis is to be minimized in this example. The problem is to find those candidate points that dominate all of the other points in both performance measures. In the face of conflict, this will result in a group of candidates in the southeast extreme of the set in Figure 1, solutions 3 and 5. These are Pareto-efficient solutions and, when all of them are accounted for, they integrate the Efficient Frontier of the MCO problem. In this example, it can be noted that among efficient solutions, an improvement in one performance measure can only come strictly at the detriment of another one: moving from solution 5 to solution 3 will result in an improvement in the performance measure associated to the vertical direction, but in a loss in the performance measure associated to the horizontal direction. Note that the general problem involves at least two performance measures to be optimized, where only the case with two performance measures has a convenient graphical representation. An MCO problem, however, can include as many dimensions (or performance measures) as necessary.
The general mathematical formulation of an unconstrained MCO problem is as follows: The MCO problem in (1) can be discretized onto a set with | | points in the space of the decision variables so as to define particular solutions , = 1,2, … , | |) which can, in turn, be evaluated in the performance measures to result in values . That is, the combination of values for the decision variables evaluated in the objective function. The illustrative example in Figure 1 follows this discretization with = 2 performance measures and | | = 7 solutions.
The MCO formulation under such discretization is, then as follows: The solutions to (2) are, then, the Pareto-efficient solutions of the discretized MCO problem. Considering formulation (2), a particular combination with evaluations will yield a Pareto-Efficient solution to (2) if and only if no other solution exists that meets two conditions, from this point on called Pareto-optimality conditions: Conditions (1) and (2) imply that no other solution dominates the solution under evaluation, , in all performance measures simultaneously. In previous publications [3,4] our group has demonstrated that if a set of candidate solutions evaluated by multiple performance measures is available, it is possible to determine a series of best compromises between all criteria through a technique called Data Envelopment Analysis (DEA). The idea behind DEA is to use an optimization model to compute a relative efficiency score for each particular solution with respect to the rest of the candidate solutions. The resulting best compromises, identified through their efficiency score, form the envelope of the solution set, therefore the name Data Envelopment Analysis. These solutions are indeed Pareto-efficient solutions of the problem under analysis.
The DEA linear programming formulations proposed by Banks, Charnes, and Cooper [5] are shown below: , , , where μ and ν are column vectors containing multipliers to be optimally determined together with scalar variables and in the first case and together with and in the second case; and are column vectors containing the values of the jth combination of performance measures to be minimized and maximized respectively; and ε is a scalar usually set to a value of 1 × 10 −6 . (4) is called the BCC Output Oriented Model. Both models are applied to each of the n candidate solutions. A particular solution with an objective function score of 1 (i.e., an efficiency score of 1) using both formulations is in the envelope of the set and is considered to be an efficient solution to the MCO problem. The BCC model is just one of many possible DEA formulations, albeit a very powerful one. This model's mathematical linear nature provides it with the capability of finding efficient solutions associated with the data set under analysis through a series of piece-wise linear segments. Nonlinear behavior is, then, approached with tractability and with the certainty that at least the efficient solutions lying in the convex part of the frontier are being found. Figure 2 shows an MCO problem solved through with DEA, specifically with the BCC model. DEA has several advantages including: (i) computational efficiency owing to its linear optimization structure; (ii) objectivity and consistency of results, which follows from not requiring the adjustment of parameters or assigning weights to the different performance measures by the user, and (iii) capability of analyzing several microarray experiments with incommensurate units. Appendix A discusses the volcano plot, a widely used tool to detect differentially expressed genes, to illustrate how the analyst can bias the results. On the other hand, one limitation of DEA is that of depending on a series of local linear approximations, as shown in Figure 2. Every time that a linear segment is superimposed over the set under analysis, there are genes lying in the nonconvex part of the set frontier that escape detection.

Model (3) is called the BCC Input Oriented Model and Model
These genes could be potential biomarkers, however. In order to circumvent the said disadvantage, the authors proposed that DEA be applied successively 10 times, each time removing the genes found in a particular iteration from the set for subsequent analyses. This strategy results in 10 frontiers, as seen in Figure 3. The number of efficient frontiers is, admittedly, an arbitrary number at this point, thus further refinement is necessary in this aspect.  At the end of Stage 1, the analyst is left with a set of differentially expressed genes that can be investigated to establish their role in the condition or illness under study, cancer in this case. This set of genes in the proposed method, however, will be used to determine how these are maximally correlated in Stage 2.

Stage 2: Finding the Most Correlated Path
It is proposed that the most correlated path among the list of candidate genes identified in the previous stage can be found optimally. To this end, the optimization problem identified in the literature as the Travelling Salesman Problem (TSP), is introduced here as a viable model.
The TSP is generally stated as follows: a salesman needs to visit n cities and needs to minimize the travel distance starting and finishing in the city of origin. Each city must be visited only once. The solution, then, is a tour. In n cities, there is a total of n! tours. If a particular city of origin is selected a priori, then the number of tours is (n-1)!. In our case, the objective is to find the tour among n genes of interest that maximizes the sum of the absolute values of pairwise correlations. This tour would then be interpreted as a surrogate for a biological pathway, defined as "a series of actions among molecules in a cell" [6], and more specifically for a genetic signaling pathway. A biological pathway "can provide clues about what goes wrong when a disease strikes." [6].
As a first approximation, it is proposed that the absolute values of linear correlation coefficients computed among a list of genes of potential biomarkers be used to construct networks such as the one presented in Figure 4, where the TSP can be readily applied. The idea of using a linear statistical correlation is, indeed, widely used in the literature as a means to uncover genetic coexpression. This information, in turn, should help cancer researchers in understanding the disease as well as look for targeted treatments. The paper by Kumari et al. [7] has studied different coexpression measurements, recommending to carry out a preliminary study to determine the most appropriate one for different objectives. It is, then, convenient at this point to resort to the use of the Pearson correlation coefficient as a starting point in this work. The TSP can, indeed, be understood as an optimization problem. Consider that cij represents the cost of traveling from city i to city j and let yij be a binary variable, indicating whether or not the salesman travels from city i to city j. Additionally let us define flow variables xij on each arc (i,j) and assume that the salesman has n-1 units available at node 1, which is arbitrarily selected as a "source node", and he must deliver 1 unit to each of the other nodes [7]. The optimization model is as follows: , ∈ (5a) Following the description in [8], let A' = {(i,j): yij =1} and let A'' ={(i,j): xij >0}. The constraints (5b) and (5c) imply that exactly one arc of A' leaves and enters any node i; therefore, A' is the union of node disjoint cycles containing all of the nodes of N. In general, any integer solution satisfying (5b) and (5c) will be a union of disjoint cycles; if any such solution contains more than once cycle; they are referred to as subtours, since they pass through only a subset of nodes.
In constraint (5d) N is an nxm matrix, called the node-arc incidence matrix of the minimum cost flow problem. Each column Nij in the matrix corresponds to the variable xij. The column Nij has a +1 in the ith row, a −1 in the jth row; the rest of its entries are zero. Constraint (5d) ensures that A" is connected since we need to send 1 unit of flow from node 1 to every other node via arcs in A". The forcing constraints (5e) imply that A" is a subset A'. These conditions imply that the arc set A' is connected and thus cannot contain subtours [8].
The TSP is known to be a hard problem to solve to optimality; however, with a manageable number of entities (nodes) optimality is well within reach. In our group's experience it has been possible to obtain the optimal TSP tour with a list of up to 100 genes in less than 1 hour of computing time in a personal computer. The Branch and Bound-an exact algorithm-was used to this end, as coded in Matlab. An exact algorithm is defined as one capable to arrive to a global optimal solution-provided that one exists-with certainty. Although it is also possible to use heuristics to approach the TSP, it must be understood that a heuristic method by definition does not provide certainty on arriving to a global optimal solution.
Referring back to Figure 4, it should be now apparent that in n genes associated to the nodes in the network, it is possible to obtain pairwise correlations to connect all genes among them resulting in a fully connected network. This network, in turn, can be mathematically translated into formulation (5a)-(5g) to identify the most correlated path. Thus, at the end of this stage, the most correlated path among all candidate genes from Stage 1, will be available as a proxy for a signalling path. The application of this two-stage analysis pipeline is demonstrated next in the context of cervix cancer.

Stage 1
In order to demonstrate the proposed analysis pipelines, this section presents results in cervix cancer previously published in [3]. The database used for this study was introduced in [9] and contained 8 healthy tissues and 25 cervical cancer tissues, all of them with expression level readings for 10,692 genes from a cDNA microarray. The list of 28 potential biomarkers after applying DEA is shown in Table 1. The genes in this list were cross validated for agreement in the direction of expression change in an independent database associated to [10]. As described previously, these genes were extracted from the first 10 frontiers of the analysis. The role of the selected genes in cancer was previously discussed in a previous publication of our group [3]. The fourth column of Table 1 summarizes the types of cancer that where the particular genes were found to be involved following such results.

Stage 2
Correlation is used in this project as a proxy for inhibitory or excitatory behavior between differences in the expression levels of two genes. As a first step, the linear correlation values between potential biomarkers are obtained. The following step was to arrange the correlation values in a matrix. To construct this matrix, first the differences between control and cancer tissues had to be calculated for each gene. Then, the absolute values of the correlation coefficients were calculated among each pair of genes based on these differences and stored in the said matrix. The absolute correlation values were consequently associated to the arcs in a fully connected graph with nodes representing potential biomarker genes. The resulting graph made possible the use of the formulation of the TSP. The optimal solution to this particular TSP is the tour among the genes of interest with the largest possible correlation, or similarly, the most correlated cyclic path as shown in Figure 5. It must be recalled at this point that there are a total of 28! ≈ 3.04 × 10 ways in which a cyclic path can be drawn among the 28 genes.
The TSP formulation allows a wide range of analyses. In this case, the idea was to test the stability of the TSP solutions. In order to do so, TSP solutions were obtained using increasing numbers of potential biomarkers in the list of genes presented in Table 1 following the increasing order of the efficient frontier in which these were found. Starting with five genes, each time five more genes were introduced until the list was depleted on each case. Path segments that persisted across both databases were identified. Furthermore, path segments that persisted along the entire study were deemed the most stable. The results of this study were then matched against known biological pathways publicly available in the Kyoto Encyclopedia of Genes and Genomes (KEGG) [49]. A python script was written to make this process more efficient. This script is provided in Appendix B. Table 2 summarizes the results for each progressive analysis that introduced five genes at a time. As shown in Table 2, (LRPPRC with MTHFD2) and (RPN1 with COX17) are adjacent in the correlated cyclic path when the optimal solution is obtained for 25 and 28 genes. In addition, gene S100A14 is adjacent to TSPO when the optimal solution for 5, 15, 20, and 25 genes is found.  Table 2. Adjacent genes in the solutions for the correlated cyclic path found adding five genes at a time.

Number of Genes
Adjacent Genes 5 (CRABP2 with PRSS2) and (S100A14 with TSPO) 10 (PIK3R1 with MAP4) and (GADD45GIP1 with ICOSLG) 15 (SSX1 with BUD31), (ICOSLG with SNTG1), and (S100A14 with TSPO) 20 (LRPPRC with C5orf22) and (S100A14 with TSPO) 25 (S100A14 with TSPO), (SSX1 with GRIA3), (LRPPRC with MTHFD2), (RAD21 with BUD31), and (RPN1 with COX17) 28 (LRPPRC with MTHFD2) and (RPN1 with COX17) A search for biological pathways in KEGG databases was conducted, however not every gene could be linked to a pathway. When comparing the known biological pathways with the obtained optimal solutions, for database GSE 7803 [9] and GSE 9750 [10], the only genes that appeared adjacent in the correlated cyclic path for both were COX17 with RPN1 and the only KEGG pathway common to both has the identifier 01100 that corresponds to the collection of Methabolic pathways. On the other hand, for database GSE 7803 [9], medium correlation was observed between genes HNRPA3 with BUD31, and both gene products are present in KEGG pathway 03040 that corresponds to the splisosome. For database GSE 9750 [10], PLCE1 is adjacent to PIK3R1, both gene products share the following KEGG pathways: 04012 that corresponds to the ErbB signaling pathway, 04015 Ras signaling pathway, 04015 Rap1 signaling pathway, 04066 HIF-1 signaling pathway, 04070 Phosphatidylinositol signaling system, 04370 VEGF signaling pathway, 04650 Natural killer cell mediated cytotoxicity, 04660 T cell receptor signaling pathway, 04664 Fc epsilon RI signaling pathway, 04666 Fc gamma R-mediated phagocytosis, 04670 Leukocyte transendothelial migration, 04722 Neurotrophin signaling pathway, 04750 Inflammatory mediator regulation of TRP channels, 04919 Thyroid hormone signaling pathway, 05169 Epstein-Barr virus infection, 05200 Pathways in cancer, 05200 Pathways in cancer, 05214 Glioma, 05223 Non-small cell lung cancer, and the 05231 KEGG pathway that corresponds to Choline metabolism in cancer. In cancer there are chromosomal physical changes that produce gains or losses of certain genes. To explore if the position of the genes in the cyclic path could also provide information about these chromosomal changes, the location of each gene was consider (this information was obtained from [50]). This information is listed in Table 3. All chromosomes in Table 3 have been reported as having changes in cervical cancer, in regions close to the ones where the selected genes belong. It is interesting to note that some of the genes that are neighbors in the cyclic path are also neighbors in their genetic localization.
HIPK1, NUCKS1, SMG7, and CRABP2 are all in chromosome 1, the first two genes of the list are adjacent in the cyclic path and the others are scatter through the cycle. Reported changes in chromosome 1 in cervical cancer include: gains in the 1p region [51][52][53], increment on the 1q32.1-32.2 genes expression [44], aneusomy of the chromosome [54] among others.
Three genes are in chromosome 2, HNRNPA3, LRPPRC and MTHFD2. There are several changes in chromosome 2 related to cervical cancer, for example reduced expression of genes in 2p has been reported [55], it has also been reported that deletions of the 2q33-q37 are common in cervical carcinoma [56] as well as loss of heterozygosity at 2q35-q37.1 [57].
COX17, RNP1, MAP4, and SMC4 (separated by three genes from the group), and ATP2C (adjacent to SMC4) are all in chromosome 3. Changes in chromosome 3 have been extensively reported for cervical cancer. Gain of chromosome 3q has been reported in pre-cancer and cancer of the cervix (these are some of the reports: [58][59][60][61]) while loss of 3p12-p14 has also been observed [62] and loss of heterozygosity on chromosome 3p has been also reported in this cancer [55].
The results suggest that the chromosomal gains and losses known for cervical cancer could include bigger regions. It is clear that true experimental validation is critical to further support the results of the proposed pipeline analysis at this point. It is also important, however, to notice its potential for biological discovery. Every time that a biological pathway is discovered, it basically is a problem of selecting a path by systematically choosing pairs of genes with scientific basis. If a mathematical point of view is adopted, this practice implies that the solution is built heuristically as opposed to optimally. This insight has important implications for the adoption of optimization methods in Medicine and Biology.

Conclusions
This work proposes a pipeline analysis based on optimization to facilitate the discovery of genetic signaling paths related to cancer and also could provide information about expanded chromosomal regions that are compromised for cases to be studied. In this instance, the method was applied to cervix cancer. The potential of the proposed method is significant if the detection of a biological pathway is understood as a combinatorial problem similar to the Traveling Salesman Problem, for which an optimal solution exists. If positively verified, this point of view could also imply that current biological pathways might have room for improvement to fully capture the signal in microarray experiments, and thus open the possibility of further discovery in the understanding-and fight-against cancer.

Acknowledgments
This work was made possible thanks to the National Institutes of Health (NIH) MARC Grant: T36-GM-095335 Bioinformatics Programs at Minority Institutions. It was also partially supported by BioSEI UPRM grant 330103080301and PRLSAMP. MCO does not require the adjustment of parameters by the users to detect differentially expressed genes, thus it preserves the objectivity and the consistent convergence of the analysis. As a comparative example, a tool like the volcano plot ( Figure A1), would require for the user to define different cutoff values to select different genes, thereby biasing the analysis. Figure A1. A volcano plot. Two cutoff values must be set by the user to decide upon the significance of different genes: one for fold change (x-axis) and one for p-value (y-axis).

Author Contributions
In addition to the evident result of choosing different sets of genes when the analyst picks different sets of cutoff values, this decision also greatly affects the number of genes that are deemed to change their expression significantly, as shown in Table A1.

Appendix B
A Python script was written to gather information from KEGG. This is provided in its three parts below.

Conflicts of Interest
The authors declare no conflict of interest.