New comparative genomics approach reveals a conserved health span signature across species.

Environmental and genetic interventions extend health span in a range of organisms by triggering changes in different specific but complementary pathways. We investigated the gene expression changes that occur across species when health span is extended via different interventions. To perform this comparison using heterogeneous datasets from different measurement platforms and organisms, we developed a novel non-parametric methodology that can detect statistical significance of overlaps in ranked lists of genes, and estimate the number of genes with a common expression profile. By comparing genetic and environmental interventions that consistently lead to increased health span in invertebrates and vertebrates we built a conserved health span signature and described how such a signature depends on tissue type. Furthermore, we examined the relationship between calorie restriction and resveratrol administration and for the first time, identified common gene and pathway changes in calorie restriction and resveratrol in both invertebrates and mammals. Our approach can thus be used to explore and better define the relationships between highly complex biological phenomena, in this case those that affect the health and longevity.

shows the contingency table describing the comparison of the top m genes across two experimental conditions. A and B are the two experimental conditions, N is the total number of genes, m is the number of genes selected from each experiment (typically the top m genes from the ranked list are selected), and k is the number of genes in the intersection.
Since the margins of Table 1 are fixed, the probability of observing k genes in the intersection of two lists generated by randomly choosing two sets of m genes out of a total of N is given by the hypergeometric distribution:

AĀ
Total AĀ Total If k * is the observed number of genes in the intersection, we want to compute the probability of observing at least k * genes in the intersection when the two lists are randomly generated. This probability is given by:

Statistical significance of adding genes
When we increase the number of genes we compare in the two lists, i.e. the value of m, the number of genes in the intersection is bound to increase as well. It is possible that although the intersection is still significant, the number of genes we added to the intersection is close to what we would expect from random chance. In order to test for this we need to compute such probability. Let k 1 be the number of genes in the intersection when we select m 1 genes from the two lists, and k 2 the number of genes in the intersection when we select m 2 genes. The probability of observing the values in Table 2: The probability of observing an intersection larger or equal to k * is then: So for a given ∆ k = k 2 − k 1 it is possible to use Equation (4) to test for significance of the intersection between the two lists for the added ∆ m = m 2 −m 1 genes in the ranked list. Specifically, adding ∆ m genes leads to a significant increase to the intersection between the two lists if P (X ≥ ∆ k ) ≤ α where α is the significance level.

List comparison algorithm
The top ranking m genes in the two experimental conditions are compared by computing the two probabilities in Equations (2) and (4). The value of m is increase with a step ∆ m until either of the two probabilities is larger than a set significance value α. In the analysis of both simulated and experimental data in this paper we used ∆ m = 100 and α = 0.05. In the experimental datasets, the algorithm was applied separately to both up-regulated and down-regulated genes. For the down-redulated gene, genes were ranked in increasing order of fold-change, i.e. from the largest negative fold-change to the smallest negative fold-change.

Algorithm performance
We evaluated the performance of the gene list comparison method on simulated data generated as follows. For any given value of m and k, we took a ranked list of N = 10, 000 genes (list A) and randomly selected a set of k genes among the top m genes. These k genes were distributed randomly across the top m ranks of list B. All of the remaining genes in list A were randomly distributed among the remaining ranks of list B, from 1 to N , that had not already been occupied by the initial k genes. This procedure guaranties that the two lists will have at least k genes in common among the top m genes. We then applied the list comparison algorithm as described in Section 4 for different values of m and k to empirically estimate the value of m. For each value of m and k we computed the percent of times out of 100 simulations the estimated of m was equal to the original value used to generate the data. Figure 1 shows this percentage as a function of k/m for different values of m/N , from 4% to 15%. When just 20% of the genes are in common among the top 4% of the total number of genes in the lists, the algorithm estimated the correct value of m over 80% of the times, and reached a 100% when 40% or more genes were in common. As we increased the value of m in the simulations, the value of k/m needed to achieve a percentage of correct estimation of 80% or more increased. This is expected because the statistical significance of the overlap decreases for increasing values of m. Figure 1: Performance of the list comparison algorithm on simulated data. Two lists with a known number k of common genes among the top m genes in the rank ordered list were analyzed using the list comparison algorithm. The y-axis corresponds to the percentage of times the correct value of m was estimated out of 100 simulations.