Controlling the false discovery rate in behavior genetics research

https://doi.org/10.1016/S0166-4328(01)00297-2Get rights and content

Abstract

The screening of many endpoints when comparing groups from different strains, searching for some statistically significant difference, raises the multiple comparisons problem in its most severe form. Using the 0.05 level to decide which of the many endpoints’ differences are statistically significant, the probability of finding a difference to be significant even though it is not real increases far beyond 0.05. The traditional approach to this problem has been to control the probability of making even one such error—the Bonferroni procedure being the most familiar procedure achieving such control. However, the incurred loss of power stemming from such control led many practitioners to neglect multiplicity control altogether. The False Discovery Rate (FDR), suggested by Benjamini and Hochberg [J Royal Stat Soc Ser B 57 (1995) 289], is a new, different, and compromising point of view regarding the error in multiple comparisons. The FDR is the expected proportion of false discoveries among the discoveries, and controlling the FDR goes a long way towards controlling the increased error from multiplicity while losing less in the ability to discover real differences. In this paper we demonstrate the problem in two studies: the study of exploratory behavior [Behav Brain Res (2001)], and the study of the interaction of strain differences with laboratory environment [Science 284 (1999) 1670]. We explain the FDR criterion, and present two simple procedures that control the FDR. We demonstrate their increased power when used in the above two studies.

Introduction

A quantifiable description of mouse behavior should promote the mapping of the mouse genome by characterizing the repertoires of inbred strains, congenic lines, knockouts, transgenic lines, and populations obtained by selective breeding. The need for such characterization has resulted in the design of batteries of behavioral and physiological tests. Such studies never constitute of a single pre-specified measure, which is being compared between two strains of mice. The studies develop and explore many characteristics—also called behavioral endpoints, trying to identify those endpoints for which there is a significant strain difference. We estimate that at the time of the writing of this paper the working list of behavioral endpoints is about a 100 endpoints long, and keeps growing.

The screening of many endpoints when comparing groups from different strains, searching for some statistically significant difference, raises the multiple comparisons problem in its most severe form. The search is conducted by testing each hypothesis of no strain difference in some endpoint, which is done at some declared level of statistical significance, say at the 0.05. Detecting such a difference as ‘statistically significant’ amounts to making a statistical discovery. But then, when screening such a large family of hypotheses simultaneously, the probability of making a false discovery may increase far beyond the declared 0.05 level. If 100 endpoints are compared in a study, assuming there are few real trait differences between the strains, and if no action is taken, the average number of errors per study will be a little less than 100×0.05, i.e. close to 5. This will be the case whether the endpoints are statistically independent or not.

The traditional approach in multiple hypotheses testing to tackle this increased probability of making false discoveries, has been to control the probability of making even one false discovery—the control of the familywise error-rate as it is called in the statistical jargon. The books by Hochberg and Tamhane [9], Westfall and Young [15] and Hsu [10] all reflect this tradition. The control of this error-rate at some level α requires each of the m tests of the endpoints to be conducted at lower levels. In the Bonferroni procedure, for example, α/m has to be used.

The Bonferroni procedure is just an example, as more powerful procedures that control the probability of making even one false discovery are currently available for many multiple comparison problems. Many of the newer procedures are as flexible as the Bonferroni, making use of the P-values only. For a recent review see [9]. Still, there is a fundamental drawback to this traditional approach: the probability to discover a real strain difference in an endpoint (the power) is greatly reduced when screening a large family of potential endpoints. The incurred loss of power in large problems (even with the newer procedures) led many practitioners to neglect multiplicity control altogether. While mandatory in psychological research, most medical journals do not require the analysis of the multiplicity effect on the statistical conclusions, the leading New England Journal of Medicine being among the other few.

In genetic research, the need for multiplicity control has been debated heavily. In QTL analysis, the debate resulted in some compromise. Allow the probability of making even one false discovery to be as high as half in order to increase power, then follow the original study with a more limited confirmatory study to ensure better protection against false discoveries (see [14] for background and further references). This strategy is elsewhere advocated in order to deal with the results of multiplicity in smaller studies, and can be quite an effective one. Nevertheless it has shortcomings: it is usually not possible to quantify the properties of the discoveries made in the follow-up study; and it turns out to be wasteful if no multiplicity adjustment is offered at the first stage. Another unfortunate practical problem is that occasionally the second stage is not performed at all. In very large and costly studies all three problems tend to appear.

It should be emphasized that the recent trend away from hypotheses testing towards confidence statements does not solve the multiplicity problem. In most analyses a decision about the statistical significance is reached by looking whether zero difference is included in the confidence interval or not—taking us back to the same multiplicity problem.

The False Discovery Rate (FDR) is a new and different point of view at how the errors in multiple comparisons could be considered [3]. The FDR is the expected proportion of false discoveries among the discoveries. In this paper we shall explain this notion and discuss some simple procedures that control the FDR. We stress the importance of controlling for the treacherous effect of multiplicity, while not being overly conservative.

Section snippets

Two motivating example

In a separate paper in this issue, Drai et al. [8] propose to study the open field behavior of mice using the approach developed in the study of rats. They describe an effort to augment the commonly used measures of the open field test with a set of new ethologically relevant parameters. These parameters, which can be measured automatically and efficiently, reveal a natural structure that involves motivation, navigation, spatial memory and learning. Some 17 such parameters are identified in

The false discovery rate criterion (FDR)

Consider the case of m endpoints being compared between two strains, or more generally any family of m null hypotheses being tested in a study. Some tested null hypotheses of no difference may be true—possibly even all—meaning no difference exists between the two strains in the corresponding endpoints. Other hypotheses of no difference may be false—meaning real differences exist—and we wish to discover these real differences as statistically significant, granting us with statistical

Two FDR controlling procedures

Benjamini and Hochberg provide in [2] a simple stepwise procedure (BH) that controls the FDR when the test statistics are statistically independent. This procedure has been lately shown to control the FDR when the test statistics are positively correlated as well. The procedure makes use of the observed significance level (the p-values) only. It is available in SAS (where it is called the FDR procedure), but once the P-value are available from any statistical software, the extra calculation can

Discussion

It is clear that the multiple comparisons problem has to be addressed in the comparison of behavioral endpoints between strains of mice. This is especially important in any automated screening tool that is designed for discovering of genetic differences, as in the study of exploratory behavior [8]. In that study 10 differences would be found significant if no multiplicity adjustment were taken, six if the traditional Bonferroni was used. Two endpoints were added to the six significant

Acknowledgements

This study is part of the project ‘Phenotyping mouse exploratory behavior’ supported by NIH 1 R01 NS40234-01.

References (17)

There are more references available in the full text version of this article.

Cited by (2951)

  • Silencing ANGPTL8 reduces mouse preadipocyte differentiation and insulin signaling

    2024, Biochimica et Biophysica Acta - Molecular and Cell Biology of Lipids
View all citing articles on Scopus
View full text