Introduction

Various scientific disciplines regularly come across ‘outliers’ in their data (e.g., Aguinis et al. 2013; Bakker and Wicherts 2014). Outliers are commonly defined as observations which are different from the majority of other cases in a sample (Barnett 1978; Barnett and Lewis 1994; Grubbs 1969; Hawkins 1980; Orr et al. 1991; Osborne and Overbay 2004; Rousseeuw and Hubert 2011). Such divergence could be due to, for example: measurement or coding error, sampling from the wrong population, exceptional circumstances, or a poor fit of the statistical model. Outliers matter and there are several famous cases where even a single case heavily impacts the results (e.g., Hollenbeck et al. 2006). For example, the journal Acta Crystallographica A had a stable impact factor of around 2.4 up until 2009 when it reached an impact factor of 49.9 due to a single paper with 5624 citations (Dimitrov et al. 2010). There is a very large body of literature on outlier handling, and an entire field in statistical pattern recognition is devoted to outlier detection and handling (e.g., Ritter and Gallegos 1997). Therefore, the purpose of this short report is not to review the importance and impact of outliers or all possible methods to deal with outliers (for in depth reviews see, for example: Aguinis et al. 2013; Barnett and Lewis 1994; Dixon 1953; Grubbs 1969; Hawkins 1980; Osborne and Overbay 2004; Rousseeuw and Hubert 2011; Shiffler 1988). Rather, we wish to examine the degree to which outlier removal decisions would impact the statistical conclusions in the context of endocrinological studies, and specifically those dealing with testosterone (T).

Outliers seem especially relevant for endocrinological studies, since hormonal data, such as T, typically do not conform to normal distributions. For example, due to its skewed distribution, T usually contains multiple outliers when defining outliers based on a normal distribution (Fig. 1 below; Stanton 2011: Fig. 1). Inclusion or exclusion of these outliers could matter, especially given the typically modest sample sizes in this research area. We therefore investigated to what extent removing outliers leads to different statistical conclusions based on statistical significance (p = .05). Our analyses focused on T, since this hormone is among the most frequently studied hormones and we had sufficient data available on T.

Fig. 1
figure 1

Histogram of 433 testosterone samples

First, we reviewed outlier handling for all the articles in Psychoneuroendocrinology, Hormones and Behavior, and Biological Psychology from January 1st 2010 to May 1st 2015 that included some measure of T levels in humans (ESM Table 1-2-3). In 43 out of 133 papers (32.33 %) outlier handling was reported, and this mostly consisted of removal (35 out of 43). Occasionally winsorizing was used, which involves reassigning the values of extreme cases to, for example, the 95th percentile score (Ghosh and Vogt 2012; Hastings Jr et al. 1947). However, the most commonly applied rule for defining an outlier (24 out of 43 papers), was based on a case being a given number of standard deviations (SD) away from the raw mean scores.

Table 1 The % of significant independent samples t-tests (at p = .05) with and without outlier removal for a given sample size (No outlier α, Outlier α). % divergent refers to the percentage of tests where the statistical conclusions differed, i.e. one test significant but another one not (at p = .05), as compared to all statistically significant entries. Median refers to the median (absolute) difference in p value where one test was significant but another one was not, and 2.5 % - 97.5 % give the corresponding percentiles. Results are presented for two runs and for outlier removal based on a 2.5 SD and 3 SD rule. These simulations were based on N = 433 testosterone samples
Table 2 The % of significant independent samples t-tests (at p = .05) with and without outlier removal for a given sample size (No outlier α, Outlier α). % divergent refers to the percentage of tests where the statistical conclusions differed, i.e. one test significant but another one not (at p = .05), as compared to all statistically significant entries. Median refers to the median difference in p value where one test was significant but another one was not, and 2.5 % - 97.5 % give the corresponding percentiles. Results are presented for two runs and for outlier removal based on a 2.5 SD and 3 SD rule. These simulations were based on a theoretical gamma distribution
Table 3 The % of significant tests (at p = .05) for the interaction in a Repeated Measures ANOVA (within-between) with and without outlier removal for a given sample size (No outlier α, Outlier α). % divergent refers to the percentage of tests where the statistical conclusions differed, i.e. one test significant but another one not (at p = .05), as compared to all statistically significant entries. Median refers to the median (absolute) difference in p value where one test was significant but another one was not, and 2.5 % - 97.5 % give the corresponding percentiles.. Results are presented for two runs and for outlier removal based on a 2.5 SD and 3 SD rule. These simulations were based on N = 433 testosterone samples

It is thus clear that outliers are often reported and that a commonly applied procedure is to remove them. Here we investigated how often statistical conclusions based on ‘p = .05 significance’ diverged when a test with outlier exclusion yielded a statistically significant result whereas the test with outlier inclusion did not, or vice versa. We also assessed by how much p values differed when these statistical conclusions did differ. Finally, we also investigated how many tests were significant when reporting any result that was statistically significant (either the test with inclusion or exclusion of outliers). To this end, we simulated an independent samples t-test design and a pre-post repeated measures ANOVA design with varying sample sizes (30 to 100) and outlier removal rules (2.5 SD and 3 SD). Simulations were performed in duplicate on real hormonal data and on a theoretical gamma distribution of T data.

Methods

Data

T values were obtained from 433 saliva samples of men (7 unpublished studies by the authors, as well as one published study (van der Meij et al. 2015). All samples were collected in a similar fashion via the same protocol (for details: van der Meij et al. 2015) and were analyzed by the laboratory of Biological Psychology at the Dresden University of Technology. T was determined via an expanded range salivary T enzyme-immunoassay kit (cat. nu 1–2402) from Salimetrics (Suffolk, UK). The intra-and inter-assay coefficients are below 10 % and 12 %. No outliers were removed from the database (based on a 2.5 and 3 SD rule there would be 13 and 9 outliers, respectively).

Simulations

Simulations Based on Actual Data - Independent Samples T-Test Scenario

Using R 3.0.2 (R Development Core Team 2008) (packages used: Meyer et al. 2015; Wickham 2011, 2009, 2007, see ESM 4). We constructed a customized script to examine how often the inclusion or exclusion of an outlier would lead to different statistical conclusions at p = .05. Moreover, we investigated by how much these statistical conclusions would differ in terms of p value. The annotated script and data are included in ESM 4. The script was run in duplicate (Run 1 and 2) and worked as follows. First, it drew a sample of n (with replacement, n = 30–100) from our data (N = 433  T measurements; Fig. 1). Second, the script log transformed the data (log10) (e.g., Dabbs et al. 1995; Feldman et al. 2002; Pollet et al. 2013; van der Meij et al. 2008). Third, it then generated two arbitrary conditions of equal size by dividing the data into two groups. Fourth, the script performed two independent samples t-tests with the two previously constructed groups as independent variable: one t-test without removing any outliers and one t-test with all outliers removed. Outliers were defined using the raw data either as 2.5 SD or more or 3 SD or more away from the mean. Independent samples t-tests were corrected for unequal variances (Welch’s t-test, see Ruxton 2006; Zimmerman 2004). All p values were two-tailed and we used a 5 % significance threshold (p = .05). Fifth, the script ran steps one to four 100,000 times. Each of those 100,000 cases can be interpreted as a different study (albeit constrained that they were sampled from our n = 433 with replacement). Sixth, it counted the number of significant t-tests with and without outlier removal. The script also calculated in how many of the cases the statistical conclusions based on significance differed: a test with outlier exclusion yielded a statistically significant result but the test with outlier inclusion did not, or vice versa (at p = .05). To examine this we thus have a 100,000 (no. of cases) × 2 matrix (outlier exclusion or inclusion). In order to calculate the % divergence the script summed up all the cases for which the statistical conclusions differed – i.e. one row entry in the matrix was significant but the other row entry was not (at p = .05 level). Next, this sum was divided by the total number of significant tests, which was then multiplied by a hundred to obtain a percentage (%). Thus, this measure could theoretically range between 0 % and 100 %. Zero percent divergence meant perfect correspondence in statistical conclusions, i.e., if a case was significant when excluding outliers, then this case was always also significant when including outliers, and vice versa. Hundred percent divergence meant zero correspondence in statistical conclusion, i.e., if a case was significant when excluding outliers this case was never significant when including outliers, and vice versa (at p = .05). Finally, the script calculated how much the p values differed in absolute terms for all cases for which the statistical conclusions based on significance differed. We reported the median of those absolute differences in p values as well as the 2.5 and 97.5 percentile of those differences in absolute p values. We also reported how many of those 100,000 entries were statistically significant at p = .05 if either the test with inclusion or exclusion of outliers was selected (note that this also counts the cases where both entries are significant).

Simulations Based on Actual Data – Repeated Measure ANOVA Scenario (RM-ANOVA)

As with the independent sample t-tests scenario using R 3.0.2 (R Development Core Team 2008) we constructed a customized script to examine examine how often the inclusion or exclusion of an outlier would lead to different statistical conclusions at p = .05 (ESM 4) for a RM-ANOVA design. Our model corresponds to a pre-post design with participants being assigned to one of two conditions. The script was run in duplicate (Run 1 and 2) and worked as follows. First, it drew a sample of n (30–80) with replacement (T1) from our data (N = 433  T measurements; Fig. 1). Second, the script used a Cholesky transformation matrix to generate correlated Time 2 (T2) data for each T1 data point (Stevenson 2013). To this end we first sampled T2 data from our sample (N = 433) with replacement, as we did for T1. This matrix with T1 and T2 matrix can then be transformed to obtain correlated values, with a certain correlation between T1 and T2. The matrix algebra is described at length in Golub and Van Loan 2012. In R, the command ‘chol(matrix)’ provides the transformation, and such a Cholesky transformation is an established way to create correlated data for simulations in R (for Matlab: (Van den Berg n.d..), for SAS: (SAS 2009; UCLA: Statistical Consulting Group 2016)). We chose to let this procedure create T2 values that were correlated with T1 with a ρ value of .5, since in a sample of our unpublished data, T1 and T2 were correlated with roughly this value (r 69 = .551). However, the correlation can vary substantially across experimental designs, due to the time between measurements (samples closer together in time correlate more). These variations are partly accounted for in our analyses as the actual (sample) correlation for a single simulation can vary substantially from the population level correlation. Third, two arbitrary conditions of equal size were generated by arbitrarily assigning each ‘participant’ to one of these conditions (each participant had a T1 and T2 data point). Fourth, the script performed two Repeated Measures ANOVAs: one with inclusion of outliers and one with exclusion of outliers (based on a 2.5 and 3 SD outlier removal rule, see 2.2.1). For both scenarios, the script calculated the p value (two-tailed) of the interaction effect (between-within interaction: condition*time). Fifth, this process was reiterated 100,000 times. Sixth, it counted the number of significant t-tests with and without outlier removal. The script also calculated in how many of the cases the statistical conclusions based on significance differed: a test with outlier exclusion yielded a statistically significant result but the test with outlier inclusion did not, or vice versa (at p = .05). As with the independent samples t-test scenario, we thus examined a 100,000 (no. of cases) × 2 matrix (outlier exclusion or inclusion). In order to calculate the % divergence the script summed up all the cases for which the statistical conclusions differed – i.e. one row entry in the matrix was significant but the other row entry was not (at p = .05 level). Next, this sum was divided by the total number of significant tests, which was then multiplied by a hundred to obtain a percentage (%). Thus, this measure could theoretically range between 0 % and 100 %. Zero percent divergence meant perfect correspondence in statistical conclusions, i.e., if a case was significant when excluding outliers, then this case was always also significant when including outliers, and vice versa. Hundred percent divergence meant zero correspondence in statistical conclusion, i.e., if a case was significant when excluding outliers this case was never significant when including outliers, and vice versa (at p = .05). Finally, the script calculated how much the p values differed in absolute terms for all cases for which the statistical conclusions based on significance differed. We reported the median of those absolute differences in p values as well as the 2.5 and 97.5 percentile of those differences in absolute p values. As with the independent samples t-test scenario, we also reported how many of those 100,000 entries were statistically significant at p = .05 if either the test with inclusion or exclusion of outliers was selected (note that this also counts the cases where both entries are significant).

Simulations Based on Theoretical Data

A drawback was that our resampling procedure was based on actual data and is thus also inherently limited to only those values. We therefore also sought to describe the distribution of T values and use simulations derived from a theoretical distribution rather than the actual values. Given the shape of the T distribution (Fig. 1), we examined several candidate distributions (normal, lognormal, Weibull and gamma distribution, (MASS package in R) (Ripley et al. 2015)). A gamma distribution was the best fit to the data based on the Kolmogorov-Smirnov tests (‘ks.test’ in R: ks.test(gamma) = 0.05, p = .15; ks.test(normal) = 1, p < .0001; ks.test(Weibull) = 0.086, p = .003; ks.test (lognormal) = 0.069, p = .034). Given that the ks.test(gamma) was not statistically significant, we maintained the null hypothesis that our observed distribution resembled the theoretical distribution. However, it is important to note that the Kolmogorov-Smirnov test for larger samples is a (very) conservative test, so other candidate distributions aside from gamma could have been unfairly rejected.

We thus ran the scripts explained in section 2.2.1 and 2.2.2 but instead of drawing from the 433 T values the script drew samples from the gamma distribution. Note that the theoretical distribution was generated based on our sample. Therefore, the ‘true’ distribution of male hormone samples could still look different from the one we modeled.

Results

The key simulation results are summarized in Tables 1-4. To illustrate how the tables should be interpreted we interpret the first row of Run 1 in Table 1. With a sample size of n = 30, actual data, and a 2.5 SD outlier removal rule, the first simulation (Run 1) found a base rate significance of 4.85 % when outliers were included (No outlier α) and 4.72 % tests when all outliers were excluded (Outlier α). In 49.37 % of the significant cases, including outliers led to a statistically significant result whereas outlier exclusion led to a statistically non-significant result, or vice versa (at p = .05, see title column and also see description in method section). When the statistical conclusions differed, they differed by around 6 % in p value (median estimate, 95 % range: 1 % to 26 %), see title column. Figure 2 plots these divergent estimates (with the upper 2.5 % tail not shown, as the maximum p value is .84 (!)).

Table 4 The number of significant tests (at p = .05) for the interaction in a Repeated Measures ANOVA (within-between) with and without outlier removal for a given sample size (No outlier α, Outlier α). % divergent refers to the percentage of tests where the statistical conclusions differed, i.e. one test significant but another one not (at p = .05), as compared to all statistically significant entries. Median refers to the median difference in p value where one test was significant but another one was not, and 2.5 % - 97.5 % give the corresponding percentiles. Results are presented for two runs and for outlier removal based on a 2.5 SD and 3 SD rule. These simulations were based on a theoretical gamma distribution
Fig. 2
figure 2

Illustration of divergence: Histogram of absolute difference in p values when including or excluding outliers (Run 1, 2.5 SD rule, n = 30, actual data) and the statistical conclusions differ at p = .05. Dashed line is median estimate. Note: Upper 2.5 % of data not shown (as maximum is p = .84). Note that this is merely an illustration of just 1 entry in Table 1. The code is included in ESM 4 to draw histograms for other entries in Tables 1-4

Across all simulations, the worst case scenario suggests that in over 50 % of the significant cases including outliers led to a statistically significant result whereas outlier exclusion led to a non- statistically significant result, or vice versa, and if these results did differ, the median difference in p values was .07 (e.g., p = .04 vs. p = .11) (depending on statistical test, n, outlier removal rule, sample size, and Run no.), see Tables 1-4 for the results. However, do note that for an individual case it could be substantially more than .07. If we take the upper estimate of the range, it could be as much as .37 difference in p value. That would imply a shift from, for example, p = .04 to p = .41. However, the best case scenario suggests, that the conclusions might only vary in 7 % of the cases, and if these cases do vary, they do so by .01 in p value.

Independent Samples t-Test

For the independent samples t-test scenario based on actual data, in between 32 % and 55 % of the significant cases a test with outlier exclusion yielded a statistically significant result whereas the test with outlier inclusion did not, or vice versa (at p = .05, depending on n, outlier removal rule, and Run no.). When these statistical conclusions did differ, they differed between .05 to .07 in terms of p value (overall range: .01 to .37). For the simulations based on the theoretical distribution, results showed a smaller difference compared to the actual data: statistical conclusions based on inclusion versus exclusion of outliers differed between 15 % and 45 % of the significant cases, and when they did differ, they differed between .05 to .07 in terms of p value (overall range: .01 to .23, depending on n, outlier removal rule, and Run no.).

RM-ANOVA

Dependent on sample size, outlier removal rule, and Run no., for simulations based on actual data, in between 15 and 28 % of significant cases a test with outlier exclusion yielded a statistically significant result whereas the test with outlier inclusion did not, or vice versa (p = .05 level). For simulations based on theoretical distributions the statistical conclusions diverged between 7 and 20 % of significant cases. For those cases where the statistical conclusions diverged, the median divergence in p values was around .02–.03 for simulations based on actual data, and around .01–.02 for simulations based on theoretical distributions. The corresponding ranges (2.5 percentile and 97.5 percentile) were between <.01 to .09 for simulations based on actual data and between <.01 to .07 for simulations based on theoretical distributions.

Robustness and Sample Size

Results were generally quite stable between runs. The lowest correlation between runs was the RM-ANOVA scenario based on theoretical data (for proportions of divergent tests: r = .85 for 2.5 SD outlier removal rule, r = .92 for 3 SD outlier removal rule). For the remainder of the scenarios correlations between Run 1 and Run 2 for the proportions of divergent tests was >.94).

Results also showed that the proportion of divergent tests was larger in larger sample sizes than in smaller sample size. For the independent samples t-tests scenario the % of divergent tests increased with sample size. For example, with a 3 SD outlier removal rule, 32 % of the tests diverge with a sample size of 30, while it was around 47 % for a sample size of a 100 (see Table 1). The median divergence in p remained relatively stable at around 5–6 %. For the RM-ANOVA scenario simulations based on actual data, increasing sample size also led to a higher proportion of tests with divergent conclusions (Table 3: 3 SD outlier removal rule: from 15 to 21 %). A potential explanation for these findings is that in larger samples sizes outliers are more likely to occur. In these larger samples there are thus more outliers to potentially remove and this makes that inclusion or exclusion of outliers could matter more often.

Furthermore, the statistical conclusions based on significance differed more often for the 2.5 SD outlier removal rule than for the 3 SD outlier removal rule (Table 1-4). For example, in the 2.5 SD outlier removal rule and t-test based on actual data, the conclusions differed in 44 % to 54 % of the cases (depending on n and Run no.), while for the 3 SD outlier removal rule the statistical conclusions differed in 32 % to 48 % of the cases. A potential explanation for these findings is that more outliers are defined using the 2.5 SD rule than the 3 SD rule, and this makes that inclusion or exclusion of outliers could matter more often. The corresponding median estimates for divergence in p value do not vary that substantially between the 2.5 SD and 3 SD rule. For example, for both the 2.5 SD and 3 SD rule, the median estimates for divergence were between .05 and .07 in p value for t-tests based on actual data.

Base Rate Significance

We also obtained baseline significance rates for each 100,000 tests. For the independent samples t-tests we found significance rates close to 5 % across all sample sizes (Tables 1-2: 4680 to 5104 out of 100,000 tests were statistically significant). However, the baseline significance rate was considerably higher for the RM-ANOVA scenario (Tables 3-4: 5884 to 6924 out of 100,000 tests were statistically significant). The 5.9 % to 6.9 % base rate suggests that Type I errors were inflated. A potential cause of this inflation is that the normality assumption of the RM-ANOVA was violated. This is suggested by the finding that some skew remained in our data even after log10 transforming the raw T scores (Skewness statistic: −.505; one-sample Kolmogorov-Smirnov test for log10(T) values: .07, p = .03).

When reporting any test that would lead to a statistically significant result (either the test with inclusion or exclusion of outliers), in between 5.15 % and 6.89 % of the independent sample t-tests were statistically significant (Table 5-6; see description in method section), and for the repeated measures ANOVA design this was between 6.32 % and 7.62 % of the tests (Tables 7-8).

Table 5 Base rate significance of independent samples t-test when counting either test significant, i.e. one with or without outlier removal. Results are presented for two runs and for outlier removal based on a 2.5 SD and 3 SD rule. These simulations were based on N = 433 testosterone samples
Table 6 Base rate significance of independent samples t-test when counting either test significant, i.e. one with or without outlier removal. Results are presented for two runs and for outlier removal based on a 2.5 SD and 3 SD rule. These simulations were based on a theoretical gamma distribution
Table 7 Base rate significance of Repeated-Measures ANOVA (within-between interaction) when counting either test significant, i.e. one with or without outlier removal. Results are presented for two runs and for outlier removal based on a 2.5 SD and 3 SD rule. These simulations were based on N = 433 testosterone samples
Table 8 Base rate significance of Repeated-Measures ANOVA (within-between interaction) when counting either test significant, i.e. one with or without outlier removal. Results are presented for two runs and for outlier removal based on a 2.5 SD and 3 SD rule. These simulations were based on on a theoretical gamma distribution

Discussion and Conclusion

Our results show that statistical conclusions based on significance can differ often and substantially if outliers are included or excluded. Our simulations showed that in between in 7 % to 54 % of significant cases a test with outlier exclusion yielded a statistically significant result whereas the test with outlier inclusion did not, or vice versa, depending on the statistical test, sample size, and outlier criterion. Our simulations should thus be taken as an illustration of how conclusions can diverge in endocrinological research.

Moreover, our results illustrate that inclusion or exclusion of outliers offer one option to ‘p hack’ (Simmons et al. 2011). Researchers are thus able to shift statistical conclusions from non-significant to significant (or vice versa) by including or excluding outliers in a relatively high percentage of cases. Our results also showed that between 5.15 % and 7.62 % of the tests were statistically significant when reporting any test that yielded a statistically significant result (either inclusion or exclusion of outliers). In absolute terms this alpha inflation could be seen as small. However, if the statistical conclusions based on significance were at odds with one another, the difference in p value between the two conclusions was substantial in some cases. Consequently, outlier handling could be an (additional) important reason why some findings are difficult to replicate in behavioral endocrinology (e.g., Oxytocin: Nave et al. 2015). Thus, researchers, reviewers, and editors need to address outlier handling in hormonal data. However, if one is determined to obtain significant results, then other routes also exist such as, for example, running multiple tests (without correction) and only focusing on significant tests (Gelman and Loken 2013). Future research is necessary to determine the relative opportunities to p hack for various questionable research practices. Modeling several such questionable research practices, such as flexibility in outlier removal, would be an interesting future avenue for research. This could be done for example via p curves (Simonsohn et al. 2014).

How could we deal with these issues concerning outlier removal? We believe that a multitude of solutions exist. Firstly, we strongly recommend reporting the results both with outlier exclusion and inclusion, and if the conclusions differ strongly then caution is warranted (Kruskal 1960). Ideally this would be done graphically, so that one can really see the consequence of a given outlier. Secondly, researchers should also consider using non-parametric criteria for outlier detection. For example, using absolute deviations from the median rather than the mean (Leys et al. 2013), or relying on interquartile range for outlier removal (Tukey 1977). A third option would be to not remove any ‘outliers’ at all (unless they are caused by a measurement error) and analyze the data with bootstrapping (Davison and Hinkley 1997), since in bootstrapping outliers receive proportionally less weight. Alternatively, robust statistics can also be used (Huber 2011; Rousseeuw and Hubert 2011; Rousseeuw and Leroy 2005). Fourth, setting a pre-defined criterion on outlier removal is also helpful. All these previous recommendations will have a greater impact if researchers make the raw data available, so that all outlier handling will be open, transparent and reproducible (Nosek et al. 2015). Finally, in the longer run, it would be a strong move forward, if we can combine baseline salivary data from various sources in order to generate population based reference values, as has been done for height (e.g., Wikland et al. 2002). Such a reference database can then meaningfully serve in decisions on whether or not to consider certain values as extreme or not.

In our review, we showed that the most common outlier detection rule was a rule based on a 2.5 or 3 SD difference from the mean. However, this is possibly not the best way to identify outliers in T data, as these data have skewed distributions. Consequently, more outliers are detected using an SD outlier removal rule in skewed data as opposed to in normally distributed data. As an example, in our dataset (n = 433), a 3 SD rule would allow removing up to 2.1 % of the data based on such an outlier criterion (9 cases, all at the upper tail). This is a fifteen-fold increase of outliers: for a standard normal distribution this should be only 0.1 % (0.6 case, at upper tail). It is important to reiterate that in certain cases outlier removal may be fully warranted, for example, due to a measurement error. However, blindly applying an outlier removal rule that is based on a normal distribution would label some cases as outliers while they are not necessarily deviating from the ‘true’ T distribution.

There are some limitations to our study. For example, we limited ourselves to a single hormone, testosterone. Similarly, our starting point was 433 T samples, and therefore our simulations will be inherently limited by that sample. We also focused on a p = .05 criterion, rather than modeling the reduction of effect sizes. The reason for doing so, is that it is clear that p values, and a 5 % cutoff, still feature prominently in the literature (e.g., Bakker et al. 2012; Button et al. 2013; Ioannidis et al. 2014) and that many researchers, reviewers, and editors decide that findings are important if p ≤ .05 (e.g., in psychophysics: Hoekstra et al. 2006; also see: Head et al. 2015; Leggett et al. 2013).

In conclusion, we believe that outlier handling is an issue that needs to be addressed in hormonal research since our simulations illustrate how statistical conclusions can vary substantially based on inclusion or exclusion of outliers. A first, achievable, step would be to encourage further openness in how results would differ if outliers are excluded or included. A second step would be to create guidelines for both reporting and removing outliers in hormonal data (e.g., Aguinis and Edwards 2014; Aguinis et al. 2013). We hope our paper can serve as a first stepping-stone to a debate on outliers and their impact on statistical conclusions in the field of hormones and behavior.