The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997–2017)

Locksley L. McV. Messam; Hsin-Yi Weng; Nicole W. Y. Rosenberger; Zhi Hao Tan; Stephanie D. M. Payet; Mahishi Santbakshsing

doi:10.7717/peerj.12453

The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997–2017)

Locksley L. McV. Messam ¹, Hsin-Yi Weng², Nicole W. Y. Rosenberger¹, Zhi Hao Tan¹, Stephanie D. M. Payet¹, Mahishi Santbakshsing¹

1Section: Herd Health and Animal Husbandry, University College Dublin, School of Veterinary Medicine, Dublin, Leinster, Ireland

2Department of Comparative Pathobiology, College of Veterinary Medicine, Purdue University, West Lafayette, Indiana, USA

DOI: 10.7717/peerj.12453

Published: 2021-11-24
Accepted: 2021-10-18
Received: 2020-12-07

Academic Editor: Sharif Aly

Subject Areas: Veterinary Medicine, Epidemiology, Statistics
Keywords: p Values, Confidence intervals, Statistical significance, Null hypothesis significance testing, NHST, Veterinary epidemiology, Veterinary medicine

Copyright: © 2021 Messam et al.
Licence: This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ) and either DOI or URL of the article must be cited.

Cite this article: Messam LLM, Weng H, Rosenberger NWY, Tan ZH, Payet SDM, Santbakshsing M. 2021. The reporting of p values, confidence intervals and statistical significance in Preventive Veterinary Medicine (1997–2017) PeerJ 9:e12453 https://doi.org/10.7717/peerj.12453

The authors have chosen to make the review history of this article public.

Abstract

Background

Despite much discussion in the epidemiologic literature surrounding the use of null hypothesis significance testing (NHST) for inferences, the reporting practices of veterinary researchers have not been examined. We conducted a survey of articles published in Preventive Veterinary Medicine, a leading veterinary epidemiology journal, aimed at (a) estimating the frequency of reporting p values, confidence intervals and statistical significance between 1997 and 2017, (b) determining whether this varies by article section and (c) determining whether this varies over time.

Methods

We used systematic cluster sampling to select 985 original research articles from issues published in March, June, September and December of each year of the study period. Using the survey data analysis menu in Stata, we estimated overall and yearly proportions of article sections (abstracts, results-texts, results-tables and discussions) reporting p values, confidence intervals and statistical significance. Additionally, we estimated the proportion of p values less than 0.05 reported in each section, the proportion of article sections in which p values were reported as inequalities, and the proportion of article sections in which confidence intervals were interpreted as if they were significance tests. Finally, we used Generalised Estimating Equations to estimate prevalence odds ratios and 95% confidence intervals, comparing the occurrence of each of the above-mentioned reporting elements in one article section relative to another.

Results

Over the 20-year period, for every 100 published manuscripts, 31 abstracts (95% CI [28–35]), 65 results-texts (95% CI [61–68]), 23 sets of results-tables (95% CI [20–27]) and 59 discussion sections (95% CI [56–63]) reported statistical significance at least once. Only in the case of results-tables, were the numbers reporting p values (48; 95% CI [44–51]), and confidence intervals (44; 95% CI [41–48]) higher than those reporting statistical significance. We also found that a substantial proportion of p values were reported as inequalities and most were less than 0.05. The odds of a p value being less than 0.05 (OR = 4.5; 95% CI [2.3–9.0]) or being reported as an inequality (OR = 3.2; 95% CI [1.3–7.6]) was higher in the abstracts than in the results-texts. Additionally, when confidence intervals were interpreted, on most occasions they were used as surrogates for significance tests. Overall, no time trends in reporting were observed for any of the three reporting elements over the study period.

Conclusions

Despite the availability of superior approaches to statistical inference and abundant criticism of its use in the epidemiologic literature, NHST is substantially the most common means of inference in articles published in Preventive Veterinary Medicine. This pattern has not changed substantially between 1997 and 2017.

Introduction

In March 2016, The American Statistical Association (ASA) issued a statement on the use of p-values and statistical significance in scientific articles (Wasserstein & Lazar, 2016). This was in response to continued and widespread misuse of null hypothesis significance testing (NHST) in the quantitative sciences (Wasserstein & Lazar, 2016). NHST is typically characterised by the use of p values as a cut-off (usually 0.05) for distinguishing between scientifically important and non-important results, notwithstanding the lack of a firm scientific basis for this (Gill, 1999; Goodman, 2008; Goodman, 2016; Rothman, 1978; Sterne & Davey Smith, 2001). Consequently, p values are often incorrectly reported as inequalities (e.g., “p < 0.05” or “p < 0.01”) (Goodman, 2008), reinforcing the erroneous impression that their precise magnitude is irrelevant and that use of the particular cut-off has merit. These practices ultimately degrade inferences based on estimated parameters into (statistically) significant and non-significant categories without any consideration of the magnitudes of the estimates themselves (Nuzzo, 2014), and how they compare to previous estimates obtained by other researchers (Poole et al., 2003). Increasingly, over the last 40 years, authors from various fields have commented on, and criticised the afore-mentioned practices (Fidler et al., 2004a; Gill, 1999; Goodman, 2008; Goodman, 1999; Johnson, 1999; Nuzzo, 2014; Rothman, 1978; Rothman, 1986; Savitz, 1993; Sterne & Davey Smith, 2001; Trafimow, 2014; Trafimow & Marks, 2015; Utts, 1988) and attempts have been made to improve statistical inference and reporting in various disciplines (Fidler et al., 2004a; Gardner & Altman, 1988; International Committee of Medical Journal Editors, 1988; Wilkinson, 1999).

Within the discipline of epidemiology, there has been notable concern and objection to the use of NHST, with editorials and commentaries published in a number of sub-speciality journals (Feinstein, 1998; Lang, Rothman & Cann, 1998; Lash, 2017; Poole, 1987; Stang, Poole & Kuss, 2010) as well as attempts made to quantify its use and impact on the field (Fidler et al., 2004b; Holman et al., 2001; Perneger & Combescure, 2017; Pocock et al., 2004; Savitz, Tolo & Poole, 1994; Stang et al., 2017). In the past, one major epidemiology journal strongly discouraged the use of p values (Lang, Rothman & Cann, 1998) and most recently, an epidemiologist co-authored a call for abandonment of the use of the concept of statistical significance (Amrhein, Greenland & McShane, 2019). Specifically, they asked for the discontinuation of the conventional application of the p value = 0.05 as a bright-line test for whether a particular result supports or refutes a scientific hypothesis. This request was endorsed by over 800 fellow scientists (Amrhein, Greenland & McShane, 2019).

Confidence interval use in the biomedical sciences has also been criticised because of its analogous weaknesses to those of p values: The habitual 95% confidence level is similarly arbitrary and without a sound basis. Their correspondence with p values is often exploited to subtly indicate whether a result is statistically significant or not, resulting in the same inferential shortcomings (Feinstein, 1998). Additionally, unlike Bayesian credible intervals, confidence intervals cannot be used to express the probability that the estimated parameter is located within its limits. Epidemiologists, while critical of this misuse of confidence intervals, have nevertheless tended to encourage their reporting alongside p values (Witte, Thomas & Langholz, 1995) or in lieu of them, correctly emphasising that confidence intervals can provide all information that p values provide, and more importantly, are useful as means of conveying the precision of estimated parameters (Naimi & Whitcomb, 2020; Lang, Rothman & Cann, 1998; Poole, 2001). While this latter is not true in all cases (Morey et al., 2016), for common epidemiologic measures (odds, risk and rate ratios, risk and rate differences) the width of the confidence interval is directly proportional to the standard error of the estimate and is therefore a legitimate measure of its precision (Greenland et al., 2016; Naimi & Whitcomb, 2020). Confidence intervals also communicate the precision of an estimate in a manner distinct from its magnitude. This is in contrast to the p value, which is dependent on both the magnitude and precision of the estimate in a manner that the reader cannot disentangle. Yet a more fundamental problem in using a p-value’s magnitude for inference in disciplines where parameter measurement is important, is that unlike confidence intervals, it does not provide an answer to research questions in the metric of the parameter of interest. This point is not new (Poole, Lanes & Rothman, 1984) and was alluded to in the ASA’s statement (Wasserstein & Lazar, 2016).

Notwithstanding attention paid to NHST by epidemiologists in human-related fields, veterinary epidemiologists have been silent on the matter. While it may be postulated that NHST is as widespread in veterinary medical research as it is in other fields, we have not found evidence of any attempts to document this in the veterinary epidemiologic literature or, for that matter, any commentaries suggesting that this is an area of concern to veterinary epidemiologists.

The goal of this project is two-fold. First, to report on the frequency of use of the reporting elements: p values, confidence intervals and statistical significance for inference in articles published in Preventive Veterinary Medicine, a leading journal of veterinary epidemiology. Second, to determine if the use of these reporting elements varies between (a) article sections and (b) over time. While investigating this, we consider the form in which p values are reported, the use and interpretation of confidence intervals, as well as the context in which statistical significance is used. We focus primarily on the patterns and frequency of use of these reporting elements in each article section as a way of documenting the relative importance of NHST to inferences communicated in different parts of a manuscript, as well as in the manuscript overall. We examine publications from 1997 to 2017 in order to identify any notable secular trends in reporting and to include substantial portions of the previous three decades in which there has been notably increased commentary and major published empirical research on the use of NHST by epidemiologists (Chavalarias et al., 2016; Cristea & Ioannidis, 2018; Fidler et al., 2004b; Greenland et al., 2016; Holman et al., 2001; Lang, Rothman & Cann, 1998; Lash, 2017; Poole, 2001; Rothman, Greenland & Lash, 2008; Savitz, Tolo & Poole, 1994; Stang et al., 2017; Stang, Poole & Kuss, 2010; Wasserstein & Lazar, 2016).

Materials and Methods

Sampling frame

The sampling frame for the study consisted of all original research articles published in Preventive Veterinary Medicine (https://www.journals.elsevier.com/preventive-veterinary-medicine) from January 1, 1997 to September 30, 2017 inclusive.

Sampling

We applied a 1-in-3 systematic cluster sampling approach with each cluster being a month of publication. First, we randomly selected 1 month from among the first 3 months of the year (January to March) and thereafter selected every third month for study inclusion. March was randomly selected as the starting point and thus, for each year of the study period, we sampled all original articles published in March, June, September and December. When for a given year, there was no journal issue in one of the chosen months, the preceding month was selected for study inclusion.

Search procedure

Articles within sampled clusters were downloaded in pdf format prior to being searched. Every article’s title, year and month of publication and volume number were noted. Each article was then searched using the following search terms: “p” for p values, “CI C.I. CL C.L. confidence” for confidence intervals and “significance significant significantly” for statistical significance. For searching, we used the “Full Reader Search” function in Adobe Reader DC version 10 along with the “Match Any of the words” and “Whole words only” sub-menus. In each article, we searched the abstract, results-text, results-table(s) (including footnotes), and discussion for one example of each reporting element without regard to the presence of the others. Each search result was binary (i.e. Yes/No). If results-texts and discussions were written as one section, the search result was recorded separately under both sections. Searches were performed by co-authors MS, SP, NR and ZHT for March, June, September and December publications, respectively.

Detailed review of reporting practices

Based on initial search results, a subsample consisting of articles in which the reporting elements were found, was selected for detailed review. The articles were divided into 24 strata defined jointly by the reporting element (singly: p values only, confidence intervals only etc. or in combination: p values and confidence intervals, confidence intervals and statistical significance etc.) and the article section (e.g., abstract, results-text, etc.) containing them. We excluded articles with Bayesian analyses but no frequentist content and articles in which p values and confidence intervals were reported as part of model selection or model fitting criteria, except when the model selection or model fitting was the study objective.

From strata containing thirty or more articles, we randomly sampled 20 percent for further review. When the number of articles in a stratum was less than 30, five or all (if less than five) articles were chosen. Random samples were generated using the IBM SPSS Statistics for Windows (Version 24.0; IBM Corp., Armonk, NY, USA) Random Number Generator with fixed seeds. Each article was independently reviewed by both LM and HYW with discrepancies resolved by consensus following discussion.

Review of the form in which p values are reported

For each article section in which a p value was reported, we noted if the exact value (p =…) or an inequality (p <, >, ≥, or ≤…) was written. When an inequality involving a number less than or equal to 0.001 (e.g., p < 0.001) was reported, this was considered the exact value, as we assumed the reporting of the inequality was likely a result of the limitations of the statistical programme. Additionally, in the abstracts, results-texts and results-tables we counted the number of p values with magnitudes (a) less than and (b) greater than or equal to 0.05. Whenever “p <” or “p ≤” was reported along with a number greater than 0.05 (e.g., “p < 0.2”), this was considered greater than or equal to 0.05. To avoid duplication, p values reported in results-texts that were used to refer to p values in results-tables were not counted.

Review of the usage and interpretation of confidence intervals

Where confidence intervals were reported in an article section, we noted (a) whether or not NHST was the analytic goal; (b) if any of the reported confidence intervals were interpreted; and if so, (c) what proportion were interpreted as if they were used to perform a significance test. Results were categorised as all, some or none of them. We deemed that a confidence interval was used to perform a significance test, if the accompanying point estimate was described as “significant”, “statistically significant” (or their negatives) in the absence of a p value, or it was stated whether it included the null value (e.g., 1, for ratio measures).

We also noted whether the confidence intervals were reported for only commonly used epidemiologic measures, only for other measures, or for both. We considered the following, common epidemiologic measures: Measures of disease occurrence (incidence rates and proportions, prevalence, and mortality), measures of association (odds, risk, rate and prevalence ratios and risk differences), sensitivity, specificity, and median survival time.

Review of the reporting of statistical significance

Whenever “significant”, “significance” or “significantly” (hereafter “significant…”) were found in a section, we noted whether they were used to indicate statistically significant study results, statistically non-significant study results or both. Statistically non-significant results were considered those which were described by a sentence expressing the converse of statistical significance. Whenever “significant…” did not refer to study results, they were excluded from consideration.

Data quality verification

During detailed review of articles, we noted and corrected all occasions in which the appearance of a reporting element was erroneously recorded (false positives). To verify the absence of each of the three reporting elements, we randomly selected 30 sections from each of the four article sections that did not contain the respective reporting element (thus 12 × 30 = 360 article sections in total). For instance, to verify the absence of p values from abstracts, we randomly chose a sample of 30 abstracts from among all abstracts that were found to not contain any p values during the initial search. We noted all occasions in which the absence of a reporting element was erroneously recorded (false negatives) during the initial search and corrected the errors. Verification was done by LM and HYW, by repeating the initial search procedure used to identify reporting elements.

Data Analysis

Estimation of article section-specific proportions

The unit of analysis was the individual article section (abstract, results-text, results- table(s) or discussion). Data were analysed in Stata Statistical Software (Release 14; StataCorp LP., College Station, TX, USA). Overall and annual estimates along with associated 95% confidence intervals (CIs) for proportions of article sections containing each combination of the reporting elements were calculated using the survey data analysis menu. We then plotted article-section specific yearly point estimates to visualize any trends over the study period.

Detailed review

P values

For abstracts, results-texts, results-tables and discussions in which p values were reported, we estimated the proportions in which all, some and none of the p values were written as inequalities as opposed to in exact form. Weighted averages of the proportion of p values that were less than 0.05 were also estimated for each section.

Confidence intervals

For abstracts, results-texts and discussions in which confidence intervals were reported, we estimated the proportions in which the confidence intervals were reported for only the commonly used epidemiologic measures. Additionally, for the subset of these articles in which NHST was possible, we also estimated the proportion of abstracts, results-texts and discussions in which all, some and none of the confidence intervals were interpreted as significance tests.

Statistical significance

For abstracts and discussions in which “significant…” referred to study results, we estimated the proportions of the sections in which the term was used to refer to (a) only statistical significant results, (b) only statistically non-significant results and (c) both.

Comparisons between sections

Finally, we estimated prevalence odds ratios (ORs) and 95% CIs comparing the odds of a given reporting element occurring in one section relative to another section. For this we used Generalised Estimating Equations with a logit link function, binomial distribution, and exchangeable correlation structure (Twisk, 2003), to account for the dependency of sections within articles.

Results

Nine-hundred and eighty-five articles were searched initially. Of these, 839 (85%) had an abstract, results-text, results-table, and/or discussion section containing at least one reporting element. Four hundred and fifty-seven article sections containing different combinations of reporting elements were used for the detailed review of sections containing the reporting elements. Average false positive and negative errors for the three reporting elements across the four sections during initial searches were 5 and 1.5%, respectively.

Overall reporting

Over the study period, the proportions of abstracts (31%; 95% CI [28–35]), results-texts (65%; 95% CI [61–68]), results-tables (23%; 95% CI [20–27]) and discussions (59%; 95% CI [56–63]) reporting “significant…” exceeded both those proportions reporting p values and those reporting confidence intervals in those sections except in results-tables (Fig. 1). The proportion of sections reporting confidence intervals and p values were similar, except among results-texts where the proportion reporting p values was substantially higher (Fig. 1). The proportion of results-texts containing p values and “significant…” (separately) were higher than the proportion of other article sections containing these reporting elements (Fig. 1). Confidence intervals were reported in a substantially higher proportion of results-tables (44%; 95% CI [41–48%]) than in other sections (Fig. 1). Overall, the odds of reporting “significant…” was substantially higher than both the odds of reporting confidence intervals and the odds of reporting p values, respectively, in abstracts, results-texts and discussions (Fig. 2). Only in results-tables were odds of reporting p values or confidence intervals higher than the odds of reporting statistical significance (Fig. 2). The odds of reporting p values were roughly equal to the odds of reporting confidence intervals in all sections except results-texts where they were higher (OR = 2.5; 95% CI [2.1–2.9]) (Fig. 2).

Figure 1: Proportions (%) of abstracts, results-texts, results-tables and discussion sections reporting p values, confidence intervals or instances of statistical significance.
Point estimates and 95% confidence intervals (CIs) for the proportions (%) of abstracts, results-texts, results-tables and discussion sections reporting at least one p value, confidence interval or instance of statistical significance (regardless of the presence of the others) in original research articles published in Preventive Veterinary Medicine (1997–2017). In each abstract, result-text, results-table and discussion, “p”, “CI C.I. CL C.L. confidence” and “significance significant significantly” indicated the presence of p values, confidence intervals and statistical significance, respectively.

Download full-size image

DOI: 10.7717/peerj.12453/fig-1

Figure 2: Odds ratios and 95% confidence intervals comparing abstracts, results-texts, results-tables and discussion sections reporting p values, confidence intervals or statistical significance.
Odds ratios and 95% confidence intervals (CIs) for the comparison of the proportions of abstracts, results-texts, results-tables and discussion sections reporting at least one p value (P), confidence interval (CI) or instance of statistical significance (SS) (regardless of the presence of the others) in original research articles published in Preventive Veterinary Medicine (1997–2017).

Download full-size image

DOI: 10.7717/peerj.12453/fig-2

In each section, there was substantial yearly fluctuation in the proportions containing reporting elements over the study period, with average yearly changes of at least 5% (Figs. 3A–3D). Nevertheless, no section showed any net tendency towards either increases or decreases in reporting p values, confidence intervals or “significant…” over the study period. Overall, yearly proportions of results-tables reporting “significant…” were consistently higher than those reporting p values and confidence intervals in both the abstracts (Fig. 3A) and discussions (Fig. 3D).

Figure 3: Annual proportions of (A) abstracts, (B) results-texts, (C) results-tables and (D) discussion sections reporting p values (•), confidence intervals (Δ) and statistical significance (+).
Plots of the yearly proportions (%) of (A) abstracts, (B) results-texts, (C) results-tables and (D) discussion sections reporting p values (•), confidence intervals (Δ) and statistical significance (+) in original research articles published in Preventive Veterinary Medicine (1997–2017). In each abstract, result-text, results-table and discussion “p”, “CI C.I. CL C.L. confidence” and “significance significant significantly” indicated the presence of p values, confidence intervals and statistical significance, respectively.

Download full-size image

DOI: 10.7717/peerj.12453/fig-3

Among sections in which only one reporting element was present, the pattern was similar to the above overall pattern (Fig. 4), with the proportions in sections reporting only “significant…” exceeding proportions of sections reporting only p values and only confidence intervals except among results-tables (Fig. 4).

Figure 4: Proportions of abstracts, results-texts, results-tables and discussion sections reporting combinations of p values, confidence intervals and statistical significance.
Point estimates and 95% confidence intervals for the proportion (%) of abstracts, results-texts, results-tables and discussion sections reporting various combinations of p values (P), confidence intervals (CI) and statistical significance (SS) in original research articles published in Preventive Veterinary Medicine (1997–2017).

Download full-size image

DOI: 10.7717/peerj.12453/fig-4

When combinations of the reporting elements were investigated, the proportion of abstracts, results-texts and discussions reporting p values and “significant…” together, were higher than the proportion reporting other combinations in corresponding sections (Fig. 4). Among results-tables, the proportion reporting p values and confidence intervals together was higher than those reporting other combinations (Fig. 4). However, p values and confidence intervals were jointly reported together in relatively fewer abstracts, results-texts and discussions sections than any other combination of reporting elements that included p values.