Reporting and misreporting of sex differences in the biological sciences

As part of an initiative to improve rigor and reproducibility in biomedical research, the U.S. National Institutes of Health now requires the consideration of sex as a biological variable in preclinical studies. This new policy has been interpreted by some as a call to compare males and females with each other. Researchers testing for sex differences may not be trained to do so, however, increasing risk for misinterpretation of results. Using a list of recently published articles curated by Woitowich et al. (eLife, 2020; 9:e56344), we examined reports of sex differences and non-differences across nine biological disciplines. Sex differences were claimed in the majority of the 147 articles we analyzed; however, statistical evidence supporting those differences was often missing. For example, when a sex-specific effect of a manipulation was claimed, authors usually had not tested statistically whether females and males responded differently. Thus, sex-specific effects may be over-reported. In contrast, we also encountered practices that could mask sex differences, such as pooling the sexes without first testing for a difference. Our findings support the need for continuing efforts to train researchers how to test for and report sex differences in order to promote rigor and reproducibility in biomedical research.


Introduction
Historically, biomedical research has not considered sex as a biological variable (SABV).

23
Including only one sex in preclinical studies-or not reporting sex at all-is a widespread issue 24 (Sugimoto et  representation of male non-human animals and cells, the policy was intended not only to 30 ameliorate health inequities but to improve rigor and reproducibility in biomedical research 3 place, but also because any less-than-rigorous test for sex differences creates risk for 48 misinterpretation of results and dissemination of misinformation to other scientists and to the 49 public (Maney, 2016). In other words, simply calling for the sexes to be compared is not enough 50 if researchers are not trained to do so; if SABV is implemented haphazardly, it has the potential 51 to decrease, rather than increase, rigor and reproducibility.

52
In this study, our goal was to analyze recently published articles to determine how often 53 sex differences are being reported and what statistical evidence is most often used to support 54 findings of difference. To conduct this assessment, we leveraged the collection of articles 55 originally curated by Woitowich et al. (2020) for their analysis of the extent to which SABV is 56 being implemented. Their original list, which was itself generated using criteria developed by disciplines and 34 scholarly journals. Of those, Woitowich et al. identified 151 articles that 59 included females and males and that analyzed data disaggregated by sex or with sex as fixed 60 factor or covariate. Working with that list of 151 articles, we asked the following questions for 61 each: First, was a sex difference reported? If so, what statistical approaches were used to 62 support the claim? We focused in particular on studies with factorial designs in which the 63 authors reported that the effect of one factor, for example treatment, depended on sex. Next, we asked whether data from males and females were kept separate throughout the article, and if sex difference was reported in 83 articles, or 56%. Of the articles reporting a sex difference, 41 88 (49%) mentioned that result in the title or the abstract. Thus, in our sample of articles in which 89 data were reported by sex, a sex difference was reported in more than half of the articles and in 90 half of those, the difference was treated as a major finding. In 44% of articles, a sex difference 91 was neither stated nor implied.

92
These results are broken down by discipline in Fig. 1B. The sexes were most commonly 93 compared in the field of Endocrinology (93%) and least often in the field of Neuroscience (33%).

94
When sex differences were found in the field of Endocrinology, however, they were reported in 95 the title or abstract only 32% of the time. In the field of Reproduction, the sexes were compared 96 89% of the time and in 100% of those cases, a sex difference was mentioned in the title or 97 abstract. Sex differences were least likely to be emphasized in the title or abstract in the fields of 98 General Biology and Neuroscience (11% each).

99
Although a sex difference was claimed in a majority of articles (57%), not all of these 100 differences were supported with statistical evidence. In nearly a third of the articles reporting a 101 sex difference, or 24/83 articles, the sexes were never actually compared statistically. In these cases, the authors claimed that the sexes responded differentially to a treatment when the effect 103 of treatment was not statistically compared across sex. This issue is explored in more detail 104 under Question 2, below. Finally, we noted at least five articles in which the authors claimed that

109
For each article, we asked whether it contained a study with a factorial design in which 110 sex was one of the factors. This design is common when researchers are interested in testing 111 whether the sexes respond differently to a manipulation such as a drug treatment (Fig 2A).

112
Below, we use the term "treatment" to refer to any non-sex factor in a factorial design. Such 113 factors were not limited to treatment, however; they also included variables such as genotype,    Table S3.  In this hypothetical dataset, there was a significant effect of treatment only in females. Some authors would claim that the treatment had a "sex-specific" effect without testing statistically whether the response to treatment depended on sex. In this example, it does not (see Maney, 2016; Nieuwenhuis et al., 2011). (C) The river plot shows the proportion of articles with a factorial design and the analysis strategy for those. The width of each stream is proportional to the number of articles represented in that stream. (D) The percentage of articles with a factorial design is plotted for each discipline. Only a minority tested for an interaction. (E) The percentage of articles reporting a sex-specific effect is plotted for each discipline. Only a minority reported a significant interaction. (F) Testing for an interaction was less common in articles claiming the presence of a sex-specific effect than in articles claiming the absence of such an effect. explicitly tested for interactions between sex and other factors in only 26 of the 91 articles 132 (29%). Testing for interactions varied by discipline (Fig. 2D). Authors were most likely to test for 133 and report the results of interactions in the field of Behavioral Physiology (54% of relevant 134 articles) and least likely in the fields of Physiology (0%) and Reproduction (0%).

135
Of the studies with a factorial design, 58% reported that the sexes responded differently 136 to one or more other factors. The language used to state these conclusions often included the 137 phrase "sex difference" but could also include "sex-specific effect" or that a treatment had an 138 effect "in males but not females" or vice-versa. Of the 52 articles containing such conclusions, the authors presented statistics showing a significant interaction, in other words appropriate evidence that females and males responded differently, in only 15 (29%). In one of those 141 articles, the authors presented statistical evidence that the interaction was non-significant, yet 142 claimed a sex-specific effect nonetheless. In an additional five articles, the authors mentioned

187
Among the articles in which the sexes were pooled, the authors did so without testing for 188 a sex difference more than half of the time (51%; Fig. 3B). When authors did test for a sex 189 difference before pooling, they sometimes found a significant difference yet pooled the sexes were least likely to have tested for a sex difference before pooling (0%) and most likely to do so 199 in Pharmacology (80%). Pooling after finding a significant difference was most common in the 200 field of Neuroscience (40% of articles that pooled).

202
Question 4: Was the term "gender" used for non-human animals?

203
To refer to the categorical variable comprising male/female or man/woman (all were 204 binary), the term "sex" was used exclusively in 69% of the articles (Fig. 4). "Gender" was used 205 exclusively in 9%, and both "sex" and "gender" were used in 19%. When both terms were used, 206 they usually seemed to be used interchangeably. In 4% of the articles, neither term was used.

207
Of the articles in which the term "gender" was used, 20% of the time it referred to non-208 human animals, such as mice, rats, and pigs. In one case, both "sex" and "gender" were used to 209 refer to non-human animals in the title. In another case, "gender" was used to refer to human

230
In the set of articles analyzed here, sex differences were claimed in a majority and were often 231 highlighted in the title or abstract. We therefore found little evidence that researchers-at least 232 those who comply with NIH guidelines-are uninterested in sex differences. Conversely, our 233 finding could indicate that researchers interested in sex differences are primarily the ones 234 following NIH guidelines.

236
Testing for interactions in a factorial design 237 Testing whether the sexes respond differently to a treatment requires statistical 238 comparison between the two effects, which is typically done by testing for a sex × treatment 239 interaction. In our analysis, however, tests for interactions were missing 71% of the time (Fig.   240 2C, D). In these cases, the most common method for detecting differential effects of treatment (Cumming, 2012). This error, and the frequency with which it is made, has been covered in Reproduction, for which we found that authors never tested for interactions.

254
Statements such as the following, usually made without statistical evidence, were 255 common: "The treatment increased expression of gene X in a sex-dependent manner"; "Our 256 results demonstrate that deletion of gene X produces a male-specific increase in the behavior"; 257 "Our findings indicate that females are more sensitive to the drug than males". In some of these 258 cases, the terms "sex-specific", "sex-dependent" or "sexual dimorphism" were used in the title of 259 the article despite a lack of statistical evidence supporting the claim. In many of these articles, 260 some of which stated that finding a sex difference was the major goal of the study, the sexes 261 were not statistically compared at all. Thus, a lack of statistical evidence for sex-specific effects 262 did not prevent authors from asserting such effects. In fact, we found that authors failing to test 263 for interactions were far more likely to claim sex-specific effects than not (88% vs. 12%; Fig.   264 2F); they were also more likely to do so than were authors that did test for interactions (88% vs.

265
62%; Table S3). Together, these results suggest a bias toward finding sex differences. In the 266 absence of evidence, differences were claimed more often than not. A bias toward finding sex given also that sex differences are often misrepresented to the public (Maney, 2014), it is 271 especially important to base conclusions from preclinical research on solid statistical evidence.

273
Pooling across sex

274
The set of articles we analyzed was pre-screened by Woitowich (2020) to include only 275 studies in which sex was considered as a variable. Nonetheless, even in this sample, data were 276 often pooled across sex for some of the analyses (Fig. 3A). In a majority of these articles, 277 authors did not test for a sex difference before pooling (Fig. 3B). Thus, for at least some 278 analyses represented here, the data were not disaggregated by sex, sex was not a factor in 279 those analyses, and we do not know whether there might have been a sex difference. Even

308
We found that a large majority of studies on non-human animals used "sex" to refer to 309 the categorical variable comprising females and males (Fig. 4). In eight articles, we noted usage 310 of the word "gender" for non-human animals. This usage appears to conflict with current 311 recommendations regarding usage of "gender", that is, gender should refer to socially 312 constructed identities or behaviors rather than biological attributes (Clayton & Tannenbaum,  This study was underpowered for examining these issues within any particular discipline.

324
For most disciplines, fewer than a dozen articles were in our starting sample; for Neuroscience 325 and Reproduction, only nine. As a result, after we coded the articles, some categories contained 326 few or no articles in a given discipline (see Table S3). The within-discipline analyses, particularly 327 the pie charts in Fig. 3B, should therefore be interpreted with caution. Firm conclusions about 328 whether a particular practice is more prevalent in one discipline than another cannot be drawn 329 from the data presented here.

330
As is the case for any analysis, qualitative or otherwise, our coding was based on our 331 interpretation of the data presentation and wording in the articles. Details of the statistical 332 approach were sometimes left out, leaving the author's intentions ambiguous. Although our 333 approach was as systematic as possible, a small number of articles may have been coded in a 334 way that did not completely capture those intentions. We believe our sample size, particularly in 335 the overall analyses across disciplines, was sufficient to reveal the important trends.

337
Conclusion 338 SABV has been hailed as a game-changing policy that is already bringing previously 339 ignored sex-specific factors to light, particularly for females. In this study, we have shown that a 340 substantial proportion of claimed sex differences, particularly sex-specific effects of

355
We conducted our analysis using journal articles from a list published by Woitowich et al.

356
(2020). In their study, which was itself based on a study by Beery and Zucker (2011), the 357 authors selected 720 articles from 34 journals in nine biological disciplines. Each discipline was 358 represented by four journals, with the exception of Reproduction, which was represented by two (Table 1). To be included, articles needed to be primary research articles not part of a special analyzed by sex, defined as either that the sexes were kept separate throughout the analysis or 367 that sex was included as a fixed factor or covariate. Of the original 720 articles analyzed, 151 368 met this criterion. We began our study with this list of 151 articles. Four articles were excluded because they contained data from only one sex, with animals of the other sex used as stimulus 370 animals or to calculate sex ratios. designs in each, and a subset of the articles was discussed between the authors to develop an analysis strategy. All articles were then coded by the second author (DLM). The final strategy 374 consisted of four decision trees (Table S1) used to assign articles to hierarchical categories 375 pertaining to each of four central questions (see below). Each article was assigned to only one 376 final category per question (Table S2). A subset of the articles was independently coded by 377 YGS and any discrepancies discussed between the authors until agreement was reached.

378
Question 1: Was a sex difference reported? Because we were interested in the 379 frequency with which sex differences were found, we first identified articles in which the sexes 380 were explicitly compared. We counted as a comparison any of the following: (1) sex was a fixed and females was presented; (4) the article contained wording suggestive of a comparison, e.g.

384
"males were larger than females". We also included articles with wording suggestive of a sex 385 difference in response to a treatment, for example "the treatment affected males but not 386 females" or "the males responded to treatment, whereas the females did not", or "the treatment 387 had a sex-specific effect". Similarly, we included here articles with language referring to a non-388 difference, for example "we detected no sex differences in size" or "the response to treatment 389 was similar in males and females." Articles in which sex was included as a covariate for the 390 purposes of controlling for sex, rather than comparing the sexes, were not coded as having 391 compared the sexes (see Beltz et al., 2019). When the sexes were compared but no results of 392 those comparisons, e.g., p values, were reported, that omission was noted and the article was explicitly identified as a fixed factor; we included here all studies comparing across levels of one 399 factor that comprised females and males with each of those levels. In some cases that factor 400 was a manipulation, such as a drug treatment or a gene knockout; these factors also included

407
For studies with a factorial design, we further coded the authors' strategy of data 408 analysis. First, we noted whether authors tested for an interaction between sex and treatment.

409
We included one study in which the effect of treatment was explicitly compared across sex 410 using a method other than a classic ANOVA (the magnitude of the differences between treated 411 and control groups were compared across sex). If authors tested statistically whether the effect 412 of treatment depended on sex, we noted the outcome of that test and the interpretation. Articles  Table S2. Any article containing pooled data was coded as pooled, even if some analyses 430 were conducted separately or with sex in the model. For articles that pooled, we further noted 431 whether the authors tested for a sex difference before pooling and, if so, whether p values or 432 effect sizes were reported.

433
Question 4: Did the authors use the term "sex" or "gender"? We searched the articles for 434 the terms "sex" and "gender" and noted whether the authors used one or the other, both, or 435 neither. Terms such as "sex hormones" or "gender role", which did not refer to sex/gender 436 variables in the study, were excluded from this assessment. For the articles using "gender" we 437 further noted when the term was used for non-human animals.

438
To visualize the data, we used river plots (Weiner, 2017), stacked bar graphs, and pie 439 charts based on formulae and data presented in Table S3.