Introduction

Mounting evidence suggests gender bias in publications and citations of scholars in STEM1,2. Such biases can result in situations where women (or other under-represented minorities) may feel invisible and ignored in men-dominated environments. The feeling of not being part of the community can result in a higher dropout rate among women, a phenomenon known as leaky pipeline3. Leakage in the academic pipeline consequently affects the academic community for generations to come due to a lack of diversity, inclusion, innovation, and role models. Thus, it is of utmost societal importance to accurately identify those biases and devise bottom-up approaches to tackle them.

Gender inequality in academia manifests itself in the production of science and performance outcomes. Unequal division of childcare, parental leave policies, career breaks, limited access to role models and resources can create situations in which women and other minorities show less productivity and performance compared with their men peers. Frequently, these inequalities are exacerbated through formal and informal social relationships, which in turn affect the citation network structure and reinforce existing inequalities.

Academic productivity is often associated with number of publications throughout a researcher’s career. Previous studies have found that women publish fewer peer-reviewed articles than men4,5, while a more recent study found that the disparity in the productivity of men and women disappears if we compare the productivity with regard to the scholar’s career length6,7. Women display higher publication rates later in their academic careers, but take up fewer leadership roles5,8. Mueller et al.9 suggest that publication productivity may be a factor that hinders women from advancing within surgery, while Reed and colleagues point out that mid-career assessment of productivity may not be an appropriate measure of leadership skills5.

Beyond disparities in publication and productivity, analyzing citation patterns can help to identify whether gender differences exist in the way scholars award and recognize each others’ works. In other words, while productivity is associated with individual or collaborative efforts, citation is an indication of how these efforts are perceived by the community10. In this sense, one can argue that while the former operates among a small number of collaborators, the latter is related to the social processes that govern the community of scholars at large.

Previous studies have shown that patterns of citation can be different for men and women. This could be explained by intentional decision, quality difference, or paradigmatic research topics11,12,13. It has been argued that in the most productive countries, articles with women in key author positions receive fewer citations than those with men in the same positions14,15. Moreover, some research concluded that the differences in citation rates between men and women increase with the number of authors per article16. This indicates that women are not only relatively less represented as high-impact key authors, but also that they attract significantly fewer citations for those key positions compared to men. One plausible assumption is that the lack of women in leadership positions causes this accentuated women under-representation (structural reasons) since the distribution of key authorships follows, by convention, a hierarchical order. In a recent paper, Dworkin and colleagues2 present a case study of citation patterns in top neuroscience journals, finding that papers for which first and last authors are men are over-represented in reference lists, and that the discrepancy is most prominent in the citation behaviors of men and is getting worse over time.

A major methodological obstacle is that simply comparing number of publications and citations of men and women is misleading. Men and women have different rates of entry in the scientific community for historical reasons, and when combined with other non-academic responsibilities, they may not show a similar behavior at the aggregate level. Indeed, recent findings show that when differences in career length are controlled for, men and women scientists have similar rates of publication and citation on average7. However, beyond these insights on a population level, do men and women really receive different recognition for a similar work published around the same time? To truly examine the gender differences in citation, one should compare pairs of papers that cover the same topics in a comparable way. Relying on analyzing only the average performance may hide variations that exist in data, and drive the community to inaccurate conclusions or inappropriate policies.

In this paper, we focus on analyzing publication and citation patterns in the physics community as one of the core STEM areas where women are exceedingly under-represented, often facing belittling remarks and harassment17,18. Our analyses reveal significant gender disparities in productivity, dropout rate, self-citation, and overall visibility of women in the citation network. More importantly, we examine gender differences not only at the population level, but also at the microscale by comparing pairs of statistically validated similar papers. We find that temporal biases play a central role in gender inequalities, as men benefit from an asymmetrical first-mover advantage and, due to historical biases, there is a disproportionate number of men senior researchers.

Results

We start by describing the dataset we have analyzed and briefly explaining the methodology we have used to build the citation network and the pairs of similar papers. Then, we proceed to study gender disparities, first at the aggregate level and then by comparing pairs of similar papers.

Data description

We study an American Physical Society (APS) dataset from 1893 to 2009, which contains articles’ metadata, the authors’ basic information, and the citations within the papers. The metadata consists of authors’ full names and a unique digital object identifier (DOI) of the publication in a string format. For those names that are repeated in the dataset, we used name disambiguation methods proposed by Sinatra et al.19 to detect unique authors and correctly match authors to publications (see Supplementary Fig. 1). To infer gender from names, we implemented a gender-detection procedure that combines author names with an image-based gender inference technique applied to search results from Google Images20. This combined method results in high accuracy in the gender identification of scholars from different nationalities (see Supplementary Methods). The final dataset consists of 541,448 scholarly articles published over the course of 116 years, categorized into 11 journals. Among those 541,448 papers, we were able to identify at least one participating author’s gender of 375,736 papers. We have identified 120,776 gendered names, 17,763 women and 103,013 men. The evolution of the number of authors per year is shown in Fig. 1a.

Fig. 1: Rate of growth of women participation, average publications by career age, dropout rate and annual ratio of men/women self-citations.
figure 1

a Number of men (blue) and women (orange) authors per year. b Average number of publications by authors' career age. The shaded area shows the standard deviation. c Proportion of men and women authors who drop out compared to the remaining active authors per career age. d Normalized ratio of men/women self-citations computed from (1) during the time period of interest. The horizontal dashed line is the line of equilibrium; data points above the equilibrium line indicate a higher ratio of men's self-citation, and points below the line imply a higher ratio of women's self-citation.

Here, the notion of “gender” refers neither to the sex of the authors nor to the gender that the author self-identifies as. By the word “woman”, we mean an author whose name has a high probability of being assigned to female at birth or being identified as a woman due to facial characteristics. Given this limitation, we can safely argue that these methodologies are in accordance with social constructs and what people perceive as gender in society.

Constructing citation networks and assessing similar pairs

We build the citation networks by considering each paper as a node and making a link from paper i to paper j if i includes a citation to j. We measure the similarity between two papers using the bibliographic coupling strength21,22; that is, the number of publications that both papers cite. Two papers that cover similar topics in a comparable way are assumed to include a similar set of outgoing citations. However, within subfields there is usually a handful of classic publications that are cited in most works, so their inclusion in two different papers may not indicate actual similarity, but a citation convention. To avoid such shortcomings of naive bibliographic coupling, and guarantee the significance of the overlapping set of citations, we apply a statistical test based on the hypergeometric distribution. This test controls for the incoming citations of the commonly cited papers and checks whether the size of the common set of citations is so large that it cannot be explained by randomness. The problem of identifying similar papers to assess gender disparities has also been approached recently using machine learning techniques23.

To explore gender disparities, we select pairs of similar papers respectively written by men and women primary authors. Then, we compare the future incoming citations to each of the pair. This comparison allows us to detect potential inequalities in the citation patterns. We have summarized this methodology in the diagram of Fig. 2 and provided all the technical details in Methods.

Fig. 2: Assessing similar pairs.
figure 2

We use bibliographic coupling and hypergeometric statistical tests to select couples of similar papers based on their outgoing citation activities. Then we compare their respective popularity (incoming citations). Each node and each arrow represent a paper and a citation respectively, whereas each dashed arrow represents a potential citation that is missing. The pair of papers being assessed (i and j) are shown in blue and orange, the papers cited by them in yellow, and the papers that cite them in green. The black arrow at the bottom represents a timeline showing the publication times of the papers.

Aggregate gender disparity trends

To characterize the gender disparities at the aggregate level, we first analyze the aspects of scientific production that depend primarily on individual choices and ability: in particular, productivity, dropout rate, and self-citations. Then, we discuss authorship order, which depends on the internal organization of research groups. Finally, we study the behavior of the scientific community as a whole by comparing the citations received by men and women.

Productivity

We define productivity as the number of publications that scholars produce during their career. In physics, we observe that women have a lower average number of publications compared to men across all their career ages (Fig. 1b). While in the first two years of author’s career the publication gap is closing, we observe a sudden increase in the gap from the second to the eleventh year. After this point, the publication gap starts decreasing again. These fluctuations in publication productivity can be associated with, among other things, the disproportionate family responsibilities that women have to take on compared with men24. For the aggregate results, see the productivity distributions by gender in Supplementary Fig. 4.

Although a researcher’s productivity can be considered to be determined mainly by individual skills, the collaborative nature of scientific work makes it dependent on external factors such as other team members or departmental organization. Likewise, these factors, together with other aspects like social perception or family responsibilities, affect women’s motivation to keep working in academia, potentially leading to the leaky pipeline phenomenon. To quantify this phenomenon, in the next section we explore the differences in dropout rates between men and women.

Dropout rate

We compute dropout as a lack of publication activity for at least five years to distinguish the authors who are active in publishing from those who have dropped out. We investigate the ratio of dropout scholars at each career age compared to the number of active scholars by gender. Figure 1c shows that women authors have a higher dropout ratio throughout their whole career. The largest gaps appear in the early career years, with a 2.28% difference between men and women in the first year and a 2.26% difference in the sixth year. The dropout rates of authors who leave academia after their first year (career age 0) are not shown in Fig. 1c. This career age presents the highest dropout rates, with 39.94% for men authors and 47.55% for women authors.

Self-citation

Self-citation refers to cases where authors cite their own previous works. Self-citations increase the total citation count and the visibility of scholars25,26,27, potentially enhancing academic promotion and attention. We have measured the relative number of self-citations by all men and women authors with the following metric (r) to study the difference in self-citation ratios between the two genders over time25:

$$r=\frac{\frac{ \% {{{{{{{\rm{men}}}}}}}}{{\hbox{'}}} {{{{{\rm{s}}}}}}\,{{{{{\rm{self}}}}}}-{{{{{\rm{citations}}}}}}}{ \% {{{{{\rm{men}}}}}}{{\hbox{'}}} {{{{{\rm{s}}}}}}\,{{{{{\rm{citations}}}}}}}}{\frac{ \% {{{{{\rm{women}}}}}}{{\hbox{'}}} {{{{{\rm{s}}}}}}\,{{{{{\rm{self}}}}}}-{{{{{\rm{citations}}}}}}}{ \% {{{{{\rm{women}}}}}}{{\hbox{'}}} {{{{{\rm{s}}}}}}\,{{{{{\rm{citations}}}}}}}}$$
(1)

Figure 1d shows the temporal evolution of the ratio r. This result shows that women tend to cite themselves less than men and that this trend is consistent over the years (See Supplementary Table 2 for more details). Consequently, women’s visibility in the citation network is partly penalized by the higher ratio of men citing their own previous works.

Another fundamental factor that affects an author’s visibility is the position in which her name appears in the list of authors. This position depends on how the whole research group is organized and, crucially, in most cases it depends on the perceived level of contribution of each collaborator.

Authorship order analysis

In the majority of the scientific fields, including physics, the authorship order indicates relative contribution and seniority by putting emphasis on the first, the last, and the second positions28,29. In order to compare the positions of authors, we first discard those papers for which authorship order is alphabetical. For this purpose, we perform a string comparison of the last names of the contributing authors and consider them to be in alphabetical order if the paper has at least four authors and all of them follow this order. Around 3.54% of the papers can be considered as alphabetically ordered; in Supplementary Table 3 we detail their fraction by PACS subfield (Physics and Astronomy Classification Scheme). After discarding those papers from the analysis, we study the authorship order in each publication and compare the proportion of women and men in each position of the author list (first, second, middle and last). We perform this comparison using a two-proportion z-test (see Methods). If there is only one author in a paper, we consider her the first author. Middle authors are those between second and last in papers with more than three authors.

The results show that there are more women than expected by chance in the first, second and middle author positions, and they are heavily under-represented as last authors (see Supplementary Table 4). The last author in physics papers is usually the most senior member of the team, so this trend can be explained by the later and slower rate of arrival of women, combined with their higher dropout rate throughout their career. This is in line with previous findings that women feature only rarely as the last authors in leading journals30.

While the authorship order reflects how a researcher’s coworkers perceive her contribution, the collective perception of the scientific community regarding the importance of a paper is manifested in the citations of papers. In the following sections we will thoroughly compare the relative popularity of publications led by women and men.

Citation centrality analysis

The flow of citations determines the visibility and recognition of papers both locally and globally. To measure the local influence of papers we use the in-degree metric, and to measure the global influence, we use the PageRank centrality. Our aim is to verify if the visibility of papers written by women is proportionate to what we expect from their overall population size. To do that, we focus on the ranking of the nodes according to their respective centrality.

Understanding ranking centrality is important for three reasons. First, the authors of papers in top ranks gain more visibility for themselves and those central papers influence future citation patterns31,32,33. Second, the visibility of papers in top ranks is being exacerbated by algorithmic tools such as Google Scholar. Third, since citation networks follow a heavy-tailed distribution, those in top ranks stabilize their ranking position and give few opportunities for other papers to catch up34. Because of these network effects, it is important to study how minorities are represented in top network centrality ranks.

We assigned to each paper a gender by labeling it based on its first author. Then, we analyzed the top h% in-degree/PageRank centrality of the papers. Figure 3a suggests that papers written by women have significantly lower in-degree and pagerank centrality than expected from their overall proportion. Women-led publications are substantially under-represented in the highest 20th, 30th, and 40th percentages, and the deviation between the observed and the expected proportions likewise increases in the highest rank positions. While in-degree and PageRank follow a similar trend as expected, the proportion of women with high PageRank centrality is even lower when compared to the in-degree centrality. This suggests not only that papers written by women receive less attention but also that they are disadvantaged in terms of their position within the entire citation network. Statistical tests confirm these findings (see Supplementary Table 5).

Fig. 3: Women author proportions in degree and PageRank centrality, evolution of centrality difference by year and relationship between time of publication and citation.
figure 3

a Proportion of publications with a woman primary author per top h% of degree (black) and PageRank centrality (red). The dotted horizontal line signifies the proportion of women primary authors in the observed samples. b Citation and temporal differences between man–woman pairs of papers with validated similarity. The colors indicate the quadrant each pair belongs to (black—quadrant 1, red—quadrant 2, green—quadrant 3, and purple—quadrant 4). c Heat map showing the probability anomaly of the joint probability distribution of citation and temporal differences computed with equation (2). d Centrality differences of similar man-man pairs and similar man–woman pairs over the years. The two papers within each pair are published no more than 3 years apart, and the publication year of the pair is defined by the year of the latter paper. The lines are the mean values and the shaded areas the standard errors. The evolution of the distribution as a whole is shown in Supplementary Fig. 7 as a percentile plot.

So far, the global gender analysis points towards a notable disparity in productivity and citation of men and women. This could be partly due to historical reasons, to the cumulative advantage that early arrival confers to men, as well as to the high dropout rate of women7. The slower rate of arrival of women (see Fig. 1a) may also play a relevant role. Together, these factors affect women’s global visibility. The question that arises from these global results is, are scholars intentionally ignoring (and therefore, under-citing) research works led by women? To explore this possibility, in the following section we study pairs of papers written by men and women that are statistically validated twins, and measure the citations that each paper receives.

Pair-wise citation analysis

We identified statistically validated pairs of similar papers (one with a man as first author and the other with a woman) using the methodology described in Methods and summarized in Supplementary Fig. 2. Then, we computed the difference in the number of citations each member of the pair receives. The overall expectation is that similar pairs of papers should have a similar number of incoming citations on average. The first sign of gender bias that we have found is that, within similar pairs of man–woman papers, men get more citations in 45% of the pairs, women in 39%, and in 16% they receive the same number of citations. We performed binomial tests against the null hypothesis that men and women should be equally likely to get citations within each similar pair and obtained a strong rejection (p-value ≈ 0).

To quantify men’s advantage, we computed the average citation difference between the man-led and the woman-led paper of each pair. Then we normalized it using the standard deviation of men’s and women’s citations to obtain Cohen’s d, a measure of effect size for the difference of means. We evaluated the significance of these differences using z-tests (see Methods). As shown in Table 1, men’s average citation count is significantly higher than women’s both in aggregate and when we consider each PACS subfield separately to control for potential differences in the citation biases per subfield. We obtained similar results by controlling for journal instead of subfield (see Supplementary Note 1 and Supplementary Table 10). We performed analogous analyses for last authors, finding consistent results for most subfields and journals (see Table 2 and Supplementary Table 12). The only noteworthy difference appears in PACS 80 (Interdisciplinary Physics & Related Studies), where women get more citations on average as first authors.

Table 1 Differences in received citations among similar pairs of publications labeled by their first-author gender.
Table 2 Differences in received citations among similar pairs of publications labeled by their last-author gender.

It is known that the publication time of a paper influences its citation count, and previous studies1,35 have used different strategies to control for it. To check whether the temporal difference between two papers is responsible for the citation disparity for women (an older paper has had more time to accumulate citations), we add a maximum 3-year difference restriction between two similar papers and redo the citation difference analyses. Tables 1 and 2 show that when the time constraint is applied, the citation difference between two similar publications decreases significantly (see Supplementary Tables 11 and 13 for the journal-wise analyses). The effect is stronger for first than for last authors. The subfield Interdiscplinary Physics & Related Studies (PACS 80) presents an anomalous behavior, as women have the citation advantage as first authors while men have it as last authors. In contrast to the rest of subfields, this advantage increases after applying the time constraint.

However, citations have a very heterogeneous distribution, with a tiny fraction of papers gathering a huge number of citations, so these discrepancies may be caused by a few papers written by women with many citations. To mitigate the influence of such outliers, we have performed analogous tests for the difference of medians. In particular, we have used the Wilcoxon test to quantify the significance of the difference and the rank biserial correlation (rc)36 to estimate its effect size. The rc metric takes values between −1 when women have more citations in every pair and +1 when men do. The results, presented in Supplementary Tables 14 and 15, show that the apparent advantage of women in PACS 80 (and in PACS 00—General Physics) after applying the time constraint, were mostly driven by outliers, as rc is positive in all cases; although, consistent with the previous analyses, it is smaller when the time constraint is applied.

Throughout these analyses, we have seen that the gender disparity within similar man–woman pairs is small (small effect sizes), but significant (p-values close to 0). However, we should be cautious when interpreting those p-values. The statistical tests rely on the assumption of independent samples, but in our methodology one paper can be part of several statistical twins, so those pairs would not be independent. The independence violation results in narrower standard errors and, in turn, lower p-values. Nevertheless, the consistency of the gender asymmetries should not be underestimated.

The temporal dimension is fundamental when comparing citation counts, as the first-mover advantage plays a crucial role in scientific success37. Within similar man–woman pairs, the man’s paper is published first in 47.7% of the pairs, the woman’s paper in 41.3%, and approximately at the same time (the same year) in 11.0% of the pairs. These results point to a clear first-mover advantage by men.

First-mover advantage within similar pairs of papers

Given the above results, we now seek to confirm whether the time of publication is a main driver for the citation disparity and whether the first-mover advantage in publication affects men-led papers and women-led papers similarly. We define Δt = Ym − Yf as the year difference between the publication dates of man–woman pairs of similar papers and ΔC = cm − cf as their citation difference. We plotted the year difference Δt against the citation difference ΔC in Fig. 3b. We likewise elaborated ten analogous plots after categorizing the data into subfields by PACS number (shown in Supplementary Fig. 5) to control for variations between subfields. Note that for this analysis we impose no time restriction between the publication times of the two papers of each pair.

To verify that the disparity in citations is caused by the first-mover advantage, we first need to test whether a first-mover advantage in fact exists. If that is the case, when a man publishes first (Δt < 0) he should get more citations (ΔC > 0) on average, but when a woman publishes first (Δt > 0) she is the one who should get more citations (ΔC < 0) on average; that is, in Fig. 3b, quadrants Q2 and Q4 should be more populated than expected if we treated Δt and ΔC as independent random variables. Equivalently, we should observe a negative correlation between Δt and ΔC.

To test this hypothesis, we compared the empirical joint probability distribution of Δt and ΔC (Pempt, ΔC)) with the one that we would obtain if they were independent variables (Pnullt, ΔC) = pt)pC)) by computing the probability anomaly as:

$${P}_{{{{{{{{\rm{diff}}}}}}}}}({\Delta }_{t},{\Delta }_{C})=\frac{{P}_{{{{{{{{\rm{emp}}}}}}}}}({\Delta }_{t},{\Delta }_{C})-{P}_{{{{{{{{\rm{null}}}}}}}}}({\Delta }_{t},{\Delta }_{C})}{{P}_{{{{{{{{\rm{null}}}}}}}}}({\Delta }_{t},{\Delta }_{C})}$$
(2)

The resulting values of Pdifft, ΔC) are shown in Fig. 3c and, as can be observed, they support the hypothesis of the first-mover advantage, since Q2 and Q4 present positive anomalies while Q1 and Q3 present negative ones. It is worth emphasizing that a positive (resp. negative) anomaly indicates higher (resp. lower) density of points with respect to a situation of no correlation between Δt and ΔC. To quantify this trend we computed the Pearson and Spearman correlations between Δt and ΔC, obtaining − 0.13 and − 0.34, respectively.

Once the existence of the first-mover advantage has been confirmed, we need to test whether there exists an asymmetry in the relative advantage that men and women obtain when they publish first. If there is no asymmetry, the average number of citations that a woman obtains by publishing a certain number of years ahead of a man should be comparable to the number of citations that a man obtains in the equivalent situation.

To verify this, we compared the citation differences of Q2 with Q4 (pairs where the earlier paper received more citations) and Q1 with Q3 (pairs where the earlier paper received fewer citations) for each temporal difference; in other words, we compared the average absolute value \(|{\Delta }_{C}|\) of points from Q2 with the average \(|{\Delta }_{C}|\) of points from Q4 for each \(|{\Delta }_{t}|=1,2,\ldots\) separately (analogously for Q1 and Q3). To perform this comparisons, we used z-tests for difference of means for each year difference (see Methods). The results of the tests for the whole dataset, shown in Table 3, indicate that men have an asymmetric advantage, gaining comparatively more citations when they publish first. We obtain similar results for each subfield (see Supplementary Table 16). The exceptions are General Physics (PACS 00) and Interdisciplinary Physics & Related Studies (PACS 80), where women get an asymmetric advantage.

Table 3 Statistical tests of gender asymmetry in the first-mover advantage.

Researcher seniority as a temporal advantage

While we have verified that the first-mover advantage plays a relevant role in the citation disparities between genders in a microscopic level, the differences between similar pairs, even if significant, are fairly small. Therefore, the temporal advantage gained by individual papers published earlier than their statistical twins may not be enough to explain the visibility differences manifested in the centrality rankings shown in Fig. 3a. As mentioned above, there are group-level temporal disparities that should also be taken into account: women’s delayed arrival, their slower rate of arrival, and their higher dropout rate, captured in Fig. 1.

These factors can have dramatic effects on the distribution of seniority of researchers (see Fig. 4a), which is another potential source of inequality. As a researcher progresses through her career, she not only gathers citations, but also recognition, which in turn attracts more citations. As we observe in Fig. 4b, the proportion of male to female authors increases with career age, indicating a strong gender bias in the seniority distribution. This bias in the proportion of senior researchers is transferred to the ranking of centrality of papers (see Fig. 4c), which shows, on the one hand, that the higher ranks are occupied on average by older researchers, and on the other hand, that the average age of women authors is consistently lower throughout all ranks.

Fig. 4: Seniority distribution of researchers by gender.
figure 4

a Number of men and women authors by their career age. b Proportion of men to women by career age. c Average age of men and women authors of papers in each top h% of degree centrality (number of citations). The inset shows the same result zooming in on the higher ranks.

This thorough analysis indicates that temporal advantages are critical factors in the emergence of gender inequalities. From the individual’s perspective, researchers that publish a result earlier gain the first-mover advantage. Men publish earlier more frequently and obtain an asymmetrical advantage when they do so. At the population level, historical disadvantages driven by the late arrival and higher dropout rate of women cause a deficit of female senior researchers, which may explain women’s low visibility in the citation network.

Historical trend in citation

Finally, we hypothesize that the physics community might have been less receptive to the contribution of women in the past compared to the present. To test this hypothesis, we measure the temporal evolution of the centrality differences (ΔC) between man–woman pairs by year and limit the publication time difference between the two papers to a trailing window of 3 years. Then, we compute the mean and standard error of ΔC for all the pairs within each window. For comparison, we perform an analogous computation for random samples of similar man-man pairs. In each time window, we matched the number of sampled man-man pairs with the number of similar man–woman pairs. We repeated the man-man computation 100 times independently and computed the average ΔC and the standard error, which we use as a baseline.

Figure 3d shows the citation differences for man–woman pairs of similar papers over the years compared with the baseline given by man-man pairs of papers. The earlier man–woman pairs seem to present a higher disparity favoring men than later pairs, whereas the ΔC values for man-man pairs throughout the years are, as expected, consistently located around zero. After all, the similar man-man pairs were chosen randomly and there is no reason for one paper of the pair to have a higher or lower citation count than the other. The early fluctuations in Fig. 3d are due to sample size (see Supplementary Fig. 6), and the negative peak of 2002 is caused by an extremely influential paper led by a woman that laid many of the theoretical foundations of the subfield of Network Science38. To measure the decreasing trend in the man–woman pairs, we ran a Mann–Whitney U-Test comparing the ΔC of man–woman pairs published before and after 1995, obtaining a p-value = 1.78 × 10−58. Hence, as hypothesized, the man–woman pairs published before 1995 show a significant disparity favouring men when compared to those published after 1995. We obtained qualitatively similar results when we performed the computation considering only the citations received up to 5 years after publication for each paper.

Discussion

The primary objective of this research was to identify gender disparities in physics focusing on five topics of interest: productivity, author order analysis, self-citation analysis, and the comparison of citations for pairs of similar papers. Therefore, our study makes a substantial contribution to the current body of literature by comprehensively analyzing the citation patterns of men and women in physics. We assembled information about all papers published in the American Physical Society from 1894 to 2009. Using a technique that combines name and image recognition, we inferred the gender of the primary authors of papers and, to study potential gender biases, we looked for statistically significant differences in the citation patterns of papers written by men and women primary authors.

Despite all the efforts to avoid any biases in our analysis, some caveats should be considered. We have combined name and image inference to identify the gender of the scholars. Even with this careful examination, we cannot infer the gender of authors who have only initials as their first names. Another caveat is related to ethnicity, as we cannot identify the majority of Asian names originating from Korea and China20 (see Supplementary Table 1). However, we can safely argue that this lack of gender identification likely affects both genders similarly. Another sensitive step of our data processing pipeline is name disambiguation, used to identify all the papers published by a given author. Although we have used various criteria to disambiguate names, there still might be errors in identifying unique authors and these errors may affect minorities, which have lower numbers of instances in the data. There are other factors that can affect citation and may not be determined by assessing similar papers. For example, papers that are novel and ground-breaking or interdisciplinary in their nature may contain citations from outside physics that make them less similar to other established papers, and those are likely not being adequately assessed in our analysis. In this case, we acknowledge that the focus of our analysis is on those scholars who work predominantly on mainstream physics.

The academic community tends to evaluate scientists based on the behavior of the majority, which in physics is predominantly the behavior of white, Western men. This evaluation, at its core, is problematic and can cause discrimination against other groups that are historically, socially, or politically discriminated against. In such cases, more attention and care should be given to women and other minorities who are more likely to suffer from such historical disadvantages. Once the system moves towards a more diverse representation, its core values will no longer be determined by only one type of majority.

The structure of the citation network can influence the future citations and recognition that papers receive. Through reading papers, scholars often follow cited papers to read and cite previous works. If papers written by women are under-represented in influential positions of the citation network, this will affect their future visibility even if they are cited adequately compared to their statistical twins. This phenomenon, also known as success-breeds-success39, in addition to cumulative advantages and the first-mover advantage37, can be consequential for the success and recognition of scholars, their visibility33, future success, and the scientific community’s perception of their work40.

Science, at its core, is a collaborative process. Through collaboration and research visits, scientists meet, ideas spread, and the foundations are laid for future collaborations. Mobility hugely impacts the centrality of scholars in their collaboration networks41. There are implicit factors that can indirectly affect the participation of women in scientific collaborations. For example, geographical distance is more likely to affect women due to their family responsibilities, restrictions on travel during pregnancy, and breastfeeding, to name a few reasons. Women might not be welcomed in certain social events that are predominantly preferred by men or for those with no family responsibilities. Lack of chemistry or shyness in interacting with another gender might also make women less likely to be invited for research visits and collaborations. We note that women are not the only group who suffer from geographical restrictions, as other forms of discrimination or simply high traveling costs can affect the collaboration of scholars from Muslim and developing countries.

Diversity has a crucial role in shaping and spreading new ideas. For example, one can safely argue that many recent publications that aim to understand the inequality and biases in academia and other social domains are directly related to the boost in participation of women and minorities. However, it is also known that despite their contributions to innovative research, minorities do not reap the benefits of their innovation when compared with majorities42. In future work, intersectional inequalities should be studied at large scale by considering the intersection of multiple disadvantaged categories such as gender, ethnicity, and race.

Conclusion

In sum, we found that despite the rise of women’s participation in physics in recent years, the rate of entry of new women into the field is still much slower than for men. Women tend to be less productive than men in their mid-career, and they tend to have a higher dropout rate over their academic careers. Moreover, in agreement with previous works, we found that men tend to cite their own previous works with more frequency than women, penalizing the visibility of women and their potential for academic promotion. This disparity in visibility is also manifested in the under-representation of women at the top ranks of both degree and PageRank centrality of the citation network, which implies a disadvantage on both a local scale (lower number of citations) and a global scale (peripheral location within the network).

When assessing pairs of similar papers, we found that the first-mover advantage drives the citation disparity significantly. These results combined suggest that the overall disparity in the citation network is a result of cumulative advantages and the first-mover effect that men have in physics. This cumulative advantage could create implicit biases that should be tackled by appropriate policies that foster the participation of women and other minorities.

Methods

Assessing similar pairs of papers

The main objective of this paper is to compare pairs of similar papers in an unbiased fashion. The similarity analysis is based on the concept of bibliographic coupling strength Nij of pairs of articles (i, j), which is defined as the number of common articles cited by both i and j21,22. To overcome the shortcomings of the most commonly used normalized versions of Nij (the Jaccard index and fractional counting, described in Supplementary Methods), we identify couples of similar papers by looking both at the outgoing references of the pair and the incoming citations of the articles they cite. In particular, we perform a statistical test using the hypergeometric distribution as a null model and detect pairs of papers whose set of common outgoing citations has a very low probability of having been generated by chance43,44. In Supplementary Fig. 2 we present a diagram of this methodology, which is explained below in detail.

First, the citation network is built for each physics subfield (the first two digits of PACS), and then each paper in the citation network is further labeled by the gender of its primary author. After establishing the citation network, two sets \({S}_{A}^{k}\) and \({S}_{B}^{k}\) are defined: \({S}_{B}^{k}\) includes all articles that are cited k times, and \({S}_{A}^{k}\) includes all articles that cite any element in \({S}_{B}^{k}\). Notice that each publication may belong to one set, to the other or to both.

Then, we build all possible pairs \(i,j\in {S}_{A}^{k}\). In order to quantify the similarity between i and j, we compute the probability of i and j both referencing a certain number of publications using the hypergeometric distribution:

$$H(X| {N}_{B}^{k},{d}_{i},{d}_{j})=\frac{\left(\begin{array}{c}{d}_{i}\\ X\end{array}\right)\left(\begin{array}{c}{N}_{B}^{k}-{d}_{i}\\ {d}_{j}-X\end{array}\right)}{\left(\begin{array}{c}{N}_{B}^{k}\\ {d}_{j}\end{array}\right)}$$
(3)

where \({N}_{B}^{k}=| {S}_{B}^{k}|\) and di, dj are the number of elements in \({S}_{B}^{k}\) that publications i and j respectively cite. Supplementary Fig. 2 shows a diagram that illustrates the meaning of these variables. Notice that if di and dj are interchanged, the value of H remains the same. Finally, X would be the number of overlapping citations. The term \(\left({{N}_{B}^{k}}\atop{{d}_{j}}\right)\) corresponds to all the possible ways of choosing dj publications from the set \({S}_{B}^{k}\); \(\left({{d}_{i}}\atop{X}\right)\) denote the number of ways one can choose exactly X publications from the di papers that i cites and \(\left({{N}_{B}^{k}-{d}_{i}}\atop{{d}_{j}-X}\right)\) are the number of ways the dj − X papers cited by j and not by i can be chosen from \({S}_{B}^{k}\). Intuitively, this hypergeometric distribution can be understood as an urn model with \({N}_{B}^{k}\) balls, such that di of them are good balls and the rest are bad balls. H is then the probability of obtaining exactly X good balls when retrieving dj balls from this urn.

Now, if i and j have actually cited \({N}_{ij}^{k}\) common papers of in-degree k, the cumulative probability of \(X\le {N}_{ij}^{k}\) provides a measure of how probable it is that the size of their set of overlapping citations can be explained by randomness:

$${p}_{ij}(k)=\mathop{\sum }\limits_{X=0}^{{N}_{ij}^{k}-1}H\left(X| {N}_{B}^{k},{d}_{i},{d}_{j}\right)$$
(4)

The higher pij(k) is, the less probable it is that the size of \({N}_{ij}^{k}\) is due to chance. Therefore, we devise a measure of similarity as follows:

$${q}_{ij}(k)=1-{p}_{ij}(k)$$
(5)

Notice that qij(k) is the probability of a particular bibliographic coupling strength of randomly selected papers i and j towards articles in \({S}_{B}^{k}\) being greater than or equal to \({N}_{ij}^{k}\). This computation is repeated for all k and the different values of qij(k) are stored. The similarity of the couple (i, j) is measured by the minimum qij(k) over all possible values of k:

$${{q}_{ij}\left(k\right)}_{\min }=\mathop{\min }\limits_{k}\{{q}_{ij}(k)\}$$
(6)

Publications i and j are considered similar if \({{q}_{ij}(k)}_{\min } < \;{p}{* }\), where p* is a threshold value. We have chosen a threshold of p* = 0.001, which provides a good balance between similarity sensitivity and sample size. In Supplementary Methods and Supplementary Fig. 3, we detail the criteria for adopting this threshold.

We take the maximum similarity (minimum qij(k)) across k values because similarity can be manifested in the reference lists in very different ways. For example, two similar papers of a niche area could share just one or two references that almost no other publication cites. In the other extreme, two similar papers of an interdisciplinary or generalist field could share many widely cited references, so the probability that they were included only due to their popularity is very low. Both of these situations would lead to high similarity values. One would present a high similarity (low qij(k)) only for low k, while the other would do so only for high k. Since the qij(k) are p-values, the statistical significance of each of them should be tested independently. By only testing the minimum we are not disregarding the remaining k, as there may be other k for which qij(k) is low enough to pass the p* threshold. Instead, following Ciotti et al.44, we are simplifying the analysis, as for two papers to be considered similar, it is enough for one qij(k) to pass the p* test.

To verify the accuracy of our approach, we manually inspected several pairs of papers with validated similarity measurements. For this test, we set a low threshold value, p* = 10−6, and applied a constraint of maximum publication year difference of 3 years. We validated the similarity between the two papers through the inspection of keywords, titles, and citation activities.For instance, papers45 and46, with \({{q}_{ij}(k)}_{\min }=1.0056\times 1{0}^{-8}\), present some connection between their main ideas and share a common author. Additionally, a large proportion of their citation activities align. Another similar pair is formed by articles47 and48 with \({{q}_{ij}(k)}_{\min }=4.0735\times 1{0}^{-12}\), which show extremely similar citation activities and deal with similar topics. As a final example,49 and50, with \({{q}_{ij}(k)}_{\min }=2.5139\times 1{0}^{-7}\), share topic, citation activities, and a collaborating author. It is worth emphasizing that, due to the highly restrictive p*, some of these statistically validated pairs of similar papers share a common author, which is a strong verification of our algorithm.

In a nutshell, the hypergeometric probability testing compares how significant the overlapping outflow of citations is for two papers compared to what we expect from the in-degree and out-degree of the citation network. Using this technique, we are able to compare papers that are inherently similar in their subject field by not only comparing their overlapping references, but also accounting for variations in the citations received by each reference. Since we control both for the outgoing citations of the pair and the incoming citations of the commonly cited papers, the comparison is robust and unbiased.

Authorship order two-proportion z-test

We denote the total men’s and women’s population as Nm and Nw, and total number of men and women first authors as nm and nw, respectively. We further define \({p}_{m}=\frac{{n}_{m}}{{N}_{m}},{p}_{w}=\frac{{n}_{w}}{{N}_{w}},p=\frac{{n}_{m}+{n}_{w}}{{N}_{m}+{N}_{w}}\) and the two-proportion z-test is performed as below:

$$z=\frac{{p}_{m}-{p}_{w}}{\sqrt{p(1-p)\left(\frac{1}{{N}_{m}}+\frac{1}{{N}_{w}}\right)}}$$
(7)

Calculating differences in received citations

Let Nmw denote the cardinality of the set of all pairs (m, w) where m and w denote publications by a primary man and woman author that share at least one reference and let M(p*) be the subset of all similar pairs validated under p*. cm and cw indicate number of citations received by m and w, and the average citation difference cd can be computed by

$${c}_{d}({p}^{* })=\frac{1}{| M({p}^{* })| }\mathop{\sum }\limits_{x=1}^{| M({p}^{* })| }{\left({c}_{m}-{c}_{w}\right)}_{x}$$
(8)

where x denotes the index of pairs (m, w) M(p*). Since we are interested in comparing pairs of papers, we normalize this average difference to obtain Cohen’s davg, a widely used measure of effect size for difference of means51 (we actually use the unbiased version of Cohen’s d, Hedge’s g, but we will keep the d notation to emphasize its interpretation as average difference):

$$d({p}^{* })=\frac{{c}_{d}({p}^{* })}{\sqrt{\frac{{\sigma }_{{c}_{w}}^{2}+{\sigma }_{{c}_{m}}^{2}}{2}}}$$
(9)

To assess the significance of this difference we perform a difference of means z-test with H0 : cm = cw, with the z-statistic defined as

$$z=\frac{{\bar{c}}_{m}-{\bar{c}}_{w}}{\sqrt{\frac{{\sigma }_{{c}_{m}}^{2}}{| M({p}^{* })| }+\frac{{\sigma }_{{c}_{w}}^{2}}{| M({p}^{* })| }}}$$
(10)

Hence, a positive z-score indicates that the data displays higher degree centrality for authors who are men than expected.

Computing temporal citation differences

We compared the citation differences of Q2 with Q4 (pairs where the earlier paper received more citations) and Q1 with Q3 (pairs where the earlier paper received fewer citations) for each temporal difference; in other words, we compared the average absolute value \(|{\Delta }_{C}|\) of points from Q2 with the average \(|{\Delta }_{C}|\) of points from Q4 for each \(|{\Delta }_{t}|=1,2,\ldots\) separately (analogously for Q1 and Q3). To perform these comparisons, we used z-tests for difference of means for each year difference:

$$z=\frac{\overline{| {\Delta }_{C}^{{Q}_{i}}| }-\overline{| {\Delta }_{C}^{{Q}_{j}}}| }{\sqrt{\frac{{\sigma }_{{Q}_{i}}^{2}}{N({Q}_{i})}+\frac{{\sigma }_{{Q}_{j}}^{2}}{N({Q}_{j})}}}$$
(11)

In this test, we evaluate the mean (\(\overline{| {\Delta }_{C}^{{Q}_{i}}| }\)) and the standard deviation (\({\sigma }_{{Q}_{i}}\)) of \(|{\Delta }_{C}|\) for two subsets of quadrants Qi and Qj. N(Qi) is the number of data points in quadrant i (number of similar pairs). We run the z-test for (i, j) = (1, 3) and (i, j) = (2, 4).