Estimating the size of fields in biomedical sciences

ABSTRACT Scientific research output has increased exponentially over the past few decades, but not equally across all fields of study, and we lack clear methods for estimating the size of any given field of research. Understanding how fields grow, change, and are organized is essential to understanding how human resources are allocated to the investigation of scientific problems. In this study, we estimated the size of certain biomedical fields from the number of unique author names appearing in field-relevant publications in the PubMed database. Focusing on microbiology, where the size of fields is often associated with those who work on a particular microbe, we find large differences in the size of its subfields. We found that plotting the number of unique investigators as a function of time can show changes consistent with growing or shrinking fields. In general, the number of unique author names associated with a particular microbe correlated with the number of disease cases attributed to that microbe, suggesting that the microbiology field workforce is deployed in a manner consistent with the medical importance of the microbe in question. We propose that unique author counts can be used to measure the size of the workforce in any given field, analyze the overlap of the workforce between fields, and compare how the workforce correlates to available research funds and the public health burden of a field. IMPORTANCE Science and its individual fields are growing at spectacular rates along with the number of papers being generated each year. However, we lack methods to investigate the size of these fields, many times relying on anecdotal knowledge on which fields are “hot topics” or oversaturated. Thus, we developed a bibliometric method analyzing authorship information from PubMed to estimate the size of fields based on unique author counts. Our major findings are that unique author counts serve as an efficient measurement of the size of a given field. Additionally, the size of a biomedical science field correlates to its public health burden when compared to case numbers. This method allows us to compare growth rates, workforce distribution, and the allocation of resources between fields to understand how scientific fields self-regulate. These insights can, in turn, help guide policymaking, for example, in funding allocation, to ensure fields are not neglected.

Today, science is organized into fields, and many fields have several subfields.We will use the fields of microbiology and immunology as examples, as these are the ones we are most familiar with.Both microbiology and immunology are subfields of the larger field of biomedical sciences, which is, in turn, a subfield of biology (1,2).Hence, fields constitute subgroupings of scientists working on discrete problems often within fussy epistemic boundaries.Fields are also the sociological units by which science is organized since they constitute a major source of friends and personal contacts for scientists (3).Herrera et al. used network analysis to map the connectivity and size among fields within physics (3).Chavalarias and Cointet developed a method to infer phylomemetic patterns from the published literature and used its density to ascertain the growth and decline of fields (4).The question of field size is important for understanding how human resources are invested in areas of science and could be significant for scientific progress given that large fields may stymie the development of new ideas and concepts (5).
Despite the importance of field organization to science, we could find little or no information on the size of fields.Although scientists anecdotally know that some fields are larger than others, remarkably little has been done to quantitate the size of fields.This is an important problem because the size of a field is a measure of how much human capital is devoted to a particular problem.For example, when the COVID-19 pandemic began in 2019, it may have been useful to know the number of scientists with experience in coronavirus or viral vaccines as this would have provided a measure of human resources initially available to confront such a threat.Knowing the size of fields is also important when considering the efficient allocation of scarce resources.However, estimating the size of fields is not an easy task.Scientists move between fields and often engage in cross-field research, creating fuzzy boundaries that defy easy categorization.Assuming one can identify a measure for field size, there are other obstacles to accurate enumeration.For example, many scientists who published in this journal work on various problems simultaneously and, therefore, may belong to more than one field at any given time.In this regard, the work of these authors could fit within the fields of immunology or microbiology, or their interface, depending on how their contributions are assessed.Second, fields evolve with time with some increasing in size and others shrinking.In this study, we approach the problem of the size of fields using bibliometric approaches whereby the number of scientists with a given name is associated with a subject and have developed software that allows one to estimate the size of fields.We use microbiol ogy to explore this topic since this is a subfield of biomedical sciences that is sectioned by the microbes studied (2).
We hypothesize that the size of a given field can be estimated by counting the total unique authors in published articles of the said field.In this study, we estimated the size of subfields within the field of microbiology by counting the number of unique names identified with specific microbes and found a large variation.We anticipate that the number of papers associated with a topic is also a measure of the size of a field but predict that counting authors rather than papers will provide a better estimate of field size since each individual is unique, and fields are composed of people, not papers.For example, a publication describing a microbe interacting with a macrophage could belong to the fields of microbiology, immunology, and cell biology while an author in the paper is a unique entity who is potentially traceable by name or ORCID.Furthermore, focusing on individuals mitigates confounders arising from differences in laboratory productivity.For example, a field composed of 100 laboratories that each produces one paper per year is larger than a field composed of one laboratory that produces 100 papers per year, but this distinction could not be made by only counting the research output.

MATERIALS AND METHODS
A given field was denoted by a search term, which was used to query the PubMed database (https://pubmed.ncbi.nlm.nih.gov/).Basic PubMed searches are used in which the query is matched to the title, abstract, and keywords of articles.All found publica tions were then downloaded using Entrez Direct, and a list of all authors was taken from the recorded author bylines (6).Unique names were determined by the first and last names documented in the PubMed database.Data were only collected for articles with a publication date and authors with a recorded first and last name entry.Funding information was collected from NIH RePORTER (https://reporter.nih.gov/) with search terms matching those of PubMed queries.Case burden information was gathered from literature and CDC national reporting (7,8).Author information was parsed and analyzed in R (version 4.3.0)(9).

Estimating field size by counting unique authors
We first estimated the size of given fields by counting all unique authors that appear in PubMed-deposited articles.We found that each of these fields had a general upward trend in the number of unique authors over time and that individual species searches heavily resembled that of the total genus searches (Fig. 1A; Tables 1 and 3).We next measured field size by the total number of papers published within a given field each year and found similar overall trends (Fig. 1B).To investigate how closely the number of unique authors in a field correlated to the number of papers published in any given year, we directly compared unique authors and total papers.We found that there is a strong overall correlation (0.98 via Pearson and Spearman) between authors and publications, with the strength of the correlation in individual fields showing more variation (Fig. 1C  and 2).While the number of unique authors in a field is clearly reliant on publications being generated in the said field, we believe using author counts can generate a deeper image of a field's growth.Counting authors accounts for trainees, collaborations, etc., which would otherwise be missed by just counting the publication itself.
To ensure that our workflow accurately captured a snapshot of a field, we compared the known literary corpus of three individual authors.As expected, we found that the Entrez Direct PubMed search was a reliable method for returning relevant articles (Table 2).We also noted that query formation is an essential step in the analysis as even slight differences in a query can drastically alter the list of returned articles, especially when dealing with larger fields (Fig. S6B).Finally, we investigated fitting both linear and exponential growth models to each of the various analyzed fields and found that exponential growth better explains almost all of them (Fig. S1).There are clear exceptions to this though, such as the simian vacuolating virus 40 (SV40 field, which experiences both growth and decline with time.

Field size correlates to disease burden
The number of cases attributed to a particular disease is a measure of the importance of that disease to society.Knowing the size of the fields working on specific diseases is a measure of the resources invested in studying a disease, which provides a mechanism for evaluating the consistency of resource allocations.We first investigated whether the total amount of funding for a given pathogen correlated to the size of a given field.We found that, generally, the increase in funding outpaced the increase in authors for all six of our fungal fields, resulting in higher dollar per author funding for the fields (Fig. 3A; Table 3) over time.This trend remained largely the same after adjusting for inflation.Histoplasma represented an exception to this trend, seeing a large spike in funding after the year 1999.There was no sudden outbreak of histoplasmosis at this time; however, the NIH budget was roughly doubled between 1998 and 2003 before flattening again and may account for this sudden burst in Histoplasma funding (10,11).
Scientists mostly study pathogens and diseases of interest to them and for which resources are allocated, so we next investigated whether fields naturally stabilize to sizes reflecting the case burden of the disease.The WHO recently released a list of fungal pathogen priorities, and we compared the field size to the relative priority of each fungal disease.As expected, a larger proportion of the workforce is committed to critical priority pathogens, though the high-priority field is experiencing rapid growth (Fig. 3B and C).
To investigate more closely, we compared the size of fields working on various fungal diseases to the burden of disease caused by that fungus.Unfortunately, current reporting of fungal diseases is limited and not standardized.However, we were able to obtain reliable estimates of various fungal disease diagnoses for 2 years and compare the case burden to the size of fields.We found that, in both years, the number of authors reporting on a field had a slight positive correlation with case burden (Fig. 3D).Interestingly, when we compared funding per author to case burden, we found no strong correlation between the two (Fig. 3E).
We next investigated whether these trends held true outside of fungal diseases by expanding our search to 10 major human diseases.Broadly, we observed similar results to the fungal field where the size of each field continually increased, funding per author was generally consistent, and neither field size nor funding correlated strongly with diagnosed case burden (Fig. 4).Several spikes in funding are noticeable after major events involving particular diseases: the 2018 Salmonellosis outbreak, the 2016 Zika outbreak, etc.For the most part, however, the workforce attributed to each category has been steadily increasing.The largest effects on the workforce, as one would expect, seem to follow diseases uncommon or previously unseen to the US breaking out (Zika outbreak in 2016, novel coronavirus in 2019, etc.).

Unique authors mirror activity and use of model organisms and methods
One would expect the number of unique authors in a field to mirror a specific model organism or pathogen use, and outdated or early versions of methods are less likely to be used by new investigators.Thus, we analyzed several model organisms (Saccha romyces cerevisiae, Danio rerio, Drosophila melanogaster, and Caenorhabditis elegans), three pathogens we would not expect to continue increasing in size given advances in Size of field comparison between author and paper count.We compared paper count to unique author count as estimates for the size of fields.Overall, the two correlate well, suggesting that the paper could be used as a broad surrogate measure of field size.a We analyzed the research corpus of three individual fungal researchers comparing manual or automatic assignment of these papers to a given topic: Cryptococcus neoformans.In each sample, manual assignment and automatic assignment of these papers fully agreed.For example, investigator 1 has 16 total published articles.We manually attributed 12 of those papers to the cryptococcal field, and the automated search attributed the same 12 papers to that field.In each case, human and automatic associations agreed.
virology and vaccines (SV40, poliovirus, and Variola major), a field which experienced multiple popularity spikes from discoveries decades apart (phage), and three recent gene editing methods with overlapping userbases: zinc finger nucleases (ZFNs), transcription activator-like effector nucleases (TALENs), and clustered regularly interspaced short palindromic repeats (CRISPR).We observed expected patterns of field size according to use and prevalence over time for each category (Fig. 5).

DISCUSSION
Understanding field size and organization is essential to effectively understand the structure of our scientific workforce.There is some evidence that scientists favor working on specific topics, which raises the concern that chasing "hot topics" could neglect essential basic science research (12).We hoped to develop a measurement of field size that could help quantify these differences and determine how and where our scientific work effort is distributed.To compile our analysis, we used Entrez Direct searches of the PubMed database for each query.These searches match queries to text in the title, abstract, and keywords of articles while ignoring text in the introduction, results, and discussions.Thus, it is unlikely that our results would be contaminated by the inclusion of search terms in background text and that searches return papers that are actually associated with a given query.The use of unique authors in a field differs from            essential to creative science and fostering new ideas whereas fields dominated by fewer smaller lab groups may find themselves stymied from dogmatic hierarchies.Measuring unique authors also avoids certain pitfalls associated with paper numberdriven analyses.First, it allows us to control for differences in output between individual authors and fields.Since we are looking to measure effort and workforce rather than productivity, a single author who outputs 10 articles per year should be weighed the same as an author who outputs 15.Conversely, a paper with 10 authors implies a larger workforce behind it than a paper with 2. Thus, quantifying effort and personnel according to papers does not allow for granularity in terms of personnel and may explain why we sometimes observed differences in field size when comparing total paper counts to unique author counts.
We propose that counting unique authors provides an estimate of the size and activity of a given field.The overall number of papers published in a field each year trended strongly with the total number of unique authors across many, but not all, of the analyzed fields.We included searches of both genus and genus species to ensure the specificity of search terms, for example, determining whether searching "Aspergil lus fumigatus" would return "Aspergillus niger" papers.We also compared searches of individual known authors and specific paper topics to ensure sub-selections of papers would be returned correctly.Across each test, PubMed returned the correct corpus of articles.
Once validated, we applied regression analyses to the size of each field over time to determine growth rates and dynamics.We found that exponential growth seemed to best describe the changing size of fields overall, though each field had their own unique patterns, and some have even begun to decline.Perpetual exponential growth is not infinitely maintainable, so we expect every field size to eventually plateau.However, it is interesting that just about every field experienced exponential growth, even those without heavily publicized outbreaks.
NIH RePORTER and CDC Required Reporting data are limited in the context of these diseases, but we were able to identify a slight trend of increasing funding per author over time.When comparing the size of a given field to the public health burden of its respective disease, we noticed a slight positive correlation across time and irrespective of whether the focus was on global cases or limited to the US.We observed a similar trend when comparing the WHO priority list of fungal pathogens, noting that critical fungal pathogens on average had a larger proportion of the workforce, followed by high priority, followed by medium priority.This analysis is reassuring in that the allocated labor force appears to be efficiently focusing on microbes with high disease burden while not neglecting microbes with lower-burden diseases.This trend persisted even when comparing fungal diseases to more globally significant human diseases such as HIV, malaria, and tuberculosis.These larger fields experienced similar growth over time, but the distribution of funding remained similar to the fungal fields, further reassuring us that resources and workforce distribute between fields of research rather than dispropor tionately flocking to perceived "hot topic" research.
Our analysis also provides insight into the growth and decline of fields over time.For example, the use and popularity of the S. cerevisiae system as a model for eukary otic cell biology rose rapidly in the 1970s-1980s and then stabilized, presumably as mammalian cell systems matured to allow comparable types of experimentation.In 1990, the estimated size of the four model organisms was S. cerevisiae > D. melanogaster > C. elegans, while zebrafish had not yet emerged as a major model organism.Three decades later, there are more authors associated with zebrafish than the other model organisms, and S. cerevisiae has moved to second place.We interpret this sequence as reflecting the fact that for eukaryotic cell biology, molecular techniques for studying deep questions were available first in yeast, then invertebrate flies, and most recently in vertebrate zebrafish.In virology, SV40, a polyoma virus that can cause tumors, was a major early experimental system used to understand how viruses caused cancer, but its popularity has declined as investigators have moved to other systems.Poliovirus was a major medical problem before the introduction of effective vaccines in the 1950s, and the number of authors associated with papers on this virus remained relatively stable until recent years when the virus resurfaced with new strains and a decline in vaccination in some regions.Similarly, smallpox was a scourge in the past but was eradicated in 1977 and followed by a steady decline around 1990, then a resurgence of interest with concerns about its potential as a biological weapon and increased interest given human pathogenic potential of other poxviruses.Hence, for both model systems and three major viruses, the trends in author numbers can be associated with historical developments in their fields.Our analysis has several caveats and limitations.First, there is no complete standardi zation of the author name format in PubMed, meaning that the same author may appear and be counted twice in our analysis if, for example, their name appears as last first and last first-initial on separate publications.We were able to identify and correct this issue for authors we personally knew, but it would be impossible to parse out initials from more common names (e.g., Smith and Nguyen).The increasing prevalence of ORCIDs may fix this problem in the future.We also caution that field size estimated from author numbers is likely to be an upper-ceiling estimate since not everyone who authors a paper in an area is necessarily doing research in the area in question.For example, the large increase in investigators authoring phage and CRISPR papers likely reflects the usefulness of those systems for a variety of high-throughput experiments, rather than an increase in the numbers of individuals working on phage biology or CRISPR machinery.Database selection is another important consideration.While PubMed is the de facto leader in medical microbiology, other fields wishing to use this system may require access to alternate databases.Finally, investigators must take care to properly format search queries as even very small changes can yield drastically different results.For example, there are several orders of magnitude differences in the size of "bacteria" compared to "bacteriology." Finally, in considering the temporal changes in fields, we caution that the number of journals published and indexed in PubMed has increased over the years, which could have increased the number of unique authors.
In summary, we propose a bibliometric method for estimating the size of scientific fields and apply it to microbiology and several subfields.The availability of large open bibliometric databases combined with software tools for their analysis provides the means to study problems that were previously inaccessible, and there are many questions that can be explored using these approaches.We note that there is little scholarship on this topic, and our analysis should be considered a starting step in the exploration of a complex topic of great importance to humanity.When applied to the subfield of medical mycology, the results show differences in the size of fields that correlate with the medical importance of a particular fungus.We are hopeful that this approach provides a useful tool for sociologists of science and policymakers for studying the structure of scientific fields.

FIG 1
FIG 1 Estimating the size of fields by author count.(A) The size of six significant fungal fields was calculated by searching PubMed for either just the genus (red) name or the genus and species (blue) name.(B) The size of these same six fields calculated by total papers per year.(C) The number of unique authors in each field each year was compared to the total number of publications.An overall positive correlation was observed but varied by field.Species designations are as follows: Aspergillus fumigatus, Blastomyces dermatitidis, Candida albicans, Coccidioides immitis, Cryptococcus neoformans, and Histoplasma capsulatum.

FIG 3
FIG 3 NIH funding and health burden according to field size.(A) We report the number of US dollars ($) awarded to a field compared to the total authors within that field over the course of all available NIH RePORTER data and adjusted for inflation according to the average annual consumer price index (CPI) for all urban consumers.(B and C) Size of fields compared to global health burden of pathogen.We determined the size of fields for every fungal pathogens recently categorized by the WHO in terms of criticality.Critical and high-priority fungal fields are larger than medium, but all categories are growing steadily in size.(B) A combined regression for unique author analysis of all fungal diseases of a given category, pooled together.(C) Individual author analyses for each fungal pathogen within a category.(D) Workforce compared to public health burden of given diseases.In both global and US-specific diseases, the size of a field correlates to disease burden.(E) Funding per author compared to diagnosed case burden of various fungal diseases.The numbers in panels D and E depict Spearman correlation coefficients.

FIG 4
FIG 4 Size and funding of major human diseases.(A) Size of major human disease fields counted by unique authors.(B) Funding per author for significant human disease fields adjusted (solid) and unadjusted (dashed) for inflation.(C) Size of major disease fields compared to case burden in the US during 2018.(D) Funding per author compared to diagnosed case burden of significant human diseases.Adjusted R-squared values are shown for linear model regressions.

FIG 5
FIG 5 Growth and shrinking of fields.We analyzed four model organisms, four fields of research, and three iterations of gene editing technology to investigate how unique author count correlates to fields growing and changing over time.(A) We found that the size of model organism fields correlated well with their increased use.Interestingly, the Drosophila field eventually plateaued and was supplanted by the zebrafish.(B) We found that the authors in the SV40 field correlated well with its use in generating cancer models.Polio and smallpox mostly remained steady at lower numbers of unique authors per year.(C) Phage research experienced a steep increase in recent popularity.(D) TALENs, ZFNs, and CRISPR are three competing gene editing technologies, with clear field preferences.The popularity of each field is clearly visualized by author count with peaks and valleys expectedly following the development of each new technique.

FIG 6
FIG6 Growth analysis.(A) Comparisons of linear and exponential fits on growth rates of each scientific field analyzed in this paper.Linear regressions were performed on either the raw data or the log-transformed field sizes to compare the two models.Exponential regressions fit better overall, though clear exceptions exist.Detailed information included in Table4.(B) Differences in results and comparisons based on differences in initial search queries.Terms like mycology vs fungus capture fewer articles, and care must be taken to craft a proper query.

TABLE 1
Size of fields in 2022 as indicated by unique author

TABLE 2
Manual and automatic testing of PubMed searches

TABLE 3
Raw data of case burden, funding, and field size for the fields analyzed in this manuscript (Continued on next page)

TABLE 3
Raw data of case burden, funding, and field size for the fields analyzed in this manuscript (Continued) (Continued on next page) Research Article mSystems January 2024 Volume 9 Issue 1 10.1128/msystems.00652-2310

TABLE 3
Raw data of case burden, funding, and field size for the fields analyzed in this manuscript (Continued) (Continued on next page) Research Article mSystems January 2024 Volume 9 Issue 1 10.1128/msystems.00652-2311

TABLE 3
Raw data of case burden, funding, and field size for the fields analyzed in this manuscript (Continued) Research Article mSystemsJanuary 2024 Volume 9 Issue 1 10.1128/msystems.00652-2312

TABLE 3
Raw data of case burden, funding, and field size for the fields analyzed in this manuscript (Continued) Research Article mSystemsJanuary 2024 Volume 9 Issue 1 10.1128/msystems.00652-2313

TABLE 3
Raw data of case burden, funding, and field size for the fields analyzed in this manuscript (Continued) Research Article mSystemsJanuary 2024 Volume 9 Issue 1 10.1128/msystems.00652-2314

TABLE 3
Raw data of case burden, funding, and field size for the fields analyzed in this manuscript (Continued) Research Article mSystemsJanuary 2024 Volume 9 Issue 1 10.1128/msystems.00652-2315

TABLE 3
Raw data of case burden, funding, and field size for the fields analyzed in this manuscript (Continued) Research Article mSystemsJanuary 2024 Volume 9 Issue 1 10.1128/msystems.00652-2316

TABLE 3
Raw data of case burden, funding, and field size for the fields analyzed in this manuscript (Continued) merely counting total publications in that it offers additional dimensions of analysis.Publication counts reflect the research output of a field as a whole, but knowing how that output is distributed across different laboratories and working environments gives a more granular image of the research landscape.Diversity of ideas and research groups is

TABLE 3
Raw data of case burden, funding, and field size for the fields analyzed in this manuscript (Continued)

TABLE 4
Statistics for linear regression models featured in Fig.6Ausing linear and exponential growth rate models for sizes of fields