Task specialization across research careers

Research careers are typically envisioned as a single path in which a scientist starts as a member of a team working under the guidance of one or more experienced scientists and, if they are successful, ends with the individual leading their own research group and training future generations of scientists. Here we study the author contribution statements of published research papers in order to explore possible biases and disparities in career trajectories in science. We used Bayesian networks to train a prediction model based on a dataset of 70,694 publications from PLoS journals, which included 347,136 distinct authors and their associated contribution statements. This model was used to predict the contributions of 222,925 authors in 6,236,239 publications, and to apply a robust archetypal analysis to profile scientists across four career stages: junior, early-career, mid-career and late-career. All three of the archetypes we found - leader, specialized, and supporting - were encountered for early-career and mid-career researchers. Junior researchers displayed only two archetypes (specialized, and supporting), as did late-career researchers (leader and supporting). Scientists assigned to the leader and specialized archetypes tended to have longer careers than those assigned to the supporting archetype. We also observed consistent gender bias at all stages: the majority of male scientists belonged to the leader archetype, while the larger proportion of women belonged to the specialized archetype, especially for early-career and mid-career researchers.


Sample-size estimation
• You should state whether an appropriate sample size was computed when the study was being designed • You should state the statistical method of sample size computation and any required assumptions • If no explicit power analysis was used, you should describe how you decided what sample (replicate) size (number) to use Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission:

Replicates
• You should report how often each experiment was performed • You should include a definition of biological versus technical replication • The data obtained should be provided and sufficient information should be provided to indicate the number of independent biological and/or technical replicates • If you encountered any outliers, you should describe how these were handled • Criteria for exclusion/inclusion of data should be clearly stated • High-throughput sequence data should be uploaded before submission, with a private link for reviewers provided (these are available from both GEO and ArrayExpress) Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission: The paper uses a dataset of publications from PLoS journals. We identified authors through the author name disambiguation algorithm (referred to in the manuscript by Caron and van Eck, 2014). We then compared the productivity ratios of the authors with all of the authors identified via the same algorithm in Web of Science from 1980 to 2018 (last update of the database at the time).
We noted that our sample of authors is not a representative sample of all authors in Web of Science, nor in Life and Biomedical Sciences. For this reason we opted to apply our model only to authors included in the first dataset. Furthermore, throughout the paper we refrain to state that our findings are universal and we keep our further analyses at a descriptive level (comparison with length trajectory, gender, productivity and impact, and author order).

Statistical reporting
• Statistical analysis methods should be described and justified • Raw data should be presented in figures whenever informative to do so (typically when N per group is less than 10) • For each experiment, you should identify the statistical tests used, exact values of N, definitions of center, methods of multiple test correction, and dispersion and precision measures (e.g., mean, median, SD, SEM, confidence intervals; and, for the major substantive results, a measure of effect size (e.g., Pearson's r, Cohen's d) • Report exact p-values wherever possible alongside the summary statistics and 95% confidence intervals. These should be reported for all key questions and not only when the p-value is less than 0.05.
Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission: (For large datasets, or papers with a very large number of statistical tests, you may upload a single table file with tests, Ns, etc., with reference to sections in the manuscript.)

Group allocation
• Indicate how samples were allocated into experimental groups (in the case of clinical studies, please specify allocation to treatment method); if randomization was used, please also state if restricted randomization was applied • Indicate if masking was used during group allocation, data collection and/or data analysis Please outline where this information can be found within the submission (e.g., sections or figure legends), or explain why this information doesn't apply to your submission: Additional data files ("source data") • We encourage you to upload relevant additional data files, such as numerical data that are represented as a graph in a figure, or as a summary table • Where provided, these should be in the most useful format, and they can be uploaded as "Source data" files linked to a main figure or table • Include model definition files including the full list of parameters used • Include code used for data analysis (e.g., R, MatLab) • Avoid stating that data files are "available upon request" Please indicate the figures or tables for which source data files have been provided: We do not report p-values nor statistical significance tests in our results, we did however, crossvalidate the results of our modelling. We report the classification error ratios in Table 3, page 18. These are discussed and explained in line 126 and lines 454-466.
Regarding the archetypal analysis a thorough description is provided in lines 467 to 511, including decisions which could affect the findings, specifically with the choice on the aggregation method from publications to authors (lines 468 to 484).
Full account on the description of the datasets are provided in lines 338 to 413. Furthermore, the complete datasets are provided at http://doi.org/10.5281/zenodo.3891055. Permission has been granted as some of the data shared are under proprietary copyright terms.