Gender-Based Homophily in Research: A Large-Scale Study of Man-Woman Collaboration

We examined the male-female collaboration practices of all internationally visible Polish university professors (N = 25,463) based on their Scopus-indexed publications from 2009-2018 (158,743 journal articles). We merged a national registry of 99,935 scientists (with full administrative and biographical data) with the Scopus publication database, using probabilistic and deterministic record linkage. Our unique biographical, administrative, publication, and citation database (The Polish Science Observatory) included all professors with at least a doctoral degree employed in 85 research-involved universities. We determined what we term an individual publication portfolio for every professor, and we examined the respective impacts of biological age, academic position, academic discipline, average journal prestige, and type of institution on the same-sex collaboration ratio. The gender homophily principle (publishing predominantly with scientists of the same sex) was found to apply to male scientists - but not to females. The majority of male scientists collaborate solely with males; most female scientists, in contrast, do not collaborate with females at all. Across all age groups studied, all-female collaboration is marginal, while all-male collaboration is pervasive. Gender homophily in research-intensive institutions proved stronger for males than for females. Finally, we used a multi-dimensional fractional logit regression model to estimate the impact of gender and other individual-level and institutional-level independent variables on gender homophily in research collaboration.

to 29% fewer citations for work published in the most influential journals (as shown for publications from the PubMed database of 3,233 recipients of prestigious fellowships in life sciences in the U.S.: Lerchenmueller et al., 2019 , p. 4).
Furthermore, gender-based homophily in citations exists in all disciplines, as a study of the citation data of seven million articles published in 2008-2016 shows: the citer disproportionately cites references from authors who are of the same gender, male scientists disproportionately citing other male scientists, possibly leading to a "perpetual disparity " in citations in favor of men as men represent about 70% of all authorships ( Ghiasi, Mongeon, Sugimoto, & Larivière, 2018, p. 1520. Moreover, recent research based on a sample of CVs of U.S. economists reports that gender influences the attribution of credit for group work, that is, co-authorship matters differently for tenure for men and women, with women being less likely to receive tenure the more they co-author ( Sarsons, Gërxhani, Reuben, & Schram, 2021 ). This differential attribution of credit contributes to the gender promotion gap ( Fell & König, 2016 ;Abramo, D'Angelo, & Rosati, 2015 ). Furthermore, the gender citation gap persists: even though female scientists may publish more in journals with higher impact factors than their male peers, their work may receive lower recognition (fewer citations) from the scientific community (as Ghiasi, Larivière, & Sugimoto, 2015 , have shown for female engineers, using a sample of 680,000 articles from 2008-2013, and Maliniak et al., 2013 , for top journals in international relations).

Female scientists and competition
Of the various approaches to studying the "increasing and persistent " gender gap ( Huang et al., 2020 , p. 3) and "pervasive " gender hierarchies ( Fox, 2020( Fox, , p. 1001 in science, an approach centered on competition is especially relevant in the context of homophilous and heterophilous collaboration patterns. There have been ongoing discussions in experimental and personnel economics (often with laboratory-based evidence) about whether women are deterred by competition in some areas of science (and in some workplaces more generally; Flory et al., 2014 ;Dargnies, 2012 ). The systematic shying away from competition could have implications not only for the gender distribution of females across academic disciplines and their sub-disciplines but also for team formation in research collaboration, selected prestige level of journals in academic publishing, and authorship composition. Laboratory experiments show that women may shy away from competition and men may embrace it, with gender implications for publishing in top academic journals, where competition is stiff and the risk of rejection high ( Sonnert & Holton, 1996 ;Kwiek, 2021). Women are extremely underrepresented in top journals in some disciplines, such as mathematics ( Mihaljevi ć-Brandt et al., 2016 , p. 19), and they can self-select into lower-ranked journals ( Mayer & Rathmann, 2018 ). Gender differences in the propensity to choose competitive environments (in our case, highly selective journals) are reported to be driven by gender differences in confidence and preferences for entering and performing in a competition ( Niederle & Vesterlund, 2007, pp. 1098-1100. In their study of all full professors in psychology in Germany, Mayer & Rathmann (2018, pp. 1674-1676 show that in top journal publications, there are considerably more men with a high publication output, as well as considerably less men with a low publication output. Gender differences in choices over competition may be driven partly by men preferring competitive to non-competitive settings and by a significantly stronger aversion to competitive workplaces among women compared to men ( Flory et al., 2014 ). Not surprisingly, male scientists over-cite ( King et al., 2017 ;Maliniak et al., 2013 ), are better represented in top journals, and have higher visibility in science ( Maddi et al., 2019 ).
Academic norms or expectations of conventional behavior may also matter: there may be a common social practice, particularly in male-dominated disciplines of science, that "holds women up to more scrutiny than men " (Gupta, Poulsen, & Villeval, 2013, p. 16). Sonnert and Holton (1996 , p. 69), in their study of gender disparities in career patterns of especially promising scientists, conclude that women might be seen as socialized to be less competitive "so that they choose their own niche rather than enter the fray with numerous competitors working on the same topic, " often feeling they are "under the magnifying glass. " Male scientists may be "more aggressive, combative and self-promoting in their pursuit of career success, and so they achieve higher visibility " ( Sonnert & Holton, 1996 , p. 67). Social norms may thus influence publishing patterns, including, for instance, predominantly same-sex publishing for male scientists -especially in more traditional societies such as Poland.
At the same time, in more firmly male-dominated disciplines (such as physics and astronomy, engineering, and computer science, in the Polish case), female scientists may feel more intense performance pressure due to their high visibility among the overwhelming majority of male scientists and carrying the burden of representing women in these disciplines. They may have to work "twice as hard to prove their competence, " with all their actions being public, as Kanter (1977 , p. 973) suggested in her classic study of the role of male-female proportions in workplace settings. Being less competitively inclined in an increasingly competitive environment of global science may hurt female scientists, especially in their early careers, at an individual level of obtaining tenure, salary increases, and research funding ( Van den Besselaar & Sandström, 2015 ;Sarsons et al., 2021 ;Kwiek, 2018a ). In Polish academia, the list of disciplines where female participation is approximately or exceeds 50% goes beyond the social sciences and humanities (to include also business, economics, and econometrics; agricultural and biological sciences; medicine; chemistry; biochemistry, genetics and molecular biology; and psychology; see Table 16 in Data Appendices). Out of the 24 ASJC Scopus disciplines studied in this paper, female representation reaches at least 50% in 10 of them.

Gender homophily in research collaboration defined
The literature investigating gender homophily in academic publishing is based both on research on selected institutions (e.g., McDowell & Smith, 1992 ), selected disciplines (predominantly economics, as in Boschini & Sjörgen, 2007, or McDowell, Singell, & Stater, 2006, and large-scale bibliometric data (see Wang et al., 2019 , who examined 252,413 papers with 807,588 authorships from the JSTOR corpus, or Ghiasi et al., 2015 , who studied approximately one million Web of Science authorships in engineering).
Most recent bibliometric studies on gender differences in research collaboration patterns suggest that men tend to co-author with men and women with women -leading to the research theme of "gender homophily " in science Potthoff & Zimmermann, 2017 ;Lerchenmueller et al., 2019 ;Kegen, 2013 ;Wang et al., 2019 ;Boschini & Sjögren, 2007 ). At the same time, however, collaboration in research, traditionally operationalized as co-authored publications, influences career progress. Excessive gender homophily among women, while supportive for early-career female researchers, may also harm their careers. This is especially relevant for particularly able female scientists publishing in high-impact journals (as Lerchenmueller et al., 2019 , show with powerful empirical evidence). Women may place themselves at a disadvantage when collaborating disproportionately with other women because, for example, "women tend to be part of less resource-rich and influential networks or because women's work may receive less attention than men's, likely harming career progress " ( Lerchenmueller et al., 2019 , p. 3). This is not the case in Poland, though, as we shall demonstrate, since the Polish female scientists studied tend to avoid publishing exclusively with other female scientists at all levels of their careers and for all age groups.
As mentioned, the homophily principle maintains that "similarity breeds connection " and personal networks are homogeneous with regard to sociodemographic, behavioral, and intrapersonal characteristics. Homophily is known to "limit people's social worlds " ( McPherson, Smith-Lovin, & Cook, 2001 , p. 415). According to this principle, contact between similar people occurs at a higher rate than among dissimilar people; in other words, "birds of a feather flock together " ( McPherson et al., 2001 , p. 417). Thus, males should co-author with males in a disproportionate fashion, while females should co-author disproportionately with females, across countries, disciplines, and institutions.
Homophily, in general, (including the gender-based homophily examined in this research) is reported to simplify communication, enhance the predictability of behavior, entail reciprocity in collaboration, and increase trust between collaborating parties ( McPherson et al., 2001 , p. 435;Kegen, 2013 , p. 63). As Kegen (2013 , p. 65) notes, while the behavior of collaborators might be more predictable and collaboration potentially less costly, gender homophily might also exclude women from informal networks. Furthermore, embeddedness in academic social networks -especially informal networks -is crucial both for doing research and for achieving a career. "Networks matter. Producing high-quality work is not sufficient for research to gain the attention of the widest number of scholars or have the greatest impact " ( Maliniak et al., 2013 , p. 918).
If homophily means "the tendency of people to choose to interact with similar others " ( McPherson et al., 2001 , p. 435), then gender-based homophily in this research means Polish male scientists disproportionately co-authoring with other male scientists, and Polish female scientists co-authoring disproportionately with other female scientists. Recent research tends to indicate that female scientists exhibit stronger gender homophily than male scientists ( Jadidi et al., 2018 ): females are reported to collaborate more often with females than males with males ( Kegen, 2013 ;Lerchenmueller et al., 2019 ;. Evidence from co-authorship patterns in economics indicates that team formation in academic publishing is not gender-neutral: rather, there is powerful gender sorting in team formation ( Boschini & Sjögren, 2007 ). However, the practices of collaboration between males and females differ across disciplines ( Maddi et al., 2019 ); the patterns of international research collaboration differ cross-nationally (see Kwiek, 2020a , on 28 European countries) and between genders intra-nationally (see Kwiek, 2020b , andRoszka, 2020 , on Poland).

Hypotheses of this research
Following a comprehensive literature review and based on prior in-depth knowledge of the Polish academic science system, we have formulated the following seven research questions leading to seven hypotheses (which are presented in Table 1 , along with the results of our research): The Polish science and higher education systems have been studied intensively. For instance, Kulczycki and colleagues examined the funding system ( Kulczycki, Korze ń , & Korytkowski, 2017 ), Bieli ń ski and Tomczy ń ska (2018) studied the various manifestations of the ethos of science and showed how Poland is moving away from Michael Polanyi's "republic of science ". Feldy and Kowalczyk (2020) studied how scientists view the system of financing science, and Kulczycki and Korytkowski (2020) examined changing publication patterns in Poland. Furthermore, higher education reforms (e.g., Shaw, 2019 ;Antonowicz, Kulczycki, & Budzanowska, 2020 ;Kwiek, 2012 ), international research collaboration ( Kwiek, 2020b ), and high research productivity ( Kwiek, 2018b ) have been examined. Gender disparity in Polish science, however, has rarely been studied, and gender collaboration patterns, including gender homophily, have not been examined except by Kwiek and Roszka (2020) , who studied international research collaboration by gender and showed that male scientists dominate in this collaboration type at each level of intensity, with significant cross-disciplinary differences ( Nielsen, 2016 , came to similar conclusions in his study of a Danish university). Siemienska (2007) examined gender research productivity gap referring to cultural capital of faculty members. Finally, Kosmulski (2015) analyzed the productivity and impact of male and female scientists in the period 1975-2014, based on a limited set of authors bearing one of the 26 most popular "-ski " or "-cki " names, showing that male scientists generally have higher productivity and impact than female scientists, except for in biochemistry, where their productivity and impact are almost equal.

Dataset
Two large databases of different natures were merged: Database I was an official national administrative and biographical register of all Polish academic scientists; Database II was the Scopus database. The two were merged to create "The Polish Science Observatory, " which was maintained and periodically updated by the two authors (a short description of the database is presented Table 1 Research hypotheses and results (summary).

Research Question Hypothesis Result
RQ1. What is the relationship between gender and same-sex collaboration?
Hypothesis 1. We would expect that the same-sex collaboration ratio is higher for female than for male scientists.
Not confirmed RQ2. What is the relationship between gender, same-sex collaboration, and age?
Hypothesis 2. We would expect that the same-sex collaboration ratio decreases with age for both male and female scientists.
Confirmed for male scientists only RQ3. What is the relationship between gender, same-sex collaboration, and academic position?
Hypothesis 3. We would anticipate that the same-sex collaboration ratio decreases with academic position for both male and female scientists.
Confirmed for male scientists only RQ4. What is the relationship between gender, same-sex collaboration, and academic disciplines?
Hypothesis 4. We would anticipate that the same-sex collaboration ratio is higher in male-dominated academic disciplines.
Confirmed RQ5. What is the relationship between gender, same-sex collaboration, and institutional research intensity?
Hypothesis 5. We would expect that the same-sex collaboration ratio is higher in research-intensive universities.
Confirmed for male scientists only RQ6. What is the relationship between gender, gender-defined research collaboration type, and journal prestige?
Hypothesis 6. We would expect that the journal prestige level of mixed-sex publications is higher than that of same-sex publications for both male and female scientists.

Confirmed
RQ7. What is the impact of gender and other individual-level and institutional-level independent variables on gender homophily in research collaboration?
Hypothesis 7. In a fractional logit regression model, we would anticipate that individual-level independent variables are more influential than institutional-level independent variables in predicting the same-sex collaboration ratio. in Kwiek and Roszka, 2020 ). The main steps in merging the biographical and administrative dataset (The Polish Science) with the publication and citation database (Scopus) are graphically shown in Fig. 1 . Database I (created by the OPI National Research Institute) comprised 99,535 scientists employed in the Polish science sector as of November 21, 2017. Only scientists with at least a doctoral degree (70,272) and employed in the higher education sector were selected for further analysis (54,448 or 54.70% of all scientists, all working at 85 universities of various types). The data used were both demographic (gender and date of birth) and professional (the highest degree awarded; award date of Ph.D., habilitation, and full professorship; and institutional affiliation), with each scientist identified by a unique ID. Database II included 169,775 names from 85 institutions whose publications for the decade analyzed (2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018) were included in the database and 384,736 Scopus-indexed publications. Authors in Database II were defined by their institutional affiliations, Scopus documents, and individual Scopus IDs. Scopus uses a sophisticated author-matching algorithm to precisely identify publications by the same author; gender is not captured in Scopus Author Profiles ( Elsevier, 2020 , p. 119). We did not reconstruct the full publishing careers (as in Huang et al., 2020 ) of Polish scientists but only for the last decade, when their Scopus publications increased markedly.

Not confirmed
We have identified authors with their different individual IDs in the two databases and provided them with a new ID in the new "Observatory " database. Probabilistic methods of data integration were used ( Fellegi & Sunter, 1969 ;Herzog, Scheuren, & Winkler, 2007 ;Enamorado, Fifield, & Imai, 2019 ). Separately within each of the 85 universities, the first name and last name records of each record in Database I were compared with each of the records in Database II using the Jaro-Winkler string distance (with values from 0 to 1; see Jaro, 1989 ;Winkler, 1990 ). Pairs of strings with a distance greater than 0.94 were considered identical (signified by 2) (see Table 2 ). Pairs with a distance greater than 0.88 but less than 0.94 were considered similar (signified by 1), while those with a distance less than 0.88 were considered disparate (signified by 0). Next, using an expectation maximization algorithm ( Enamorado et al., 2019 ), the posterior probability that a given pair of records belongs to the same unit was estimated. If the probability was greater than 0.85, the pair was considered to be part of the same unit (as suggested by Harron et al., 2017 ). The computation was made using the fastLink R package (version 0.6.0).
By employing a probabilistic approach to the merging of the data sets, it was possible to estimate the uncertainty of the process and thus assess the quality of the new integrated database by calculating the percentage of records incorrectly classified as matches (false discovery rate, FDR) and as non-matches (false negative rate, FNR). Deduplication procedures were applied to the raw integrated author database as 38,750 records referred to 32,937 unique authors. For duplicated records, a clerical review was performed ( Herzog et al., 2007 ). Manual verification of duplicate records revealed that 1,207 records (12.15% in terms of duplicated records and 3.11% of all integrated records) were incorrectly assigned to the same person. These records were deleted from the integrated database. An integrated database used in our research finally included 32,937 unique authors of publications, including 25,463 authors of journal articles.
Finally, Database II also contained metadata on 384,736 publications published in 2009-2018. From among them, the 377,886 papers had up to 100 authors, and 230,007 were written by the authors included in Database I (we used deterministic record linkage at this stage of data integration). Subsequently, only articles written in journals were selected for further analysis, with the number of papers in the database reducing to 158,743 articles.

Limitations
Our research has some limitations and possible biases (e.g., selection bias) as a result of the database construction procedures we employ: we select only internationally visible authors, that is, authors with Scopus-indexed publications. The selection of a different database (for instance, Web of Science or the Polish Scientific Bibliography -PBN), a different period (other than 2009-2018), a different publication format (other than articles in journals), or a different language (other than English) might lead to different results.
The date of reference for the data derived from Database I ( "The Polish Science ") was November 21, 2017, and for the data derived from the Scopus database was the whole decade studied. There are five simplifying assumptions. (1) The paper examines a decade of individual publishing output. While the actual publishing period may, in fact, be shorter than a decade for younger scientists, it may be only the most recent decade of the long-term publishing activities of older scientists. (2) Journal percentile ranks as provided by Scopus are deemed stable within this decade -even though they may fluctuate over the period studied. (3) We assume that Polish scientists were not changing institutions (between 75 research-involved and 10 research-intensive institutions) in the decade studied, as the mobility within the Polish higher education system is very low. (4) We regard Polish scientists who were assistant, associate, and full professors on the date of reference as keeping these positions for the whole decade studied, while these positions are the highest ranks achieved in the study period. (5) We use an internationally recognizable tripartite division of academic positions into assistant, associate, and full professors, even though in fact we use two Polish academic degrees (doctorates and habilitations) and a Polish academic title (professorship). In this sense, our "academic positions " are proxies for Polish "academic degrees and titles ". However, all scientists in our sample have their doctoral degrees (and therefore must be at least assistant professors). All scientists with habilitation degrees receive the position of associate professor within three years, and all full professors have their professorship titles.
While biological age, academic position, and employment type and institution were defined as of November 21, 2017, the variables derived from the Scopus database were constructed to show mean values for the decade of 2009-2018, in which they may have differed from year to year. A limitation is that the values for 2017 for some variables and the mean values for the decade of 2009-2018 are lumped together. Clearly, even the binary classification of male-dominated disciplines valid for 2017 may have been different in the previous years, especially for disciplines close to the threshold value of 50% (for instance, HUM, SOC, and ECON, with 49.8%, 49.8%, and 49.1% of female scientists, respectively; see Table 16 in Data Appendices).
This means that longitudinal studies (year by year) and cohort studies (by consecutive cohorts of scientists) were not possible because of data limitations. Actually, the dominant disciplines ascribed to scientists, individual publication portfolios, gender composition of disciplines, and average publication prestige were constructed for the decade of 2009-2018. For instance, a single observation was a male who in 2017 was 60 years old and was employed full-time as an associate professor in a research-intensive university. He was publishing in ASJC Physics and Astronomy , and his individual Scopus-indexed research output for the decade studied (2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018) was strictly defined in terms of publication numbers (20 Scopus-indexed journal articles), the gender composition of his co-authorships (60% all-male, 35% mixed-sex, 5% solo), and the individual average publication prestige expressed in percentile ranks (Scopus 85 th percentile rank). While acknowledging this limitation, we must stress that Polish scientists publish far too little per year to use Scopus data from a single year (e.g., 2017 only). We assume that a decade of publishing provides a good overview of individual publishing patterns captured within individual publication portfolios.

Methods
As in our previous work on gender disparities in international research collaboration ( Kwiek and Roszka, 2020 ), also here every Polish scientist represented in our integrated database was ascribed to one of 27 ASJC disciplines at the two-digit level (following Abramo, Aksnes, & D'Angelo, 2020 , who determined the dominating Web of Science subject category for each scientist they studied). A given paper can have one or multiple disciplinary classifications (see the ASJC discipline codes used, as described in Table 3 , which presents the variables used in this analysis). The dominant discipline for each scientist is the mode for each of them: the most frequently occurring value (when no single mode occurred, the dominant discipline was randomly selected). All Polish scientists were defined by their gender, discipline, as well as their publications (solo, all-male, all-female or mixed-sex). Every ASJC discipline represents proportions of male and female scientists. However, GEN, NEURO, and NURS disciplines did not meet an arbitrary minimum threshold of 50 scientists per discipline and were omitted from further analysis.
In the present research, in which the unit of analysis was an individual scientist, every scientist had solo or collaborative articles. Collaborative articles include same-sex and mixed-sex articles. Collaborative articles with authors included in our database are defined in terms of the gender of the authors. Of the Polish scientists included in the integrated database of 54,448 scientists, 100% had their gender defined in the original administrative database. In contrast, there are Polish co-authors outside of our database (e.g., affiliated with the Polish Academy of Sciences) and international co-authors of publications with Polish co-authors whose gender is not defined.
Regarding international collaborators of Polish authors and their gender, we analyzed 158,743 articles with individual EIDs (Scopus individual publication IDs). There were 15,149 articles (9.54%) written solely by female scientists, 39,089 (24.62%) written solely by male scientists, 78,419 (49.40%) written in mixed female-male collaboration, and 18,109 (11.41%) solo-written articles. There were 7,979 articles (5.03%) for which only the gender of Polish co-authors was known.
For the purpose of determining the gender of the international co-authors, we used another dataset at our disposal: a dataset of 27.4 million publications published in the same period of 2009-2018 in the OECD area and indexed in Scopus. Our "OECD " dataset includes all metadata about all publications produced in the study period in 1,674 research-active institutions located in 40 OECD economies (the threshold we used was 3,000 Scopus-indexed articles published in the past 10 years). Specifically, we used a subset of our OECD dataset of authors (with 11,087,392 individual Scopus IDs). In the next step, we used the R package of genderizeR to estimate the gender of the OECD authors from our OECD dataset (see Wais, 2016, on the various gender determination methods, including via the R package).
GenderizeR was previously used for gender prediction in Topaz and Sen (2016) for gender representation in editorial boards in 435 journals in mathematical sciences; Fell and König (2016) studied gender difference in co-authorships among 4,234 industrialorganizational psychologists; Huang et al. (2020) examined gender inequality in the academic careers of 7.9 million Web of Science authors. Finally, Wang et al. (2019) also used the R package to study gender-based homophily in JSTOR publication data. Genderize.io provides a count of the number of times that first name appears in the corpus and corresponding probabilities of gender (which is either male or female). In order to establish optimal values of gender prediction indicators, we can manipulate the threshold of probability and count values.
Using the R package, the gender of 7,640,123 our OECD authors (individual Scopus IDs) was estimated with a probability of greater than or equal to 0.85. With the data at our disposal, out of 11,087,392 authors, the genderizeR algorithm was unable to estimate the gender of 2,521,150 authors (22.74%), including a large number of authors from Japan and South Korea, with whom Poland collaborates only marginally. Out of 8,566,242 authors whose gender the algorithm estimated, in 926,119 (10.81%) of cases, gender was estimated with a probability lower than 0.85. In the next step, using individual Scopus IDs, the "The Polish Science Observatory " and the "OECD " datasets were merged to determine the gender of international collaborators of Polish authors. Out of 164,908 international collaborators, we were able to determine the gender of 83,702 (or 50,75%). Our reference database to estimate Ministry the gender of co-authors was restricted to 1,674 research-intensive OECD universities; consequently, we were not able to estimate the gender of collaborators from non-research-intensive universities in the OECD area or from non-OECD universities. Next, using an individual scientist as the unit of analysis, we calculated the proportion of same-sex publications among collaborative articles within the individual publication portfolio of every Polish scientist in the sample. Thus, for all scientists, male and female, within their collaborative articles only, we determined what we termed the same-sex collaboration ratio (for male scientists collaborating only with male scientists, the ratio is 1). Analogously, the ratio of 0 is equivalent to conducting no same-sex collaboration -the scientist collaborates only with the other gender, i.e., there are only mixed-sex publications in the scientist's individual publication portfolio). The ratio does not take into account the different availability of male and female colleagues within each discipline. The availability, or the gender composition of each discipline, includes both their numbers and their percentages. As the sample section (3.3) shows in detail (see Table 16 in Data Appendices), there are more than 1,000 female scientists in only three disciplines (AGRI, BIO, and MED) and more than 500 in six disciplines (the three above and CHEM, ENG, and ENVIR), whereas there are more than 1,000 male scientists in three disciplines, and more than 500 in 12 disciplines.
The gender composition of the 24 disciplines studied would be a serious limitation of the independent variable as long as we assumed that scientists collaborate only within their disciplines. However, scientists in our dataset collaborate both within disciplines and across them, which can be seen from the disciplinary statistics pertaining to individual publication portfolios by ASJC discipline and to authorship combinations by ASJC discipline for individual papers. Traditionally, especially in academic profession surveys (see Kwiek, 2019 ), scientists who are strongly embedded in their disciplines are identified, for instance, by checking the discipline Table 4 The median of the same-sex collaboration ratio by gender. of their doctoral dissertations; based on our dataset, in contrast, we examine scientists and their collaborative publications to which disciplines are ascribed via a Scopus indexing system. The availability of male or female colleagues within a discipline seems to matter much less in current settings, given the increasing large-scale cross-disciplinary collaboration (as evidenced by differing dominant ASJC disciplines ascribed to collaborating scientists). Table 3 provides a short description of variables ( "Observatory " means The Polish Science Observatory database).

Sample
The characteristics of the sample (N = 25,463; 14,886 males and 10,577 females, 58.5% and 41.5%) is presented in Table 16 in Data Appendices: about half of the scientists are middle-aged (or in the 40-54 age bracket (49.7%), and over half of them are assistant professors (56.0%). Column percentages enable the analysis of the gender distribution by major age groups, academic positions, and disciplines (by type: STEM and non-STEM, female-dominated and male-dominated). Row percentages enable the analysis of how male and female scientists are distributed according to a given feature. About half of the scientists work in female-dominated disciplines and about a half in male-dominated disciplines (49.8% and 50.2%); however, females in female-dominated disciplines are the weaker majority (54.6%) than males in male-dominated disciplines (71.4%). All assistant professors hold doctoral degrees, all associate professors hold habilitations, and all full professors hold professorship titles.
The 25,463 scientists in our integrated database had at least a single article in the Scopus database in the period 2009-2018; therefore, it includes all internationally visible Polish academic scientists (on the skewed distribution of research productivity of Polish scientists, see Kwiek, 2018b ; on the upper 10%, termed top performers , who produce about half of all publications across 11 European systems, see Kwiek, 2016 ). Additionally, our sample includes the international collaborators of Polish authors, whose gender was determined using the algorithm described in the Data and Methods subsection (164,908 international co-authors). The differentiated proportions of female scientists can also be examined by academic discipline. While female scientists are especially underrepresented in the four disciplines of computer science (COMP 16.5%), engineering (ENG 14.9%), physics and astronomy (PHYS 16.6%), and mathematics (MATHS 25.2%), the number of male and female scientists is almost equal in arts and humanities (HUM) and social sciences (SOC).

The same-sex collaboration ratio by gender
Hypothesis 1. We would expect that the same-sex collaboration ratio is higher for female than for male scientists (not confirmed).
Gender homophily in publishing, or the same-sex collaboration ratio, falls within the range of 0 (no same-sex collaborative articles among collaborative articles in the individual publication portfolio) to 1 (exclusively same-sex collaborative articles among collaborative articles in the portfolio). The average ratio for males to be involved in same-sex collaboration is more than three times that of females (the median ratio for males is 0.500, compared with 0.153 for females, Table 4 ). For the whole national sample, the median ratio is 0.333, meaning that at least 50% of authors conduct same-sex collaboration (males with males, females with females) at the 33.3% level. Mann-Whitney's Z-test shows the gender difference to be significantly different at the significance level of 0.05. Thus, Hypothesis 1 is not confirmed.

The same-sex collaboration ratio by age and academic position
Hypothesis 2. We would expect that the same-sex collaboration ratio decreases with age for both male and female scientists (confirmed for males but not for females).
Hypothesis 3. We would anticipate that the same-sex collaboration ratio decreases with academic position for both male and female scientists (confirmed for males but not for females).
Before analyzing the effect of age and academic position, we examined the level of correlation between these two variables since, in many academic systems, seniority is a significant predictor of career advancement. The boxplots in Fig. 2 divide the data into  quartiles and show the median, which is higher for each subsequent academic position. The boxes enclose the middle 50% of the data (for instance, across all disciplines, half of full professors are aged about 60). Outliers are located predominantly above the boxes, showing the presence of older scientists within the three academic positions rather than younger ones. There is a clear interdependence between age and academic position as the average level of age increases with the three consecutive academic positions adopted in this paper (assistant professor, associate professor, and full professor) across all 24 ASJC disciplines. Also, the observed average age for each of the three stages of an academic career is similar among all the disciplines. This empirical observation is confirmed by the formal Kruskal-Wallis test in which we tested the null hypothesis that the average age is the same at every stage of the academic career: for each discipline, we reject the null hypothesis at a significance level of 0.001 ( Table 5 ). However, both variables emerge as important in previous literature and therefore their joint impact will be studied below.
For the purposes of examining the same-sex collaboration ratio by age group, we divided our sample into these three categories: young scientists (aged 39 and younger), middle-aged scientists (aged 40-54) and older scientists (aged 55 and older), of which middle-aged scientists are the largest age group (45.79%) ( Table 6 ). The proportion of males and females is almost equal among young scientists -but females are less than 30% of older scientists (see % column). Table 7 shows the distribution of the median value of the same-sex collaboration ratio by gender and age group. The median ratio for males slightly decreases with age (by 6 percentage points). In contrast, the same median ratio for females substantially increases with age (by 18 percentage points). While the ratio for females triples with age, it is still very low compared with that of males (the difference being 35 percentage points).   Fig. 3. The distribution of the same-sex collaboration ratio by gender. The gray area is the overall distribution for both genders.
The difference in collaboration patterns for young scientists by gender is interesting in view of previous literature about gender patterns of research collaboration. This strand of literature suggests that women tend to co-author with women Potthoff & Zimmermann, 2017 ;Wang et al., 2019 ;Lerchenmueller et al., 2019 ), although this is not true in the Polish case. While half of young male scientists write at least 54% of their papers in collaboration with males, the same indicator for females is nine times lower (6.3%). Young males tend to collaborate with males -and young females tend not to collaborate with females. While 50% of young female scientists are characterized by the same-sex collaboration ratio at the level of 0.06, in the case of older females, the median ratio quadruples to 0.24: older females still tend to collaborate primarily with males. For all age groups (see the Total line in Table 7 ), the difference by gender in Polish science is substantial: while the median same-sex collaboration ratio for males is 0.5, the median for females is more than three times lower (0.15). (These results will be confirmed in a fractional logit regression analysis in section 4.6 .) What is clear in the two panels in Fig. 3 is the predominance of extreme values (0 for no same-sex collaboration and 1 for exclusively same-sex collaboration) in individual publication portfolios. The total number of extreme values (0 and 1) is similar for both genders. The majority of collaborations are mixed-sex collaborations. A substantial proportion of collaborating male scientists (left panel, right peak) co-published mostly with males in the decade studied; a substantial proportion of collaborating female scientists (right panel, left peak), in contrast, tended not to co-author with females in the same period.
The distribution of the same-sex collaboration ratio for females is the mirror image of that for males. Apart from the two extreme values of 1 and 0, the distribution of the ratio in question for males is basically uniform. For females, a gradual decline in the ratio is clearly observed. Comparing the extremes, there are more females without same-sex collaboration than males for the same collaboration type; there are about three times more males who collaborate only with males compared with females who collaborate only with females.
When we examine academic positions, in a similar vein, the same-sex collaboration ratio by males decreases with the highest academic position reached ( Table 8 ). In contrast, the same ratio for females increases with academic positions, although its level is still very low for all females. While the median ratio level for females increases two and a half times when we move up the academic ladder, it is still much lower compared with that of males. While 50% of female assistant professors are characterized by the same-sex collaboration ratio at the level of 0.105, for female full professors, the ratio increases to 0.250. See a graphical summary for age groups and academic positions in Fig. 4 . The gender difference in collaboration patterns can be studied in more detail using boxplots and violin plots combined. The gender difference by age group ( Fig. 5 ) closely resembles the gender difference by academic position ( Fig. 6 ). Female scientists consistently, across the three age groups and across the three academic positions, tend not to collaborate with other females (compare the shapes for Ratio = 1, i.e., females collaborating only with females, across the age groups and academic positions for females). Note that the median shown in boxplots is much lower for each group for females than for males, and it increases for females with age; it is also much lower for female assistant and associate professors and lower for female full professors.
Inverse proportionality in collaboration patterns between males and females is visible for each age group and each academic position. In terms of within-sex variation, male scientists are more differentiated than female scientists (compare the height of the boxes in the two columns) for each age group and each academic position studied. Females, and especially young females and female assistant professors, tend not to collaborate with other females. As can be seen from Figs. 5 and 6 , generally, conclusions from a study of age groups resemble conclusions from a study of academic positions.
While above, we have studied three broad age groups, below, we focus on biological age as a numerical variable. The year-byyear approach illustrated by regression lines in Fig. 7 generally confirms the two opposite trends for both genders, at least until the age of 60 for males and for all ages for females. Interestingly, the generally downward trend in the same-sex collaboration ratio for male scientists is reversed for those aged 60 and above: the ratio for the oldest males increases. In contrast, for female scientists, the damped growth characteristic of all ages until about 60 turns into exponential growth for the oldest female scientists (a cut-off point of 70 is used, the standard retirement age for full professors). The dots in Fig. 7 represent the median value of the same-sex collaboration ratio for each year of age. Relatively high variation of median values for very young male scientists and no variation for very young female scientists (see the respective dots in both panels) is caused by the low numbers of scientists in these age groups. Thus, Hypotheses 2 and 3 are confirmed for males but not for females.

The same-sex collaboration ratio by academic discipline
Hypothesis 4. We would anticipate that the same-sex collaboration ratio is higher in male-dominated academic disciplines (confirmed).
First, we examined the correlation level between the mean same-sex collaboration ratio (ranging from 0 to 1) and the percentage of male scientists within the discipline (see Fig. 8 ). The correlation between the two variables is weak (r = 0.228, R 2 = 0.052); however, as the percentage of males increases, so does the mean same-sex ratio (see the red regression line). The disciplines fall into two categories: female-dominated (left of the vertical dotted line indicating 50%) and male-dominated (right of the line and on the line, by our definition; see Table 1 for the variables). The bubble size reflects the number of scientists. In five disciplines (CHEM, ENVIR, ECON, SOC, and HUM), the percentage of men is very close to 50%. The highest mean same-sex collaboration ratio is not correlated with the male and female distribution within a discipline: it is equally high for physics and astronomy (PHYS) and computer science (COMP), in which male participation exceeds 80%, as it is for pharmacology, toxicology, and pharmaceutics (PHARM) and biochemistry, genetics, and molecular biology (BIO), with male participation in the 30-40% range. At the same time, while social sciences, arts and humanities, and economics, econometrics, and finance (HUM, SOC, and ECON) exhibit a mean same-sex ratio of around 0.5 among the five gender-balanced disciplines (those with close to 50% male participation), chemistry (CHEM) exhibits a ratio of around 0.7.
The same-sex collaboration ratio differs vastly by discipline and by gender. Previous research shows that as the fraction of female researchers in a discipline increases, women increasingly tend to publish with other women; also, the male ratio to co-author with women is higher in disciplines with more women ( Boschini & Sjögren, 2007 , p. 339). A good way to visualize gender differences in the median same-sex collaboration ratio is through a heat map (the color palette in Table 9 changes from light blue for low values to deep blue for high values). In the case of COMP, ENG, and MATH, with the high overrepresentation of male scientists, the ratio for males is extremely high (and the median values reach the level of 1 or almost 1). That is to say, at least half of male scientists in these disciplines collaborate only with males. In COMP, ENER, ENG, HEALTH, PHYS, and VET, at least half of females do not collaborate with females at all (and the median values reach the level of 0 or almost 0). In contrast, in disciplines such as PHARM, PSYCH, and SOC, the median value for females is significantly higher than for males. The median level by ASJC discipline is also shown graphically in boxplots in Fig. 9 . Thus, Hypothesis 4 is confirmed.

The same-sex collaboration ratio by institutional type
Hypothesis 5. We would expect that the same-sex collaboration ratio is higher in research-intensive universities (confirmed for males but not for females).
Previous literature indicates differences in gender homophily in research collaboration not only by discipline but also by institution. Therefore, we will test whether the same-sex collaboration ratio also differs by institutional type: we contrast the 10 research-intensive institutions with 75 other institutions in the national system. The 10 institutions are the IDUB (or "Excellence Initiative-Research University ") institutions, which were selected for additional research funding for the 2020-2026 period. The IDUB institutions include both top Polish universities and polytechnic institutes (similar results were achieved for the top 10 institutions in terms of publication numbers overall and publication numbers within the Scopus 90 th -99 th journal percentiles).
For male scientists employed in the IDUB institutions, the ratio is high: the proportion of articles published only with males by the upper 50% of male scientists is at least 60% and is larger than the overall ratio for males in the system (see the Total line in      9. The same-sex collaboration ratio: distribution by discipline and gender. Table 10: 50%). For female scientists, in contrast, the same proportion in the IDUB institutions is more than four times lower and is even lower than the overall ratio for females in the system. In other words, we reach the somewhat surprising conclusion that for males, the proportion of all-male collaboration in individual publication portfolios is higher in research-intensive institutions than the already high proportion for all institutions -while for females, the proportion of all-female collaboration is lower in research-intensive institutions than the already low proportion for all institutions.
In the Polish academic science system as a whole, the same-sex collaboration ratio for males is more than three times that for females (a finding which is confirmed by a fractional logit regression analysis in Section 4.6 below). Fig. 10 shows the gender difference in the median same-sex collaboration ratio by institutional type and gender in more detail using boxplots and violin plots combined.

Table 9
The median same-sex collaboration ratio by discipline and gender (shading: from the highest ratio in dark blue to the lowest ratio in light blue).

Table 10
The median of the same-sex collaboration ratio by institutional type and gender. The distribution of the median ratio for females is basically the same in both institutional types, and the within-sex variation is much higher for males than for females, as indicated by the height of the boxplots. The difference between the median values for males and females is much larger in the case of research-intensive institutions; the median value for males is much higher in these institutions, as it is for females. This effectively means that in research-intensive institutions (see the top IDUB panel in Fig. 10 ), males as well as females are more likely to collaborate with males. Gender homophily is thus stronger for males and weaker for females in research-intensive institutions. In other institutions (see the bottom panel), the number of males collaborating exclusively with males and the number of males collaborating exclusively with females are equal; the number of females collaborating exclusively with males and the number of females collaborating exclusively with females are similar in both institutional types (see the large base on which the two right columns rest for female scientists in both panels). Thus, Hypothesis 5 is confirmed for males but not confirmed for females. Fig. 10. The same-sex collaboration ratio: distribution by institutional type and gender (boxplots and violin plots combined).

Table 11
The median prestige level distribution (by percentile from 0-99, with the 99 th percentile being the highest) of publications by major gender collaboration type and gender.

The same-sex collaboration ratio by journal prestige
Hypothesis 6. We would expect that the journal prestige level of mixed-sex publications is higher than that of same-sex publications for both male and female scientists (confirmed).
Both the quantity and quality of output in academia are relatively easily measured (with all standard limitations) using the Scopus database as articles are published in journals of different ranks. The scientists in our sample have their own unique individual publication portfolios with publications, translatable into average individual prestige via Scopus citation metrics. The prestige of each article in this portfolio is derived from the prestige of the journal in which it was published and is defined by the percentile rank ascribed annually to each academic journal within its ASJC discipline. Top journals, including the Journal of Informetrics and Scientometrics , are usually ranked in the upper 5% of journals.
Importantly, the citation-based percentile ranking system used by Scopus is being systematically used in Poland, for instance in a complicated system of indicators used first to select (in 2019) and then to additionally finance (in 2020-2026) 10 research-intensive Polish universities. We used the measure of average prestige, which represents the median prestige value for all publications written by a given scientist in the study period of 2009-2018 for three categories of publications (same-sex, mixed-sex, and solo publications). For journals for which the Scopus database did not ascribe a percentile rank, we have ascribed the percentile rank of 0; Scopus ascribes percentiles to journals in the 25 th to 99 th percentile range, with the highest rank being the 99 th percentile.
The median prestige level (in a range of 0-99) for all Polish publications written in same-sex and mixed-sex collaboration by gender does not differ much ( Table 11 ): the median values for all-male publications and all-female publications by gender are almost identical (59.17 and 58.00, respectively). Also, the median value for mixed-sex collaborations does not differ significantly by gender. Both males and females, on average, regardless of the collaboration type, publish in journals with relatively low prestige. Articles  Fig. 11. The prestige level distribution of publications (by Scopus percentile rank from 0-99, with the 99 th percentile being the highest in prestige) by major collaboration type, gender, and discipline. written in mixed-sex collaboration are, on average, published in more prestigious journals than those written in same-sex collaboration and in much more prestigious journals than solo articles (see the Total line in Table 11 ).
The distribution of the median journal prestige level by discipline and collaboration type (mixed, same-sex, and solo publications, separately for males and females) shows both common patterns and substantial variations. Generally, for each ASJC discipline ( Table 12 ), solo research is characterized by the lowest prestige level. BIO, CHEM, ENER, and PHARM belong to disciplines with the highest median prestige level, regardless of the collaboration type. Both mixed-sex and same-sex collaborations have higher average prestige levels than do solo articles.
The differences in prestige level by gender are as follows: for mixed-sex collaborations, they are marginal, but for same-sex collaboration, they are substantial (compare the same-sex collaboration columns for males and females in Table 12 ). Male-only collaborations have higher median prestige than do female-only collaborations, and this pattern is characteristic of a large number of disciplines. Males collaborating with males, on average, publish in more prestigious journals than do females collaborating with females. Solo research by females exhibits lower median prestige levels than does solo research by males in all except for nine disciplines (including BIO, CHEMENG, ENER, ENG, MATER, MED, and PHARM). The median prestige level by ASJC discipline and gender is also shown graphically in the boxplots in Fig. 11 to go beyond the median values and to highlight intra-disciplinary crossgender variability, with three separate panels for the three gender-defined collaboration types. Thus, Hypothesis 6 is confirmed.

A modeling approach: A fractional logit regression model
Hypothesis 7. In a fractional logit regression model, we would anticipate that individual-level independent variables are more influential than institutional-level independent variables in predicting the same-sex collaboration ratio (not confirmed).
Finally, we move from descriptive statistics and two-dimensional analysis to modeling, and we use a regression model for a fractional dependent variable -a fractional logit regression model ( Papke & Woolridge, 1996 ), designed for variables bounded between zero and one (as with our dependent variable: the same-sex collaboration ratio). Linear models to examine how a set of explanatory variables influences a given proportion or fractional response variable are not appropriate here ( Ramalho, Ramalho, & Murteira, 2011 , p. 19). In this model, no special data adjustments are needed for the extreme values of zero and one.
In our case, we have 24 ASJC disciplines represented in our 85 research-involved institutions. The number of employees and the percentage of female scientists vary in each of them; each discipline in each institution is either male-dominated (i.e., with exactly or more than 50% male scientists) or female-dominated (i.e., with more than 50% female scientists). We also have a set of 10 highly research-intensive institutions (termed IDUB institutions) and one containing the rest of them. Individual scientists are embedded in their institutions and in their disciplines, and both institutions and disciplines have their specific patterns of cross-gender collaboration. In some disciplines and institutions, same-sex collaboration is more prevalent than in others. For the sake of clarity, here is an example: a single observation here is not a male mathematician with individual features only, such as biological age and academic position. This male mathematician is also embedded in a highly research-intensive institution (variable: IDUB type) employing 2,000 teaching and research faculty (variable: number of employees) and publishing in the discipline of mathematics (variable: STEM discipline), which is male-dominated (variable: male-dominated discipline). Furthermore, in this institution, customarily, male mathematicians tend to have the habit of publishing with their male rather than female colleagues.
In the regression model, we also include the mean individual publication prestige percentile, which requires a clear explanation as it differs from the prestige attached to an individual article (and it may be considered to be a consequence of the collaboration rather than a cause of the collaboration). Collaboration , as defined in this paper, is considered to be a product rather than a process: collaboration between two scientists is viewed only through the proxy of the paper they co-authored and published. By our definition,

Table 12
The median prestige level for publications (by Scopus percentile ranks from 0-99, with the 99 th percentile being the highest in prestige) by major collaboration type, gender, and discipline (shading: from the highest median prestige level in dark blue to the lowest median prestige level in light blue. every scientist in our dataset has his or her own, clearly defined mean individual publication prestige percentile, which is determined by the entirety of their Scopus-indexed publication output from the decade studied (each article is linked to its source or journal, with a clear Scopus-calculated highest journal percentile). Consequently, the mean individual publication prestige percentile for each scientist (with the 99 th percentile being the highest) is an individual-level predictor: it is a proxy for the average prestige of their maximally decade-long publication history. It is higher for scientists publishing exclusively in top journals and lower for those publishing in a combination of top and second-tier journals or in second-tier journals only, as ranked by Scopus. As our dependent variable is fractional (ranging from zero to one), we estimate a fractional logit regression model. We estimate odds ratios for conducting same-sex collaboration in journal publishing, i.e., publishing with scientists of the same sex. We calculate the same-sex collaboration ratio as the percentage of same-sex collaboration articles in all of the published collaborative articles in all of the scientists' individual publication portfolios. Using a fractional logistic regression approach, we estimated the probability of conducting same-sex collaboration.
The first type of independent variable captures scientists' individual demographic, biographical, and bibliometric characteristics: gender, biological age, mean individual publication prestige level within the study period of 10 years (or less), current academic position, and the type of the dominant Scopus-defined ASJC discipline (STEM or non-STEM). The second type of independent variable captures three major institutional characteristics: a binary variable indicating employment in an IDUB or non-IDUB institution (being employed full-time in one of the 10 highly research-intensive institutions or not), the number of scientists employed in the author's institution (in FTEs in 2018), and publishing in a male-dominated discipline or not. The nonexistence of collinearity of the independent variables was confirmed through an analysis of VIF coefficients (see Table 17 in Data Appendices). Although the correlation table of independent variables shows in some cases (e.g. IDUB -the number of employees, full professor -age) a pairwise correlation of moderate strength, the vector of independent variables is not characterized by significant collinearity, as indicated by the VIF coefficients ( Table 13 ). The correlation between these pairs is largely controlled by other variables in the model.
The distribution of residuals in our dataset was not normal (i.e., the K-S normality test statistic is equal to 0.104, with a p-value less than 0.001). The normality of residuals distribution allows performing statistical inference on the model properties as all statistical significance tests assume the normality of distribution. To overcome the model's inconsistency with the assumptions, robust standard errors were estimated, and, on the basis of the estimates, a significance test for individual coefficients in the model was conducted. A further step in the analysis of the residuals distribution indicates the lack of influential observations (as the range of standardized residuals does not exceed ± 3 standard deviations). Consequently, the conclusions drawn from our model are valid. Table 13 presents the results of a fractional logit model. All coefficients are statistically significant and have a significant impact on the same-sex collaboration ratio. However, the parameters of the model thus estimated cannot be interpreted naturally. As Long and Freese (2006) show, for a natural and transparent interpretation of the influence of independent variables on the same-sex collaboration ratio, average marginal effects should be estimated for the model (see Table 15 ). The greatest influence on the ratio is exerted by gender and publishing in male-dominated disciplines. Being a male scientist increases the ratio on average by 0.2 (which means that male scientists on average have a 20 percentage points higher same-sex collaboration ratio than female scientists, ceteris paribus ). Publishing in a male-dominated discipline increases this ratio on average by 0.167. Working in a highly research-intensive (IDUB type) institution increases the ratio on average by 0.048, and the influence of age is relatively weak: every ten years of age results in an average increase of the ratio by 0.01. Being a full professor decreases on average the ratio by 5.6 p.p, and being an associate professor decreases it by 1.4 p.p. Also publishing mainly in STEM disciplines on average decreases the ratio by 3.9 p.p. The weakest, though still significant, predictors of the same-sex collaboration ratio are the journal prestige and the number of employees in scientists' institution.
Thus, Hypothesis 7 is not confirmed: the impact on gender homophily is exerted by both individual-level and institutional-level predictors, rather than by merely individual-level ones. Being male is as influential in the model as working in a male-dominated discipline and being employed in a research-intensive university. There are two most influential predictors of the same-sex collaboration ratio, both gender-related: being male and working in a male-dominated discipline.
As R 2 is relatively low (R 2 = 0.159), there are other, unknown predictors of conducting same-sex collaboration in academic publishing. Other predictors are known predominantly from qualitative literature about research collaboration (see, especially , Sonnert & Holton, 1996 ;Lerchenmueller et al., 2019 ); however, the data necessary to be used as explanatory variables in our model are not available. Other predictors determining gender homophily in research and in gender network structures more generally include individual, disciplinary, institutional, cultural, political, social, and economic factors, as previous national qualitative and quantitative studies indicate (see, e.g., McPherson et al., 2001 ;McDowell & Smith, 1992 , who included 178 US male and female economists with their full publication records; Ibarra, 1992 , who conducted 79 interviews with employees of an advertising firm; Ibarra, 1997 , who conducted 63 interviews with middle-level US managers; Kegen, 2013 , who studied two German Excellence Initiative institutions;and Feenay & Bernal, 2010 , who conducted a survey of scientists and engineers from 151 US research universities). Especially useful seem to be the various studies from systems known for their high female participation in academic science ( Halevi, 2019 ;see, especially, the cross-national data in Elsevier, 2020 ;Elsevier 2018 ).

Summary of findings, discussion, and conclusions
Our research differs from previous studies in several respects. First, we examined every internationally publishing Polish male and female scientist and the entirety of internationally visible (Scopus-indexed) Polish academic knowledge production for a decade (2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018). Second, owing to the characteristics of the database used, we had 100% gender determination for all scientists in the system (rather than probability thresholds in gender determination). Third, we defined what we termed the "individual publication portfolio " for every Polish scientist to examine the same-sex collaboration ratio at the level of the individual scientist (the idea of the individual publication portfolio can be used for other research questions: see our research on gender disparities in international research collaboration - Kwiek & Roszka, 2020 ). Fourth, our unit of analysis was the gender-defined individual scientist rather than the individual publication, with his or her specific distribution of male/female authorships.
Finally, and most importantly, we used a comprehensive, fully integrated biographical, administrative, publication, and citation database (the "The Polish Science Observatory " database, which we constructed by merging the national registry of all 99,535 Polish scientists with the Scopus dataset comprised of all their publications in 2009-2018). Our sample (N = 25,463) included all the university professors holding at least a doctoral degree and employed in 85 research-involved universities, grouped into 27 disciplines with all their Scopus-indexed publications (158,743 articles).
While most previous literature highlights that women are much more likely to have a female than a male co-author (in three top economic journals, Boschini & Sjögren, 2007 ;in life sciences, Holman & Morandin, 2019 ;among computer scientists, Jadidi et al., 2018 ;and among industrial-organizational psychologists, Fell & König, 2016 ), or a female rather than a male collaborator in research projects ( Lerchenmueller et al., 2019 ), leading to excessive gender homophily in female publishing, our findings, which are based on a large national sample, do not support this gender disparity in collaboration patterns. These findings may tend to indicate the uniqueness of massively expanded and relatively gender-balanced Central European science systems, with female participation in the academic labor force reaching 50%, testifying to the significance of proportions, or minorities and majorities, for academic life ( Kanter, 1977 ).
Having an integrated dataset at our disposal, we were able to examine the same-sex collaboration ratio across several dimensions, previously usually either studied separately and with only a selection of our variables or studied based on small datasets. This research goes beyond traditional bibliometric studies of gender-based homophily in research collaboration by combining the following: (1) biographical and administrative data routinely inaccessible to large-scale studies, namely, the biological age of all scientists (rather than a proxy of first publication) and the three stages of their academic careers (assistant, associate, and full professorships), as well as (2) data that is routinely accessible in bibliometric studies, such as journal prestige, academic disciplines, and institutional type.
Previous research tended to be restricted either 1) by focusing on selected institutions ( Kegen, 2013 ) or selected disciplines ( McDowell & Smith, 1992 ;Lerchenmueller et al., 2019 ;Fell & König, 2016 ;Maddi et al., 2019 ), sometimes with disciplines represented by their top journals ( Potthoff & Zimmermann, 2017 ;Boschini & Sjörgen, 2007 ), or 2) by being large in scale but focused solely on bibliometric data ( Huang et al., 2020 ;Wang et al., 2019 ;Ghiasi et al., 2015 ;Larivière et al., 2013 ;. This research, in contrast, reveals the opportunities that large-scale, comprehensive national databases may provide (such databases are currently available for Norway and Italy; see Abramo, Aksnes, & D'Angelo, 2020 , who compared the performance of Norwegian and Italian professors and verified the feasibility of applying their "research efficiency indicator " to the two systems). Although our "Observatory " database is not an example of the Current Research Information System (CRIS) as a data source as recently defined by Sivertsen (2019) , new Polish databases (such as POL-on 2.0 and PBN, which are national registries of all higher education institutions, publications, and scientists) are moving in the CRIS direction.
Our results show that in the Polish academic science system as a whole, the same-sex collaboration ratio for males is more than three times that for females (a finding which is confirmed by a fractional logit regression analysis: it is on average 20 percentage points higher for males). The ratio for females to collaborate with females and for males to collaborate with males (or gender homophily in publishing patterns) showed clear patterns in accordance with biological age and academic seniority: across all age groups, female scientists tend to collaborate with male scientists, and male scientists also tend to collaborate with male scientists. All-female collaboration, often discussed in literature ( Boschini & Sjörgen, 2007 ;McDowell & Smith, 1992 ;McDowell, Singell, & Stater, 2006 ), is marginal, and all-male collaboration is pervasive. The gender patterns in publishing are stable not only across age groups -but also across academic positions. Both males and females, on average, regardless of the gender-defined collaboration type (same-sex, mixed-sex, or solo publications), publish in journals with prestige that is relatively low. However, articles written in mixed-sex collaboration are, on average, published in more prestigious journals than those written in same-sex collaboration (which is consistent with previous literature; see Campbell, Mehtani, Dozier, & Rinehart, 2013 ; see the prestige economy of top journals in ( Kwiek, 2021 ) The difference in collaboration patterns for young scientists (an age group with equal participation of males and females in our sample) by gender is especially interesting in view of previous literature about gender patterns of research collaboration. Previous research suggests that women tend to co-author with women Potthoff & Zimmermann, 2017 ;Wang et al., 2019 ;Lerchenmueller et al., 2019 ) -which is not the Polish case. While half of young male scientists write at least 54% of their papers in collaboration with males, the same indicator for females (writing with females) is nine times lower (6.3%). So, while young males tend to collaborate with males, young females tend not to collaborate with females. While 50% of young female scientists are characterized by the same-sex collaboration ratio at the level of 0.06, in the case of older females, the median ratio quadruples to 0.24: older females still tend to collaborate primarily with males. For all age groups, the difference by gender in Polish science is substantial: while the median same-sex collaboration ratio for males is 0.5, the median for females is only 0.15. This finding is not in line with previous research, which generally shows that female scientists exhibit stronger gender homophily than male scientists ( Jadidi et al., 2018 ) and that females collaborate more often with females than males with males ( Kegen, 2013 ;. Gender homophily in team formation ( Boschini & Sjögren, 2007 ) in Poland seems to occur with male scientists but not with female scientists. One explanation might be along the lines of the gender and competition theme introduced in Section 2 : younger females, feeling more "under the magnifying glass " and being less "aggressive, combative, and self-promoting " in seeking higher visibility ( Sonnert & Holton, 1996 , pp. 67-69), tend to co-author with males rather than with other females because males are viewed as more deeply embedded in science. Also academic norms may be viewed as influencing publishing patterns, including, for instance, predominantly same-sex publishing for young male scientists and predominantly mixed-sex publishing for young female scientists (considering that the availability of male and female colleagues under 40 in the system is similar). To verify this hypothesis further, we would need the biological age and academic position of all of the international co-authors, not only the Polish ones, which could be represented only via different proxies.
The gender difference in the same-sex collaboration ratio by age group closely resembles the gender difference by academic position. Male scientists in general tend to collaborate with males; female scientists consistently, across the three age groups and across the three academic positions, tend not to collaborate with other females. Inverse proportionality in collaboration between the two genders is characteristic of each age group and each academic position.
The year-by-year approach we used generally confirms the two opposite trends for both genders: the downward trend in the samesex collaboration ratio for male scientists stands in sharp contrast to the upward trend for female scientists. In the specific Polish case, age and academic positions are strongly correlated as the principle of "up or out " has not been operative in the system for at least three decades.
We have examined the same-sex collaboration ratio across all disciplines. Differently than in most previous studies, we compared male-dominated disciplines with female-dominated disciplines. Our research supports the finding from previous research that as the fraction of female researchers in a discipline increases, females increasingly tend to co-author with other females ( Boschini & Sjö-gren, 2007 ). In the case of the male-dominated fields of computer science, engineering, and mathematics, the same-sex collaboration ratio for males is prodigious: at least half of male scientists in these disciplines collaborate exclusively with males. In computer science, engineering, health professions, and physics and astronomy, at least half of females do not collaborate with females at all. In contrast, in several female-dominated disciplines (e.g., social sciences and psychology), the median value of same-sex collaboration for females is significantly higher than for males.
The same-sex collaboration ratio also differs by institutional type. We contrasted 10 highly research-intensive institutions with 75 other institutions. An interesting conclusion is that for males, the proportion of all-male collaboration in individual publication portfolios is higher in research-intensive institutions than the already high proportion for all institutions -while for females, the proportion of all-female collaboration is lower in research-intensive institutions than the already low proportion for all institutions. Males in research-intensive institutions are even more likely to collaborate with males, and females are even less likely to collaborate with females. Gender homophily in research-intensive institutions is thus stronger for males and weaker for females than in the rest of the higher education system, which might suggest that a stronger institutional research focus generally induces collaboration with male scientists.
Finally, using a fractional logistic regression approach, we estimated the strength and direction of predictors of conducting samesex collaboration. The model showed that the same-sex collaboration ratio for male scientists is on average 20 percentage points higher than that for female scientists and that working in a male-dominated discipline increases the ratio by 16.7 p.p.; age slightly increases the ratio (which is in line with our findings from two-dimensional analyses for females but not for males). Also being employed in a highly research-intensive institution increases the ratio by 4.8 p.p. Finally, individual-level factors do not emerge from this model as more influential than institutional-level factors: both types of factors matter. Being male and working in a male-dominated discipline are the two most influential predictors of the same-sex collaboration ratio, followed by working in a research-intensive university.
Male-female collaboration practices in research were tested against the homophily principle: our findings indicate that similarity indeed breeds connection between individual scientists and structures academic publishing ties. However, in the Polish case, this is true only for male scientists. Gender-based homophily has substantial implications for academic careers, with the citation measure being increasingly used as a "reward currency in science " that often underlies decisions on all major aspects of an academic career ( Ghiasi et al., , p. 1519. While forming collaborative research teams -perhaps more intuitively and resulting from the dominant social norms in academia rather than from solid individual publishing strategies -Polish female scientists tend not to publish with other females and seem to prefer male co-authors (perhaps viewing male scientists as hubs in collaboative science attracting more attention, with power and resources such as e.g. the supervision of doctoral students). This is the case not only for young female scientists (who may have had predominantly male mentors in their doctoral programs), but for all age groups. This, in time, paradoxically, may contribute to the reduction of the gender productivity, citation, and promotion gaps in Polish science as, conversely, previous global literature suggests that these gaps may widen if females excessively co-author and form professional networks with females, especially in male-dominated disciplines ( Maliniak et al., 2013 ). However, more detailed studies, especially those based on surveys and interviews, would be needed to exclude the possibility that female scientists feel coerced to add male co-authors as part of the global problem of authorship manipulation in academic research ( Fong & Wilhite, 2017 ).
Future research avenues would include, first, moving to a global study: from our "The Polish Science Observatory " dataset to our parallel "The OECD Science Observatory " dataset (with a complete set of metadata pertaining to all 19.3 million articles produced in the same study period of 2009-2018 in 1,674 research-active institutions located in 40 OECD economies, with their 8.7 million unique authors). A global account would allow examining gender-based homophily from a comparative cross-national perspective. Second, future research should also include moving from a cross-sectional study in which "individual publication portfolios " come from a single decade to a longitudinal study in which the portfolios come, for instance, from the 1990s, 2000s, and 2010s, and are compared, thereby revealing cross-national and global trends in man-woman collaboration over time and from a more historical perspective.

Author contributions
Marek Kwiek: Conceived and designed the analysis; Collected the data; Contributed data or analysis tools; Performed the analysis; Wrote the paper.
Wojciech Roszka: Conceived and designed the analysis; Contributed data or analysis tools; Performed the analysis; Wrote the paper.

Table 16
Structure of the sample, all Polish internationally visible university professors, by gender, age group, academic position, and discipline (by type: ASJC, STEM and non-STEM, female-dominated and male-dominated), presented with column and row percentages (Young scientists mean those 39 and younger, middle-aged those 40-54, and older those aged 55 and more).