The fraternal birth-order effect as a statistical artefact: convergent evidence from probability calculus, simulated data, and multiverse meta-analysis

The fraternal-birth order effect (FBOE) is a research claim which states that each older brother increases the odds of homosexual orientation in men via an immunoreactivity process known as the maternal immune hypothesis. Importantly, older sisters supposedly either do not affect these odds, or affect them to a lesser extent. Consequently, the fraternal birth-order effect predicts that the association between the number of older brothers and homosexual orientation in men is greater in magnitude than any association between the number of older sisters and homosexual orientation. This difference in magnitude represents the main theoretical estimand of the FBOE. In addition, no comparable effects should be observable among homosexual vs heterosexual women. Here, we triangulate the empirical foundations of the FBOE from three distinct, informative perspectives, complementing each other: first, drawing on basic probability calculus, we deduce mathematically that the body of statistical evidence used to make inferences about the main theoretical estimand of the FBOE rests on incorrect statistical reasoning. In particular, we show that throughout the literature researchers ascribe to the false assumptions that effects of family size should be adjusted for and that this could be achieved through the use of ratio variables. Second, using a data-simulation approach, we demonstrate that by using currently recommended statistical practices, researchers are bound to frequently draw incorrect conclusions. And third, we re-examine the empirical evidence of the fraternal birth-order effect in men and women by using a novel specification-curve and multiverse approach to meta-analysis (64 male and 17 female samples, N = 2,778,998). When analyzed correctly, the specific association between the number of older brothers and homosexual orientation is small, heterogenous in magnitude, and apparently not specific to men. In addition, existing research evidence seems to be exaggerated by small-study effects.


S1.1 Jones and Blanchard (1998)
The very first meta-analysis (Jones & Blanchard, 1998) introduced two new measures of relative birth order, the fraternal and sororal indices. The fraternal index (FI) can be defined as an individual's number of older brothers divided by their number of older brothers plus the number of younger brothers; i.e., the proportion of an individual's older brothers relative to the individual's total number of brothers. By analogy, the sororal index (SI) is equivalent to an individual's proportion of older sisters relative to their total number of sisters.
With respect to the notation introduced in the main text, the FI and SI for the th individual are given by FI = #OB #All Brothers and SI = #OS #All Sisters . Jones and Blanchard (1998) stated that, assuming the absence of the FBOE, the mean FI and SI should be 0.5 for both homosexual and heterosexual men. Hence, under the FBOE, the mean FI in homosexual men should shift towards 1, while the mean FI in heterosexual men should shift towards 0. But the magnitude of this shift towards 0 in heterosexual men should be only a fraction of the magnitude of the shift towards 1 in homosexual men, due to homosexual men making up only approximately 2% of the male population (Cantor et al., 2002). Jones and Blanchard (1998) observed this pattern of shifts in the FIs across 872 homosexual and 2,115 heterosexual men taken from nine samples (see Table 5 in the main text for samples which were included). Such a group difference in the mean FIs would not only be expected under the FBOE, but also under a more general older sibling effect. That is, if homosexual men had a greater number of both older brothers and sisters than heterosexual men. In order to show that the difference between homosexual and heterosexual men in the mean FI is driven solely by by a greater number of older brothers, but not by a greater number of older siblings overall, one could compare the mean paired differences, the difference-in-difference, of FI-SI between groups. Jones and Blanchard (1998) found this difference-in-difference to be 0.036. Given the reported summary statistic of (3165) = 2.01, we were able to construct an approximate 95% confidence interval CI of [.001, .072], which includes a wide range of compatible values.
The samples included in Jones and Blanchard (1998) did not correspond to the complete set of studies which had already been published by 1998 (the year of publication). At least three separate samples from two studies (Blanchard & Bogaert, 1998;Bogaert et al., 1997) were omitted. Both of these studies were co-authored by Blanchard, thus raising the question as to why they were not included in the meta-analysis.
Since then, a number of further studies Bozkurt et al., 2015;Gómez-Gil et al., 2011;Schagen et al., 2012;Skorska et al., 2020;VanderLaan & Vasey, 2011;Vasey & VanderLaan, 2007) reporting fraternal and/or sororal indices (or some modified version thereof) have been published. However, none of these compared the mean paired difference of the fraternal and sororal indices between homosexual and heterosexual men. An update to Jones and Blanchard's meta-analysis is impeded by the non-reporting of standard deviations for FIs and SIs by Jones and Blanchard (1998), as well as the non-reporting of correlations between fraternal and sororal indices in all samples mentioned in the main text (Table 5). Beyond that, the fraternal and sororal indices once again are ratios. This complicates their statistical analysis to an unnecessary degree (see our discussion about the use of ratios in the main text in Part I).
Here, we also present t tests for both the between-group difference in the mean FI and the difference in the mean FI-SI for each of the 18,000 replicates in the simulation study of the main text in Part II. As can be seen in Figure S1, the FI behaved just like the MROB, the MPOB, the OBOR, and the other models and measures, which do not adjust for the number of older siblings (see Figure 2 in the main text). Used as a criterion for the existence of the FBOE, the mean FI would bring about false-positive decisions at a much greater frequency (close to 100% for some conditions) than expected with a significance level of 5%. Figure S1. Results of the difference between the fraternal index (FI; top) and the paired difference of fraternal and sororal indices (FI-SI; bottom) across all 18,000 simulations of homosexual and heterosexual participants (see main text for the description of the simulation in Part II). Points depict the means of the differences (horizontal axis). The endpoints of the error bars depict the .025 and .975 quantiles of the distributions of differences. The column Sig. indicates the proportion of differences with p < .05 (twotailed).

S1.2 Blanchard (2004)
The second meta-analysis (Blanchard, 2004) contained five additional samples (see Table 5 in the main text). Instead of contrasting FIs and SIs (Jones & Blanchard, 1998), Blanchard (2004) treated the samples as equally weighted single observations and used the mean number of older brothers, older sisters, younger brothers, and younger sisters in each sample as predictor variables in a logistic regression model, with the sexual orientation of the groups as the outcome variable. The p value of the change in the logistic regression's deviance statistic associated with the removal of each predictor separately was used for assessing the effect of each predictor. Only the removal of the mean number of older brothers from the model yielded a statistically significant change in the model's deviance. Hence, Blanchard (2004) concluded that the number of older brothers must have increased the odds of homosexual orientation. Blanchard (2004) reported neither regression coefficients nor shared the data used in this meta-analysis. We reconstructed Blanchard's (2004) analysis using the data of the 14 samples in (Blanchard, 2004) as reported in the appendix of the third meta-analysis (Blanchard, 2018a).
The file blanchard_2018_append.csv contains the total number of older brothers, older sisters, younger sisters, and younger brothers for the 30 samples as reported by in the third meta-analysis (Blanchard, 2018a). Note that the data reported by Blanchard (2018a) are slightly different from the data we used in our meta-analyses in the main text, as we excluded bisexual individuals whenever possible.
We next fitted the same models as described by Blanchard (2004). First, the baseline model predicting the samples' sexual orientation from the mean number of older brothers, older sisters, younger brothers, and younger sisters. Notice that R returns a warning, indicating that some fitted values are 0 or 1.
mod_base <-glm( group~ob_avg+os_avg+yb_avg+ys_avg, family = binomial(link = "logit"), data = blanchard_2004 ) Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred We ignore this issue for now, since our goal was to reproduce the analyses. Next, Blanchard (2004) defined a second model omitting the mean number of older brothers from the list of predictors and compared this second model to the baseline model using the p value associated with the change in the deviance from the baseline model.  The difference between the two models' deviance statistics is almost identical to that reported by Blanchard (2004). Here, we observe a value of 11.77, Blanchard reported a value of 11.88. Blanchard (2004) did not report regression coefficients for any of the models. An analysis of deviance reveals very little about the size and direction of the association between outcome and predictors. The regression coefficients for the baseline model were: Call: glm(formula = group ~ ob_avg + os_avg + yb_avg + ys_avg, family = binomial(li nk = "logit"), data = blanchard_2004) According to the above results, the estimated regression coefficient associated with the mean number of older brothers (ob_avg) is 142 lnORs, which corresponds to an OR of 4.08 × 10 61 . An estimate this large clearly indicates that entire analysis cannot be trusted.

S1.3 Blanchard and VanderLaan (2015)
In the third meta-analysis, Blanchard and Vanderlaan (2015) aggregated 14 samples, which were neither collected nor analysed by themselves or any of their collaborators. The study was described as a "simple meta-analysis that would be as free from our own potential unconscious biases as possible" (Blanchard & VanderLaan, 2015, p. 1506). Yet, this metaanalysis excluded the sample from a national cohort study with more than 2 million participants (Frisch & Hviid, 2006), which assessed sexual orientation via the registered sex of a person's spouse (i.e., used same-sex marriage as an indicator of homosexual orientation). Notably, said study (Frisch & Hviid, 2006) concluded to have found no evidence for a greater number of older brothers in homosexual men, compared to heterosexual men. Blanchard and VanderLaan (2015) reported a positive sign for 13 out of the 14 comparisons between homosexual and heterosexual men in the included studies, meaning that the average number of older brothers was greater for homosexual than heterosexual men. Testing this number against the null hypothesis of an equal probability of positive and negative signs for these comparisons yielded a p-value of .002, which is well below the significance threshold of .05. For the respective average number of older sisters, only 10 out of 13 comparisons had a positive sign (only 13 samples reported the mean number of older sisters). This number was not statistically significant, p = .097. Based on this pattern of statistically significant and statistically not significant effects, Blanchard and VanderLaan (2015) concluded that homosexual men, on average, had more older brothers than heterosexual men, whereas no such difference existed with respect to the average number of older sisters. However, the lack of statistical significance does not imply the absence of an (even similar-sized) effect with respect to older sisters here. It is yet another example of the "the-difference-between-significant-and-not-significant-is-not-itself-statisticallysignificant"-fallacy (Gelman & Stern, 2006). Observing that one effect is significantly different from the point-null effect and that a second effect is not significantly different from the point-null effect, does not necessarily indicate that the two effects are themselves significantly different from each other. If anything, Blanchard and VanderLaan (2015) provided evidence for a more general older sibling effect (i.e., homosexual men have more older brothers and sisters than heterosexual men; the exact same point had already been raised by Gelman and Stern, 2006, who used the analysis in Blanchard and Bogaert, 1996a, to illustrate this fallacy). Even if Blanchard and VanderLaan's interpretation of the sign test had been correct, their analysis would have provided no information about the magnitude of the supposed greate number of older brothers in homosexual men. Furthermore, the application of a sign test implies that each study should be treated as containing the same amount of information (i.e., each study carries the same weight), an assumption which cannot be justified when aggregating over differently-sized studies (Hedges & Olkin, 1985). A properly weighted meta-analysis could have led to a vastly different conclusion.

S1.4 Blanchard, Krupp, VanderLaan, Vasey, and Zucker (2020)
For the sixth meta-analysis, Blanchard et al. (2020) put forth yet another measure for quantifying the supposed greater number of older brothers, originally proposed by Khovanova (2020). The authors' implementation of Khovanova's suggestion may be described in two steps. First, the sampling space is restricted to men from two-son sibships (i.e., these men reported either a single older brother or a single younger brother and any number of sisters). Second, the following odds ratio is computed and subjected to the usual inferential machinery: where #Second and #First, refer to the number of second-and first-born sons. We refer to this OR as the second-to-first-born OR. Over a set of 14 samples, the authors reported a mean OR 21 of 1.38. Thus, there seemed to be an excess of second-born sons among homosexual men, who have exactly one brother but any number of sisters. This was in turn interpreted as evidence for the FBOE. Again, this result carries little information about the FBOE, as it fails to account for the number of older sisters (i.e., the older sibling effect).
Moreover, and in line with Parts I and II of the main text, the second-to-first-born OR cannot distinguish between a genuine older brother effect and a more general older sibling effect. This can be seen clearly from the results obtained by fitting the OR 21 to the simulated data reported in the main text ( Figure S2; See main text for the description of simulation study in Part II).

S2 Explanations of Equations in Part I
Equation 1 states that if all of the siblings reported by homosexual and heterosexual participants are treated as the sample, the OBOR is given by the odds of observing an older brother among all siblings of homosexual individuals divided by the odds of observing an older brother among all siblings of heterosexual individuals. Per definition (e.g., Blitzstein & Hwang, 2019, p. 53), the odds of observing an older brother in either group are given by the probability of observing an older brother among all siblings divided by the probability of not observing an older brother among all siblings (i.e., the probability of observing an older sisters, a younger brother or a younger sister).

Equation 2
is just an algebraic transformation of equation 1 (see Supplement S3 for a derivation). It states that if the odds of observing an older brother among all of the siblings of homosexual individuals are greater than those same odds for heterosexual individuals, then the probability of observing an older brother among the siblings of homosexual individuals must also be greater than the same probability for heterosexual individuals, and vice versa.
The equation in-between equation 2 and equation 3 (has no number) states that in addition to classifying the siblings in both groups (the homosexual and the heterosexual group) into whether they are older brothers or not, we can also classify them according to whether they are older siblings or not. That is, among the siblings reported by the homosexual and heterosexual participants, respecitvely, we can further distinguish between siblings who are older and siblings who are younger than the participant. All older brothers can be found among the group of older siblings, whereas it is impossible for an older brother to be among the younger siblings.
Equation 3 states that, since there are no older brothers among the younger siblings reported by the homosexual and heterosexual individuals, respectively, the probability of observing an older brother among younger siblings is 0. Thus, the probability of observing an older brother among all siblings of homosexual and heterosexual individuals, respectively, can be decomposed into the probability of observing an older brother among the older siblings multiplied by the probability of observing an older sibling in either group (homosexual or heterosexual) in the first place.
Equation 4 combines the statements from equations 1 to 3. The inequality in the bottom row shows that one can have an OBOR > 1 even if the proportion of older brothers is smaller in the homosexual group as compared to the heterosexual group if the proportion of older siblings (i.e., brothers and sisters) is sufficiently greater in the homosexual group as opposed to the heterosexual group.

S3 Converting Inequalities About Odds Ratios to Inequalities About Probabilities/Proportions
Let , , be disjunct events defined on a sample space. The conditional probabilities of event given either or are denoted by ( | ) and (  And > 1 can be rewritten as: Substituting , , by , , in equation 1 of the main text, equation 2 of the main text follows immediately.

S4 Additional Researcher Degrees of Freedom in Primary Studies
Having a number of different measures in one's statistical toolbox is by far not the only degree of freedom available to researchers in the field (Gelman & Loken, 2013;Simmons et al., 2011). It is not difficult to obtain a statistically significant result under these conditions (Gelman & Loken, 2013;Simmons et al., 2011). Importantly, it does not even have to be the case that researchers try out various analysis plans (cf. Blanchard, 2018b) and only report results that turned out to be statistically significant. All that is required is a set of different hypothetically justifiable analysis plans from which researchers can choose in light of the data. Two additional examples of inconsistent data analytic decisions which have been made in primary studies on the FBOE may clarify this point.
Some studies considered a moderating effect of handedness on the excess of older brothers (Blanchard, 2008;Blanchard et al., 2006;Skorska et al., 2020), while others did not even mention handedness. Thus, when analysing data, the theoretical hypothesis of the fraternal birth order effect could be mapped at least onto two different statistical hypothesis tests: one test that does not include handedness as a moderator and another test that does. If the test which does not consider handedness turns out to be statistically significant, then this would be interpreted as supporting the theoretical hypothesis of the FBOE. If the statistical test turns out significant in only one handedness-group but not in the other, then this result could also be reconciled with the FBOE (Blanchard, 2008; not to mention the large number of ways, which may be employed for classifying handedness).
Having found a statistically significant effect, one can post-hoc always draw on biological theory or evolutionary psychology and put forth a potent and coherent explanation as to why an effect may be present in one handedness group, but not the other. For instance, the presence of an effect in right-handers, but not left-handers, could be ascribed to differences in the level of testosterone in utero between left-and right-handed individuals. Yet, such an approach is not identical to, and differs in significant ways from, using statistical tests for pre-experimentally derived hypotheses, employing pre-registered analysis plans that make the substantive case that the hypothesis has actually been derived beforehand.
As a second example, some studies either excluded a large number of participants (Blanchard & Lippa, 2007) or explained away not significant findings by supposing that a so called "stopping rule" obfuscated the effect of the FBOE (Zucker et al., 2007). A stopping rule entails that, due to societal or environmental factors, parents cease to reproduce; for instance, after a certain number of offspring or combination of offspring-sexes is obtained (Blanchard & Lippa, 2007). It is argued that if such a stopping rule is at play, the FBOE might not show as there are not enough younger brothers who are homosexual to detect the effect (Blanchard & Lippa, 2007).
Both the moderating effect of handedness and the presence of a stopping rule are plausible explanations. However, the inconsistency with which these covariates are included throughout the entire field, the fact that many of these explanations have been put forth post-hoc, and that there are studies that have made adjustments to their analysis plans after the primary analyses did not return a statistically significant result (Blanchard et al., 1995(Blanchard et al., , 1996Blanchard & Lippa, 2007) leave the impression that many of the decisions pertaining to the mapping of the theoretical hypothesis of the FBOE and the implied excess of older brothers in homosexual men onto statistical hypotheses were at least data driven if not motivated by a desire to obtain statistically significant results.

S5.1 Methods
We used the R packages simstudy (version 0.2.1; Goldfeld & Wujciak-Jens, 2021), doRNG (version 1.8.2; Gaujoux, 2021), doParallel (version 1.0.16; Microsoft Corporation & Weston, 2021a), and foreach (version 1.5.1; Microsoft Corporation & Weston, 2021b) to generate random draws from a four-dimensional Poisson distribution with a specified correlation matrix, wherein each draw represented a simulated participant. The four dimensions served as the variables #OB, #OS, #YB, and #YS. We chose the Poisson distribution over other discrete distributions because it provides a simple way of generating count data. Other distributions may well provide better approximations to real-world sibling data; however, we are not trying to approximate or model the distribution of sibling counts as they are encountered in real data. Rather, we aim to demonstrate the problems that emerge when applying inadequate methods and decision rules to the kind of count data inevitably encountered in FBOE research. The odds of homosexual orientation for participants with no older brothers were fixed at 0.02 (this is a baseline estimate that occurs throughout the FBOE literature; e.g., Cantor et al., 2002). Each older brother increased these odds by a factor of (1 + θ) with θ ∈ (-1,∞). Thus, individuals' simulated odds of homosexual orientation were given by 0.02(1 + θ) #OB .
We investigated the choices of -0.33, 0.33, and 0 as possible values for θ, with 0 representing no effect of older brothers on homosexual orientation at all, -0.33 a decrease in the odds of homosexual orientation per each older brother, and 0.33 a true increase in the odds of homosexual orientation per each older brother. A value of θ = 0.33 corresponds to the 33% increase in the odds of homosexual orientation per older brother reported throughout the literature (Blanchard, 2001;Blanchard et al., 1996); a value of θ = 0 corresponds to a scenario in which the theoretical estimand is 0 and thus, there is no distinct association between the number of older brothers and sexual orientation and a value of θ = -0.33 corresponds to a scenario in which the number of older brothers decreases the odds of homosexual orientation.
The Poisson distribution was parametrised by a vector containing four elements: The mean numbers of older brothers, older sisters, younger brothers, and younger sisters, respectively (i.e., the rates at which these events occur). This vector of means (or rates) was determined by multiplying the mean number of all siblings, μ, by the vector where .515 again represented the population estimate for the probability of male birth (Grech & Mamo, 2020), and π denoted the proportion of older siblings in a given sample.
For instance, for an average number of siblings of μ = 3 and equal probability that each sibling is an older or a younger sibling (i.e., π = .5), would be given by By plugging in different values for π, the simulation could generate an older sibling effect independently of the specific older brother effect θ. This way, we could flexibly adjust the magnitude of θ, while being able to distinguish it from a more general older sibling effect. There are a multitude of alternative ways for implementing different older brothers and older sibling effects in a simulation: This is just one of them. For the homosexual sample, the study employed three different values for π, namely π = .5, π = .6, and π = .7, while for the heterosexual sample π was fixed at .5. That is, we simulated differences in the number (or proportion) of older siblings by increasing π in the homosexual group, assuming that the proportion of older siblings is equal to the proportion of younger siblings among homosexual individuals. We considered only scenarios in which the proportion of older siblings is greater in the homosexual versus the heterosexual sample, because the greater number (or proportion) of older siblings (as opposed to just a greater number of older brothers) represents the alternative model consistent with much of the FBOE literature. Other scenarios, wherein there are fewer older siblings in homosexual as opposed to heterosexual men, may be interesting, but would correspond to a model of homosexual men having fewer older siblings than heterosexual men. This model does not compete with the FBOE model and therefore is irrelevant.
The median of the mean numbers of all siblings for homosexual participants across the 45 samples in Blanchard (2018aBlanchard ( , 2018b, and Blanchard et al. (2021) was 2.45. In the equal condition of the simulation study, this value served as the mean number of all siblings (i.e., μ = 2.45) for both the homosexual and heterosexual group. In the unequal condition, the mean number of siblings for each group were taken from the Mismatch 2 sample in Blanchard (2014), where the mean number of siblings in the homosexual group was μ = 2.19, and the mean number of siblings in the heterosexual group was μ = 3.31. Blanchard used the Mismatch 2 sample to demonstrate the inability of tests for mean differences and logistic regression to detect an older brother effect and to promote the use of the MROB and MPOB.
As mentioned in the introduction in the main text, it is claimed that due to a positive correlation between the number of older brothers and older sisters, homosexual men should also have more older sisters than heterosexual men (Blanchard, 2014;Blanchard & Klassen, 1997;Jones & Blanchard, 1998). However, to our knowledge, no estimate of the magnitude, or an explanation of the precise nature, of the correlation, that would be required to bring about a difference in the number of older sisters, has been put forth. We nevertheless address this purported correlation and incorporated our best guess of such an estimate, by imposing the correlation matrix of the four sibling types found in the male participants of Tran et al. (2019) onto our simulated participants (Table S2; see https://osf.io/3wnhu/ for data). This study consisted of 1,779 men, who provided complete information on the number of each sibling type (see the original publication for full description of the sample). That is, the simulation assumed equal correlation structures in both homosexual and heterosexual participants, as we are unaware of any claims in the FBOE literature that these structures should differ between homosexual and heterosexual men. Thus, there were 18 possible combinations of θ, μ and π. For each combination, we fitted 10 different models (see Table 3 in the main text).

S5.2 Further Evaluation and Sample-Size Considerations
The performance of the models was primarily assessed with respect to inferring the state of θ with null-hypothesis significance testing (NHST), as it is the case throughout the FBOE literature. Only Models 1, 6, 7, and 9 in Table 3 in the main text could be interpreted as providing an estimate of θ (by exponentiating the estimates and subtracting 1). It thus makes little sense to report conventional quantities such as bias, (root) mean-squared errors, or coverage probabilities, which are used to assess how well the coefficients of interest of the models in Table 3 perform in estimating θ, as most of the model coefficients of interest in Table 3 are not intended to estimate θ or used in that manner. For instance, it would not be meaningful to determine the relative frequency (i.e., the coverage probability) with which the natural logarithm (ln) of the factor (1+θ) (i.e., the odds ratio for being homosexual per additional older brother) is inside the 95% CI for the lnOBOR (Model 4), as the lnOBOR is an entirely different unknown parameter than ln(1+ θ). Say, lnOBOR = ln(1+η), where η is the proportional increase in the odds of observing an older brother in the homosexual versus heterosexual group, whereas θ is the proportional increase in the odds of being homosexual per older brother. It is evident that η and θ are entirely different parameters.
Owing to the finite number of observations (1,000 replications per condition), the observed false-positive rates for the regression coefficients could differ from the underlying real false-positive rate. We defined the margin of error for an acceptable true false-positive rate of .05 as follows: Falsely rejected null-hypotheses can be modelled as Bernoulli random variables with probability of success equal to .05. Thus, the sum of 1,000 Bernoulli random variables (i.e., the number of falsely rejected null-hypotheses) is a binomial random variable. Suppose a test has a false-positive rate of 5%. The probability of observing 67 successes or less in 1000 trials (i.e., a false-positive rate of 6.7%) is .99. Hence, we regarded an observed false-positive rate of greater than 5%, but smaller than 6.7% as compatible with a true false-positive rate of 5%. Sample sizes were determined by carrying out the simulation several times, increasing the sample size on each iteration until at least 80% of the regression coefficients for the predictor #OB in Model 9 returned a p < .05 (two-tailed) given θ = 0.33. In other words, we increased the sample size until the coefficient of interest in Model 9 had a minimum approximate power of 80% or more. Conditional on the alternative hypothesis (here: θ = 0.33) being true, a test is equivalent to a Bernoulli random variable with probability of success equal to the test's power. Thus, the probability of observing 829 successes or less in 1,000 trials (i.e., a success rate of 82.9%) is .99. It should then be safe to conclude that for an observed rate of correctly rejected null hypotheses of 829 across 1,000 simulation replications, the hypothesis tests had a power of at least 80%. A sample size of 700 participants per group seemed to be more than sufficient for this purpose (see results below). The rationale for tuning the sample size such that Model 9 had adequate power was that this model not only adjusted for the number of older siblings but also provided a direct estimate for θ (i.e., the older brother effect in our simulation study) and was thus the most easily interpretable of the models.

S5.3 Models
Model 1 in Table 3 in the main text predicted the logit of the probability of being homosexual, using the raw number of older brothers and older sisters as predictors. Blanchard (2014) warned against using this model, stating that differences in the total number of siblings between homosexual and heterosexual participants (i.e., heterosexual participants having more siblings in general) may lead to the non-detection of a genuine association between the number of older brothers and sexual orientation as conveyed by the regression coefficient of #OB. Models 2 and 3 used the MROB and MROS (Model 2) or the MPOB and MPOS (Model 3), respectively, instead of the raw numbers of older brothers and older sisters as predictors in a logistic regression model, as recommended by Blanchard (2014) and Blanchard et al. (2020).
The regression coefficients for the predictor Hom in Models 4 and 5 correspond to the natural logarithms (ln) of the OBOR (Model 4) and the OSOR (Model 5) when regarded as generalized linear models at the level of the reported siblings. Specifically, Model 4 predicted the logit probability of the th sibling reported by the th participant being an older brother (for Model 5: older sister), using only the homosexual orientation of the th participant as a predictor (abbreviated as "Hom"), with a value of "0" meaning no homosexual orientation and a value of "1" representing homosexual orientation. In Model 5, exp(β Hom ) was not intended to capture the older-brother effect (θ) itself, but to assess a positive, but weaker, OSOR (as compared to the OBOR), which should be observable in the presence of an isolated effect of older brothers on sexual orientation, presumably due to the positive correlation between older brothers and older sisters (Blanchard, 2018c;Blanchard et al., 2021).
Models 6 and 7 both included the raw number of older brothers as one predictor, but differed with respect to the second predictor, i.e., the variable whose confounding effect should be adjusted for. Model 6 adjusted for the number of all siblings, #All, and Model 7 adjusted for the number of siblings who are not older brothers (i.e., "other siblings," #Other). These are the models implied by the claimed necessity of having to adjust for #All and #Other (as opposed to the MPOB and the MROB), respectively. Models 8 through 10 adjusted for the number of older siblings, #Older (Frisch & Hviid, 2006;Gelman & Stern, 2006;Zietsch, 2018). Model 8 corresponds to the one suggested by Gelman and Stern (2006) and included the difference between the number of older brothers and older sisters as the predictor of interest, while adjusting for #Older. The regression coefficient of interest, β #OB−#OS , quantifies the particular association between older brothers and homosexual orientation. Model 9 contains instead the number of older brothers as the predictor of interest.
Finally, the regression coefficient for Hom in Model 10 is equivalent to the natural logarithm (ln) of the odds ratio of older brothers to older sisters introduced in Equation 6, with the sampling frame being restricted to older siblings only. Model 10 predicted the logit of the probability of the th older sibling reported by the th participant being an older brother using the homosexual orientation (coded as 0 and 1) of the th participant.
Thus, just like Models 8 and 9, Model 10 adjusts for the number of older siblings, but is defined on the level of the reported siblings, rather than the level of participants.

S5.4 Summary Plots of Models Fitted to Simulated Data
The full of the seven logistic regression models predicting the probability of homosexual orientation from the simulation study are displayed as error plots in Figure S3. These plots are different from Figures 1 and 2 in the main text, in that they also convey summaries of the regression coefficients of the predictors that the models in Table 3 (main text) were intended to control for. As noted in the main text (Footnote 5), the regression coefficients for the MROB and MPOB were consistently greater in magnitude than the corresponding regression coefficients for the MROS and MPOS when an older brother effect was present. This suggests that interpreting the two conjointly (for instance, by considering their difference) may yield decisions with a false-positive rate of 5%. Figure S3. The error plots show the average log odds ratios (lnOR) over 1,000 simulations of the seven logistic regression models predicting the probability of homosexual orientation for each combination of , and . The column indicates whether the mean number of all siblings was equal at = 2.25 for both the homo-and heterosexual participants or whether this mean was unequal, i.e., the mean number of all siblings was greater in heterosexual group ( = 3.31) as opposed to the homosexual group ( = 2.19). The column denotes the proportion of older siblings in the homosexual group. For the heterosexual group, was fixed at 0.5. The column denotes the presence of a positive ( = 0.33), a negative ( = -0.33) or the absence ( = 0) of an effect of older brothers on the odds of homosexual orientation. The endpoints of the error bars correspond to the .025 and .975 quantiles of the distribution of simulated coefficients. The column "% p < .05" indicates the percentage of simulation replications that yielded a two-tailed p-value of less than .05.

S6.2 Results of the Fixed-Effect Model Analyses
In the specification curve and multiverse analyses, when including GDY as a predictor, the fixed-effect summary estimates for the 52 non-GDY and the 12 GDY samples were lnOR = 0.  Figure S4 presents a rainforest plot of the fixed-effect meta-analysis of all male samples lnOR effect sizes. Figure S4. Rainforest plot of the fixed-effect meta-analysis of all male sample lnOR effect sizes. In this plot, the sample weights are both encoded in the degree of saturation and the thickness of the shaded regions, with heavier effect sizes being more saturated and thicker. The widths of the shaded regions correspond to 95% confidence intervals (CIs); the point estimates are represented by the ticks inside these intervals. The teardropshape encodes the relative likelihood function of the mean of a normal distribution conditional on the standard deviation being equal to the observed standard error (i.e., the lnORs are assumed to be distributed normally) and its horizontal mirror image over the range of the 95% CI. The teardrop shape encodes the likelihood of true values over the range of the 95% CI, given the observed estimate and assuming the lnORs to be normally distributed with known variance.

S7 Distribution of S Values Under the Null Model
Technically, if a test statistic is continuous (as opposed to discrete) and the null (or test) hypothesis refers to a single parameter value (e.g., 0 : β = β 0 ), the associated p value is distributed uniformly over the unit interval upon repeated sampling if the null model is true.
Let be distributed uniformly over the unit interval, i.e., ∼ Unif(0,1), let = log 2 (1/ ), and denote by and the cumulative distribution functions (cdf) of and , respectively. Then which describes the cdf of an exponential distribution with rate parameter equal to ln(2); Exponential{ = ln(2)}. Thus, under the null model and given a continuous test statistic, values follow an exponential distribution.

S8 Parametric Bootstrapping Plots
A modified approach to the parametric bootstrapping procedure described in Voracek et al. (2019) was carried out to also explore its sensitivity to different degrees of effect size heterogeneity. For each of the 64 male samples, a random value from a normal distribution with mean equal to zero and variance equal to 2 + τ 2 was drawn. Three τ values were used: (1) τ = 0, corresponding to a fixed-effect model; (2) τ = 0.13, corresponding to the REML-estimate observed across all 64 male samples; (3) τ = 0.21, corresponding to the upper bound of the 95% CI for the REML-estimate. We computed the 1,638 meta-analytic specifications for each set of simulated effect sizes, and repeated this procedure 1,000 times. The resulting 0.025 and 0.975 quantiles of the estimated meta-analytic means for each of the 1,638 specification were sorted in ascending order. Plotting each specification's quantiles and connecting them with a line resulted in a 95% parametric bootstrap confidence region for specification curves under the null model. Figure S5 displays the results of these simulations for both male and female samples.
The observed specification curve (red line) for male samples ( Figure S5, top row) is mostly located outside the bootstrapped confidence bands created by the simulated data under the three conditions of effect size heterogeneity considered here. This suggests a mismatch between the data and the respective models which entail an effect of lnOR = 0. This is in line with the results obtained thus far -conditional on all assumptions about the data generating process being met (most importantly, the assumption that the observed effect sizes are a random selection of all effect sizes; e.g., Rafi & Greenland, 2020) the effect does not appear to be zero (or in the immediate vicinity of zero).
In comparing the female specification curve to its respective confidence bands, it is easy to see that most of the meta-analytic effects are well inside the confidence bands for the null model which assumes an ln of equal to 0. Thus, the meta-analytic summary effects of the specifications of female samples do not appear unusual given the null model.
Note that it would be a mistake to conclude that (a) "the effect is present in male samples but not in female samples" or (b) "the effect is greater in male as opposed to female samples". As far as (a) is concerned, the confidence bands could theoretically be made arbitrarily narrow if enough female samples were available (no effect is exactly zero). As far as (b) is concerned: Judgments about the size of a difference between two effects actually require that this difference be assessed -otherwise one would fall victim to the "thedifference-between-significant-and-not-significant-is-not-itself-statistically-significant"fallacy (Gelman & Stern, 2006; the inferential specification curves are in fact equivalent to hypothesis tests for each observed specification summary effect). Figure S5. Inferential specification curve plots for the male (top) and female (bottom) samples and three values of ;. The plots show the observed specification curves from Figures 5 and 6 (red lines) in the main text for male and female samples. The lower and upper bounds of the grey area correspond to the .025 and .975 quantiles of 1,000 simulations of each specification obtained by drawing effect size estimates from a normal distribution with mean equal to 0, heterogeneity equal to ; (= 0, 0.13, 0.19), and standard deviation equal to the observed standard error. Observed estimates outside this area are taken as evidence against a true effect of lnOR = 0.

S9 Specification Curve Analysis over the Difference of Male and Female Samples
Here we present the results of specification-curve multiverse meta-analyses obtained by comparing the estimated meta-analytic means for each of the 1,638 specifications of male samples to the fixed-and random-effects mean for all 17 female samples. To this end, we added all 17 female samples to each of the 814 unique subsets of male samples and computed both fixed-and random-effects meta-analyses using sex (male versus female) as a predictor. The resulting 1,628 regression coefficients, (i.e., the difference in the mean effect between male and female samples for each specification), the 95% CIs and the p values were extracted. Figure S6 shows the corresponding specification-curve (see Part III in the main text for description of specification-curve plot).
The estimated mean differences between the male subsets and all 17 female samples ranged from lnOR = -0.36 to lnOR = 0.86, with an inter-quartile range of -0.01 to 0.08. 69% of these differences were greater than 0. Of these, 4.3% were accompanied by 95% CIs which did not include 0 (i.e., had a two-tailed p value of less than .05). Overall, 3.3% of specifications returned p < .05. Figure S6. Specification-curve for the difference between male and female samples. Figure S7 provides the inferential specification-curve plots, Figure S8 the kernel-density estimates of the S values. Figure S7. Inferential specification-curve plots for the difference between male and female samples and three values of τ.
The kernel density estimate of the of -values shown in Figure S8 appears to be reasonably well approximatedby an exponential distribution with rate parameter = ln(2), as judged by the visual comparison of the observed values to the 1,638 random draws from an Exponential{ln(2)} distribution as well as the probability density function. Figure S8. Kernel density estimate of the S values for the difference between male and female samples.