Sex differences in variability across nations in reading, mathematics and science: a meta-analytic extension of Baye and Monseur (2016)

A recent study by Baye and Monseur (Large Scale Assess Educ 4:1–16, 2016) using large, international educational data sets suggest that the “greater male variation hypothesis” is well supported. Males are often over-represented at the tails of the ability distribution despite similarity in measures of central tendency and the gradual closing of the attainment gap relative to females. In this study, we replicate and expand Baye and Monseur’s work, and explore greater male variability by country using meta-analysis and meta-regression. While we broadly confirm that variability is greater for males internationally, we find that there is significant heterogeneity between countries, and that much of this can be quantified using variables applicable across these assessments (such as test, year, male–female effect size, mean country score and Global Gender Gap Indicators). While it is still not possible to make any causal conclusions regarding why males are more varied than females in academic assessments, it is possible to show that some national level variables effect the magnitude of this variation. Results and suggestions for further work are discussed.

phenomenon which appears somewhat paradoxical. Baye and Monseur (2016) suggested that this may be due to the way in which sex differences have been historically examined, focussing on mean results which assume homogeneity of variance across the achievement distribution. In a study using international assessment data, they demonstrated that the magnitude of the sex differences in achievement across literacy, mathematics and science varied across the range of results, and that the largest differences are seen at the extreme tails of the distribution. Girls tended to outperform boys at both tails of the distribution on reading measures, and in the lower percentiles of mathematics and science, while boys outperformed girls in the higher percentiles of mathematics and science. While the differences at the top of the distribution were of note, they called attention to the fact that inequities in the lower percentiles of the distribution were much more striking. Baye and Monseur (2016) also examined the variance ratios of boys and girls on these assessments and found that in 93% of cases, variances for boys were higher. The finding of greater male variances in assessments here is not in and of itself original and has been noted in studies for many decades (although rarely as a core focus). The "greater male variability" hypothesis in fact has its roots in the 19th century (Ellis 1894). However, if we are to understand differences between the sexes at different points of the distribution, we must attempt to determine how their respective distributions differ. It is to the issue of differences in variability, not average performance, that the rest of this paper attempts to address building on earlier work.

Male and female variability
Differences in the spread of scores between males and females have been noted in educational assessments for a long time, although often with contrasting findings. Maccoby and Jacklin (1974) showed that males were more variable than females in mathematical and spatial abilities, whereas variances showed parity in verbal measures. Feingold (1992) found larger male variances in the domains of general reasoning, mechanical reasoning, abstract reasoning, quantitative and spatial abilities, perceptual speed, memory and on verbal test batteries. Strand et al. (2006) found similar patterns in the domains of verbal, quantitative and non-verbal reasoning on a representative sample of 11-year olds in the UK, with greater male variances ranging between 7 and 17%. Similar results on U.S. students were found by Lohman and Lakin (2009) and later, Lakin (2013). IQ scores have also shown to reflect the same pattern (Johnson et al. 2008). Finally, assessments of non-cognitive and behavioural domains such as creativity (He et al. 2013;Karowski et al. 2016), sensation seeking (Cross et al. 2011), personality (Borkenau et al. 2013) and aggression (Archer and Mehdikhani 2003) appear subject to the effect. Combine these findings with the work reported earlier from Baye and Monseur and the fact that the above represents only a fraction of reported findings, one can see why many consider greater male variability to be ubiquitous.
Yet despite the volume of work related to differences in variances between the sexes, there has been little systematic attempt to explain this phenomenon (either partially or in its entirety). This is likely in part due to the contention that studies on sex differences in abilities tends to bring with it. Feingold (1992) noted that the explanation for greater male variability has become a polarised nature versus nurture debate. As a result, many empirical papers avoid proposing an explanation. Johnson et al. (2008) point out that although results have often seemed clear, studies are often attacked on methodological grounds pertaining to sample size, representativeness, sample selectivity and age amongst other things. While it is not our intent to repeat the full history of the greater male variability hypothesis (see Johnson et al. for an in-depth review) we will briefly consider some of the proposed explanations for this effect.

Explanations for greater male variability
As Feingold claimed, arguments regarding biological innateness are often invoked for theories of sex differences in cognitive and behavioural domains. Early theories (Ounsted and Taylor 1972) focused on the Y chromosome, claiming that differences in gene expression resulted in slower development and expressed more harmful as well as more beneficial traits, which would presumably lead to more variability in males. Gualtieri and Hicks (1985) suggested such differences could emerge from differences in the uterine environment, making males more differentially susceptible to physical and psychological disorders over the lifespan.
Evolutionary theories suggest that ancient adaptive mechanisms produced greater male variability to enhance survival in ancestral environments and that they are still in operation today. Evolutionary theories are based on sexual selection theory and parental investment theories (see Archer and Mehdikhani 2003 for a comprehensive review) and they would ultimately result in males showing greater variation across a range of traits in order to ensure reproductive fitness. Hill (2017) proposed two mathematical models simulating how one sex could have become more variable over evolutionary time if one sex in our ancestral past (presumably females in the case of homo sapiens, although Hill makes no explicit assumption) is more selective of the other for the purposes of mating, and that this greater variability will be independent of other measures of central tendency. Hill also suggested that in such circumstances where the selective sex is no longer being as selective, greater variability in the selected sex may in fact decline over successive generations. No direct test of this latter hypothesis has been made however.
While many support the biological and evolutionary basis for greater male variability, there are some shortfalls in this interpretation, as well as additional potential explanations as to why males are perhaps more variable. Miller (2001) claimed that susceptibility to defects resulting from prenatal conditions would only explain why males are overrepresented in the lower, not the higher tail of a distribution. As early as (1922), Hollingworth argued for an explanation based on gender roles, claiming that male employment, compared to the more restricted home role of women, allowed them the opportunity for greater diversification in education and environmental experiences. Noddings (1992) highlighted the issue of conformity, claiming that while most girls worked hard enough to avoid being in the bottom of the distribution in class, brighter girls are often pressured into not demonstrating the full extent of their abilities. Ceci et al. (2009) argued that biological accounts of differences in quantitative fields between the sexes are largely inconsistent and suggested that female preferences were a better explanation of underrepresentation in some professions. Critics of the evolutionary perspective also argue that if this phenomenon resulted from innate, evolved mechanisms, invariance of this effect across cultures would be expected. Several previous studies indicate that some nations show greater male variation, others greater female variation and many show homogeneity of variance (Feingold 1994). Feingold went on to attribute heterogeneity in his data to social and cultural factors rather than any innate biological mechanism. Feingold (1992) also argued that national test norms alone may not be sufficiently generalizable to afford definitive proof of a biological origin of greater male variability. However, more recent studies using international assessments such as PISA, PIRLS and TIMSS do seem to suggest that variability is greater for males in the domains of reading and mathematics across cultures (Baye and Monseur 2016;Machin and Pekkarinen 2008).
There has been some suggestion that elements of test design may also play a role in magnifying sex differences in terms of measures of central tendency and variances. Spelke (2005) claimed that supposed differences in ability, particularly in mathematics and science, resulted largely from item and test biases favouring males, and that research generally fails to support the greater male variability hypothesis in these domains. Lakin (2013) supports this to an extent, suggesting that changes to Cognitive Ability Tests (specifically, the introduction of new quantitative reasoning items with a lesser verbal load) may have been responsible for shifting more males into the upper echelons of the distribution compared with earlier versions of the assessment. Strand et al. however found few substantive sex differences related to item difficulties in non-verbal and verbal batteries and suggested that test construction was unlikely the root cause of differences in variability. They made a tentative suggestion that a speed-accuracy trade off favouring boys may account for some of the variability differences in quantitative domains, but cautiously note that that previous research has mirrored these effects in untimed assessments (such as Feingold 1992). Lakin also noted that the consistent trend of increasing variance ratios between cognitive ability tests at grades 4 and 7 is likely to be something more systematic than simple test design and potentially reflects changes to society in terms of educational opportunity and personal educational preferences. Arguments focussing purely on test construction and procedure are thus hard to substantiate in the current literature. Machin and Pekkarinen (2008) highlighted a compositional effect of sex differences in central tendency and distribution of scores. In their analysis of TIMSS and PIRLS data in 15-year olds, they noted that greater male variance in maths was attributable to overrepresentation of males in the higher part of the test distribution, with males outperforming females on average. In reading, male overrepresentation was largely at the bottom of the distribution, with females outperforming males on average. Indeed, Nowell and Hedges (1998) found a correlation of 0.74 between variance ratios and male-female effect sizes. Baye and Monseur found a smaller overall correlation of 0.42. However, they noted that the strength of the relationship varied by the point in the distribution. At the 5th percentile, the relationship was 0.50. At the 95th percentile, this had declined to 0.31. These results seem to suggest that variability for males increases in line with superior female performance, particularly at the lower end of the distribution. While Feingold's work (1994) failed to show a consistent greater male variance in international test scores, this could be attributable to the methodology. He conducted a metaanalysis by searching the literature for reading, mathematics and spatial measures, which carries many issues with it including many different tests, test administrations, issues of representation etc. Baye and Monseur (2016), using more recently available international assessments (PISA, PIRLS and TIMSS) found different results, suggesting that greater male variability was effectively universal. They found that variances (on average) were 15% greater for males in reading, 12% greater in maths and 14% greater in science. Even using Feingold's (1994) conservative estimate of any ratio falling between 0.90 and 1.10 as not representing evidence of greater variance, Baye and Monseur's work is suggestive of greater male variability. The advantage of using these international assessments is that they are designed to be internationally comparable, with representative samples of children selected in each country and administered in a standardised fashion. This helps remove potentially confounding factors that may impact on assessment results.

The current study
However, Baye and Monseur's work leaves many questions unanswered. How similar are countries to each other in terms of variance ratios, and are there some that are much more male biased than others? If countries vary in terms of male and female variances, are there any recorded factors that may account for this? Baye and Monseur did make some attempt to look at differences between primary and secondary school measures, as well as by IEA and OECD membership, but beyond this, no systematic heterogeneity analysis was conducted. Yet analysing heterogeneity is important and can be revealing. Furthermore, this international data could be linked to cross-country metrics that may elucidate meaningful patterns of variation. For example, Borkenau et al. (2013) showed that differences across countries in variances in personality were significantly linked to national measures of gender inequality and human development. Given earlier suggestions by Hollingworth (1922) that variances favouring males are largely due to gender roles, and later works (Ceci et al. 2009;Lakin 2013) suggesting that societal practices and female choice are likely to have a major impact on variance ratios, international indices of societal development, particularly forms of gender inequality, are potential sources that could be used to explain any cross-national heterogeneity. To our knowledge, this has not been examined in the context of large-scale international assessments.
In this study, we attempt to answer these questions and extend our knowledge surrounding the nature of greater male variability. We examined the same data sets used by Baye and Monseur, with the addition of more recent test administrations from years 2015 and 2016, to (1) replicate their findings using meta-analysis, (2) determine if greater male variability is homogenous both within and between countries and (3) quantify any meaningful sources of heterogeneity. For the purposes of the third aim, we link these data to international metrics on human progress (Human Development Index) and male-female participation in education, labour forces and politics (Global Gender Gap Index) as well as examining test specific factors such as grade, test, OECD membership, the size of the male-female difference at the mean and national means.

Data sources
Data from three major international assessments were selected to allow an examination of variance ratios across countries: OECD PISA (Programme for International Student Assessment;2000, 2006, 2012, 2015, IEA PIRLS (Progress in International Reading Literacy Study;2001, 2006 and IEA TIMSS (Trends in International Mathematics and Science Study;1995, 1999, 2007, 2015. These were selected due to having multiple testing points over time and having a wide coverage of countries across the globe. All data is freely available from the OECD website (http:// www.pisa.oecd.org) and IEA Study Data Repository (http://rms.iea-dpc.org). Methodological information is available in the technical reports on each survey (Adams and Wu 2002;Martin et al. 2000Martin et al. , 2003Martin et al. , 2004Martin et al. , 2007Martin et al. , 2016Kelly 1996, 1997;Mullis 1996, 2012;OECD 2005OECD , 2009aOECD , 2014OECD , 2016Olson et al. 2008).
International data on Human Development was also collected where available for each country. The Human Development Index (HDI) is made up of four sub-factors: expected years of schooling for children of school entry age, mean years of schooling for adults aged 25 and above, life expectancy and gross national income per capita (GNI). This data is freely available from the United Nations Development program website (http://hdr. undp.org/en/data).
International data on gender inequality was also gathered from the Global Gender Gap project. The Global Gender Gap Index (GGGI) is made of four sub-factors: economic participation, educational attainment, health and survival and political empowerment. Each factor represents an outcome and is measured on a scale of 0 to 1, where a score of 1 would represent parity between males and females. Data is freely available from the World Economic Forum's website (http://repor ts.wefor um.org).

Sample
Data from each country surveyed within each of the assessments was included in this analysis. For the purposes of this study, we used measures from three content areas: literacy, maths literacy and science literacy. In total, we included 564 cases for literacy, 1054 cases for mathematics literacy and 991 cases for science literacy gathered from over 100 nations worldwide (where each case represents a national test occurrence within a given year and within a specific content area). In terms of population size across all cases, in mathematics literacy it consists of 2,507,046 males and 2,512,273 females, for reading 1,471,698 males and 1,486,578 females and for science literacy 2,512,559 males and 2,515,645 females. It should be noted that for science literacy, we did not use data from TIMSS Advanced as these measures focussed on concepts from Physics only.

Data calculations
Statistics were calculated by generating means and standard deviations for males and females within each country for each measure within each assessment. These were calculated using each of the five plausible values within each database and aggregated according to the methodologies supplied by the OECD and IEA in their analyses manuals (OECD 2009b;Martin et al. 2016). Standard errors for these statistics were calculated using replicate weights within each database (80 Fay weights in PISA and 75 JK2 replicates in PIRLS and TIMSS). SPSS (V22; IBM Corp 2013) was used to calculate these statistics (see OECD 2009b; Martin et al. 2016 for technical details regarding the SPSS macros used to compute these statistics).
Variances were calculated from the standard deviations. The ratio of male to female variances was taken by dividing the male variance by the female variance. A variance ratio greater than one would indicate that the male variance is higher than the female variance. Variance ratios are a common method of examining variability between the sexes (see Hedges and Friedman 1993;Baye and Monseur 2016). In keeping with previous authors (Hedges and Friedman 1993;Katzman and Alliger 1992), but not Baye and Monseur (2016), ratios were logarithmically transformed to increase precision of the estimates and to avoid overestimation, as it ensures a normal distribution. Assuming that the log of the variances follows a normal distribution, the variances of these ratios were then calculated as: As we are examining variance ratios by country, some of the data points were combined for the purposes of the analysis. Countries such as Italy, Spain, Canada and the United States often report data for sub-regions but not consistently over assessments. These were collapsed for the purposes of this study. Where a nation has national and regional data within a given test administration, the subnational data points were used. China and the United Kingdom also report at the level of autonomous states (England, Scotland, Northern Ireland, Taipei, Machao, Shangai and Hong Kong). Countries falling into these states are denoted in the table but are not considered separately for aggregation. Assessments were considered together regardless of whether they were done in the primary or secondary years.

Meta-analysis
To examine the overall size of the variance ratio and to meaningfully quantify heterogeneity, meta-analyses were conducted using Comprehensive Meta-Analysis Version 3 (Borenstein et al. 2013). Many traditional analyses assume that effect size parameters are fixed and relatively homogenous. In this study, we are not assuming homogeneity of these parameters and are thus implementing a random effects model, assuming that effect size parameters are randomly sampled. The use of a random effects model is appropriate where heterogeneity is expected. In this study, we examined heterogeneity by country, whether the countries were OECD member states, test and grade.
Heterogeneity is examined by calculating Q statistics, which can be used to test for equality of effect sizes within and between analysis categories and follow the formulae below: , and k is the number of effect sizes.
Q statistics follow a Chi square distribution of k − 1 degrees of freedom (Hedges and Olkin 1985). While significant Q statistics can detect the presence of homogeneity, they are not indicative of its magnitude. They are also sensitive to sample size (Hardy and Thompson 1998;Higgins and Thompson 2002) and its presence is generally expected when analysing large numbers of studies (Higgins 2008).
The mean of the log variance ratios, standard errors and confidence intervals for each country were then calculated (and presented in their un-transformed format for ease of understanding). For each country, we also tabulated the proportion of studies where; the variances were significantly larger for males, (2) the variances were larger for males but not significantly so, (3) the variances were greater for females but not significantly so and finally (4) the variances were significantly greater for females.

Meta-regression
Meta-regression was used to explore and quantify potential sources of heterogeneity. We recorded the mean test score for each country in each year and calculated a weighted effect size of the gender difference between male and female means, as previous work has suggested that this effect size is related to the variance ratio (Baye and Monseur 2016). This was taken as the female mean subtracted from the male mean (a negative score therefore suggests higher scores for females). Using SPSS, this was converted into a standardised effect size (Hedges g) calculated from the effect size d multiplied by the correction factor J (correcting for small sample sizes): Other additional moderators were derived from test administrations. Previous researchers (discussed earlier) have suggested that some differences may result from test design. As such, the test type, year, test grade and OECD membership were included as moderators to determine if these had a substantial impact on heterogeneity. Baye and Monseur (2016) found small differences in variance ratios between these variables and thus they may be contributing to some of the heterogeneity. Alongside these, the subfactors of the HDI and the GGGI were included to see if other country level contributing factors could account for variation across countries. As consistent data for both these indices is only available from 2006, meta-regression was performed only on cases from test administrations from 2006 onwards.

Results
Analysis of each content domain is presented separately. Countries with only one or two data points are included in the analysis although conclusions about the stability of their variance ratios must be treated cautiously. Variance ratios and their confidence intervals are presented in their un-transformed form for ease of interpretation. The percentage of cases that have a variance ratio below (significantly and non-significantly) and above (significantly and non-significantly) 1, with ratios above 1 representing greater male variance, are also presented. Q statistics and their significance are also reported for each nation. Table 1 shows the results for this analysis on international mathematics literacy data sources. Each of the 102 individual participating nations is listed in alphabetical order.

Mathematics literacy
For mathematics literacy, variance ratios across nations range between 0.96 (Algeria) and 1.43 (Saudi Arabia), the average being 1.12. Data from 102 nations clearly shows Chinese districts administered separately that in mathematics, less than 6% of recorded cases show larger variances for females. Almost 61% show significantly larger variances for males than females and less than 1% of ratios are significantly female biased. In 91 countries, the variances are significantly larger for males than females. In only one country (Algeria) was the opposite pattern found to be true (and with no evidence of heterogeneity) although this result is not significant. In 36 nations (35%), there is no significant evidence of heterogeneity. Heterogeneity is present in the remaining 65% of cases however and is present overall. Figure 1 demonstrates the ratios and 95% CIs graphically (in order from smallest to largest). As is evident, while many significantly differ from 1.00 and countries vary considerably, few countries significantly vary from each other in the domain of mathematics literacy. Table 2 shows the results of the same analysis on international measures of reading. Note that while many countries are common to both assessments, this is not true of all of them.

Reading
In reading, around 95% of all assessments taken had wider variances for males than for females (almost 79% significantly so). Ratios range from 0.96 (Algeria) to 1.75 (Saudi Arabia). Only 4% of all cases were female biased (< 1% significantly so). The average across countries was 1.16. Only two countries have a wider variance for females and these are each based on only one assessment point (Algeria and Belize). Across 87 countries, the variances are significantly male biased at the 5% level. Only 34 nations (37%) however don't show significant heterogeneity in their Q scores, and of these, 14 are single case nations where the figure can't be calculated. As the remaining 63% show significant heterogeneity, the data cannot be considered homogenous. Figure 2 demonstrates the ratios and 95% CIs graphically (in order from smallest to largest). As is evident, while many significantly differ from 1.00 and countries vary considerably, few countries have ratios that significantly vary from each other in the domain of reading. Those that do differ significantly from each other tend to be positioned at the tails of the distribution.

Science literacy
For science literacy, we conducted the same analysis as in the previous two cognitive domains (Table 3).     In science literacy, around 95% of all assessments taken had wider variances for males than for females (almost 69% significantly so). Ratios range from 0.96 (Algeria) to 1.48 (Saudi Arabia). Less than 4% of all cases were female biased (< 1% significantly so). The average across all countries was 1.13. Only two countries have a wider variance for females. Across 86 countries, the variances are significantly male biased at the 5% level. 44 nations (48%) don't show significant heterogeneity in their Q scores, and of these, 2 are single case nations where the figure can't be calculated. While less heterogenous than the other two content domains, as 52% are showing significant heterogeneity, the data cannot be considered homogenous. Figure 3 demonstrates the ratios and 95% CIs graphically (in order from smallest to largest). As is evident, while many significantly differ from 1.00 and countries vary considerably, few countries significantly vary from each other in the domain of science literacy.

Meta-regression
While meta-analysis gives us an approximation of overall ratios and points to the presence of heterogeneity, on its own it does not advance our understanding of where the heterogeneity is coming from. A novel approach to attempt to apportion the variance attributable to known sources of heterogeneity is to use a form of linear regression often termed meta-regression. This procedure produces outputs recognisable as regression coefficients for covariates and an amount of variance explained synonymous with the traditional R 2 value. Table 4 illustrates a predictive model of variance ratios across countries which is built from the following covariates: Year, test, being an OECD country, mean score for each country, the average male-female effect size (calculated as Hedges g) GGGI economic participation, GGGI educational attainment, GGGI health, GGGI survival and political empowerment, expected years of schooling, mean years of schooling, life expectancy and GNI. Academic grade could not be considered in the model for mathematics literacy and reading as it was collinear with test. The referent categories were PISA and non OECD. Due to availability of matched HDI and GGGI variables only cases from tests administered from 2006 onwards are included in the analysis. Table 4 shows the results of this analysis.
These covariates predicted 31% of heterogeneity in Mathematics Literacy, 46% of the heterogeneity in Science Literacy and 54% of the heterogeneity in Reading. Many of the factors included in the model explain significant amounts of variance in effect    sizes however, this varies by domain. By far the most significant predictor is the size of the gender difference in scores (across all three domains). As the gap becomes larger in favour of females, the variance for males increases. The mean score of the country is statistically significant for reading and science literacy but has a very small, positive impact. The same can be said for the test year in mathematics literacy. There are small and significant effects for the tests (with TIMSS and PIRLS showing slightly less male variance) but this is harder to interpret, as it is confounded by age. HDI indicators seem to have little impact on variance ratios, although GNI has a very small positive but statistically significant effect on mathematics literacy and science literacy. GGGI indicators have a stronger, negative impact on national variance ratios however. Countries with higher Economic Participation for women have ratios favouring females across all domains. Better Educational Attainment for women significantly increases the ratios in favour of males however in mathematics literacy and science literacy. Increased political empowerment for women also seems to increase variances for females in literacy.

Discussion
Results broadly confirm the previous works of Baye and Monseur (2016) and suggest that male variances are greater than female variances internationally. This was largely expected as, although the methodology differed, most of the data used in this study was the same. Baye and Monseur showed variances for males were greater by 15% in reading, 12% in maths and 14% in science. Our results indicated that these ratios are 16%, 12% and 13% respectively, and suggest that the inclusion of more recent international surveys has not altered them substantively. Similarly, the correlation between male-female effect sizes and variance ratios was in line with those found by previous authors, with superior female performance increasing the gap in variance between the sexes. As such, we can broadly support the findings of past research and conclude that over the studied period, male variances in the domains of reading, mathematics literacy and science literacy are almost universally greater. However, these results suggest that we can take this conclusion a step further. Feingold (1994) suggested that a difference of about 10% in variance ratios should be considered a substantive difference. Tables 1, 2 and 3 clearly show that for most countries engaging with PISA, PIRLS and TIMSS assessments, male variances are greater by often more than this threshold in all three domains. There are no geographical areas in this study that show significantly greater female variances. It would seem therefore that the question currently should no longer be, do male and female variances differ, but by how much more varied are males compared with females?
While in over 95% of cases, males show greater amounts of variance, there is a significant heterogeneity in these results, both within and between countries. While we can say with confidence that males are certainly more varied and generate a fairly precise estimate of a global average, we cannot come to an absolute value for each country individually and must contend with a large amount of dispersion. This dispersion is telling however and shows that not only do countries differ (significantly in some cases, as is evident in Figs. 1, 2 and 3) but that they vary internally as well. There is a significant amount of heterogeneity across these data in most countries examined in this study which requires explaining.
Our meta-regression within each domain has gone some way in explaining close to half of the heterogeneity observed in the dataset for reading and science literacy and about a third for mathematics literacy. Some of the findings are harder to interpret than others. The variable with the largest impact is the male-female effect size. This is the most substantive factor across all three domains and suggests that as girls outperform boys, the variability of boys increases. This seems to support earlier works that demonstrated a correlation between effect sizes and variance ratios (Baye and Monseur 2016;Nowell and Hedges 1998). The mean score for the country also has a significant albeit smaller impact in the same direction for science literacy and reading. Countries that perform better on average are therefore more likely to have greater variability for boys.
PISA tests appear to result in slightly more variance for males than TIMSS and PIRLS. Baye and Monseur (2016) found slightly smaller ratios in the primary years across all three domains. As TIMSS and PIRLS assess younger children, it may be that this simply reflects an age or maturity effect. However, we cannot rule out that the actual tests themselves are not causing some of the heterogeneity or, that there may be a compositional effect between the two.
Interestingly, most of the HDI indicators were not significantly predictive of variance ratios across domains. The exception to this appears to be the GNI indicator (an adjusted form of GDP per capita) for mathematics literacy and science literacy but not reading. Reading is a specific skill that requires mastery and is often contingent on home environments for reinforcement. While this is to an extent true of basic mathematical concepts, later mathematics and science are likely tied more strongly to whatever specific curriculum is delivered, and this is largely coordinated at a national level. This may explain why national wealth may impact more upon maths and science as opposed to reading. However, it should be noted that, despite its statistical significance, it has only a minute impact on increasing male variance.
Measures from the Global Gender Gap Index however seem to have a larger impact on variance ratios. Increasing female economic participation appears to increase levels of female variance across all three domains. This suggests that countries actively incorporating more women into the labour force has an impact on educational outputs. Increased political empowerment for women also increases female variances in reading. Increased educational attainment for women has mixed impacts however. It has a significant effect of increasing male variances in mathematics and science but a non-significant effect of increasing variance for women in reading. Taken together, it suggests that cultural practices tied to increasing female participation generally appear to increase variances for females and suggests that greater male variance in educational outcomes may be practically reduced on national levels. While this study cannot isolate what specific national level practices are responsible for this, it does lead to interesting further questions regarding the processes underlying male/female variability.
The year of the test also had a very small but statistically significant effect on variance ratios in mathematics literacy. As with the test variable itself, why precisely this should be the case is difficult to rationalise. As mentioned earlier, there could be specific test administrations which have differences that create a small, positive effect. Alternatively, it could be that national educational systems have been adapting educational practices in order to improve their position in international rankings, and that these new practices are impacting upon the spread of scores. From this data alone, we can only speculate on the specifics as to why this may be the case.

Limitations and future work
There are several limitations to the data and the procedure we have used to explore it. First, a meta-analysis of international assessments such as PISA, PIRLS and TIMSS, while it controls for many extraneous variables not possible to account for in a metaanalysis via a literature search, does limit generalizability to alternative educational assessments. There could be something specific to these assessments that creates this effect. A limitation perhaps related to this applies to the assessments themselves. In PISA, the content being assessed is heavily based in literacy abilities. Even mathematics and science components are rooted in the ability to read and poor readers are unlikely to achieve if they cannot interpret the questions posed. As is evident from Table 2, the domain with the greatest amount of male variability is reading. As such, it is possible that mathematics and science show comparable overall ratios simply because they are rooted heavily in the ability to read. It is interesting to note that previous works using different assessments have shown greater variabilities in quantitative domains compared to verbal ones (Lakin 2013;Lohman and Lakin 2009). Thus, what this data may perhaps be showing is the greater variability in reading generally. This is still important and would pose the question 'why are males more variable at reading' but we must therefore be cautious regarding the conclusions we draw from the mathematics and science domains.
This study tentatively suggests (as does Baye and Monseur 2016) that age may be a factor, and that variability for males increases as candidates get older. To our knowledge, no study specifically examines this, either longitudinally or cross-sectionally (with perhaps the exception of Lakin 2013). Alternatively, attempting to quantify nation specific factors that could be included in additional regression analyses may be a future avenue worth exploring (particularly considering the impact of GGGI variables on ratios), potentially allowing us to quantify greater levels of heterogeneity in these results.
A final avenue of exploration would be to examine this effect over additional academic assessments. Research historically focuses on core domains of reasoning (Baye and Monseur 2016;Lohman and Lakin 2009;Strand et al. 2006). While this is important, do we get similar patterns across curricular subject examinations (anything from art to zoology), or different modes of assessment (pencil and paper tests compared to practical performance assessments)? These are often studied less, in part due to reasons of sample representation, or the fact that specific subjects are often self-selecting. As it stands from the data and the literature reviewed here, we would expect to see similar patterns across assessments generally. It would be telling if this was not the case. If there are exceptions, what are they and why do they differ?

Implications for theory and policy
From a theoretical perspective, we cannot contribute causal explanations for why males are more variable. Data suggests the effect is almost universal, which, while supportive of biological and evolutionary theories, doesn't rule out specific cultural, educational, political, social or religious practices. Indeed, the fact that we can quantify substantial variation as dependent on increased female participation in society suggests that, at least in educational outcomes, it is not necessarily the case that males should vary more.
However, without a clear understanding of why males vary more and how this difference is maintained, we acknowledge that a meaningful discussion regarding what can be done to ensure parity is difficult. Increased female participation in the economy, education and political empowerment significantly reduce the size of the discrepancy in variances between males and females across the three educational domains studied here. If these increase, we might expect the variance gap to decrease. Which specific practices within countries are enabling this however are not discernible from the existing data, and more comparative, in-depth work within nations (with closer attention to specific educational practices) would be required before specific policy recommendations could be formulated to ensure parity between males and females across the ability distribution.
Differences in the spread of abilities are important for society. If, for example, we want to increase the representation of women in top positions and educational institutions, so that parity between the sexes exists at this level, it is important that males and females are equally represented in the higher percentiles of whatever qualifications or ability metrics that constitute the selection processes. Similarly, the large gap in reading ability between boys and girls in the lower percentiles (Baye and Monseur 2016) suggests that some boys are likely to be at a serious disadvantage in later education (and potentially later life outcomes). Whilst implementing measures that strive for parity in the right tail of the distribution are important, we must also be mindful to not neglect the left.

Conclusions
Our analysis seems to suggest that greater male variability is currently universal in internationally comparable assessments implemented over the past decade. However, this effect is far from homogenous, and there are quantifiable differences that exist over nations. Furthermore, some of this heterogeneity can be attributed to some yet unspecified practices or policies targeted at increasing male-female equality, general malefemale performance as well as potentially the age of candidates and the type of test. Further work however is required to examine these factors in more detail, and analyses within nations may be informative to examine more specific practices that can explain national patterns. Comparative work examining high and low scoring GGGI countries may be informative in this endeavour. In doing so, it may be possible to determine if the root cause of these differences in distributions are attributable to some species universal mechanism or some other social or cultural phenomenon.