A study of stereotype accuracy in the Netherlands: immigrant crime, occupational sex distribution, and provincial income inequality

In this pre-registered study, we gathered two online samples totaling 615 subjects. The first sample was nationally representative with regards to age, sex and education, the second was an online convenience sample with mostly younger people. We measured intelligence (vocabulary and science knowledge, 20 items each) using newly constructed Dutch language tests. We measured stereotypes in three domains: 68 national origin-based immigrant crime rates, 54 occupational sex distributions, and 12 provincial incomes. We additionally measured other covariates such as employment status and political voting behaviors. Results showed substantial stereotype accuracy for each domain. Aggregate (average) stereotype Pearson correlation accuracies were strong: immigrant crime .65, occupations .94, and provincial incomes .85. Results of individual accuracies found there was a weak general factor of stereotype accuracy measures, reflecting a general social perception ability. We found that intelligence moderately but robustly predicted more accurate stereotypes across domains as well as general stereotyping ability (r’s .20, .25, .26, .39, β’s 0.17, 0.25, 0.21, 0.37 from the full regression models). Other variables did not have robust effects across all domains, but had some reliable effects for one or two domains. For immigrant crime rates, we also measured the immigration preferences for the same groups, i.e. whether people would like more or fewer people from these groups. We find that actual crime rates predict net opposition at r = .55, i.e., subjects were more hostile to immigration from origins that had higher crime rates. We examined a rational immigration preference path model where actual crime rates→stereotypes of crime rates→ immigrant preferences. We found that about 84 % of the effect of crime rates was mediated this way, and this result was obtained whether or not one included Muslim % as a covariate in the model. Overall, our results support rational models of social perception and policy preferences for immigration.


Introduction
Stereotypes have a poor reputation in science. They are frequently labeled inaccurate, exaggerated, sexist, racist, and harmful (Jussim, 2012;Jussim et al., 2018b). Theories of the harmfulness of stereotypes generally involve claims of causality from social perception to social reality. For instance, the very popular model of stereotype threat involves stress from negative stereotypes being activated to actual performance or behavior in line with the stereotype (Shewach et al., 2019). On the other hand, there are some researchers who pursue a more rationalist approach to stereotypes wherein these are mostly accurate approximations of group differences that ordinary humans form based on their observations of other humans, reports in the media, government statistics and so on. From this perspective, social reality is the main cause of stereotypes, not the other way around. In recent years, social and biomedical science has suffered from the revelations of the replication crisis. It has been found that probably most published research is not replicable when other researchers, or sometimes even the same researchers, try to re-do a given experiment or study (Gordon et al., 2020;Kvarven et al., 2020). These empirical findings of replication failure are mostly congruent with the original findings being false positives, i.e., they reported some association in reality that didn't exist, or at least, is so weak as to be indistinguishable from noise unless one has a quite large sample size (e.g. n > 1,000).
In light of this replication crisis, we highlight the difference between two lines of research involving stereotypes: stereotype threat theory and stereotype accuracy. The first is based on the idea that exposure to stereotypes about one's own group(s) induces stress that lowers performance in various domains, but chiefly on standardized cognitive tests (Spencer et al., 2016). As with most other research in social psychology, the stereotype threat theory has not fared well. In particular, meta-analyses show that larger studies find weaker or no such effects, thus suggesting publication bias causing a misleading picture of the evidentiary status (Shewach et al., 2019). Furthermore, at least one large (n = 2,064 Dutch high school students) pre-registered replication failed to find any evidence of stereotype threat for female and math ability (P. Flore, 2018; P. C. Flore et al., 2018).
On the other hand, stereotype accuracy findings have replicated extremely well with several recent, large studies finding approximately the same results as earlier small-scale research had reported (?). Thus, publication bias did not seem to affect this literature, in line with Sesardic' conjecture. 1 There are some limitations on the existing research on stereotype accuracy, however. First, most research has been done with North American students, the convenience sample of choice for most social science (Henrich et al., 2010). Thus there is a general need to see if findings replicate with samples from other parts of the world, especially those which are more representative. Second, as far as the authors are aware, all prior research has involved only a single domain of estimates. Thus, studies may have asked about stereotypes of ethno-racial groups, sexes, immigrant groups, or age groups, but none have measured stereotype accuracy for multiple groups at a time. For this reason, it is unknown how accuracy or bias in one domain relates to accuracy in another domain. Are people with more accurate stereotypes about sex differences also more accurate about age differences? Do people who exaggerate group differences in one domain also do so in others? Third, there are so far no known strong predictors of stereotype accuracy, e.g., with r > .30. Prior literature has reported positive results mainly for intelligence, educational attainment, male sex (sometimes), older age (sometimes), and some policy preferences. However, these factors have so far explained a quite meagre perfect of the observed variance, always below 10 %. Thus, it is not clear why some people apparently have accurate social perceptions for some domain (or in general) while others don't. The current study sought to partially remedy these shortcomings by studying multiple domains of social perception at once, measuring more potential predictors of individual variation in stereotype accuracy, and employing large, Dutch samples that are more representative than the typical student samples. Furthermore, the study's analyses were pre-registered to a large extent, thus giving confidence that the results were not cherry picked among possible method variations. 2

Data
We sought to sample 500 Dutch citizens using the Prolific survey company (https://www.prolific.co/; (Palan & Schitter, 2018). Unfortunately, they did not have sufficient subjects to take part in our research as we had planned, and we ended up with 411 subjects with valid data. For this reason we used an additional company, Survee (https://www.survee.dk/), to sample an additional 200 subjects (204 valid subjects obtained; we decided on 200 to have sufficient data to compare the two data sources). We used the same survey for both recruitment services, except that we had to insert extra questions into the survey for Survee because this service did not provide the same metadata automatically as Prolific does (data about employment status and so on). These questions were inserted for consistency, so that we had full coverage of all variables for both data sources. In total, we collected data for 685 subjects, 60 of which failed our attention checks and were excluded from further analysis, leaving us with 615 valid subjects. See the appendix for details on the attention checks.
Our questionnaire took a median of 22.9 minutes to complete 3 and contained 64-67 questions (3 extra questions for the Survee version). The survey structure was as follows: 1. Description and consent 1 Sesardic's conjecture is that because research that is not friendly to left-wing ideology faces unscientific discrimiantion in academia, the published research that nonetheless make it through the filters (ethics approval, grant applications, editorial and peer review), is consequently of higher than average scientific rigor. In his case, he was thinking of research on genetics (behavioral genetics), but the conjecture is more general (Kirkegaard, 2020a;Sesardić, 2005). 2 Preregistration document https://osf.io/8qhmr/. 3 The median is preferable here because some subjects leave the tab with the survey open for hours or even days, causing a large tail. The mean was 2,305 minutes (38.4 hours) with a standard deviation of 23,155. The median absolute deviation was 8.6 minutes. For measuring intelligence, we decided to measure two aspects: vocabulary and science knowledge. These are both crystalized (i.e. accumulated) aspects of intelligence. The reason for the choice of these is that they are faster to measure and have high loadings on the general factor, making for a more reliable test (Kan et al., 2013). The items in these tests were designed for this survey. The vocabulary test was designed after the English test at https://openpsychometrics.org/tests/VIQT/, (Kirkegaard, In Review). The design is a select-2-from-5 approach. Each item is a list of 5 words and the subject is asked to pick the two synonyms. The science test concerned knowledge questions about various areas of science in typical multiple choice format (choose 1 from 6-10 options). 5 We sought to maximize the number of distractors (false response options) since this reduces the chance of blindly guessing correctly, and thus should increase the factor loading of the item. The tests are both in Dutch and are freely available to anyone in the supplementary materials for any purpose with no prior permission (public domain). The appendix gives examples of items. The resulting data were analyzed using item response theory. We tried different scoring methods, as set out in the pre-analysis plan. Our primary measure was the single factor item response theory (IRT) model. We additionally scored a 3-factor model, with a general factor and a group factor for each test (vocabulary and science knowledge). We were particularly interested in examining ability tilt effects (Coyle, 2018;Kirkegaard, 2020b), so we wanted an orthogonal tilt factor to include in our regressions. The scores from the 3-factor IRT model did not produce this result, so we tried 3 other methods: creating a tilt score by: 1) subtracting the z-scored sum of science items from the z-scored sum of verbal items, 2) subtracting the science knowledge IRT score from the verbal IRT score, both from the 3-factor model, and

Confirmation of
3) computing the non-g residuals of the sum scores of the verbal and science scores, and then subtracting the science score from the verbal score.
All of these attempt to quantify the notion of doing relatively better on the verbal part as compared to the science part, while ignoring the overall level of intelligence. Analyzing these scores, we find that the last approach produces appropriately orthogonal scores to g, and we used this for our analysis (V tilt).
For the stereotype measurement, we were especially interested in immigrant stereotypes because of prior research on the topic and the general political relevance . We sought two other domains and picked sex differences in employment sections/jobs, as well as provincial income differences. These were selected so as to be as different from each other as possible. For the immigrant crime data, we picked 68 origin 4 Units are part of the same overall question, but involve different units, i.e., countries, or professions, as part of the same question (e.g. grid response type). 5 These were obtained from our current pool of about 250 science knowledge questions that are under development. There were 6 biology, 4 math/statistics, 2 economics, 1 history, 2 psychology/psychiatry, 2 linguistics, 2 physics, and 1 geography questions. countries from a recent study of immigrant crime rates (Kirkegaard & de Kuijper, 2020). These estimates should be highly reliable, as they are based on public data published by the government, and thus suitable as criterion data (Jussim, 2012). The numbers specifically concern the arrest rates for the groups. For the occupation data, we obtained a list of 54 occupations from CBS (Centraal Bureau voor de Statistiek; Statistics Netherlands). For the provincial data, we obtained their average (mean) disposable income (corrected for household size -excl. Student households) for the 12 provinces of the Netherlands. 6 For immigrant preferences data, we reused data from a prior study that concerned the same 68 origins (Kirkegaard & de Kuijper, 2020). The subjects in this prior dataset overlapped with the current ones to some extent, but a prior study found that sample overlap between subjects asked about preference and crime stereotypes did not affect results . The specific questions were, with English translations: • Geef met behulp van de slider aan voor hoeveel procent u denkt dat de volgende beroepen door mannen uitgevoerd worden.
-Please use the slider to indicate the percentage of men performing this profession.
-There are many different immigrant groups in the Netherlands. For each of the groups, adjust the slider to your estimation of the crime rate relative to Dutch natives. This means you should adjust the slider to two (2) if you think the crime rate of this group is twice that of natives.

• Besteedbaar inkomen [followed by a slider for each region]
-Disposable income

Stereotype scoring methods
Because the scoring methods of stereotypes are not well known, we provide examples here for illustrative purposes. Suppose we gathered data from 3 raters (A, B, C) who rated 3 target groups (X, Y, Z) on some trait. Table 1 shows example computations of accuracy and bias metrics. Table 1 shows both the raw estimates and the computed metrics for accuracy and biases. The real group means are 10, 20, and 30 on some hypothetical trait. Each rater had some level of accuracy for these differences, shown in their correlational accuracies which are all positive, ranging from .43 to 1.00. The difference between the Pearson and Spearman values is that for Spearman, only the order matters, whereas for Pearson, the relative differences matter, though not the scale. The scale consists of the mean and standard deviation of the ratings. Thus, we see that rater C has perfect accuracy in correlational terms (1.00), but actually his scale is widely off the mark in terms of both central tendency and dispersion. Central tendency and dispersion can be operationalized in different ways, but here we used the (arithmetic) mean and the SD, the most common metrics. C's (implied) estimate of the SD is 1 whereas the true SD is 10, thus earning him an SD error of -9 (i.e., he was 9 too low). Similarly, the raters differ in their estimate of the mean, ranging from 14.0 to 24.0, whereas the true value is 20. Thus, they suffer from mean bias but in different directions, ranging from -6.0 to 4.0. Finally, one can calculate the mean absolute deviation (MAD), which is the all-inclusive measure of accuracy. The various aspects of (in)accuracy may or may not covary. It depends on the structure of errors. If everybody perceives the same signal and is affected by different amounts of random noise, the metrics will tend to be positively related when adjusted for direction (i.e., lower MAD is better, but higher correlations are better). However, if the errors are a mix of many influences that differ between people in complex ways, the various metrics may not relate much or could even show opposite relations. For instance, correlation accuracies might be larger for people who overestimate group differences (i.e., have positive SD errors). The appendix provides a set of results from simulated data that illustrate these points.
In the above case, the scale of the estimates and the criterion values is the same. However, if the scale is not the same, then some of these values cannot be used. It does not make sense to compute deviation scores when they Criterion values 20 10 are on different scales, and thus, the metrics derived from these also make no sense. Thus, in practice, when scales are not the same, one is limited to using correlational metrics. In practice, these are arguably the most important metrics as well because for decision making, it is mainly the relative differences between groups that matter, less so whether one gets the scale wrong. However, since there are persistent claims that stereotypes exaggerate real differences, one will have to acquire data on the same scale to compute the dispersion error metrics to examine these claims. Central tendency errors are sometimes of interest. If one was estimating crime rates by group, then tendencies to over-or underestimate crime rates in general may be of interest since this would presumably relate to people's preferred policies in the criminal justice system (e.g., if one overestimates/underestimates the prevalence of crime, then one might support more/less funding for policing than is necessary).
The scoring of aggregate stereotypes works the same way as the individual level, except one first aggregates the individual ratings. In the above, the mean was used. Usually, the average estimates will be more accurate than the average of the individual accuracies. However, this is not necessarily the case, and in fact, not the case in the above for the correlational metrics (since C had perfect scores).

Results
There are results of interest at two levels of analysis: individual level (personal) and aggregate level (consensual) stereotypes. At the individual level, accuracy is generally weaker, and varies among people. This allows one to examine associations between accuracy (and bias) metrics and other individual variables, as well as examine the distributions of the variables. At the aggregate level, typically the arithmetic mean is used to aggregate the individual stereotypes to a single vector of values, which one then relates to other variables of interest. In this study, we used both approaches, and so the results are necessarily split between these levels of analysis. It is useful to keep in mind that stereotypes may function differently depending on their level. Individual stereotypes will be used to guide the decisions of these individuals as they function as reasonable priors in a Bayesian sense (if they are accurate, that is) (Jussim, 2012;Levin, 1997). At the social level, when stereotypes are largely shared, they will likely result in political decision making being affected, such as policies for dealing with certain groups.
Because of our use of three domains, the results are complicated to present, and we have split them by domain. The sections on occupations and provinces are less detailed since these were not the primary focus of the study.
To decrease false positive rates, we used a more conservative p-value threshold of 0.01. Due to the number of tests done across the full study, this level should still be taken as suggestive of an association pending replication.  to the average votes for that party in the last election and hypothetical election today (i.e., values can be 0, .5, and 1).

Individual level results
The Survee dataset was intended to be nationally representative, and the descriptive statistics bear this out. The Prolific group was younger (means 29 vs. 42, average age for the country is 42), more male (58 % vs. 47 %, average for the country is about 50 %), much more likely to be students (48 % vs. 18 %), somewhat smarter (0.43 d), and much more likely to vote for left-wing parties (e.g., Green-left/Groenlinks 24 % vs. 9 %, the party received 9.1 % of the vote in the 2017 general election, see Table 3 provides summary statistics for the voters by party appendix). Furthermore, for some variables. Values for variables were calculated by extrapolating to the expected value at complete support for the party (voted for the party in 2017 and would vote for today). We see that in terms of average intelligence, the social-liberal parties had the highest levels. This confirms previous research using both scales to measure social liberalness and party votes (Carl, 2014(Carl, , 2015Deary et al., 2008;Kirkegaard et al., 2017). The ability tilt seems to be mostly related to the average age of the voters, unsurprising since these variables are correlated r = .42 in this dataset, that is, older people do relatively better on verbal tests than science knowledge.

Immigrants and crime rates
Immigrant crime rates were the primary domain of interest for the stereotypes in this study. Overall, there was a fair amount of accuracy in the stereotypes. Figure 1 shows the distributions of four main metrics, and Table 4 shows summary statistics for the metrics.  The distribution of correlational accuracy shows the same long tail into negative values as seen in prior studies. It appears that some people provide reverse ratings on purpose, since such large negative values are unlikely to result by chance. Still, this did not affect the mean much, since in this case the mean and medians are nearly identical (.33 and .32). In terms of elevation error, there is disagreement with the mean and median values (0.15 vs. -0.34). This is due to a long tail of people who grossly overestimated crime levels relative to natives, but the median person (and the majority of persons) actually underestimated non-native levels of crime involvement. With regards to dispersion error, we see the same pattern general distribution but here even the mean person underestimated true differences in crime rates between groups. We were particularly interested in the association between intelligence and stereotype accuracy. Figure 2 shows the scatterplots of intelligence, and the stereotype accuracy metrics. Each measure tells its own story. For the two main measures of accuracy, we see notable relationships, r = .20 and -.30 (plots 1-2), for correlational accuracy and mean absolute error (lower values better), respectively. In other words, smarter people are better at getting the relative differences right and also better at getting the absolute values right. Looking at the scaling metrics, we see that smarter people tend to underestimate the overall crime rate a bit (plot 3, expected value at high g is negative), but not enough such that they become more inaccurate when we disregard directions (plot 4, expected value of high g is about 0). With regards to dispersion, we see strong evidence that smarter people underestimate real differences (plot 5, negative values for higher g), and this is enough that they aren't more accurate when we disregard direction of error (plot 6, correlation is near zero, though possibly slightly negative). Expanding to the entire set of predictors, Table 5 gives the correlations between accuracy metrics and all the quantitative metrics. Table 5 reveals a number of findings. First, it can be noted that there are sometimes opposite findings for Pearson r and MAE, for instance, age shows a weak positive relationship, r = .11, while MAE shows a stronger relationship, r = .22 (negative is better). Hence, it appears to be the case that older persons exaggerate differences (positive SD errors), and while this increases their pearson r a bit (which ignores scaling errors), it results in overall worse accuracy (by MAE, which takes into account everything). We also see that verbal tilt is related positively to SD errors, and similarly to overall accuracy metrics, but not to the mean errors. Education is interesting in that the pattern is very similar to that for intelligence, yet comparatively weaker. The two major players in terms of immigration stances are the PVV (Party for Freedom, the party of nationalist Geert Wilders), and Groenlinks (green left party). PVV voters tend to exaggerate both overall rates of crime and the differences between groups (positive errors for mean and SD, including absolute variants), while green party voters tend to underestimate the same, but not so much it causes worse accuracy when direction is ignored (absolute versions are negative, i.e., higher accuracy). However, the predictor variables are correlated, so it is not clear which is an effect of which, or simply due to confounding. Table 6 shows the main regression results from OLS (ordinary least squares). The appendix contains additional results for nationalists and non-nationalists.
The regression models clarify some things. Age is no longer a useful predictor (β = 0.00 in full model), the validity seen prior due to association with other predictors included in the model. Most importantly, intelligence was still a notable predictor, and it was about equal with education (β's = 0.17 and 0.15). Males had somewhat better accuracy (β = 0.23 in full model; β denotes the slope for the binary or proportional predictors), as has been found previously. Student status was associated with quite worse accuracy (β = -0.35), which is interesting. Party voting was mostly unrelated to accuracy, except that voters for PvdD and GL (both are left-wing parties) were associated with quite worse accuracy, consistent with bivariate results. The full model above contains a larger number of variables without detectable validity. To see to which degree these could be left out without affecting the model validity, we used lasso regression to find a good subset of the variables. Table 7 gives the results. The lasso results replicate most of the findings from OLS, namely that intelligence predicts well (β = 0.15), male status (β = 0.15), education also (β = 0.10). Curiously, lasso selects a number of variables considered mostly Since writing the analysis plan for our paper, we became aware of another method for variable selection and comparison, which is arguably more suitable here, namely Bayesian model averaging (BMA) (Goenner, 2004;Hinne et al., 2020). Since our analysis plan included the lasso, we could not switch to this alternative approach, so we instead present it here as an additional approach along the lasso. Briefly, BMA works by exploring the model space of possible models. If it is possible to explore all of them, this is done, otherwise, they are sampled at random or explored using an adaptive approach (similar to forward selection). Among the models explored, the best are chosen based on model fit criteria. Among this subset of best models, the summary statistics for betas are summarized by the mean and SD, weighted by the model fit (best fitting models providing the most weight). This approach results in more stable results than picking a single best model (as lasso does), while still providing more interpretable results. We used the BMA package for this using the default settings (Raftery et al., 2020). 7 Results are shown in Table 8. The results are similar to the prior. Notable is that 100 % of the best models included intelligence as a predictor (β = 0.19, not much different from r = .20 in bivariate results in Table 5). In contrast to the full regression results (in Table 6), and the lasso results (in Table 7), BMA does not find education is as important as intelligence, including it only in 37 % of the best models, and assigning it a notably smaller beta (β = 0.04). Student status (β = -0.33) and voting for certain left-wing parties (PvdD and GL) are still strong predictors included in almost all models, while the remaining variables are more sporadic. An exception to this is the FvD party (Forum for Democracy, a nationalist party headed by Thierry Baudet), which had a moderate positive beta (β = 0.17). However, all the parties with notably betas had high variance in the estimates, so the present study is not large enough to estimate their values.

Muslim bias
Of particular interest was potential Muslim bias in the stereotypes (Kirkegaard & Bjerrekaer, 2016). Muslim bias can be thought of conceptually in multiple ways, but all of them involve groups and countries with higher percentage of Muslims having larger errors in some sense. As detailed in the prior study , there were 3 metrics to use: 1) muslim error r, 2) muslim standardized error r, and 3) Muslim elevation error.
In the first, the deviation from criterion values are computed for each estimated, and these are correlated with Muslim % in the groups. In the second, a regression model is fit, the residuals saved, and then correlated with the Muslim % values. The difference here is that the first approach forces the scaling to be on the true scale, whereas the second does not. The third method involves computing the deviations as in the first, but then computing two weighted means, one with Muslim % and one with 1-Muslim % as weights, and then subtracting the second from the first. Empirically, the first and third metrics have been found to be very highly correlated (r = .96 in prior study), but the latter has the advantage of being given in natural units, whereas the correlation is unitless [-1 to 1]. Table 9 shows summary statistics for the metrics and Figure 3 shows the distributions.  In terms of direction, all the metrics show the same conclusion, namely that subjects had biases in favor of the more Muslim groups in the sense that they tended to underestimate their criminal involvement (negative values mean that more negative errors are associated with Muslim %, i.e., underestimation bias). The weighted mean metric shows the median error is about -0.52, meaning that the median subject underestimated the relative crime rate by 0.52 for more Muslim groups as compared to the less Muslim groups. In the same way, the median correlation between their errors and Muslim % is -.27. Figure 4 shows the most representative case in the dataset across bias metrics (see the appendix for details of the method). For this case, we see their correlation is -.29 (compared to the median of -.27). Their estimation errors for the Muslim groups are more negative on average. For instance, Tunisia is estimated to have an average crime rate (RR = 1.1), but actually has a high one (RR = 3.5), thus giving a large negative error (-2.4). Table 10 shows the correlations among the metrics as well as the primary accuracy metrics.
As in the prior study, the error correlation and weighted mean approaches are in near perfect agreement, both in directional and absolute variants (r's .94, and .84), whereas the scale-free residual approach only shows strong correlations with the directional metrics (r's .59 and .60), and near zero with the absolute variants (r's -.13, and .01). As with correlations with accuracy, there are seemingly paradoxical findings. First, there is a positive correlation between Muslim bias r and Pearson r accuracy, r = .37, but also a positive with the mean absolute error, r = .42 (again, smaller is better, so one would expect this to be negative). How is this possible? Here it should be recalled that the metrics are directional and the best value is 0, not 1 (or infinite). As such, greater directional Muslim bias values may be related to correlational accuracy, it depends on where the distribution is located. Figure 5 shows the scatterplot of the two variables.
In the left plot, we see that most people are located on the left side (i.e., underestimate Muslim crime rates), and those who overestimate them (on the right side) tend to have higher correlational accuracies, possibly because they exaggerate real differences. To clarify this, one can examine the other primary accuracy metric, in the right plot. Here we see that people with greater errors (in any direction), do show relatively positive Muslim biases. So perhaps those with large Pearson r accuracy attain this by drastically overestimating Muslim crime rates. In fact, the correlation between SD error (i.e., tendency to over or underestimate true differences) and Pearson accuracy is .19, so this appears not to be (much) the case. Table 11 shows the correlations with the quantitative predictors, and Figure 6 shows the scatterplots for the relationship to intelligence.    Starting with the first metric (bias r), we see that smarter people tend to underestimate Muslim crime rates (plot 1, expected values for high g is negative), and this tendency is strong enough that it results in greater undirectional errors (plot 2, expected value for high g is positive). Interestingly, the same pattern is seen for the second metric (wmean, plots 3-4), but much weaker, especially for the absolute variant. The reason for the discrepancy is not obvious as the metrics are correlated quite strongly, and should ideally quantify the same concept. The weaker third metric is only included here for comparison purposes (plots 5-6), and shows only very weak patterns. Turning to the predictors, we see that the left-wing parties tend to have similar bias patterns to the highly intelligent and highly educated, thus illustrating the connection between voting left-wing and being upper class. The nationalist parties show the opposite patterns. Still, when we look at the undirectional errors (bias r abs, and wmean abas), few variables are related. It is only intelligence and education, but again, only to the bias r variant, not the wmean variant. It therefore appears that various kinds of people make approximately equal sized errors with estimation of Muslim groups, but differ in their directions of error along a partisan split of globalist vs. nationalist.

Occupations and sex
Next, we turn to the stereotypes about sex differences in occupation. We have data for 54 occupations, thus a similar number of groups as in the prior section (68 origin groups). As before, we score these for accuracy using a variety of metrics. Distributions of select metrics are shown in Figure 7, while Table 12 shows the summary statistics.  Overall, we see very high levels of accuracy. The median Pearson r is .75, among the highest value seen for any stereotype study (Jussim et al., 2018a). Unlike the case with crime rates, the rank r is slightly lower than the Pearson r (.73 vs. .75), suggesting that for crime rates, people had some trouble with the scaling, but not the rank orders. For sex differences, this was not the case. The median MAE is 15.37, meaning that the median guess was about 15 % points from the right value. Considering that values can span from 0 to 100 %, and thus a random guess would lead to a median MAE of about 32 (cf. the simulations in the appendix), this is a quite high level of absolute accuracy. The median mean error was about -1.6, meaning that people slightly underestimated the number of men in an occupation, but it was very close to the true average across the occupations (61.6 vs. median guess of 60.0). For dispersion, there is substantial underestimation of sex inequality of the job market, with the median SD being 22.6, while the true value is 30.5, thus a median SD error of -8.00. In relative terms, this is quite large, -26.2 % underestimation. This finding flies in the face of repeated claims of exaggerated sex stereotypes (Jussim et al., 2018a). Turning to the relationship with intelligence, Figure 8 shows the scatterplots. As with crime rates, we see notable accuracy levels for individual estimates, with correlations with Pearson r accuracy of .25, and MAE of -.20 (lower better). For the componential errors, we see that smarter people tend to underestimate the proportion of men in the occupations, though not by much, and the tendency is only moderate (r = -.15). The Pearson r accuracies have a number of outlying values. If one removes values below 0, the correlation with intelligence increases to r = .32. The other associations are too small to care about (|r| < .10). Table 13 expands this approach to our OLS regression models with the other predictors.
Intelligence keeps its position as the dominant quantitative predictor (β = 0.25), while we also see a substantial effect (β = -0.60) of being a non-Dutch native speaker. This second finding could be interpreted as a lack of language skill with the survey, however, this predictor was not strong for crime rates (β = -0.09, p > .05), which makes this explanation implausible. No other predictor reaches detectable utility (as p < .01, unadjusted for multiple testing), so it appears the model can be substantially simplified. Table 14 gives the results from the lasso regression, while Table 15 gives the results from BMA.
The approaches are mostly in agreement: there are only 2 notable predictors, intelligence and being a non-Dutch native speaker. Lasso oddly includes a political party but with a near-zero beta, so this should probably be seen as a fluke. Our models are not much predictive of variation in this stereotype (full model adj. R 2 = 6.8 %, compared to 13.0 % for crime rates), despite the inclusion of many demographics variables and political variables that should capture aspects of feminist ideology, which conceivably would be related to stereotype accuracy in this domain.

Provincial incomes
Finally, subjects estimated the mean incomes of the 12 provinces of the Netherlands. Table 16 gives the summary statistics of the accuracy metrics, while Figure 9 gives the distributions.
In correlational terms, there is substantial accuracy, with a median Pearson r of .62, and rank r of .61. The means, however, are quite reduced (both r's .46), in agreement with the long tail towards -1 seen in the distributions.  This tail is curious, as it seems unlikely some people are purposefully filling out the questionnaire reverse of their real beliefs, as was seen previously with immigrant stereotypes (Kirkegaard & Bjerrekaer, 2016). The median mean error was close to 0 (1,192, or about 4 % off the true mean value of 28,808), indicating that the scale was understood by the subjects despite its somewhat technical nature (disposable income). Interestingly, the median SD error was positive (1,787), showing that subjects overestimated real differences. In relative terms, this effect is very large, the median estimated SD was 3343 but the true was only 1555, so the estimate was 115 % too large! Apparently, the public believes provincial income inequality in disposable income as much, much larger than it really is. Moving on to prediction, Figure 10 shows the scatterplots with intelligence.
The scatterplots reveal the presence of outlying values. Presumably, these represent people who grossly misunderstood the task and somehow gave results opposite of reality. These people are mostly clustered among below average intelligence subjects, giving rise to a correlation of .26 with intelligence. If subjects below Pearson r accuracy of 0 are removed, the correlation is reduced to .18. This is in contrast with the occupational  stereotypes where removing outliers (below 0) resulted in a stronger correlation (from .25 to .32). Looking at the plots also reveals a cluster of people with mean errors around -26k. These are people who filled in very small values, relative to this scale. Inspection of these cases showed that some of them are probably lazy responding (e.g., 30 people filled in responses with zero variance), and some are people who assumed they were giving values in the 1000's (e.g., one person filled in varying values between 32 and 45). As a robustness test, we removed all subjects who filled in values with a mean below 1000 (i.e., more than factor 26 off), and  those with no variance, n = 536 cases remaining of 615, or 87 %. The relationship to intelligence was mostly unaffected, r = .25 with Pearson r, r = -19 with MAE. The correlations with the mean and SD metrics became a bit stronger, but overall, this exploratory analysis did not change much. Furthermore, a large fraction of the outliers with Pearson r < 0 remained after this filtering (15.6 % before, and 14.9 % after). Table 17 gives the regression results that expands the analysis to the other predictors. This analysis was done on the full dataset, as the above exploratory analyses did not reveal serious issues because of the data problems.
The full regression results reveal only 3 useful predictors: intelligence (β = 0.21), being male (β = 0.31), and voting for the PVV party (β = -0.58, p < .01). The first is not surprising, given the prior findings. Men have a stronger interest in economic matters and greater scientific knowledge, including economics (Caplan & Miller, 2010;Tran et al., 2014), so the second finding is not surprising. The latter finding is odd, as there is no obvious reason why voting for this nationalist party should be related to stereotype accuracy of provinces, and not other nationalist parties, or the opposite effect for anti-nationalist parties. As before, the model had many variables of doubtful importance, and we used lasso regression and BMA to prune it, results are given in Tables 18 and 19. The methods were again mostly in agreement. Both find intelligence to be the most important predictor, in the case of BMA, including it in 100 % of the best models. Male status was afterwards the most important being included in 97 % of the best models. Both approaches included voting for PVV to some degree. With BMA, the influence is uncertain, the beta SD is large, and it is only included in 36 % of the best models. BMA furthermore sporadically included other variables (e.g., time taken, 27 % of models), but not much can be made of this.

Domain general accuracy
In the last part of the individual-level analysis, we looked for evidence of a general factor of stereotype accuracy. Figure 11 shows the heatmap of correlations for all metrics across domains.
The results show that there are many associations between the metrics both within and between domains. However, most of the cross-domain correlations are weak. For instance, the correlations among the correlational accuracy metrics are only .12 to .23, i.e., there is little shared variance. The same pattern is seen when one looks at the MAE where the correlations range from .18 to .20. The same is seen when one looks at the more specific sources of error. If we look at dispersion errors, the correlations range from .05 to .23. In other words, there is only a very weak tendency for people who overestimate differences (i.e., have positive SD error) in one domain to do so also in other domains. For mean errors, there was no consistency across domains, with correlations ranging from -.08 to .23. In other words, people who overestimated (or underestimated) values in one domain had no tendency to do so in other domains. Looking at the cross-metric within-domain associations, we find correctly signed relations between correlational accuracy and MAE of -.18, -.92 and -.25. It's not obvious why the correlation is so much stronger for the occupations than for the two other domains. In similar fashion, the correlation between correlational accuracy and mean error absolute was also correctly signed by weak, at -.17, -.50, and -.24. Again, the outlier is for the occupations. In contrast, the associations with SD error abs. were unimpressive and centered around 0 (i.e., people who were more inaccurate about the differences between groups were not the same across domains).
Though the correlations are weak across domains, they are there. How do the factor loadings look like, if we postulate a general factor of stereotype accuracy? While one could use just the MAE metric, which is all inclusive (any deviation from truth), we decided to use both Pearson r and MAE alone, and in combination. Table 20 shows the results. Unexpectedly, the planned factor analysis produced poor results, in that one domain had extreme influence on the factor. This is because the two metrics of accuracy in the occupational domain are strangely strongly correlated (r = -.92). This resulting factor is thus mostly just the occupational accuracy score, which is not what was desired. Deviating from our planned analysis, we employed unit-weighted factor analysis. This is a more robust alternative to the more common differential weights factor analysis. In this method, the indicators are given equal weights, and the loading loadings resulting are the correlations to this resulting score. This method is better in some edge cases where factor loadings vary unexpectedly or sample sizes are too small for stable results (Figueredo et al. 2000;Gorsuch 2015, sec. 12.2.2). In our case, we see that UWFA produced more even loadings, i.e., more even influences from the three domains. We furthermore scored each metric for its separate UWFA score. Table 21 shows the correlations among the factor scores and indicators. We see that the UWFA produced stronger correlations, while the EFA scores were essentially just a duplicate of the occupational scores. Thus, we used the UWFA scores for further analysis (this is a deviation from our pre-analysis plan). The resulting correlation with intelligence was impressive, r = .39, shown in Figure 12. The scatterplot has some outlying values, a result of the outliers in the components. The LOESS fit seems to indicate some nonlinearity, which is confirmed by a model comparison of a linear model vs. a natural spline model (p < .0001; adj. R 2 's 15.4 % and 18.2 %). Table 22 shows the regression results.
Overall the model is sparse. Intelligence (β = 0.37) and being a non-Dutch native speaker (β = -0.40) are the most important variables, but others also cross the p = .01 barrier: verbal tilt (β = 0.12), and age (β = -0.15). The fact that both of these cross and are positively correlated and yet have opposite betas is surprising. These variables mostly had same direction betas in the prior analyses. This pattern indicates suppression effects (i.e., where direct and indirect effects differ, presumably of age). As before, we attempt to simplify this model with lasso regression and BMA, the results are shown in Tables 23 and 24.
Surprisingly, lasso regression found that most (17) predictors were needed, while BMA was more parsimonious in the conclusions. Only intelligence was included in all the best models, while some others were also in the majority of models (education 68 % of models, non-Dutch first language, 55 %), as well as some with more sporadic appearance (e.g. PVV voting, 20 % of models).

Data source effects
Though we did not plan to collect data from two different pollsters, the fact that we did so lets us examine whether they produced different results with regards to stereotype accuracy. We saw earlier that Prolific subjects were younger, more left-wing, more likely to be students, smarter, so it is possible they also differ on stereotype accuracy, even beyond the effects of the measured variables. Here we formally test this by adding a source dummy to the regression models. Table 25 shows the results.  We find evidence of source effects for 3 of the 7 models tested, and most important, for the final model with general stereotype accuracy. Surprisingly, the effect is seen for two different outcomes, but not on one metric: occupational sex differences with Pearson r, and crime rates with MAE. The effect size on the general stereotype accuracy score is quite large at β = -0.38, with stronger accuracy seen for Prolific users. It is not clear why this is the case, as we statistically controlled here for many of the things that differ between the survey sources (as mentioned in Table 2).

Aggregate level results
Having seen the complexity of the individual level results, we are now ready to examine the aggregated results. When aggregating results, opposite-direction errors cancel out, which usually results in a much stronger signal, and in the case of stereotypes, much higher accuracy. However, this is not necessarily the case, but depends crucially on the structure of the errors in estimation. Insofar as these go in the same direction, it can lead to large overall errors in stereotypes (Surowiecki, 2004). 8

Immigrants and crime rates
As before, we begin with the primary domain of immigrant crime rates. There are many ways to aggregate estimates to a single set of values by taking into account the prior history of and the correlations among estimators (Atanasov et al., 2016;Lyon & Pacuit, 2013;Navajas et al., 2018). Surprisingly, the simplest is among the best: take the arithmetic mean. Table 26 shows the stereotype accuracy metrics across 4 aggregation methods.
Here we find that the trimmed mean (at 10 %) and the untrimmed mean do about the same, and both do better than the median (which is the same as the 50 % trimmed mean). Overall, however, the correlational accuracy is substantial with a Pearson r of .65 and rank (Spearman) r of .70. All methods substantially underestimate the true variability between groups (the negative values in SD error), estimating the SD in relative crime rates to be about 50 % of its true value. The estimation of the mean value is, however, quite accurate, being slightly too high using the untrimmed mean and slightly too low using the 10 % trimmed (0.15 vs. -0.17), whereas the median fares very poorly with a substantial underestimation (-0.41). For simplicity's sake, however, we will be using the untrimmed mean for further analysis (stated in our pre-analysis plan). Figure 13 shows the scatterplot between the average estimates and the criterion values.
known facts), independence (people's opinions are not determined by the opinions of those around them), decentralization (people are able to specialize and draw on local knowledge), and aggregation (some mechanism exists for turning private judgments into a collective decision). If a group satisfies those conditions, its judgment is likely to be accurate. Why? At heart, the answer rests on a mathematical truism. If you ask a large enough group of diverse, independent people to make a prediction or estimate a probability, and then average those estimates, the errors each of them makes in coming up with an answer will cancel themselves out. Each person's guess, you might say, has two components: information and error. Subtract the error, and you're left with the information."  One striking finding is that the lowest crime rate estimate was for the Dutch, but about 35 % of countries actually have a crime rate below that of Netherlands natives. In this sense, the crime rates of these countries are all overestimated. At the same time, however, the crime rates of the high crimes are underestimated, sometimes substantially so. The Netherlands Antilles (a Caribbean former colonial possession) has an actual rate of 4.2, while the estimate is only 2.3. If we inspect the distribution of estimates for some countries, we get an idea of why this may be so, shown in Figure 14.
For each country, the most chosen value is 1, i.e. the estimate is that immigrants from that country are the same as Dutch natives in crime rate. However, the countries differ in the length of the right tail, thus producing the differences in means. The problem here is that values above 1 have a greater influence than values below 1, even though they are both, in a sense, equally distant from the value of 1. To see this, imagine if we inverted the scale to be the number of times less criminal than Dutch natives (i.e., we took the reciprocal, 1/x). A low crime origin might then be assigned a value of 3 (commits crimes at 1/3 the rate of Dutch natives), and so on. There is a way to avoid this positive bias inherent in the scale, namely to convert the numbers to log scale, take the average, and convert back. This is because, on the log scale, 1/3 and 3 are both equally distant from 1 (i.e., -1.1, and 1.1), so the average of two estimates who describe a group as 1/3 and 3 times as criminal as the natives is 1 (same rate as natives). Figure 15 shows the results when the mean is taken of these log-converted values (log mean method in Table 26).
Thus, we see that this approach helps with the below 1 values, but also reduces the estimates for the high crime groups. In fact, if we look back at Table 26, this approach produces worse overall results in terms of even more severely underestimating the real differences (by about 60 %) and the overall mean is also much too low. The  result is that the linear fit (orange) deviates further from the perfect calibration fit (stippled line) than before. Beating the simple mean isn't easy. Turning to the question of Muslim bias in the ratings, Figure 16 shows the scatterplot of Muslim % and the estimation error.
In line with the individual level results, we see a tendency to underestimate the crimes rates of the more Muslim groups. There are a number of interesting outliers in the bottom left, countries that had strong underestimation of values, yet do not have many Muslims. This weak pattern in the aggregate data is similar to prior studies with Danish data Kirkegaard & Bjerrekaer, 2016).
A reviewer suggested looking for African bias after examining the scatterplot above. To do this, we used data from the World Migration Matrix (version 1.1; https://sites.google.com/brown.edu/louis-putterman/ world-migration-matrix-1500-2000) from economist Louis Putterman. This matrix " gives, for each of 165 countries, an estimate of the proportion of the ancestors in 1500 of that country's population today that were living within what are now the borders of that and each of the other countries.". Using these, we coded countries as Sub-Saharan African or not following the United Nation's coding (https://en.wikipedia.org/ wiki/Sub-Saharan_Africa) which includes Sudan and Somalia. For Czechoslovakia, we used the means of the origin countries (both 0), and for the Soviet Union, we used Russia (also 0). Figure 17 shows the results. The results confirm a large bias in favor of African nations, such that their crime rates are underestimated. This effect size is much larger than for Muslims. To test for incremental validity, we fit a regression model with both predictors. Both predictors attained reasonably large betas: -0.38 for Muslim % (p = .08) and -0.90 for Sub-Saharan African (p = .0004). Since this analysis was not planned, and the p value for Muslim is weak, this question will have to be explored in future research.
In terms of immigration opinions, a prior study measured the preferences for the same origin groups in a sample of 200 people living in the Netherlands (Kirkegaard & de Kuijper, 2020), partially overlapping with the present. Figure 18 shows the results. Prior research has found that measured stereotypes mediated the link between real crime rates and immigration preferences, a hypothesis suggested by (Carl, 2016). In other words, people are more opposed to immigration from high crime origins, and their preferences are in line with what one would expect based on their actual beliefs . To test this model further, we carried out the same mediation tests in the present dataset as done previously for a Danish dataset. Table 27 shows the correlation matrix for the variables in question. Mediation analysis was done using the mediation package for R. Results showed that an estimated 84 % of the effect of actual crime rates to net opposition was mediated by the stereotypes, and if Muslim % was included as a covariate, this value was 85 % (both mediation p's < .0001). As such, the prior findings are strongly confirmed (prior mediation % was about 100 %). Next, we fit the path model with Muslim as an independent predictor of net opposition. Results are shown in Figure 19. The path model shows what the mediation analysis finds, that the crime rate itself does not have much validity on net opposition directly, it's effect is through the stereotypes. In contrast to the prior study, we find a notable effect of Muslim % on net opposition, the prior study found this path to be p > .05 and with a beta of 0.03 . Thus, in the Dutch data, we see that the public is against Muslim immigration beyond the effect of the above-average crime rate of the more Muslim groups (r = .43).
The stereotypes and the immigrant preference data were collected from a partially overlapping subset of Prolific subjects. Though the data were collected months ago, this overlap might nonetheless bias results upwards due to the common method variance factor of being collected from the same subjects (Chang et al., 2010). If subjects who provided both data realized the link between them, which is a main hypothesis of this study, then they potentially made their responses more consistent with each other than if asked independently. The prior study on Danish data looked for evidence of this and found none: the correlations between preferences and stereotypes were the same whether they were aggregated from the same subjects or not . In this study we took a further step, since we had collected data from two different pollsters, so we were able to compute two sets of stereotypes. These correlated r = .97. Rerunning the mediation analyses with only the stereotype data from Survee did not produce any notable changes.

Occupations and sex
For the sex differences in occupation, we scored the estimates in the same way as before, except that we left out the log-conversion approach. Figure 20 shows the scatterplot of the mean estimates and the true values, while Table 28 gives the summary statistics.  The data shows near perfect accuracy in correlational terms, each method producing r = .94. Despite this accuracy, the MAE is not near 0, but is in fact around 13, meaning that the estimate is on average 13 % points off the mark. The reason for this divergence is that the estimates are not extreme enough, suffering from a large negative SD bias of 12.6 % points, or 41 % in relative terms (12.62/30.5). In other words, the estimates substantially underestimate true variability between occupations. This aggregated error is larger than the individual-level one, where the median implied SD estimate was about 26 % too small. The stereotypes did not differ by sex, as shown in Figure 21.
The stereotype accuracy metrics of the two set of estimates are essentially identical (e.g., Pearson r accuracies were .94 for both, SD error was -.12.6 for both), showing that stereotypes do not differ by sex. In the same, these two sets of estimates correlated at r = 1.00, regression slope = 1.00 (intercept -1.2, p > .05), so there was no statistical difference between them in any way. Men and women hold on average exactly the same stereotypes of the sex distribution of occupations.

Provincial incomes
Finally, we turn to the provincial incomes. As before, we scored these using 3 aggregation approaches, results shown in Table 29, and Figure 22 shows the scatterplot of the true values and the mean estimates.
The overall correlational accuracy is very high, Pearson r = .85 for the mean estimates. The 10 % trimmed estimates were slightly more accurate across the metrics. In contrast to the prior two sections, the provincial estimates were actually more dispersed than reality, giving a positive SD error of about 900, or 58 % too large. There was also a small positive mean error, i.e., estimates were too high on average by about 1600, or about 5 %.   This lengthy study had many findings of interest. First, we found strong accuracies overall. We find this across different measures of accuracy, across data sources, and across domains. This shows that stereotype accuracy is a strong, replicable and general phenomenon. This is furthermore in line with the large majority of other stereotype accuracy studies reviewed in reviews by Lee Jussim and colleagues across many years (Jussim, 2012;Jussim et al., 2018aJussim et al., , 2009Lee et al., 1995).
Second, we find that there are notable Muslim related biases in stereotypes. Going against popular narratives, we confirm prior findings, namely that biases are in favor of Muslim groups since the biases are towards underestimating the crime proneness of these groups Kirkegaard & Bjerrekaer, 2016).
Third, we found that stereotypes statistically mediate attitudes expressed towards immigration from the same groups in a rational way: groups with higher actual crime rates have stereotypes with higher crime rates, and face more opposition. Insofar as the public are trying to avoid increased crime in their countries, they appear to express crime-minimizing preferences. One British survey found that crime is the most important criterion that the public uses when evaluating which countries should be allowed to send immigrants (Carl, 2016).
Fourth, we found very clear evidence that stereotype accuracy metrics were predictable by intelligence, and this was also the case when we statistically controlled a large list of confounders including education level, age, sex, voting behavior and intentions, student status, employment status, and country of origin. The direct effect of intelligence was usually only somewhat smaller in the full regression model compared to the correlation, and thus the effect of intelligence was not mediated by any of these variables to a notable degree. This seems somewhat surprising. Variable selection methods, in our case lasso regression and Bayesian model averaging, also found that intelligence was always a useful predictor.
Fifth, accuracy of stereotypes was correlated across domains and across metrics, but surprisingly, not strongly so (correlations about r = .20, akin to typical item-correlations in an intelligence test). It was nonetheless possible to speak of a general factor of accuracy. Once extracted, this general factor showed a stronger relationship with intelligence (close to .40, without adjustment for reliability issues), as expected from psychometric theory. In a hierarchical or bifactor model of intelligence, one may posit various group factors, which are other broad abilities that are not general intelligence, but which contribute to the variation of performance on some subset of tests. Since stereotypes are essentially just a type of knowledge of demographics and regional statistics, this is a kind of general knowledge, and thus may be subsumed under the previously identified knowledge factor (Carroll, 1993;Jensen, 1998;McGrew, 2009). Social psychology offers a number of older studies that also support such general factors, and a relationship to own ability (Landy & Farr, 1980): Several studies have found that the performance level of the rater affects the nature of the ratings assigned to others by that rater. D. E. Schneider and Bayroff (1953) and Bayroff, Haggerty, and Rundquist (1954) reported that peers who received high aptitude test scores and were rated positively during training gave ratings of their fellow trainees that were more valid in predicting subsequent job performance. Mandell (1956) found no difference in central tendency between good and poor job performers but did find that those raters who were poor performers tended to disagree more with consensus ratings of subordinates than did the more favorable performers. Kirchner and Reisberg (1962) found that the ratings given to subordinates by supervisors high in job performance were characterized by greater range, less central tendency, and by more emphasis being placed on the independent action of subordinates as the basis for ratings. In a related study Mullins and Force (1962) obtained evidence for a generalized ability to rate others accurately. Peer raters who were more accurate in judging one skill of their co-workers also were accurate in judging another performance dimension. (Accuracy was assessed by comparing the ratings with scores on pencil-and-paper tests.) Sixth, overall, stereotype accuracy was not well-predicted despite a large collection of potentially relevant predictors. Typically, we were able to account for about 10 % variance. One major part of this puzzle may be due to the unknown reliability of stereotypes. As far as we know, no prior study of stereotypes computed a test-retest correlation, so it is unknown whether individual-level stereotypes are simply unreliable, and that is why they are hard to predict well. If that is the case, this would mean the current findings about the role of intelligence are actually greatly underestimated, since adjusting for unreliability of the dependent variable would result in large increases in the betas of all reliable predictor variables. We suggest further research should clarify this question.

A Attention checks
We placed 4 attention check questions throughout the survey. These asked the subject to select a given point using a slider. Attention was then scored simply as whether the subject had picked the right number or not. However, when we were collecting data from Survee, the pollster alerted us to the possibility that many subjects were failing by small amounts. The slider had a range of 100, and it was difficult to hit the exact number using mobile devices for some subjects. This issue was mostly evident for older subjects who are generally less technically competent. To get around this, we improvised a new attention check scoring where we computed the total deviance score per subject, defined as the sum of many much of their select value on the sliders deviated from the value the attention check asked. Using this approach, a small deviance would not indicate lack of attention but rather trouble hitting the right value. A person who was not paying attention would result in very high deviance scores. Figure A.1 shows the validation of this approach using the total survey time use as indicator of inattentive responding. Figure A.1 shows that most people obtained deviance scores of 0. Those that did not were mostly persons who finished the survey quickly. By agreement with Survee, we used the threshold of deviance score 20 to delineate between attentive and inattentive subjects. This resulted in acceptance rates of 97.6 % and 77.2 % for Prolific and Survee subjects, respectively. Our planned method was more strict and resulted in rates of 91.9 % and 71.6 %, respectively. This method choice did not affect our results notably.

B Intelligence example items
The first vocabulary item is shown in Figure  Both screenshots are from the exported survey file (Questionnaire Prolific.pdf).

C Main correlation matrix
The main correlation matrix is divided into to parts, shown in TableTable C.1 and Table C.2.  The set of estimates were nearly perfectly correlated (r = .96), however they differ in scale. We scored the accuracies for the two sets, shown in Table E.1.

F Accuracy metrics in simulated data
To get a better intuitive understanding of the accuracy metrics, it can be useful to simulate data under various conditions and examine the metrics and their interrelations. For this purpose, we simulated four datasets under the following conditions: 1. Random normal errors, mean=0, sd=1 2. Half signal + random normal errors, mean=0, sd=1 3. Half signal + random normal errors, mean=0, sd=varying between 0 and 2 per uniform distribution 4. Half signal + random normal errors, mean=varying per normal distribution, sd=varying between 0 and 2 per uniform distribution Each simulation was based on n=1000 raters each rating 20 groups. The criterion data were a vector of random normal values (mean=0, sd=1) which has mean=0.19, sd=0.91. Thus, in the first case, all the data are completely random, and there is no signal even if data are aggregated. Any relations between accuracy metrics thus come only from their relatedness and coincidence. The simplest way to understand the data concisely, is to inspect the pairwise scatter and distribution plots, shown in Figure F.1 below. These are made using the GGally package.   With regards to the distributions, all the non-absolute values are approximately normally distributed, as reflected in their skew and kurtosis near 0. The variants with absolute values of course cannot have negative values, and thus they follow something close to a half normal distribution (which can be seen also in their skew and kurtosis). The plot also contains the (Pearson) correlations between each variable in the upper triangle.
Thus we see that Pearson and Spearman (rank) correlational accuracy are highly correlated (r = .92) as might be expected. These are also strongly negatively related to the MAD (mean abs error), which is not surprising since having estimates further from their true values on average means the correlation will also be more negative. The relations to the mean and SD errors are approximately null, as these scale differences do not affect correlations. It is also worth noting that the measure based on means and SDs have some curious relations. The two pairs, mean+mean error, sd+sd error, show clean relationships, in fact, these are correlated at 1.00 since mean error = estimate mean -true mean (same for SD). The absolute version shows the characteristic V shape pattern reflecting the fact that over-and underestimating the mean/SD by 1 or -1 is the same amount of error in absolute terms. As there was no signal at all, the aggregated estimates do not show more favorable statistics either, having a correlation accuracy near 0 as well. It is worth noting that the SD error of the aggregated estimates is much smaller (0.03), since the random errors cancel out leaving only the signal, of which there is none.
In simulation 2, we add some true signal. Specifically, we add the criterion values times 0.5 (i.e., half signal) to the estimates along with the same random errors from before. Results are shown in Figure    The new part about simulation 2 is mainly that we now see some average level of accuracy present, e.g. mean Pearson r is now .41 compared to .00 before. We now also see some paradoxical findings. For instance, correlational accuracy is now positively related to SD error abs, though it should in some sense be negative. The aggregate estimates are now essentially perfectly accurate in terms of correlations (.99 to 1.00), however they still suffer from downwards SD error. In fact, the SD of the aggregate estimates are half as large as they should be, and that is of course because we only used half the signal strength in this simulation. The errors are otherwise random, so they canceled out leaving only the hal signal SD remaining.
In simulation 3, we add further realism, and allow for individual variation accuracy. This is done by varying the strength of the random error from SD=0 to 2 (uniform). Thus, individuals who have higher SDs have less relative signal in their values. Figure    The addition of the real variation in estimating skill across subjects made a large difference in that now there are appropriate negative correlations between the correlational metrics and SD error abs. Like before, the aggregate estimates are nearly perfect in terms of correlations, but suffer from the same SD error as before. The fact that some subjects have a larger error spread than others does not alter the fact that these cancel out across subjects.
Finally, in simulation 4, we add elevation errors to subjects, so that they both vary in their ability to get the elevation right and the dispersion right. The situation is now approaching reality. Figure    We now see the V curve for the mean variables as well, showing the variation in mean estimate across persons and how this relates to the mean of the criterion values. Despite this added realism, the aggregate estimate is still perfect in terms of correlational accuracy. As a matter of fact, this never happens in real datasets because the true estimates do not simply consist of the criterion values and random errors. The real life errors are systematic, varied, and probably interrelated. People do not have a source of perfect knowledge about group differences in most cases, but rely on various proxies (shortcuts). For instance, it has previously been found that when people are asked to estimate immigrant groups' economic contributions, they seem to rely upon knowledge of the origin countries' wealth in terms of GDP per capita. The evidence for this comes from correlated errors between the estimates people produce and those produced from predicting from GDP per capita. See Kirkegaard & Bjerrekaer (2016) for details.

G Case representativeness method
To pick a representative (central, typical) case, we devised a simple method. In this method, the variables are first standardized, then the central tendency is subtracted (if it's not the mean), absolute values are taken, and finally a mean is taken. Thus, the value is how far on average the case differs from the central tendency across all the variables. By standardizing the variables, they are given equal weight, and by taking the absolute value, the variables are not allowed to offset each other (otherwise, negative distance to central tendency on one variable would cancel out with positive distance on another). To illustrate the method, we computed the principal components on the mpg (car) dataset in R. Figure   As expected, the most central two cases (equally central) are roughly in the middle of the plot.