Niche diversity effects on personality measurement – evidence from large national samples during the COVID-19 pandemic

We report systematic variability in the psychometric properties of a brief personality inventory during the early stages of the COVID-19 pandemic. Drawing upon recent discussions about the universality vs cultural relativism of personality measures, we review and comparatively test theories predicting systematic variability in personality measurement across cultures using an established brief personality measure applied to population samples in 16 nations during the first wave of the COVID-19 pandemic (N = 35,052). We found systematic variation in factor replicability and effective dimensionality. In line with previous theorizing, factors replicated better in contexts with greater niche diversity. Examining possible drivers underlying this association, the investigation of the individual components in the niche construction index suggested that life expectancy and to a lesser degree economic complexity are associated with greater personality structure differentiation. Population-level indicators of acute threat due to COVID-19 did not show credible effects. These patterns suggest that a) investigation of personality structure in population samples can provide useful insights into personality dynamics, b) socioecological factors have a systematic impact on survey responses, but c) we also need better theorizing and research about both personality and culture to understand how niche construction dynamics operate.

A consensus has emerged among personality researchers that broad based personality can be best summarized in terms of five or six basic dimensions: extraversion, neuroticism (or emotional stability), openness to experience, conscientiousness, agreeableness and a possible sixth factor called honesty-humility ( Ashton and Lee, 2007 ;De Raad et al., 2010 ;Goldberg, 1990 ;McCrae and Costa, 1997 ). This theoretical structure has emerged over decades of research and more recently, the Big Five Factor framework has been shown to provide an overarching framework for individual difference constructs in psychology ( Bainbridge et al., 2022 ) and moderate to high heritability ( Sanchez-Roige et al., 2018 ;Segal and Hur, 2022 ;Vukasovi ć and Bratko, 2015 ). The basic structure of personality can be captured with brief measures and even single item measures ( Konstabel et al., 2017 ;Rammstedt and John, 2007 ;Soto and John, 2017 ;Spörrle and Bekk, 2014 ). This impressive progress in measuring personality has been accompanied by an increasing number of studies reporting problems in replicating factor structures in both full length and brief measures when being tested in more culturally and educationally diverse samples ( Fischer, 2017( Fischer, , 2021bGarrashi et al., 2023 ;Laajaj et al., 2019 ;Ludeke and Larsen, 2017 ;Singh et al., 2013 ). This but do not need to improve personality theorizing. For example, translation problems at the item level may cancel out at the facet or factor level ( Church et al., 2011 ), which may always require longer scales. Alternatively, a number of theories (e.g., Fischer, 2017 ;Lukaszewski et al., 2017 ) have been proposed that suggest that contextual variables influence the replicability of personality structure and thereby the theoretical organization of personality, independent of measurement. In this view, cognitive representation of personality is systematically related to institutional and ecological factors.
For example, examining validity data from a series of large-scale individual differences studies, Fischer (2017) re-analyzed data from a number of large scale studies ( McCrae et al., 2005 ;Schmitt et al., 2007 ;Thalmayer and Saucier, 2014 ) and found that factor structures and reliabilities tended to improve in more economically developed nations. Drawing upon sociological and institutional research derived from postmodernization theory in sociology ( Inglehart, 2018 ;van de Vijver and Poortinga, 2002 ;Welzel, 2014 ), these patterns were interpreted in evolutionary terms as resource vs threat trade-offs ( Inglehart, 2018 ;Thornhill and Fincher, 2014 ): individuals are more likely to fully express their behavioral preferences, if they have sufficient resources and are less likely to face existential threats that constrain behavioral expressions.
A corollary of this threat hypothesis is that acute stress at a population level may have an impact on responses to personality questionnaires. Personality assessment is rarely spontaneous, but rather requires complex cognitive processes, including comprehension of the questions posed, retrieval of relevant information, reflection and integration of information into a judgement or estimate and then mapping this judgement or subjective estimate onto the presented response options ( Fischer, 2017 ;Tourangeau, 2018 ). The psychometric properties that are taken as validity of the personality measurement are the outcome of cognitive processes that have been shown to be sensitive to stress effects. A large body of literature has shown that under conditions of stress a) working memory and retrieval of relevant information is impaired, b) that individuals on average are less able to control interference from distractors, c) have reduced cognitive flexibility to reflect and integrate information, d) make suboptimal decisions and e) are less able to make goal-directed decisions ( McEwen and Sapolsky, 1995 ;Moran, 2016 ;Sandi, 2013 ;Shields, 2020 ). Furthermore, a number of studies have reported that cognitive resources systematically influence personality structure ( Bowler et al., 2009( Bowler et al., , 2012. These dynamics are relevant within the context of the recent COVID-19 pandemic. Anxiety and stress during the early stages of the pandemic increased significantly and abruptly at population levels globally ( Blendermann et al., 2023 ;Deng et al., 2021 ;Jin et al., 2021 ;Luo et al., 2020 ;Necho et al., 2021 ;Santomauro et al., 2021 ). Given what is known about the effects of acute stress on cognitive processes, the response processes when retrieving, integrating and mapping memories to response scales may have been influenced during the pandemic. Similarly, the increased restrictions during the early stage of the COVID-19 pandemic influenced behavioral goals available to individuals ( Daniel et al., 2021 ), which may have influenced how individuals rate their typical behavior. The changed behavioral context may also change how individuals responded to specific items and preliminary evidence suggests that some items have taken on new meaning within the context of the pandemic ( Sutin et al., 2020 ).
These different lines of evidence suggest that survey responses may have been affected by population-level increases in stress and anxiety during the pandemic. These effects constitute a possible proximal effect of acute stress on survey responses and factor structures during a specific event (see Sutin et al., 2020 ), whereas resource vs threat models often model long-term threats that involve extended exposure and therefore adaptive responses which are associated with changes in personality structure over time ( Fischer, 2017( Fischer, , 2021aThornhill and Fincher, 2014 ;Van de Vliert, 2009, 2013. The pandemic provides an interesting opportunity to examine the effect of acute stress at the population level on personality structure. Given the previously reported impact of stress on human cognition and behavior, it is important to assess whether such acute society-level threat have a systematic effect on structural properties of a widely used brief personality instrument in general population samples. A second line of reasoning draws upon ideas from evolutionary psychology: Lukaszewski et al. (2017) proposed a niche diversity hypothesis to explain cross-cultural variations in personality-factor correlations. They argued that more complex societies offer distinct niches that provide differential affordances and cost-benefit structures to segments of a population, leading to a greater diversity of trait profiles and therefore weaker personality-factor intercorrelation. They computed an index that included economic data as well as education levels, urbanization rates and diversity of export goods to capture niche diversity. Initial studies with student data ( Lukaszewski et al., 2017 ), an online study where users from a large number of countries answered surveys in a limited number of languages ( Durkee et al., 2020 ), and a simulation exercise ( Smaldino et al., 2019 ) have supported this hypothesis. Still, these studies are limited by their use of possible self-selection biases in largely non-representative samples (e.g., students, internet users on a personality portal) and language effects due to limited language options available (for language effects, see Harzing, 2006 ).
Building on the earlier work investigating threat -resource models, we argue that a proper test of the niche diversity hypothesis needs to first focus on the replicability of the factor structure rather than assuming that the factors are replicated but vary in their intercorrelation. Lasker et al. (2022) had pointed out that a failure to properly test measurement invariance in previous studies leads to alternative methodological interpretations. Our view is that testing the replication of the factor structure provides actually a conceptually more robust support for the argument that increased niche diversity provides differential costbenefit ratios of behavior that in the aggregate of a community and over time translate into differential behavioral profiles that unfold into more or less differentiated personality structures. Therefore, the test of measurement invariance and linking (lack of) invariance parameters to systematic population-level criteria provides a more appropriate test of underlying theoretical idea of niche construction of personality.
At the same time, a comparative test of the niche construction predictions vs stress and resource models in the context of the COVID-19 pandemic offer a new opportunity to understand possible systematic effects on personality structure and measurement. First, the presence of an acute population-level stressor provides an opportunity to test to what extent a major stressor may influence personality responses at a structural level. Considering that the pandemic affected populations at different rates across time, we can examine the relative effects of an objectively measurable acute stressor on personality responses at a population level. Second, the niche construction as well as threat -resource models rely on partially the same indicators. Whereas resource models have often focused on national wealth (e.g., Gross Domestic Product per capita), the niche construction theory takes a broader perspective and uses the Human Development Index as a base, an indicator which includes additional indicators that go beyond pure resource conditions. Specifically, the Human Development Index is a geometric mean of Gross National Income per capita, life expectancy at birth and the mean of the expected and average years of schooling within a nation (United Nations Development Program, 2022 ). In addition to this standard index in economics and development research, the niche construction index includes the percent of urban population in a nation and an index of economic complexity calculated as the diversity of exports a country produces and their ubiquity (e.g., specialization and diversity of the economic export of a nation, which expresses available levels of know-how) ( Hidalgo and Hausmann, 2009 ). These components capture different processes and dynamics of possible relevance for personality dynamics. National income (either GDP or GNI) is closely associated with economic complexity and these indicators capture the level of economic capital and prosperity of a nation, that is available resources. Economic complexity has the conceptual advantage in that it is linked to classic macro-economic concepts of division of labor and is an indirect measure of the capabilities of a country to grow further ( Hidalgo and Hausmann, 2009 ). The education index within the HDI captures the access to knowledge, whereas life expectancy provides a forecast of the living conditions and medical advances that individuals within a nation encounter. Therefore, these two indicators can be clearly seen as resources as well. Finally, urbanization rates have historically been the centres of innovation and socioeconomic complexity, bringing highly diverse groups with different skills and abilities in close contact and thereby providing opportunities for more diverse social and occupational niches ( Lukaszewski et al., 2017 ). Hence, the niche construction hypotheses extends resource models in substantive ways and a comparative test of this broader index vs economic indicators alone will provide first insights whether any effects may be more plausibly explained by resource availability encapsulated by economically driven postmodernization processes ( Inglehart, 1997( Inglehart, , 2018 or via niche construction effects ( Lukaszewski et al., 2017 ). Moreover, an examination of the individual components can provide avenues for further theorygeneration that may point towards the empirically most salient components (for diverse perspectives making the same point, see Mõttus et al., 2019 ;Seeboth and Mõttus, 2018 ;VanderWeele, 2022 ;VanderWeele and Vansteelandt, 2022 ).
These theoretical predictions have important implications for measurement. If we found good factor replication, then there is less need to worry about personality measurement in general population samples. In contrast, if we found poor factor replicability and we were unable to identify possible sources, most likely random measurement artefacts are at play and need to be considered in future studies (e.g., using longer scales, better translations, and adaptations). On the other hand, if we are able to pinpoint systematic variance in factor replicability, it adds conceptual nuance to previous arguments and highlights theoretical dynamics that affect measurement structures (and associated substantive interpretations).
In summary, we report validity information of a brief measure of personality in large national samples during the early phases of the pandemic. This provides important baseline information on the validity of personality data in more diverse samples and during an acute public health crisis. We then correlate indicators of average factor replication as well as a new measure of effective dimensionality ( Del Giudice, 2021 ) with indicators of niche diversity (and their subcomponents), economic wealth, COVID-19 case numbers and deaths at the start of each data collection interval. To advance personality science, it is necessary to examine boundary conditions of personality measurement and exam-ine whether cognitive representations of personality may systematically vary across contexts and which contextual factors need to be considered for further research.

Method
The data is available from the Austrian Social Science Data Archive ( Aschauer et al., 2021 ) and first wave data was downloaded on June 2, 2021. The objective of the project was to monitor social values and attitudes during the COVID-19 pandemic. Online panel surveys using quota sampling to approximate the national population were implemented in each country by the country coordinator. Target sample size was 2000 responses in each context. In total, 17 countries contributed data: seven countries in Europe (Germany, Austria, UK, Sweden, Poland, Italy, Greece), two Latin American countries (Brazil, Colombia), five countries from Asia (Japan, China, Hong Kong, South Korea, the Maldives) and three countries from the former Soviet Union (Kazakhstan, Georgia, Russia). A different personality instrument was included in the Swedish data, therefore we did not further analyze this sample here. We did not exclude any data points. Table 1 shows the sample size, gender distribution and mean age in each sample as well as the nation-level predictor variables. We do not have sufficient information to calculate the representativeness of each sample to the latest census in each country, therefore, we cannot ascertain that all data sets are representative of the larger population.
Personality. The BFI-10 ( Rammstedt and John, 2007 ) was measured with versions translated into the locally dominant language. It includes two items per personality trait, with a positively and negatively keyed item per trait. Responses were recorded on a 1 (disagree strongly) to 5 (agree strongly). Although it only contains two items per scale, previous research has suggested that the instrument offers an economic and practical option considering the trade-offs between length, reliability and validity ( Rammstedt et al., 2013( Rammstedt et al., , 2014 Niche Diversity Index. We used the coefficient developed by Lukaszewski et al. (2017) . The index is based on a principal component analysis of the Human Development Index, the urbanization rate and sectorial diversity as measured by a nation's volume of exports. We used both the 2015 index by Durkee et al., (2022) and also computed the index for the latest available year prior to the beginning of the pandemic (2019 for most variables, 2018 for urban population). We did not use the 2020 index as most of the data was collected in the early months of 2020 in which the true effects of the pandemic may not have had affected those indices. For the original index, no data was available for the Maldives due to a lack of sectorial diversity data. We  therefore used the composite score for Mauritius, the maritime neighbor of the Maldives, with a similar urbanization rate and a comparable HDI ( Central Intelligence Agency, 2021 ). For the recent index, data for all nations was available. To examine the contribution of the individual components, we also extracted information for each of the subcomponents, that is the Human Development Index, the Economic Complexity Index provided in the Harmonized System and the percent of the urban population. Because the Human Development Index itself is a composite index, we also extracted the following individual components: Life expectancy at birth and the mean of expected years of school and mean average year of schooling. The HDI also contains an estimate of Gross National Income per Capita (GNIpC). As we use the highly correlated Gross Domestic Product (in our data, r = 0.99), we did not separately include the GNIpC. National wealth. We used the gross domestic product (GDP) in Purchase Power Parity per capita in international US$ averaged for the years 2017-2020 ( World Bank, 2020 ).
Covid-19 Information. We calculated objective risk indicators by including the number of positive new cases on a rolling 14 day average per million inhabitants and the number of deaths on a smoothed 7 day average per million citizens n.d. ). The relevant smoothed data for the first day of data collection for each national sample was downloaded.

Analysis
First, we applied multigroup confirmatory factor analysis ( Fischer and Karl, 2019 ) using lavaan ( Rosseel, 2012 ) and ccpsyc ( Fischer and Karl, 2019 ) to the data. All models were fitted with a robust maximum likelihood estimator, the mean of the latent variable fixed to 0 and the latent variable variance fixed to 1 to allow all items loadings to be freely estimates. To determine whether an instrument shows comparable structures, it is necessary to include increasingly restrictive constraints to test whether parameter estimates are comparable across samples. In line with standard practice, we started with a baseline model in which items were allowed to load freely on their theoretically specified latent factor (configural invariance). We then restricted the factor loadings to be identical across samples in a second step (metric invariance). Finally, we restricted item intercepts to be identical across samples (scalar invariance). Only under condition of scalar invariance can the trait scores be directly compared across samples. In condition of metric invariance, it is possible to compare correlations and score patterns but not the means directly across samples.
We used the following fit criteria to evaluate model fit: the 2 value, the comparative fit index (CFI), the Tucker Lewis Index (TLI), the root mean square error of approximation (RMSEA) and standardized root mean square residual (SRMR). We report robust fit statistics for CFI, TLI and RMSEA. Values above 0.95 for the CFI and TLI indicate appropriate fit, whereas values of less than 0.08 for the RMSEA and less than 0.06 for the SRMR indicate sufficient fit. The relative change in fit when introducing restrictions is evaluated by comparing the difference in CFI and RMSEA, with values of less than 0.01 indicating adequate fit (for more information on fit indices, see Hu and Bentler, 1999 ;Marsh et al., 2004 , for a discussion of the cut off values for multigroup settings see Rutkowski and Svetina, 2014 ).
Confirmatory factor analysis is a technically conservative test for personality instruments due to known problems with cross-loadings ( McCrae et al., 1996 ). In case no adequate fit is found, we conduct principal component analyses separately for each sample first. In a second step, each solution is rotated to maximally similarity with the original factor structure of the German sample, which was based on a general population sample ( Rammstedt et al., 2013 ). Procrustes rotation is used because of the known indeterminacy of sample specific factor rotations ( Commandeur, 1991 ). To determine whether the factor structure is similar, we calculate Tucker's Φ (Phi), which indicates the relative factor similarity between 0 and 1. Values above 0.85 represent a minimum acceptable level of similarity ( ten Berge, 1986 ), with values above 0.9 or ideally 0.95 indicating adequate factor replication ( Fischer and Fontaine, 2011 ;Fischer and Karl, 2019 ).
An alternative is to examine indicators of effective dimensionality ( Del Giudice, 2021 ), that is the number of orthogonal dimensions that would reproduce the observed pattern of covariation without making any assumptions about the underlying structure. The estimation of effective dimensionality is based on the eigenvalues of the correlation matrix in relation to the number of variables in the set. We used ED(n 1 ) which is considered a balanced estimator for general purposes and can be understood as the equivalent of Shannon entropy H 1 .
To test whether replicability varied systematically across samples, we first correlated the average Tucker's Φ and ED(n 1 ) values with the niche diversity index, GDP per capita as well as relative COVID cases and deaths.
To compare the overall effects and to estimate the overall uncertainty around parameter estimates we report a Bayesian regression analysis with brms ( Bürkner et al., 2022 ). We initially fitted a standard regression model for each model, standardizing the data. On this standardized data, we ran a generalized linear model with 4 chains and 2000 iterations (with 1000 warmup samples). The advantage of Bayesian analysis is that it incorporates prior knowledge and beliefs into the analysis. These priors provide a way to incorporate expert opinions, past experience or theoretical expectations into the statistical model and can then be used to assess the uncertainty associated with the data and produce more accurate results in light of what prior knowledge. Neutral priors represent a lack of prior information or prior knowledge about the parameter of interest. It is usually specified as a flat or uniform distribution over a range of values that encompasses the entire possible range of the parameter (similar to a null hypothesis testing paradigm where the comparison is always the expectation that there is no relationship). A negative prior can be used to represent an expectation of a negative association between observations, or in other words, the parameter of interest is likely to be less than the value proposed by the null hypothesis (which in our case is zero). A positive prior represents the belief that the parameter of interest is likely to be greater than the null hypothesis, expressing an expectation in line with predictions, e.g., a positive associations between observations in the data. We specified three different priors: a negative prior with an opposing effect to our predicted effect ( = -1, sd = 1), a flat prior assuming no effect ( = 0, sd = 1) and a positive favoring our predictions ( = 1, sd = 1).
We also used ROPE (Region of Practical Equivalence) to make decisions on the credibility of our data being significant in the sense of large enough to be of theoretical interest. The use of a ROPE in Bayesian testing provides a more flexible approach to hypothesis testing compared to traditional frequentist methods. It allows us to specify our prior beliefs about the difference between the null and alternative hypothesis and to incorporate these beliefs into the analysis. This can lead to more accurate inferences and more appropriate decision-making based on the data. We computed the credibility intervals and examined the percentage of credible values that fall within the ROPE as an alternative to null hypothesis testing. Our region of practical equivalence was set from − 0.1 to + 0.1 for the normalized predictors. In other words, it expresses the changes in the criterion variable in relation to the standard deviation, an increase in 1 SD in the predictor holding all other values in the equation constant is expected to result in a change of at least 0.1 in the standard deviation of the Tucker's Θ or ED(n 1 ). Credibility intervals can be unstable, therefore one of the recommendations is to use the 89% Credibility Intervals. The specific value of 89% is largely arbitrary, partly chosen because it is the largest prime number below some region of instability identified in previous stability tests, partly to highlight the arbitrariness of 0.95 or 0.99 thresholds used in null hypothesis significance testing ( Makowski et al., 2019 ;McElreath, 2020 ). In order to directly compare the ROPE tests with a classic null hypothesis testing approach, we also computed the 100% Credibility Interval in relation to ROPE. We wrote a function using calculations provided by the bayestestR package ( Makowski et al., 2019 ).
Because Niche Construction involves an equivalent measure of GDP within the HDI, we report analyses separately for Niche Construction and GDP, as well as the individual subcomponents of the Niche Construction index. In all regressions, we included COVID-19 deaths per million. Using a Bayesian estimation approach, we can calculate the certainty around estimates for both niche diversity and GDP.
Finally, any cross-cultural data is potentially non-independent due to geographical proximity, cultural borrowing and common cultural descent. In our case, we are dealing with variables that may show different types of dependency structures, some of which are counterintuitive (e.g., COVID cases spread relatively fast from China to parts of the Middle East and Europe via business contacts but moved slower to both linguistically and geographically closer nations). As a conservative estimate, we reran all our analyses including continent as a fixed factor.

Results
We first report the confirmatory factor analyses. We ran an unconditional model across the full dataset, followed by a set of increasingly restricted models in which we forced all items to load on the same latent variable (configural invariance), restricted the loadings to be equal across samples (metric invariance) and finally constrained intercepts to be identical across samples (scalar invariance). For the unconstrained model, the models did not converge when assuming linear and ordinal data formats. This may be expected given the availability of only 2 indicators per latent factor ( Bollen, 1989 ). When constraining the loadings to be identical across samples, the models converged, but model fit was poor: X2 = 16,596.937, df = 475, CFI = 0.50, TLI = 0.24, RM-SEA = 0.15, SRMR = 0.11. When further constraining the intercepts to be equal, fit deteriorated even further: X2 = 27,045.534, df = 550, CFI = 0.22, TLI = − 0.02, RMSEA = 0.18, SRMR = 0.15.
In the next step, we applied a principal component analysis to each sample individually and then rotated the factor loading matrix to maximal similarity with the original German factor loading matrix ( N = 1134). We used Procrustes rotation and estimated factor loading similarity using Tucker's Φ and correlations (see Table 2 ). The mean values of Tucker's Φ were 0.84 (Extraversion), 0.67 (Agreeableness), 0.80 (Conscientiousness), 0.84 (Neuroticism) and 0.73 (Openness). Using correlations, the means were comparable and slightly higher: 0.86 (Extraversion), 0.73 (Agreeableness), 0.83 (Conscientiousness), 0.86 (Neuroticism) and 0.76 (Openness). Averaging across all five dimensions, the  only sample with an average above the 0.85 threshold was Great Britain (mean = 0.884), with the lowest average Tucker's Φ being observed in Kazakhstan data ( Φ = 0.584). Could translation problems explain these problems? When rotating the German sample to the previously collected German target sample, the average replicability was below minimum thresholds ( Φ Mean = 0.802), with values ranging from Φ = 0.43 (Agreeableness) to 0.99 (Extraversion, Neuroticism). The Austrian data was collected with the same German language version. The mean Φ was 0.83, with only the Extraversion and Neuroticism factors being above the minimum threshold again. The German and Austrian data therefore suggest that translation problems might not be the primary driver of the low factor replication across all factors in the current data.

Reliability estimates
To examine the reliability of the personality domains we recoded reverse coded items and used the Spearman-Brown formula to estimate reliability. Across samples, all factors showed low reliability. Assuming an internal consistency of 0.6 for research instruments, only the Extraversion scores for the Austrian, German, British and Korean data met this threshold as well as the Neuroticism scores in the British and Russian sample. We show the reliability estimates in Table 3 . The online supplement also reports the average correlations between the item pairs.

Country-level associations
For descriptive purposes, the mean Tucker's Φ value correlated strongly with the ED(n 1 ) value (see table 4 ), suggesting that the two estimates are capturing similar information about the structural properties of the instrument. We next examined whether the overall factor replicability and factorability of the data were associated with the nationlevel variables (see Table 4 ). To interpret the association of variables in our dataset, we decided to interpret absolute correlation values above 0.36, which is two standard deviations above the average nation-level correlation observed in previous cross-country research ( Franke and Richey, 2010 ). Of the ten country-level variables times two factor structure indicators, eleven associations were above this threshold. Consistently above the threshold for both factor indicators were the niche diversity indices for 2015 and 2019, the association with the HDI as well as life expectancy within the HDI. Traditional statistical significance was only observed for Tucker's Θ with the 2019 niche diversity index ( p Table 3 Internal consistency estimates (Spearman-Brown) per scale and country.   Note. M and SD indicate the sample mean and standard deviation, respectively. Values in square brackets show the 95% confidence interval. * p < .05. * * p < .01.    < .05), life expectancy ( p < .01) and economic complexity ( p < .05). The COVID-statistics did not reliably associate with factor structures, although the association of COVID-deaths per million with Tucker's Θ showed a correlation of 0.41, which was two standard deviations above the average country-level correlation in previous research.
To examine these effects comparatively, we fitted a series of Bayesian regression models in which we regressed GDP, niche diversity or one of its components together with deaths per million on the average Tucker's Φ and ED(n 1 ) value (see Tables 5-12 ). To assess the robustness of our findings we used three different priors (Negative = -1, sd = 1; Neu-tral = 0, sd = 1; Positive = 1, sd = 1) (see Fig. 1 ). Focusing on the 89% credible intervals and whether the interval included zero as a point value, the effects for niche diversity in 2015 on Tucker's Θ (with both a neutral and positive prior), and the 2019 niche diversity index on both Tucker's Θ and ED(n 1 ) showed some credibility. The effects for GDP did not show sufficient credibility even when including positive priors. Breaking the niche diversity effects down into its components, the consistently strongest effect was observed for life expectancy. The economic diversity effect showed a credible effect for Tucker's Θ assuming a positive prior, whereas urbanization showed a positive effect on Table 9 Summary of Bayesian regression predicting average Tucker's Θ and ED(n 1 ) using Life Expectancy (HDI component) and COVID-19 deaths per million.    ED(n 1 ) assuming positive priors. When considering the ROPE in relation to the credibility interval (see Table 13 ), the most consistent result was observed for niche diversity in 2015 on Tucker's Θ (positive prior), for Life Expectancy for all prior conditions for Tucker's Θ and with positive priors for ED(n 1 ); as well as for economic diversity on Tucker's Θ using neutral and positive priors. None of the COVID statistics had a consistent effect on factor replication or factorability. The overall results were replicated when including continent as fixed effect (see Table 14 ). Some of the effects for niche diversity actually became somewhat more credible when controlling for continent.

Discussion
We analyzed a brief personality measure that was applied to large samples during the initial stages of the COVID-19 pandemic. A first important finding is that the instrument did not show satisfactory psychometric properties. Importantly, these structural features were not random but appear systematically related to important ecological features of the environments in which these populations are residing. In contexts with higher niche diversity, the factor structure more closely resembled the originally observed personality structure of this brief instrument.  Note: Only the predictors from niche diversity and GDP shown, none of the COVID-19 predictors showed practically meaningful results, * denotes effect is reliable. Note: Only the predictors from niche diversity and GDP shown, none of the COVID-19 predictors showed practically meaningful results, * denotes effect is reliable.
Extending previous work on the niche diversity hypothesis, we are the first to report data at a population level and using both factor replicability and factorability instead of factor intercorrelations to test the hypothesis. Arguably, using the factor replicability is a more appropriate test of the niche construction hypothesis, because it provides an overall estimate on the organization of the personality space as inferred from the survey responses instead of assuming that the factors are present and then only examining the relative relation of factors to each other.
We strongly encourage future studies to consider the cognitive implications at the psychometric level instead of just analyzing the factor interrelations because in the face of substantiative factor incomparability these intercorrelations might not be meaningful ( Fischer et al., 2022 ;Lasker et al., 2022 ). Focusing on the relative strength of the wealth vs niche diversity hypothesis, especially when considering the components within niche diversity index, our results suggest that the effects may be mostly associated with variables leading to greater life expectancy, but also economic complexity within a nation. Life expectancy at birth may be seen as a forecast of the stability and reliability of institutions that assure the wellbeing of individuals and provide support for a long life. This was the most reliable association with factor replicability (average congruence). Economic diversity within the niche diversity index also showed some credible effects, but the associations were less consistent. What is interesting to consider is that this index is closer to classic considerations of division of labor leading to higher income, suggesting that it is the opportunity to secure income through diverse means that is associated with greater factor replication.
We also tested whether COVID-19 related variables were correlated with the overall psychometric properties. As discussed above, stress induced during the pandemic may influence the cognitive processes that are relevant for responding to surveys. Supporting our initial theorizing, a large sample in the US suggested that the interpretation of some items changed during the early stage of the pandemic ( Sutin et al., 2020 ). In our population samples from 16 nations, we did not find credible evidence for such relationships at the level of the overall instrument. Although the strength of the association ( r = 0.41) with death rates was sizeable, the credibility intervals suggested that there was considerable uncertainty about the association.
One area for future investigation is a more careful exploration of the subjective threat levels of individuals and how those levels may affect responses to personality surveys. We relied on official COVID-19 statistics at the beginning of data collection within each nation. The subjective levels of threat may differ due to the salience of the threat during the time of the study and such subjective threat perceptions may have a more substantive effect on personality responses. Furthermore, a stronger test of the stress response at the nation-level would have evaluated the personality structure just prior to the pandemic and then again during the pandemic. Although we correlated COVID-cases and deaths with factor structure indicators, our associations may be due to other third variables and not necessarily stress. Addressing these issues is clearly an important avenue for further work and a shortcoming of our study. We report these associations to alert researchers to the possibility of such effects (see also Sutin et al., 2020 ) and stimulate further research.
Another important question is whether these effects are driven by the brief nature of the personality inventory that was used. Given previously reported patterns with more extensive instruments ( Durkee et al., 2020 ;Fischer, 2017 ;Fontaine et al., 2008 ;Lukaszewski et al., 2017 ) and the emerging evidence of the robustness of single item or short personality measures ( Konstabel et al., 2017 ;Rammstedt and John, 2007 ;Soto and John, 2017 ;Spörrle and Bekk, 2014 ), we should probably expect similar patterns. The calculation of credibility intervals also points to a relative robustness of this effect.
Yet, there is parallel literature that points to significant problems of short-measures ( Chapman and Elliot, 2019 ;Laajaj et al., 2019 ;Ludeke and Larsen, 2017 ;Steyn and Ndofirepi, 2022 ), including the measure that we used. There are a number of known problems when trying to capture behavioral information across cultural contexts with short measures. A conceptual problem is the issue of domain underrepresentation as well as lack of relevance and representativeness of items ( Fischer and Karl, 2019 ;Fontaine, 2005 ). With fewer items, it is easier to miss out important components from a theoretical concept or to have items that do not capture the relevant or representative content of the conceptual domain in each cultural context. This then means that the factor structures become less stable. It may also mean that important indicators that systematically vary across contexts are missing, which would lead to a lack of information and therefore inability to test systematic niche construction effects. A second known problem is the presence of reverse coded items across languages which may lead to a greater replicability issues ( Vijver and Leung, 1997 ). Such reverse coded item problems might be exacerbated by linguistic distance due to grammatical differences in how negations are grammatically encoded (Karl & Fischer, 2022). As a consequence, shorter measures may in all likelihood reduce replicability across a larger number of countries. This is a classic trade-off, reminiscent of the bandwidth -fidelity trade-off in psychometric measurement more generally.
At the same time, we think the discussion needs to shift whether the systematic replicability is operating in the same direction across instruments as predicted by the niche diversity hypothesis. To the extent that there are systematic effects in niche construction that influence how individuals perceive themselves, we should expect that the pattern may replicate in both longer and shorter measures. In fact, the observations of better replication of short measures in WEIRD samples by Ludeke and Larsen (2017) as well as Laajaj et al. (2019) seems to support our argument. We think this is an important area for further research -to systematically examine the impact of any niche construction effects on personality measures in interaction with the length and comprehensiveness of the measures. Our prediction would be that the niche construction effects can be detected independent of personality length, but niche construction may be more easily detectable in longer measures to the extent that longer instruments capture a broader set of factors that are sensitive to diversification of niches across cultures.
Finally, considering the underlying rationale of the niche diversity hypotheses, it would also be informative to recruit truly representative national samples to comprehensively test whether economic diversification is indeed associated with greater factor differentiation at a population level. We strongly encourage systematic exploration of personality structures across instruments (including instruments varying in length) in interaction with features of the local environment to provide a more powerful and comprehensive science of human personality Fig. 2 .

Authorship statement
RF developed the study idea, RF & JAK compiled and analysed the data, RF wrote the first manuscript draft, JAK revised and edited the manuscript, both authors approved the final version

Ethical-declaration
The analyzed data is available online and ethical clearance has been obtained by the research consortium, according to locally applicable research guidelines.

Funding
This project was supported by the São Paulo Research Foundation (Fundação de Amparo à Pesquisa do Estado de São Paulo, FAPESP) for the project entitled: Recomeçar melhor: Implementação de uma plataforma digital para avaliação e promoção de saúde mental durante e pós-pandemia COVID-19 (Build back better: Implementation of a digital health platform to evaluate and promote mental health during and following the COVID-19 pandemic), Processo: 2021/08774-1.

Declaration of Competing Interest
Given their role as Editor in Chief Fischer R., had no involvement in the peer-review of this article and had no access to information regarding its peer-review. The other author has declared that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.