Psychometric properties of the Actively Open-minded Thinking scale

The Actively Open-minded Thinking scale (AOT; Stanovich & West, 2007) is a questionnaire that is used to measure the disposition towards rational thinking as a single psychological trait. Yet, despite its frequent use, also in abbreviated form, it is still unclear whether sumscores of the AOT can actually be used to order individuals on their disposition towards actively open-minded thinking and whether the questionnaire can be validly shortened. The present study aimed to obtain a valid and shorter AOT. We conducted Mokken scale analyses on the (Dutch) AOT using two samples of higher education students (N=930; N=509). Our analyses showed that none of the 41 items could discriminate sufficiently between respondents with varying latent trait levels. Furthermore, no item-set of the AOT could be obtained to validly order individuals on the assumed latent trait, which is a crucial assumption when using it in research. Consequently, it is questionable whether scores on the AOT provide insights into the concept it aims to measure.


Introduction
introduced the concept actively open-minded thinking as an ideal standard of thinking, aimed at avoiding the tendency to reason based on intuitive heuristics and to focus instead on reflection about rules of inference. After Baron introduced the concept, the Actively Open-minded Thinking scale (AOT) developed by West (1997, 2007) has been widely used a measure of people's disposition towards rational thinking (see Table 1). The AOT has been shown to predict performance on, for example, critical thinking tests and is an important measure in reasoning and decision-making research (e.g., Heijltjes, Van Gog, Leppink, & Paas, 2014;Toplak, West, & Stanovich, 2011;West, Toplak, & Stanovich, 2008). The most widely used version of the AOT  consists of 41 statements in the form of Likert type items. Because this version takes substantial time to administer, it would be practical to obtain a valid shorter version (cf. the abbreviated versions to assess Need for Cognition, another widely used disposition scale; Cacioppo, Petty, & Feng Kao, 1984;Chiesi, Morsanyi, Donati, & Primi, 2018). In the literature, many different versions of the AOT have been applied, with item selections ranging from as few as 7 items (e.g., Haran, Ritov, & Mellers , Table 1 Different versions of the Actively Open-minded Thinking scale (AOT) developed by Stanovich and West (2007) Athanasiou and Papadopoulou (2015) n.r. n.r.

Greek and Turkish
No Sumscore α = .71 and α = .72 Baron, Scott, Fincher, and Emlen Metz (2015) 8 Note. This table provides an overview of scientific articles in which use of the Actively Open-minded Thinking scale (AOT;  was reported and shows what kind of version was used. This overview was generated as follows: we surveyed the list of publications referring to Stanovich and West (2007) in Thomson Web of Science on February 7, 2019 and we included the articles that applied the AOT in some way in this table (i.e., theoretical or review articles, conference papers, and empirical articles referring but not applying the AOT were ignored). Studies are listed in chronological, then alphabetical order.
No. items shows number of selected items in the study; Intermixed shows whether the AOT items were intermixed with items of other questionnaires during the assessment; AOT measure shows in what way the AOT was included in the analyses; Reported psychometrics shows what psychometric statistics were reported on the AOT measure. N r. = not reported in the manuscript; (?) = this was not explicitly mentioned but became implicitly clear from the manuscript; Unclear = the authors reported explicitly that they used a different item selection than the original version but did not report how many items they adopted; α = Cronbach's alpha; SH = split-half reliability (Spearman-Brown corrected).
Thus far, based on considerably high Cronbach's alpha's, most versions of the AOT including the original version, have been considered to be reliable. Despite the high Cronbach alpha's and the associations of the AOT scores with variables like critical thinking test performance, it is as yet unclear to what degree the items together reflect the single psychological trait of actively openminded thinking. In other words, it is unclear what the internal validity of the AOT is and because of this, it is hard to substantively interpret correlations of the AOT with other variables. Thus, to further advance and strengthen research in the domain of reasoning and decision-making, it is important to investigate whether and to what degree the items included in the AOT measure the psychological trait actively open-minded thinking.
The little research that has been conducted on the internal validity of the AOT has used Factor Analysis (FA; Stanovich & West, 1997;Svedholm-Häkkinen & Lindeman, 2018). The results of these FAs, however, shed doubt on whether the AOT sumscores measure a single psychological trait and, thus, on whether AOT sumscores can be used to order individuals on the assumed latent trait; a finding we will elaborate on shortly. Furthermore, an assumption of FA is that the item scores are continuous and although this is a common assumption in social sciences research, psychometricians argue that Likert scale data are not continuous and should be treated as categorical data (Flora, LaBrish, & Chalmers, 2012;Jamieson, 2004;Liddell & Kruschke, 2018).
An alternative test theory approach that is suitable for categorical data is Mokken scale analysis (Mokken, 1971;Sijtsma & Molenaar, 2002). Mokken scale analysis is a non-parametric item response theory approach which tests whether a set of items can be used to order individuals on an assumed latent trait. Further benefits of this analysis are that it does not require multivariate normality or linear correlation between items, which are two assumptions of FA that are easily violated when working with Likert scale data (Flora et al., 2012;Jamieson, 2004;Liddell & Kruschke, 2018). Therefore, the present study applied Mokken scale analyses on (Dutch) AOT data (translated version used in Heijltjes, Van Gog, Leppink et al., 2014), to see whether we could obtain a shorter and valid version. In addition, for comparability with prior work by Stanovich and West (2007) and by Svedholm-Häkkinen and Lindeman (2018) we also conducted two Confirmatory Factor Analyses (CFAs), applying their proposed models.

Actively Open-minded Thinking
According to Baron (2008), good thinking consists of (1) "search that is thorough in proportion to the importance of the question, (2) confidence that is appropriate to the amount and quality of thinking done, and (3) fairness to other possibilities than the one we initially favor" (p. 200, italics added); open-minded refers to "the consideration of new possibilities, new goals, and evidence against possibilities that already seem strong" (p. 200, italics added); and active refers to not waiting for these things to happen but seeking them out. In his work, Baron argued that our thinking often deviates from the ideal of actively open-minded thinking, which leads to biases in our reasoning and decision-making. Think for example of our tendency to ignore evidence that goes against the conclusion we favor (i.e., confirmation bias).
Actively open-minded thinking is viewed as a thinking disposition. Thinking dispositions (or cognitive styles) are viewed as relatively stable psychological mechanisms that tend to generate characteristic behavioral tendencies and tactics (Stanovich, West, & Toplak, 2016). Thinking dispositions reflect people's goal management, epistemic values, and epistemic self-regulation. In their book "The rationality quotient", Stanovich et al. (2016), claim that, next to intelligence, thinking dispositions underlie rationality. They argue thatindependent of cognitive abilitythose who habitually seek various points of view or think extensively about a problem tend to display more rational behavior than those without such thinking dispositions. Psychologists have studied many thinking dispositions in relation to rationality. For instance, an individual's tendency to engage in and enjoy thinking, measured with the Need for Cognition scale, (NFC; Cacioppo & Petty, 1982;Cacioppo et al., 1984) has shown to be positively associated with rational thinking skills 1 after controlling for variance due to cognitive ability (Toplak & Stanovich, 2002;West et al., 2008). Other examples of thinking disposition questionnaires are the Rational-Experiential Inventory (Epstein, Pacini, Denes-Raj, & Heier, 1996) and the Consideration of Future Consequences scale (Strathman, Gleicher, Boninger, & Edwards, 1994). The disposition to think actively open-minded, however, is theorized to be the most central to rational thinking (Baron, 2008;Stanovich et al., 2016). Baron (2008) mostly measured actively open-minded thinking qualitatively by assessing people's beliefs of good thinking, for example, through asking them to evaluate exemplars of the thinking of others. Inspired by Baron's work, Stanovich andWest (1997, 2007) composed a questionnaire to measure actively open-minded thinking, the Actively Open-minded Thinking scale (AOT). The AOT consists of statements about thinking based on which participants rate their (dis)agreement on a Likert response format with six categories: 6 (agree strongly), 5 (agree moderately), 4 (agree slightly), 3 (disagree slightly), 2 (disagree moderately), 1 (disagree strongly). An example of such a statement is: "A person should always consider new possibilities."

The Actively Open-minded Thinking scale
For the first version in 1997, Stanovich and West (1997) composed 56 items distributed across eight subscales (Flexible Thinking, Openness Values, Dogmatism, Categorical Thinking, Openness-ideas, Absolutism, Superstitious Thinking, and Counterfactual Thinking). They found only three out of the eight subscales were to be reliable. 2 Moreover, a principal components analysis (PCA) 1 In these studies, rational thinking is operationalized in ability to avoid bias in reasoning and decision-making measured with performance on socalled heuristics-and-biases tasks . These tasks measure whether someone is prone to a specific bias during a specific type of reasoning.
2 Split-half reliability and Cronbach's alpha were, respectively, .49 and .50 for Flexible thinking; .73 and .71 for Openness values; .54 and .60 for revealed that the first six subscale sumscores formed one component explaining most of the variance (40.8 %). Consequently, they excluded the subscales Superstitious Thinking and Counterfactual Thinking and computed a single composite score using the remaining subscales (i.e., summing the scores on the Flexible Thinking, Openness-Ideas, and Openness-Values scales and subtracting the sum of the Absolutism, Dogmatism, and Categorical Thinking scales), hereby treating the AOT as a unidimensional trait without subfactors. This score intended to order respondents on a scale ranging from "openness to belief-change and cognitive flexibility" (high scores) to "cognitive rigidity and resistance to belief change" (low scores). In the 1997 study, the Spearman-Brown corrected split-half reliability of the scale was .90 and Cronbach's α of the scale as a whole was .88. Ten years after the first introduction of the AOT (Stanovich & West, 1997), Stanovich and West (2007) introduced a 41-item AOT, which from then became the most widely used version of the AOT. The scale consisted again of six subscales: four scales remained the same as in the first version (i.e., Flexible Thinking, Openness Values, Dogmatism, Categorical Thinking), but two scales (Openness-Ideas and Absolutism) were replaced with the subscales Belief Identification and Counterfactual thinking. A sumscore of the 41 items (after reverse scoring of 30 items) intended to order respondents on their disposition towards actively open-minded thinking. Again, the reliability for the total scale was good (split half reliability .75 and Cronbach's alpha .83). Because the subscale reliabilities were not reported, we assume that the new item selection was again intended to assess actively open-minded thinking as a unidimensional trait. However, an analysis to test this assumption was not reported.

Research using the Actively Open-minded Thinking scale
Since its introduction, numerous researchers have used the AOT. For example, multiple studies showed that the sumscores of this scale positively correlated with measures of rational thinking, with significant correlation coefficients ranging from .10 to .85 (e.g., Sá, West, & Stanovich, 1999;Sá, Kelley, Ho, & Stanovich, 2005;Sá & Stanovich, 2001;Toplak et al., 2011;West et al., 2008;Heijltjes, Van Gog, Leppink et al., 2014;Lean Keng & AlQudah, 2017;Svedholm-Häkkinen & Lindeman, 2018). A search in Web of Science (February 2019) indicated that the scientific papers introducing the first (Stanovich & West, 1997) and second version of the AOT  have been cited in 205 and 87 journal articles, respectively. We reviewed the 87 studies citing the 2007 version (currently the most widely used version) and found that 36 had adopted (a part of) the AOT as a measure (see Table 1). Researchers used itoften in combination with other disposition questionnairesas a predictor of reasoning and decision-making (e.g., performance on rational, scientific, or analytic reasoning tasks or political choices), epistemic beliefs (e.g., evolutionary theory acceptance, belief in conspiracy theories, or religiosity), or behavior (e.g., being a gambler or showing adaptive teaching behavior). Studies varied widely, however, in the way the scale was administered. Some relatively small differences concerned the response formats (4-point to 7-point Likert scales) and whether the items were intermixed with other (disposition) questionnaires. A more important difference concerned the item selection. Within those 36 studies, 12 studies used a different item selection than the original 41-item AOT (see Table 1). In addition, in their book 'The Rationality Quotient ' Stanovich et al. (2016) introduced a 30-item version and a 16-item short-form. Most studies did report a sufficient reliability for the total scale, but as listed in Table 1 Recently, Svedholm-Häkkinen and Lindeman (2018) noted that it is not clear whether the AOT is unidimensional or multidimensional because the PCA on the first 47-item AOT (Stanovich & West, 1997) was run on sumscores of the subscales rather than single items and because subsequent studies reported reliability measures for the scale as a whole only. Svedholm-Häkkinen and Lindeman (2018) aimed to develop a reliable, valid, shorter AOT and investigated whether the AOT was multidimensional or not. To this end, they conducted FAs in four separate samples (N = 2735, N = 458, N = 102, and N = 50) who had completed a Finnish version of the 41-item AOT . A 17-item version was sufficient to obtain a good reliability (Cronbach's alpha) for the total scale and to obtain correlations with variables assessing other thinking dispositions, social competence, and supernatural beliefs. However, their results also showed that the AOT was not unidimensional. They compared five different factor models and concluded that four intercorrelated subfactors (Dogmatism, Fact resistance, Liberalism, and Belief personification) described the data best. Neither a model with a higher-order factor (i.e., representing active open-mindedness) explaining the common variance in the four subscales, nor a single factor solution described the data adequately, which suggests that AOT sumscores cannot be used to validly order individuals on the assumed psychological trait of active open-mindedness. In addition, just as in the study by Stanovich and West (1997), the four subscales were only marginally or not reliable 3 .

The present study
In sum, despite its frequent use, previous studies have not yet demonstrated that the AOT is a valid measurement instrument to order individuals on actively open-minded thinking. High reliability values indicate that the observed AOT sumscores can classify individuals low to high. In addition, a positive correlation of observed sumscores with an external predictor shows that a high AOT (footnote continued) Dogmatism; not reported for Categorical thinking because the scale consisted of only 2 items; .73 and .77 for Openness-ideas; and .69 and .64 for Absolutism, not reported for Counterfactual thinking because the scale consisted of only 2 items; .73 and .73 for Superstitious thinking (see Stanovich & West, 1997). 3 Cronbach's alpha in Study 1 was .67 for Dogmatic thinking; .67 for Fact resistance; .43 for Liberalism; .56 for Belief personification.
sumscore is likely to go together with a high score on a relevant other variable. When an AOT sumscore is computed and used in analyses, the implicit assumption is that this set of items can be used to order individuals on the assumed latent trait representing active open mindedness. Both reliability and correlational analyses, however, do not answer the question of whether the items used (i.e., the specific AOT sumscore), measure a single latent trait. Psychometric validation of the AOT scale, therefore, requires testing this assumption by assessing whether the responses of individuals to each item can be described as a function of a single latent trait (i.e., internal validity). Svedholm-Häkkinen and Lindeman (2018) tested this assumption using FA; however, their FAs suggested that the items they included did not reflect a unidimensional trait (i.e., no higher-order factor). Hence, a sumscore of their 17-item solution cannot be used to validly order individuals on the latent trait of actively open-minded thinking either, as it is unclear what concept a sumscore on this abbreviated version reflects. To illustrate our point, imagine that we measured four traits with a questionnaire: social economic status (SES) with 4 items, work satisfaction with 4 items, motivation to eat healthy with 4 items, and engagement in politics with 5 items. If we validly measured the four traits and subjected all 17 items to a FA, one would expect to find four intercorrelated factors without a higher-order factor (i.e., similar to the factor structure as Svedholm-Häkkinen & Lindeman, 2018). However, even if the reliability was sufficient, a sumscore of those 17 items cannot be interpreted meaningfully, because a sum of a person's SES and motivation to eat healthy cannot easily be interpreted as a character trait of a person (i.e., the scales do not form a higher-order unidimensional trait). Also if this sumscore would correlate with other variables (e.g., mental health or having debts), this still does not allow for interpreting the sumscore as a single latent trait.
The aim of this study was to re-examine the validity of all 41 items of the AOT developed by Stanovich and West (2007), to see whether we could develop a valid shorter version that allows for ordering participants on the assumed latent trait. To this end, we used Mokken scale analysis. Mokken scale analysis is a non-parametric item response theory approach, which tests whether a set of items can be used to order individuals on an assumed latent trait. Moreover, Mokken scale analysis has advantages over the commonly conducted FAs, when analyzing data based on Likert type scales. An important advantage is that Mokken scale analysis is suitable for categorical data whereas FA requires data at the interval level. In the field of psychometrics it is argued that treating Likert scale data as interval data can be problematic (Liddell & Kruschke, 2018). When treating Likert scale data instead as being ordinal, it is technically not possible to test for the normality assumption of FA because the difference between two successive values cannot be quantified. Even if one would still assess normality with ordinal data, Likert-scale items typically indicate skewed or polarized distributions (Jamieson, 2004;Liddell & Kruschke, 2018). An additional advantage of Mokken scale analysis in this respect, is that it does not require multivariate normality or linear correlations between items. In the present study, we conducted an exploratory Mokken scale analysis (Mokken, 1971;Sijtsma & Molenaar, 2002) on the Dutch version of the AOT (Heijltjes, Van Gog, Leppink et al., 2014; using two samples of higher education students (N = 930 and N = 509) to see whether we could obtain a valid shorter version of the AOT. In addition, for comparability with prior work by Stanovich and West (2007) and by Svedholm-Häkkinen and Lindeman (2018) we conducted two CFAs representing their proposed models.

Method
All materials, datasets, R-code, and output are stored on an Open Science Framework (OSF) page for this project, see https://osf. io/4hxzu/.

Participants and procedure
We obtained anonymized AOT datasets from a Dutch University of Applied Sciences 4 , where the AOT (see next section) was filled out on a computer by first year students, as part of a critical thinking course they were enrolled in. It was not possible to skip questions and only fully completed questionnaires could be submitted, so there was no missing data.
We repeated the same Mokken scale analyses on two datasets to see whether we obtained similar results. Dataset A (N = 930) was a merged dataset that consisted of 460 students in the economics and business domain (data collected in 2014) and 470 students in the health care domain (data collected in 2016). Age and sex were indicated by 908 participants, whose mean age was 18.84 years (SD = 2.30) and 55 % of whom were female. Dataset B (N = 509) was a merged dataset that consisted of 257 students in the marketing and business management domain (data collected in 2017) and 252 students in the health care domain (data collected in 2017). Age and sex were indicated by 506 participants, whose mean age was 18.82 years (SD = 2.46) and of whom 50 % were female.

Actively Open-minded Thinking scale
We used a Dutch translation of the original 41-item version of the AOT . In a previous study, Heijltjes, Van Gog, Leppink et al. (2014) made the Dutch translation, which was checked by two persons one of whom was a native English speaker. In line with the original scale, the response format consisted of six answering categories: Strongly agree (6), Moderately 4 The Dutch education system distinguishes between higher professional education offered by universities of applied sciences (Bachelor, Master), and academic education offered by academic universities (Bachelor, Master, PhD, with the PhD being an additional four-year trajectory after a Master degree). agree (5), Slightly agree (4), Slightly disagree (3), Moderately disagree (2), and Strongly disagree (1). We reverse scored 30 items so that for all items a higher score indicated a stronger disposition towards actively open-minded thinking 5 (for all items, see https://osf. io/4hxzu/).

Mokken scale analyses
We conducted an exploratory Mokken scale analysis (Mokken, 1971;Molenaar & Sijtsma, 2000) aiming to extract a valid shorter version of the AOT from the total item pool. The Mokken scale analysis was performed using the Monotone Homogeneity Model and the Automated Item Selection Procedure algorithm from the 'Mokken' package in R (R Development Core Team, 2008;Van der Ark, 2007). The Automated Item Selection Procedure in Mokken scale analysis partitions a set of items from an item pool into one or more scales. Items included in such a scale need to have sufficient discriminative power. Items that do not, or only very weakly, discriminate between persons with varying latent-trait levels are left unscalable (Sijtsma & Molenaar, 2002).
In contrast to factor analysis, Mokken scale analysis requires only few assumptions and is, therefore, robust to problems concerning the distribution of the underlying data. First, the model underlying Mokken scale analysis assumes unidimensionality, which means that all items in a particular test measure the same latent trait (Sijtsma & Molenaar, 2002). Second, the underlying model assumes local independence, which means that a person's response to an item is not influenced by his or her responses to the other items in the test, given the underlying latent trait. However, if students gain knowledge during the test, which they can use to answer the next items in the very same test, the assumption of local independence is violated. The third and final assumption is that the probability of answering the item correctly (or, in case of polytomous items, the probability of agreeing to the item) increases or stays the same as the ability level increases or, put more technically, that the response functions of the items (IRFs) are monotonically nondecreasing (Sijtsma & Molenaar, 2002). Furthermore, Mokken scale analysis is suitable for the analysis of categorical data. The AOT items all have six ordered response categories, ranging from "disagree strongly" to "agree strongly". Inspecting the frequency distributions of the 41 AOT items indicated that for some items the distribution was skewed (for these results, see https://osf.io/ 4hxzu/). As such, Mokken scale analysis was most suitable for our AOT data as it does not assume the data to be normally distributed.
Three scalability coefficients were used to determine whether or not items formed a scale, and as diagnostics to assess the strength of the scales (Kuijpers, 2015): (1) item-pair scalability coefficient H ij , which expresses the strength of the association between items i and j given their marginal distributions; (2) item scalability coefficient H j , which expresses how well item j fits with the other items in a test, and also indicates the extent to which item j discriminates between respondents (Sijtsma & Molenaar, 2002, p. 66); and (3) total-scale scalability coefficient H, which expresses the degree to which respondents can be ordered by means of a set of items (Sijtsma & Molenaar, 2002, pp. 36, 39). A set of items can be used to order individuals on the assumed latent trait if (1) all H ij ≥ 0 (i.e., the underlying model assumes positive inter-item covariances) and (2) if H j > c > 0 for all j. The latter indicates that all item scalability coefficients should be at least positive, and rather above a positive value c (by default set to .3), such that nondiscriminating items or only weakly discriminating items are excluded from the scale. As follows from these two criteria, the value of the total-scale scalability coefficient H should be at least .3 (Kuijpers, Van der Ark, & Croon, 2013; Mokken, 1971;Molenaar & Sijtsma, 2000). H-values lower than .3 are regarded as indicating that the set of items is poorly scalable. Finally, note that sufficient scalability coefficients imply that a set of items can be used to order individuals on an assumed latent trait. Obtaining sufficient scalability coefficients does, however, not automatically imply that this set of items is measuring a unidimensional construct (Smits, Timmerman, & Meijer, 2012). To gain insight into the dimensionality of a scale, factor modeling is a more suitable method.

Confirmatory factor analyses
To gain insight into the dimensionality of the AOT and for comparability with prior work by Stanovich and West (2007) and by Svedholm-Häkkinen and Lindeman (2018), we also ran two CFAs on dataset A and B. First, we ran a CFA on the model proposed by Stanovich and West (2007): a one factor with the 41 items as indicators of a single trait. Second, we ran a CFA based on the final model proposed by Svedholm-Häkkinen and Lindeman (2018): a 17-item version with four intercorrelated factors without one higher order factor. We used the 'Lavaan' package in R (R Development Core Team, 2008;Rosseel, 2012) with robust weighted least squares (WLSMV) as estimation method. This estimator is seen as most suitable for categorical data (Brown, 2006). To be fully consistent with Svedholm-Häkkinen and Lindeman (2018), we also ran the CFAs with ML estimation, yielding highly similar results (for these results, see https://osf.io/4hxzu/). We followed the guidelines by Hu and Bentler (1999) to examine the model fit. Hu and Bentler (1999) argue that values close to .95 for the Comparative Fit Index (CFI) and the Tucker Lewis Index (TLI), in combination with values close to .06 and .08 for the Root Mean Square Error of Approximation (RMSEA) and the Standardized Root Mean Square Residual (SRMR), respectively, are needed to conclude that there is a relatively good fit between the hypothesized model and the observed data. We used the standardized factor loadings to determine whether test items could discriminate between respondents with varying trait levels (Brown, 2006). Values of the standardized factor loadings should be at least .4 to be sufficient.

Mokken scale analyses
The Mokken scale analysis performed on the first dataset (A) showed that no subset of items could be constructed that validly order individuals on the assumed latent trait. None of the 41 items could discriminate sufficiently between respondents with varying latent trait levels, all H j s ≤ 0.182, H = .105 (for all 41 H j s and the item-pair scalability coefficients, see https://osf.io/4hxzu/). Furthermore, the explorative analyses indicated that 18 out 41 items formed eight separate scales with each two to three items at most. The remaining 23 items were left unscalable, that is, the items were not or even more poorly discriminating and/or covaried negatively with items included in one of the eight scales. Table 2 shows the eight scales and the item scalability coefficients (H j ) with the corresponding standard errors for the 18 scalable items (for the item-pair scalability coefficients, see https://osf.io/4hxzu/). Only two items had a H j that was significantly above .3 and none of the scales consisted entirely of items with H j significantly > .3. With regard to the scales' total scalability coefficients, only the first cluster (C1)  In addition to the finding that the scales did not discriminate between respondents, the scales appeared to be unreliable, C1: latent class reliability coefficient (LCRC; Van der Ark, Van  Findings for dataset B were more or less similar. Again, none of the 41 items could discriminate sufficiently between respondents with varying latent trait levels, all H j s ≤ .195, H = .100 (for all 41 H j s and the item-pair scalability coefficients, see https://osf.io/ 4hxzu/). The explorative analyses identified seven separate scales with five items in the first scale and two items in each of the other six scales. The 24 remaining items were left unscalable. Table 3 shows the seven scales and the item scalability coefficients (H j ) with the corresponding standard errors for the 17 scalable items (for the item-pair scalability coefficients, see https://osf.io/4hxzu/). Again, only two items had a H j that was significantly above .3 and none of the scales consisted entirely of items with H j significantly > .3. With regard to the scales' total scalability coefficients, none of the clusters had a coefficient significantly > . Thus, the results of the Mokken scale analyses on both datasets A and B suggested that the 41 items together could not discriminate sufficiently between respondents with varying latent trait levels. Furthermore, no item-set of the AOT could be obtained to validly order individuals on the assumed latent trait. The item scales found did not have sufficient discriminative power and had a poor reliability. Moreover, it has been argued that using many subscales with only two or three items can have a negative impact on the reliability, validity and measurement precision of a scale (Kruyen, Emons, & Sijtsma, 2013;Kruyen, Emons, & Sijtsma, 2012;Mellenbergh, 1996;Reise & Waller, 2009).

Confirmatory factor analyses
We first conducted a CFA on dataset A and B, testing the one-factor model on all 41 items as proposed by Stanovich and West (2007). For dataset A, we obtained mixed results on the model fit indices. For the absolute fit indices, Chi-square indicated a poor fit, χ 2 (779) = 3508.85, p < .001, which could be expected given the large sample size. RMSEA and SRMR, on the other hand, showed an acceptable fit, indicating that there was an acceptable discrepancy between hypothesized model (with optimal parameter estimates) and the actually obtained sample data (covariance matrix), RMSEA = 0.061; SRMR = 0.070. The incremental fit indices (analogous to R 2 ), however, showed poor fit, indicating that the improved data fit by the tested one-factor model was only marginal when compared to the data fit of the null model (in which all the observed variable are uncorrelated), CFI = 0.686; TLI = 0.669. Following the guidelines of Hu and Bentler (1999), we concluded that the model did not describe the data adequately. Furthermore, 25 out of 41 items had small standardized factor loadings (< .4; for all factor loadings, see https://osf.io/4hxzu/), indicating that those items could not discriminate between respondents with varying trait levels (Brown, 2006). In line with Stanovich and West (2007), the scale as a whole was reliable, α = .81.
This same model tested on dataset B was over-identified, which indicates that the model should not be interpreted. Moreover, WLSMV estimator could therefore not be used to compute robust standard errors and the adjusted test statistics. The model's parameters could only be estimated using diagonally weighted least squares (DWLS). These results showed more or less similar estimates as found for Dataset A using the WLSMV estimator, χ 2 (779) = 2628.24, p < .001; RMSEA = .068; SRMR = .074; CFI = 0.829; TLI = 0.820. Also, again 25 items had small standardized factor loadings (< .4) and the Cronbach's alpha for the total scale was .80.
Next, we conducted a CFA on dataset A and B, testing the intercorrelated four-factor model without a higher-order factor on the 17-item AOT as proposed by Svedholm-Häkkinen and Lindeman (2018). Both for dataset A and B, we obtained mixed results. For dataset A, the absolute fit indices indicated an acceptable fit, χ 2 (113) = 763.84, p < .001; RMSEA = 0.079; SRMR = 0.066, whereas the incremental fit indices indicated a poor data fit, CFI = 0.782; TLI = 0.737. Hence, this model also did not describe the data adequately. Additionally, five out of 17 items had small standardized factor loadings (< .4; for all factor loadings, see https://osf.io/ 4hxzu/). The scale as a whole had a Cronbach's alpha of .67, and the subscales Dogmatism, Fact Resistance, Liberalism, and Belief Personification had alphas of .53, .56, .32 and .51 respectively. For dataset B, we obtained more or less similar results, χ 2 (113) = 546.30, p < .001; RMSEA = .087, p < .001; SRMR = .079; CFI = 0.763; TLI = 0.714. Six items had small standardized factor loadings (< .4) and the Cronbach's alpha for the total scale was .66. The subscale alphas were . 60,.51,.24,.46,for Dogmatism,Fact Resistance,Liberalism,and Belief personification,respectively. In sum, both the Mokken scale analyses and the one-factor CFA did not yield an item set that could be used to validly order individuals on the latent trait actively open minded thinking.

Discussion
The aim of this study was to obtain a valid shorter version of the AOT developed by Stanovich and West (2007) that could be used to order individuals on the latent trait actively open-minded thinking. Our results did not provide support for the hypothesis that either the 41-item AOT or a subset of items would measure actively open-minded thinking as a single latent trait. The Mokken scale analyses performed on two large datasets of Dutch first-year higher professional education students showed that none of the items discriminated very well between students on the (assumed) latent trait. In addition, no adequate AOT subscales could be identified. These findings imply thatfor the studied populationsumscores on the AOT do not provide insight into the concept it aims to measure.

Relating the current results to previous findings
Sumscores on the AOT are widely used in, for example, correlational analyses. When one computes a sumscore and assumes that it provides insights into the construct "actively open-minded thinking", one assumes that all items load on the same assumed latent trait. However, the evidence so far, including our results, do not support this assumption. Together, the results of our Mokken scale analyses, our one-factor CFA, and the results of Svedholm-Häkkinen and Lindeman (2018) indicated that the AOT is not measuring one unitary trait. This renders the reported reliabilities for the scale as a whole (see Table 1) rather meaningless, as Cronbach's alpha assumes a unidimensional construct.
In addition to the fact that the scale does not measure a unidimensional trait, we also found no evidence for meaningful subscales. Here, our results differ somewhat from Svedholm-Häkkinen and Lindeman (2018), who found that a 17-item version of the AOT measured four separate subscales. Our Mokken scale analyses indicated that more than half of the 41 items were left unscalable (i.e., could not be included in a subscale) and that none of the (very small) subscales that were formed had sufficient discriminative power. Hence, the items included in a subscale could not order participants with varying levels of the latent trait that the scale was potentially measuring. Furthermore, the formed subscales were not reliable. Our CFA testing the four-factor model (without one higherorder factor) proposed by Svedholm-Häkkinen and Lindeman (2018) did not describe the data adequately. We obtained acceptable values for the absolute fit indices, but a poor values for the incremental fit indices. Hence, the tested four-factor model fitted acceptably with the obtained sample data but the model fitted the data only slightly better than the worst possible model would do. Furthermore, five (dataset A) and six (dataset B) of the 17 items had low factor loadings and could thus not discriminate between participants. Also note that both in our study and in the study by Svedholm-Häkkinen and Lindeman (2018) low reliabilities for the four subscales were obtained. Hence, there is still no convincing evidence that scores on the subscales can be interpreted meaningfully. One possible explanation for the divergent results with Svedholm-Häkkinen and Lindeman (2018) regarding the CFAs may be that, the Likert type AOT items are not suitable for FA and therefore do not yield robust results across studies (Magidson & Vermunt, 2003). A more likely explanation, however seems that the AOT items do not sufficiently measure what they intend to. Taking the Mokken scale and FAs together, it seems that the AOT items included in the available studies so far are not measuring a single psychological trait actively open-minded thinking nor any subtraits. The construct validity and content validity of the items should be improved in order to obtain a valid measurement instrument.
If it is unclear what the sumscore on the AOT represents, it is also unclear how the correlations that previous studies found between the AOT and other variables (such as other dispositions like the tendency to enjoy and engage in thinking, measured with the NFC, or performance on critical thinking tests Heijltjes, Van Gog, Leppink et al., 2014;Svedholm-Häkkinen & Lindeman, 2018;Toplak, West, & Stanovich, 2014;Toplak, West, & Stanovich, 2014) should be interpreted. The correlations between the AOT and other thinking dispositions may mean that some AOT items measure more or less the same as some items from other disposition questionnaires, and that the AOT sumscores therefore correlate with these variables (e.g., an item in the AOT is 'If I think longer about a problem I will be more likely to solve it' and an item in the Need for Cognition scale is 'I would prefer complex to simple problems'). It may also be that both the AOT and its criterion variables (e.g., rational reasoning) implicitly measure something else that we are not aware of and that this causes a correlation (cf. the third variable problem).

Limitations and further research
To our knowledge, this is the first study that investigated the psychometric properties of the AOT using Mokken scale analysis, which can be considered to be more suitable than the more commonly used FAs because it is suitable for the categorical responses to the AOT items and robust to violations of multivariate normality and linear correlation between items (Flora et al., 2012;Jamieson, 2004;Liddell & Kruschke, 2018;Mokken, 1971;Sijtsma & Molenaar, 2002). Nevertheless, two potential limitations of our study should be noted. First, it is possible that all participants in our study sample were very strong actively open-minded thinkers (i.e., relatively high average item scores items and therefore quite homogeneous), resulting in little or no variance in item scores. However, based on the items' distributions and the range of item scores, we consider both study samples sufficiently heterogenous for testing the items' scalability (for these results, see https://osf.io/4hxzu/). In addition, participants in our sample had a rather similar total score on the 41-item AOT (M = 171.8, SD = 15.2) compared to the Stanovich and West (2007) that introduced this version of the AOT (M = 170.7, SD = 18.2). A second limitation is that our analyses were conducted on the Dutch translation of the AOT. To our knowledge, none of the translated versions, including ours, have been compared to data on the English version. Hence, it remains an open question to what extent findings obtained with the translated scales apply to the original English AOT. However, on theoretical grounds we see no reason to expect any strong translation effects. Moreover, the results of previous studies using translated versions of the AOT seem compatible with the results of studies using the English version. That is, they showed comparable descriptive statistics (after correcting for number of included items and/or response format) and similar correlations of AOT scores with other variables (e.g., Deniz, Donnelly, & Yilmaz, 2008;Heijltjes, Van Gog, Leppink et al., 2014;Svedholm-Häkkinen & Lindeman, 2018). It should also be noted that investigating whether translated AOT scales are invariant to the English scale will be quite challenging as long as the factor structure is unclear. Nevertheless, based on these considerations, we cannot fully rule out the possibility that the current results were somehow affected by the fact that we used a Dutch translation instead of the English version. Therefore, it would be interesting to replicate our Mokken scale analyses with other datasets on the English version of the AOT.

Conclusion
To conclude, the results of our study suggest that there is no item set of the 41 item version of the AOT that can be used to validly order individuals on their ability to think active open-mindedly, which is a crucial assumption when using it in research. Consequently, it is questionable whether scores on the AOT provide insights into the concept it aims to measure. If the results of the present Mokken scale analyses would replicate with English AOT data, this would be a strong argument for starting the process of (re-)designing a new scale to measure actively open-minded thinking or to consider alternative measures of thinking dispositions.

Open science framework
All materials, datasets, R-code, and output are stored on an Open Science Framework (OSF) page for this project, see https://osf. io/4hxzu/.

Declaration of Competing Interest
None.