Principles and assumptions of psychometric measurement

Introduction: the validity of claims about learner self-reports depends on the credibility of the measures used. Inventories developed within the psychometric tradition are expected to provide theoretical and empirical evidence for the validity and reliability of the measures to support subsequent interpretations and decisions. These practices depend on a set of assumptions based on latent trait theory that are essential to understanding what psychometric measures can do. This tutorial outlines essential characteristics of design and reporting of psychometric self-report data


Introduction
Decision-making about instruction, intervention, or treatment requires high-quality information about current status (i.e., strengths and weaknesses), trends in change (if any), and impact of precursor and introduced causal factors.Furthermore, understanding of the merit, worth, or value of the obtained values in the information needs to be exercised so as to lead to appropriate decisions, actions, and consequences (Scriven, 1967).Educational and clinical interventions require robust information from those who are supposed to benefit from those interventions (i.e., students, clients).If we seek knowledge, beliefs and attitudes, or intentions.Readers interested in detailed understanding of psychometric principles and practices are encouraged to read authoritative handbooks on psychometrics (Rao & Sinharay, 2007), educational measurement (Brennan, 2006), or testing (Geisinger, 2013).

Latent Theory Assumptions
Psychometric theory rests on the assumption that manifest behaviours are explained or caused by variation in latent psychological (i.e., cognitive, emotional, volitional, intentional, etc.) attributes within the individual (Borsboom, 2005).These can be called thoughts, beliefs, attitudes, or any number of invisible, non-material, but causal factors that consistently generate observable behaviours.This assumption arises out of our inability to read each other's psychological phenomena without the ability to communicate with each other verbally (Corballis, 2002) about these things.That we can talk about psychological phenomena (e.g., feelings, thoughts, beliefs, etc.) and believe that we understand what others are talking about suggests strongly that we think these things exist.Hence, latent trait theory about psychological phenomena appears to be a logical conclusion from the nature of our shared existence.
The origins of statistical or mathematical characteristics of latent trait theory date back to Spearman's application of correlation techniques (e.g., rank-order correlations, correction for attenuation, and estimation of test reliability) to patterns in test scores to infer the idea of general mental abilities ('g') that account for shared variation among scores (Clauser, 2022).General mental ability or intelligence is presumed to explain why students who do well in one subject tend to do well on other subjects.Unsurprisingly, research into the nature of intelligence has advanced to multi-dimensional models and described the causal influence of both genetic and environmental factors (Deary, 2001).
to change knowledge, skill, attitude, belief, and behaviours, we need accurate measures of where people are and where they have got to post-intervention (Messick, 1989).As such then, all methods of systematically sampling individuals' skill, attitude, ability, or any other characteristic constitute 'tests' (Cizek, 2020).
Psychometrics, then, is the statistical science behind testing and measuring latent human emotions, thoughts, attitudes, beliefs, opinions, ideas, and knowledge that shape manifest behaviours.Because these phenomena exist behind the eyes, between the ears, and in the viscera of human existence, it is essential that the individual provide this information to the tester/assessor.While this information may be biased through optimistic illusions about oneself (Dunning et al., 2004), there really is no option but to obtain and exploit the information given by the individual about the state of their own feelings, cognitions, and intentions.Unlike research with physical or animal phenomena, the human is capable of knowing and communicating about internal states and traits.While intimate partners (e.g., parents, siblings, children, partners) may have privileged insights into the mind and behaviours of an individual, the person with the potentially most complete and current information is the individual herself.Observable behaviours may give useful insights into the mind and will of an individual; nevertheless, quite contrary beliefs, intentions, or motivations can lead to very similar observed actions.Hence, making sense of what is in the human mind depends on obtaining information from the owner of that mind.Other than exceptional animals that have learned to communicate with sign language, only the human individual seems able to have something to say that others cannot know unless the individual reveals that information.Thus, the reliance of psychometric research on self-reported information is both necessary and inevitable.Nonetheless, the reliability and validity of tools used to test or measure these psychological phenomena has to be demonstrated for the results to be deemed credible.
In this paper, I shall provide a high-level overview of or guide to psychometric theory and principles used to evaluate measurements of human a unidimensional latent variable model.The resulting latent variable might serve as a more effective and efficient pragmatic communication tool that represents a decidedly conscious compromise between cognitive fidelity, empirical feasibility, and utilitarian practicability.(p.122; emphasis added) This shows that our measurement, regardless of mathematical approach, of a psychological construct always presumes that we have used a defensibly valid or theoretically robust set of proxies, which, in accordance with scientific fallibility (Popper, 1945), may turn out to be wrong.
An advantage of the quantitative assumption is that the numeric values, especially when obtained from large populations, will behave like a real-valued random variable so that mathematical manipulations (e.g., correlations, mean, standard deviation, etc.) provide insight into variation within populations as to the latent trait.Choices have to be made as to how the numbers created in measurement are statistically analysed, ranging from classical test theory to item response theory (Embretson & Reise, 2000).Nonetheless, an individual's position relative to a social norm is an important factor, within the theory of planned behaviour, in determining intentionality and behaviour (Ajzen, 1991).
Related to the possibility that latent constructs are not quantitative, we must accept that our measurements do not have true zero values or inherent scale increments.The arbitrary nature of the scales we use to measure psychological phenomena does not invalidate the existence of the trait.For example, intelligence tests tend to set population mean to 100 and standard deviation to 15.This is a convenient way to describe and locate individuals relative to others.The mean value of 100 could just as easily have been 500 or 1000, with a concomitant SD of 50 or 100.The meaning of any score relative to those norm values would allow us pragmatically to plan useful educational or clinical interventions.Hence, the mathematics of latent trait theory when combined with a theoretical framework for the nature and function of that latent trait allow us to understand and respond to needs or strengths.
The mathematical model of latent trait theory assumes that the underlying ability behind test-taker performance requires just one assumption.Lord (1953) puts that assumption as: "the trait or ability under discussion can be thought of as an ordered variable represented numerically in a single dimension" (italics in original, p. 518).By extension, researchers who measure an individual's underlying orientation toward a specific mental phenomenon assume that the latent cause has ordered unidimensional characteristics so that individuals can be located as to their frequency, intensity, importance, likelihood, or valence for that construct (Allport, 1935).The widespread use of Likert's (1932) summated rating scale in social psychology manifests this assumption that there is latent cause that explains the strength and covariance of responses to prompts, questions, or items.
It has been argued that latent factors have not been proven to be quantitative (Michell, 1999).Because evidence for latent constructs can only be obtained from manifest operations (e.g., answers to questions), research into how individuals perceive themselves or phenomena must always rest on an assumption that the underlying trait is something that can be counted or quantified.That our experiences of the world and our abilities to respond to that world differ is not an assumption but a given.Hence, it is relatively easy to accept that all individuals vary within populations around normative values for such diverse psychological phenomena as the nature of one's love for their own mother, their ability to solve problems, their ability to use language(s), and so on.Consequently, as Michell (2008) notes, latent trait theory research still has to be honest to admit that "At present we do not know whether this hypothesis [quantitative models of psychological entities] is true, but we will assume it recognizing that at some point in the future someone needs to investigate it" (p.12).
Consequently, measurements of a latent construct are a proxy for cognitive processes which are always psychologically multidimensional (Rupp, 2008).Thus, as Rupp (2008) put it: it may still be truly meaningful, and not just statistically convenient, to summarize data via goal scores correlated positively with activation in the negative emotion brain areas during norm-referenced feedback.
A review of EEG studies identified strong emphasis in the negative emotion brain areas around feedback-related negativity (Meyer et al., 2021), but suggested that because multiple brain regions are involved in these processes and the limitations of the EEG method, modifications are needed to understand how feedback relates to the brain.The complexity of the brain can also be seen in how visual representations of familiar objects and people are located in multiple brain locations (Quiroga et al., 2005).While laboratory studies using fMRI or EEG may reveal how the brain interacts with the mind, the challenge is how those studies relate to the complexities of functioning in the real-world environments.There are differences in results when organisms are studied in laboratory glass tubes (i.e., in vitro) and when they are released into living populations (i.e., in vivo) (Autoimmunity Research Foundation, 2012); let alone how they might behave in a cyber or simulated environment (i.e., in silica).How well human self-reported perceptions about feedback map to brain activity when anticipating or receiving feedback is still not evaluated.
A well-established approach to establishing validity of measures of a psychological construct is analysis of how individual self-reports relate to outcome measures.This is well established in educational testing systems, such as OECD's Program for International Student Assessment (PISA) surveys.Marsh et al. (2006) showed that self-reported psychological constructs (i.e., self-reported interest, self-concept, and self-efficacy) collected in PISA from >100,000 students in 2000 had statistically significant, but modest (.25 < r < .35),relationships with performance, with invariance across 25 different jurisdictions.Two things need to be said about this result.First, the relationships are consistent with theoretical expectations of how these psychological constructs function.Second, the effect is modest, in part because in vivo contexts are so complex and because individuals may have variability in the constructs they endorse.Indeed, cluster analysis of motivational variables within assessment

External Validation
Just because latent theory for psychological phenomena is pragmatic, does not mean that a latent construct matters.Observation of systematic relationships between variation in the latent construct and observable behaviours and outcomes in the real world is needed to establish whether a hypothesised latent trait matters.Evidence from observable behaviours or independently generated outcomes scores overcomes the bias of self-report as the sole source of data.Gold-standard validation of self-reported scores requires comparison with other measures that are theoretically similar or different (i.e., convergence and divergence; Campbell & Fiske, 1959).A self-report score that is highly correlated with a previously developed measure of a related construct provides convergent evidence.In contrast, low correlations with a measure of a completely different construct provides divergent evidence for the meaningfulness of the measure.Likewise, divergence across time or informant may also call into question the validity of a self-report.Ideally, the convergent and divergent measures will avoid the methodological weakness of self-report data, which is the point of the multi-trait, multi-method approach.Data from biometric evidence (e.g., fMRI), online behaviour, test scores, attendance, and other physical measurements will show if variation in scores is meaningfully related to (i.e., causally or explanatorily) to constructs that should be sensitive to the construct (Borsboom et al., 2004;Zumbo, 2009).
Modern neuropsychology attempts to find specific organs of the brain that map to psychological processes (e.g., firing of mirror neurons when physically grasping is related to understanding others; Kaplan & Iacoboni, 2006).An fMRI study revealed that, when participants were given bogus feedback about performance, brain regions associated with negative affect (i.e., posterior cingulate cortex, the medial frontal gyrus, and the inferior parietal lobule) were activated when norm-referenced feedback was given to low-competence participants and also when criterion-referenced feedback was given to high-competence participants (Kim, Lee, Chung, & Bong, 2010).Further, performance-approach the coherence of the various stimuli or prompts used to elicit responses from individuals to that theoretical specification.Pilot studies (International Test Commission, 2018), expert judgement panels (McCoach et al., 2013), participant think aloud studies (van Someren et al., 1994), and cognitive interviews (Karabenick et al., 2007), and so on, are used to demonstrate that there is evidence that the instrument has prima facie alignment with what it is intended to measure.
Reports of how that test or battery is administered and the kinds of data collected are essential to give confidence that the protocols are replicable and theoretically in accord with the domain.Evaluations that test the theoretical preferred model against alternative explanations also provide evidence that the scales are sound (Cronbach, 1988).Hence, evaluation of the internal structure of an inventory should include multiple competing alternatives.

Internal Structure
The psychometric industry takes a scientific approach in which data collection should generate consistent patterns of responses amongst individuals in accordance with their varying responses to a phenomenon.A key constraint on measuring psychological constructs is that they are in and of themselves not directly observable; they are latent (Borsboom, 2005).As such multiple indicators for multiple causes (MIMIC) are used to reduce error in estimation of the strength and direction of attitude, belief, or value and to better represent the phenomenon of interest (Jöreskog & Goldberger, 1975).With sufficient samples and theoretically designed measurements, mathematical modelling of MIMIC response patterns (e.g., estimate of internal reliability, factor analysis) is used to create evidence for the structure and dependability of the proposed scales or factors (Haertel, 2006).
Approaches that provide evidence about the pattern of responses include scale reliability estimation which can be estimated in multiple ways.Although most researchers are familiar with Cronbach's (1951) alpha, extensive research indicates McDonald's (1999) omega and Hancock and Mueller's (2001) coefficient H are superior meth-of mathematics performance showed that in almost all of the 12 nations analysed across three waves of data, there were individuals who did not have consistent self-reported scores across the three motivational scores and performance depended more or less on the mix of motivations (Michaelides et al., 2019).
Nonetheless, self-reported scores about one's own psychological phenomena is a fraught domain.Not only do humans suffer from memory problems about their experiences (Schacter, 1999), but they also suffer from ignorance about themselves and their competence (Dunning, Heath & Suls, 2004), in part because being honest about inadequacies or failure may threaten their ego (Boekaerts & Corno, 2005); or among adolescents it may be 'fun' to subvert surveys (Fan et al., 2006).To overcome these threats psychometrics proposes a number of methods outlined below.

Theoretical Grounding
The field of psychometrics has argued extensively about how to establish validity evidence for any measure of psychological phenomena (Kane & Bridgeman, 2022).Prior to Messick (1989), validity tended to be thought of in terms of multiple types (i.e., face, content, concurrent, construct, and predictive).However, contemporary understanding is that validity is a unitary concept best captured as 'construct validity' (Cizek, 2020).Evidence for a degree of validity judgment (e.g., 'preponderance of evidence', 'clear and convincing evidence', 'substantial evidence'; Cizek, 2020, p. 26) is achieved by consideration of the various empirical and theoretical arguments for the proposed interpretation of a measure (Cizek, 2020;Kane, 2006;Messick, 1989).
Validation evidence includes the theoretical and explanatory qualities of the measurement tool or instrument; the stimulus items, prompts, or tasks presented to elicit responses need to be theoretically aligned to expert theoretically informed definitions of a domain that include hypotheses of how the construct will influence responding (American Educational Research Association et al., 2014;Schmeiser & Welch, 2006).This means that considerable effort should have gone into specifying the domain and then developing and testing the same population or not.Ideally, the psychometric characteristics of the measurement tool should be within chance when applied in a new sample drawn from the same population.Clearly, non-invariance should be expected when samples are from divergent populations.This does not mean the measurement is broken; rather, it suggests that the measurement works differently, or the construct being measured is different across language, age, culture, prosperity, or educational boundaries.Consider the non-invariance found in Teacher Conceptions of Feedback inventory between New Zealand and Louisiana which have very different policy frameworks (Brown, Harris, O'Quin & Lane, 2017).Measurements that are deployed with new samples have greater opportunity to generate validation evidence by overcoming chance artefacts associated with the development of the measurement.Consider the similarity of the Teacher Conceptions of Assessment inventory within New Zealand and its lack of invariance across jurisdictions and languages (Brown, Gebril, & Michaelides, 2019).
An unfortunate side effect of emphasis on and the complexity of statistical and mathematical modelling of the internal structure of a measurement is that validating evidence from external measures tends to be overlooked.Clearly, it is harder to collect evidence from independent samples, to ask participants to complete parallel or divergent measures at the same time as a new one, and even more difficult to collect independent behavioural evidence so as to make a strong case that a new measure is not only psychometrically robust but also has theoretical and empirical evidence for the relationship between what the mind reported and what can be seen from the outside.Nonetheless, without the statistical and mathematical evidence of how a new measure actually works it will not be possible to test theoretical claims about how humans think, feel, and behave.Furthermore, given the complexity of factors impinging upon performance, the actual effect of any specific perception may be quite small.When effects are small and responsive to environmental conditions, they are inherently hard to replicate (Lindsay, 2015).ods for establishing reliability of a set of item.Within the Rasch modeling framework, Wright and Stone's (1999) item separation G can be used to claim that items elicit responses coherent with that version of item response theory.Researchers should be aware that very high reliability results (e.g., alpha > .90)can be obtained by writing items that are almost identical in wording (i.e., have high homogeneity) producing a 'bloated specific' (Cattell & Tsujioka, 1964).
While principal component analysis can identify underlying vectors in a data matrix, psychometric theory, with its emphasis on error, relies on the common factor model to identify shared and unique factors underlying the same data (Bryant & Yarnold, 1995).Conventional criteria exist to guide interpretation of the statistical results (Bandalos & Finney, 2010) so that the quality of evidence for the internal structure of a research or measurement tool can be evaluated (i.e., scales with little or poor evidence can be ignored, while those with robust evidence can be used).Once robust measurement of latent constructs is established, scores for each factor can be derived (DiStefano, Zhu, & Mîndrilă, 2009).
With scale scores, individuals and groups can be distributed by their scores, which allows comparison of one construct to another and the comparison of score differences between groups and over time.These analytic techniques are robustly associated with the field of psychometrics as they address issues of demonstrating statistically that the theoretical expectations have been met.These practices help create an argument for the trustworthiness of the information obtained from humans about themselves and for interpretations and uses of that data.

Replicability
In the spirit of scientific research, replication studies can examine the stability of psychometric properties across samples (Makel et al., 2012).Statistical techniques such as multigroup confirmatory factor analysis allow researchers to establish whether statistical models of how participants respond to an instrument vary according to whether the new sample is drawn from gent and divergent measures, experimentally manipulated scores, and created large-sample norms.Indeed,users of any psychometric measure should expect evidence of this kind before settling on the use of a new tool or inventory.

Challenges
A fundamental problem within psychology is that everything in the life experiences, environments, and physiology of individuals influences everything they think, feel, believe, say, or do.So, it should not surprise us that the impact of any single psychological constructs should be relatively small because it interact with all other things that also matter.Efforts to isolate and understand important psychological factors in human life has unfortunately led to widespread jingle-jangle (i.e., same words with different meanings or same meanings with different words) in the field (Flake & Fried, 2020).Nonetheless, latent trait theory provides us a way into the mind, heart, and mental representations of individuals.
Good psychometric evidence for measurements of any psychological construct should be able to: 1. Demonstrate fidelity to a theoretically robust description of what the construct is, how it functions, and what it should do; 2. Provide evidence that the proposed operationalisation has prima facie credibility against that theory; 3. Demonstrate robust statistical evidence for the coherence of items against the construct design of a measurement model; 4. Provide evidence that the measurements are reproducible from additional samples; 5. Provide evidence that the measurements produce effects on other measures (including self-reports), behaviours, or outcomes that align with theoretical expectations; and 6.Provide evidence that the construct can be manipulated such that measurement scores change and have the theoretically proposed effects.Readers may be dismayed at the thought that any single report should necessarily achieve all of these things.However, an excellent example of how these concerns can be addressed is visible in Thielsch and Hirschfeld (2019) which in seven studies provided: a theoretical framework, item set development, statistical demonstration of scale or factor properties, demonstrated test-retest reliability, validated the scales against other conver-