Testing and evaluating one-dimensional latent ability

doi:10.1016/j.measurement.2015.05.048

Measurement

Volume 78, January 2016, Pages 348-357

https://doi.org/10.1016/j.measurement.2015.05.048 Get rights and content

Highlights

•
A new approach to evaluation of engineering ability test results is proposed.
•
Levels of difficulty of non-destructive test items are unknown beforehand.
•
Item response function is considered to be self-similar and scale invariant.
•
The main assumptions of the proposed approach are strictly formulated.
•
An algorithm for solving the problem and numerical example are provided.

Abstract

A new approach to evaluation of binary test results when checking a one-dimensional ability is proposed. We consider the case where a qualitatively homogeneous population of objects is tested by a set of non-destructive test items having different, but unknown beforehand levels of difficulty, and we need to evaluate/compare both the intrinsic abilities of these objects and the level of difficulty of the test items. We assume that the responses to different test items, applied to the same part, do not affect one another and the same scale invariant item response model applies to all members of the tested population of objects under test (OUTs). OUT can mean an electronic component, examinee, program unit or material under test, etc. An algorithm for solving the problem, applicable for engineering testing, is proposed. It combines item response theory, maximum likelihood estimation, method of flow redistribution and other methods. Numerical example is presented.

Introduction

The English language contains hundreds of words directly or indirectly describing the different types of overt and hidden abilities: from cognitive as, for example, memory and attention, to purely technical, such as reliability, stability, capability, availability, durability, portability, and reusability. In this paper, we deal only with the basic and simplest issue of so-called one-dimensional ability, when the test item performance of the object under test (henceforth abbreviated OUT) can be explained by a single latent ability. We consider the case when a qualitatively homogeneous population of OUTs is tested using a set of non-destructive test items having different, but unknown beforehand levels of difficulty, and we need to evaluate/compare both the intrinsic abilities of these OUTs and the difficulties of the test items. This type of set hereinafter will be called the test and it can include any – but must be the same for all OUTs – number of test items. For instance, in the psychometrics test the OUT is an examinee, a separate question on the exam is a test item and the examination as a whole is a test. Usually, it is assumed [1] that the test item response is estimated on the binary scale base (pass/fail) and the results of different test items, applied to the same OUT, are conditionally independent (i.e., the response to one test item does not affect the response to another). It is also assumed that the inherent ability of the OUT is independent of the test item difficulty. Homogeneity here means that the same item response model is applied to all members of the population, but in any case does not imply equality of the tested abilities among these members.

Even in such a simplified model the matter of correct and effective evaluation of test results has not been resolved completely and is still a subject of discussion in psychometrics and educational measurement [2]. An extensive study of latent ability modeling and evaluation in education was pioneered by [1], who proposed a well-known model of an interconnection between the test item difficulty, the examinee’s ability and the test result based on a standard logistic distribution. The Rasch model was extensively studied and extended during the last decades (see, e.g., [3], [4], [5], [6], [7] and references therein). The problem of estimating the Rasch model parameters is tackled mainly by using one version of the maximal likelihood estimation [8].

The problem, however, is discussed to a much lesser extent in engineering, which prefers to deal with quantifiable test results, estimated on a predetermined scale of difficulties (e.g., life time testing). This state of affairs seems a little strange, since in the broader context, testing can be subjected to any property and any OUT: people, program units, electronic components, materials, network connectivity, etc. Moreover, engineering objects of interest are more predictable and less variable, being free from purely human restrictions. The technical test population can often easily be established as more or less uniform, for instance when all the parts belong to the same production batch. In view of this, it seems desirable to develop some unifying/standardized approach for evaluating test results of such objects, when the difficulty of test items is unknown beforehand. The combination of two tests items – over-stressed and overrated [9] – can serve an example of such a test as well as testing including a wider variety of test items.

We propose an algorithm for test result evaluation applicable to a broad spectrum of engineering tests satisfying the model assumptions described below in Section 2. The proposed approach combines several already developed methods, allowing building a reasonable numerical scheme for test results evaluation. The developed algorithm is illustrated by a numerical example.

Section snippets

Testing model

Before focusing on the details of the testing model, we would like to make some general considerations. Suppose the studied ability a is distributed among the tested population of OUTs according to some cumulative distribution function (cdf) $F (a)$ . It may be a discrete distribution, but at the moment, this does not matter, because our aim is to illustrate the general idea. Let d denote the difficulty (or level of difficulty) of the test item in relation to the studied ability. For the purpose of

Ranking test items

The test items are ranked from 1 to K in reverse order to the frequency of corresponding successful results. So the first test item corresponding to the largest number of successes ( $k = 1$ ) is considered to be the easiest and the last, corresponding to the smallest number of successes ( $k = K$ ), the most difficult. The only problem here can occur if there are two equal frequencies. In this case, the order can be set arbitrarily. In our opinion, such an effect in a well-constructed test is somewhat

Numerical example

In this example, $K = 3$ . The proportions $p_{sequence}$ are given in Table 3.

The initial approximation of the vector d is $d^{0} = {(1, 2, 3)}^{T}$ . The value of $a_{111}^{i}$ (a “maximal possible” ability) is defined as $a_{111}^{i} = r_{\max} d_{1}^{i},$ where $r_{\max}$ is a maximal value of ability, normalized by the difficulty $d_{1}$ . In this simulation, $r_{\max} = 5$ . The stopping criterion is $δ^{i} ≜ {[\sum_{k = 1}^{K} {(d_{k}^{i} - d_{k}^{i - 1})}^{2} + \sum_{m = 1}^{2^{K}} {(a_{{sequence}_{m}}^{i} - a_{{sequence}_{m}}^{i - 1})}^{2}]}^{1 / 2} < 0.001, i = 2, 3, \dots$ The values $d_{1}^{i}, d_{2}^{i}, d_{3}^{i}$ , as well as the values $a_{100}^{(i)}, a_{010}^{(i)}, a_{001}^{i}$ and $a_{110}^{i}, a_{101}^{i}, a_{011}^{i}$ are

Summary

We have proposed a new approach to evaluation of test results, when the test item response is binary and the difficulties of K test items are not known a priori. The evaluation is carried out after the test is performed among some population of objects under test (OUTs), i.e., a posteriori, and involves determination of the following values:

•
difficulty of the test items;
•
abilities, assigned to $2^{K}$ possible test results;
•
anticipated distribution of the above abilities among the tested population.

References (13)

G. Rasch
Probabilistic Models for Some Intelligence and Attainment Tests
(1960)
L. Crocker et al.
Introduction to Classical and Modern Test Theory
(2006)
B.D. Wright et al.
Best Test Design
(1979)
B.D. Wright et al.
Rating Scale Analysis: Rasch Measurement
(1982)
G.N. Masters
A Rasch model for partial credit scoring
Psychometrika
(1982)
G.H. Fischer et al.
Rasch Models: Foundations, Recent Developments and Applications
(1995)

There are more references available in the full text version of this article.

Cited by (9)

Addressing traceability of self-reported dependence measurement through the use of crosswalks
2021, Measurement: Journal of the International Measurement Confederation
Measurement in the social sciences has been characterized by deficient justification and underdeveloped conceptual theories. Instruments supposed to measure the same measurand typically do not provide comparable measurements. From the perspective of metrological traceability, the state of affairs has thus been unsatisfactory. Today, better instruments can be developed as psychometrics provides tools for invariant measurement (Rasch measurement theory), where measurements are justifiable, linear, and sample-independent. Different instruments can be linked to a common metric of the measurand by means of co-calibration of item parameters. Such linkages, referred to as crosswalks, are an important and practically useful contribution to traceability, when common references have not been developed, yet. The measurement of dependence on tobacco and/or nicotine-containing products through self-report instruments illustrates the limitations of traditional measurement, how they can be overcome by new instrument development, and how a network of crosswalks with existing legacy instruments can be established.
Binary test design problem
2018, Measurement: Journal of the International Measurement Confederation
Citation Excerpt :
The difference between psychometrical, technical, financial, statistical, physical and other tests consists in the models used to describe the specific item response function (IRF). In technical, financial, statistical, physical testing, the response models may differ significantly and no longer have the remarkable properties of the Rasch model [35], while acquiring some other properties (such as self-similarity, for example [5,6,12]) The mathematical expressions of the IRF in most cases are quite distinct. Therefore, it makes sense to discuss the problem of binary test planning from the most general principles point of view.
Unlike in traditional measurement methods, in binary testing each test item provides only one bit of information. In view of limited test resources, effective planning of the test is crucial. In this article, the general problem is formulated from the metrological point of view for a high variety of objects under test and a homogeneous item response function. Different optimization criteria are reviewed for one-item testing (single and replicated), and their advantages and disadvantages are discussed. The article concludes with preliminary recommendations for how to plan a binary test.
Theory-based metrological traceability in education: A reading measurement network
2016, Measurement: Journal of the International Measurement Confederation
Citation Excerpt :
The organic integration of theory, data, and instruments in institutional contexts sensitive to ground-up self-organizing processes requires systematic conceptualizations of measurement as a distributed process, where scientific fields, markets, and societies operate as massively parallel stochastic computers [66,67]. Recent comparisons of engineering and psychometric perspectives on the possibility of such systems in education suggest a viable basis for such conceptualizations [68–73]. Metrological traceability systems of this kind [24] will integrate qualitative progressions in learning defined by predictive theories of causal relations [49], construct maps [74], and associated item hierarchies in educational assessments generally.
Huge resources are invested in metrology and standards in the natural sciences, engineering, and across a wide range of commercial technologies. Significant positive returns of human, social, environmental, and economic value on these investments have been sustained for decades. Proven methods for calibrating test and survey instruments in linear units are readily available, as are data- and theory-based methods for equating those instruments to a shared unit. Using these methods, metrological traceability is obtained in a variety of commercially available elementary and secondary English and Spanish language reading education programs in the U.S., Canada, Mexico, and Australia. Given established historical patterns, widespread routine reproduction of predicted text-based and instructional effects expressed in a common language and shared frame of reference may lead to significant developments in theory and practice. Opportunities for systematic implementations of teacher-driven lean thinking and continuous quality improvement methods may be of particular interest and value.
Ordinal response variation of the polytomous Rasch model
2022, Metron
Item response function in antagonistic situations
2020, Applied Stochastic Models in Business and Industry
Abrupt change of process behavior: The Anderson-Darling detection tool
2018, Quality Engineering

View all citing articles on Scopus

View full text

Testing and evaluating one-dimensional latent ability

Highlights

Abstract

Introduction

Section snippets

Testing model

Ranking test items

Numerical example

Summary

Probabilistic Models for Some Intelligence and Attainment Tests

Introduction to Classical and Modern Test Theory

Best Test Design

Rating Scale Analysis: Rasch Measurement

A Rasch model for partial credit scoring

Psychometrika

Rasch Models: Foundations, Recent Developments and Applications