Elsevier

Measurement

Volume 78, January 2016, Pages 348-357
Measurement

Testing and evaluating one-dimensional latent ability

https://doi.org/10.1016/j.measurement.2015.05.048Get rights and content

Highlights

  • A new approach to evaluation of engineering ability test results is proposed.

  • Levels of difficulty of non-destructive test items are unknown beforehand.

  • Item response function is considered to be self-similar and scale invariant.

  • The main assumptions of the proposed approach are strictly formulated.

  • An algorithm for solving the problem and numerical example are provided.

Abstract

A new approach to evaluation of binary test results when checking a one-dimensional ability is proposed. We consider the case where a qualitatively homogeneous population of objects is tested by a set of non-destructive test items having different, but unknown beforehand levels of difficulty, and we need to evaluate/compare both the intrinsic abilities of these objects and the level of difficulty of the test items. We assume that the responses to different test items, applied to the same part, do not affect one another and the same scale invariant item response model applies to all members of the tested population of objects under test (OUTs). OUT can mean an electronic component, examinee, program unit or material under test, etc. An algorithm for solving the problem, applicable for engineering testing, is proposed. It combines item response theory, maximum likelihood estimation, method of flow redistribution and other methods. Numerical example is presented.

Introduction

The English language contains hundreds of words directly or indirectly describing the different types of overt and hidden abilities: from cognitive as, for example, memory and attention, to purely technical, such as reliability, stability, capability, availability, durability, portability, and reusability. In this paper, we deal only with the basic and simplest issue of so-called one-dimensional ability, when the test item performance of the object under test (henceforth abbreviated OUT) can be explained by a single latent ability. We consider the case when a qualitatively homogeneous population of OUTs is tested using a set of non-destructive test items having different, but unknown beforehand levels of difficulty, and we need to evaluate/compare both the intrinsic abilities of these OUTs and the difficulties of the test items. This type of set hereinafter will be called the test and it can include any – but must be the same for all OUTs – number of test items. For instance, in the psychometrics test the OUT is an examinee, a separate question on the exam is a test item and the examination as a whole is a test. Usually, it is assumed [1] that the test item response is estimated on the binary scale base (pass/fail) and the results of different test items, applied to the same OUT, are conditionally independent (i.e., the response to one test item does not affect the response to another). It is also assumed that the inherent ability of the OUT is independent of the test item difficulty. Homogeneity here means that the same item response model is applied to all members of the population, but in any case does not imply equality of the tested abilities among these members.

Even in such a simplified model the matter of correct and effective evaluation of test results has not been resolved completely and is still a subject of discussion in psychometrics and educational measurement [2]. An extensive study of latent ability modeling and evaluation in education was pioneered by [1], who proposed a well-known model of an interconnection between the test item difficulty, the examinee’s ability and the test result based on a standard logistic distribution. The Rasch model was extensively studied and extended during the last decades (see, e.g., [3], [4], [5], [6], [7] and references therein). The problem of estimating the Rasch model parameters is tackled mainly by using one version of the maximal likelihood estimation [8].

The problem, however, is discussed to a much lesser extent in engineering, which prefers to deal with quantifiable test results, estimated on a predetermined scale of difficulties (e.g., life time testing). This state of affairs seems a little strange, since in the broader context, testing can be subjected to any property and any OUT: people, program units, electronic components, materials, network connectivity, etc. Moreover, engineering objects of interest are more predictable and less variable, being free from purely human restrictions. The technical test population can often easily be established as more or less uniform, for instance when all the parts belong to the same production batch. In view of this, it seems desirable to develop some unifying/standardized approach for evaluating test results of such objects, when the difficulty of test items is unknown beforehand. The combination of two tests items – over-stressed and overrated [9] – can serve an example of such a test as well as testing including a wider variety of test items.

We propose an algorithm for test result evaluation applicable to a broad spectrum of engineering tests satisfying the model assumptions described below in Section 2. The proposed approach combines several already developed methods, allowing building a reasonable numerical scheme for test results evaluation. The developed algorithm is illustrated by a numerical example.

Section snippets

Testing model

Before focusing on the details of the testing model, we would like to make some general considerations. Suppose the studied ability a is distributed among the tested population of OUTs according to some cumulative distribution function (cdf) F(a). It may be a discrete distribution, but at the moment, this does not matter, because our aim is to illustrate the general idea. Let d denote the difficulty (or level of difficulty) of the test item in relation to the studied ability. For the purpose of

Ranking test items

The test items are ranked from 1 to K in reverse order to the frequency of corresponding successful results. So the first test item corresponding to the largest number of successes (k=1 ) is considered to be the easiest and the last, corresponding to the smallest number of successes (k=K), the most difficult. The only problem here can occur if there are two equal frequencies. In this case, the order can be set arbitrarily. In our opinion, such an effect in a well-constructed test is somewhat

Numerical example

In this example, K=3. The proportions psequence are given in Table 3.

The initial approximation of the vector d is d0=(1,2,3)T. The value of a111i (a “maximal possible” ability) is defined asa111i=rmaxd1i,where rmax is a maximal value of ability, normalized by the difficulty d1. In this simulation, rmax=5. The stopping criterion isδik=1Kdki-dki-12+m=12Kasequencemi-asequencemi-121/2<0.001,i=2,3,The values d1i,d2i,d3i, as well as the values a100(i),a010(i),a001i and a110i,a101i,a011i are

Summary

We have proposed a new approach to evaluation of test results, when the test item response is binary and the difficulties of K test items are not known a priori. The evaluation is carried out after the test is performed among some population of objects under test (OUTs), i.e., a posteriori, and involves determination of the following values:

  • difficulty of the test items;

  • abilities, assigned to 2K possible test results;

  • anticipated distribution of the above abilities among the tested population.

References (13)

  • G. Rasch

    Probabilistic Models for Some Intelligence and Attainment Tests

    (1960)
  • L. Crocker et al.

    Introduction to Classical and Modern Test Theory

    (2006)
  • B.D. Wright et al.

    Best Test Design

    (1979)
  • B.D. Wright et al.

    Rating Scale Analysis: Rasch Measurement

    (1982)
  • G.N. Masters

    A Rasch model for partial credit scoring

    Psychometrika

    (1982)
  • G.H. Fischer et al.

    Rasch Models: Foundations, Recent Developments and Applications

    (1995)
There are more references available in the full text version of this article.

Cited by (9)

  • Addressing traceability of self-reported dependence measurement through the use of crosswalks

    2021, Measurement: Journal of the International Measurement Confederation
  • Binary test design problem

    2018, Measurement: Journal of the International Measurement Confederation
    Citation Excerpt :

    The difference between psychometrical, technical, financial, statistical, physical and other tests consists in the models used to describe the specific item response function (IRF). In technical, financial, statistical, physical testing, the response models may differ significantly and no longer have the remarkable properties of the Rasch model [35], while acquiring some other properties (such as self-similarity, for example [5,6,12]) The mathematical expressions of the IRF in most cases are quite distinct. Therefore, it makes sense to discuss the problem of binary test planning from the most general principles point of view.

  • Theory-based metrological traceability in education: A reading measurement network

    2016, Measurement: Journal of the International Measurement Confederation
    Citation Excerpt :

    The organic integration of theory, data, and instruments in institutional contexts sensitive to ground-up self-organizing processes requires systematic conceptualizations of measurement as a distributed process, where scientific fields, markets, and societies operate as massively parallel stochastic computers [66,67]. Recent comparisons of engineering and psychometric perspectives on the possibility of such systems in education suggest a viable basis for such conceptualizations [68–73]. Metrological traceability systems of this kind [24] will integrate qualitative progressions in learning defined by predictive theories of causal relations [49], construct maps [74], and associated item hierarchies in educational assessments generally.

  • Item response function in antagonistic situations

    2020, Applied Stochastic Models in Business and Industry
View all citing articles on Scopus
View full text