The construction of test reliability statistical criteria by a computer simulation

. The paper describes a computer simulation experiment of knowledge assessment test when item’s characteristic functions are power functions and knowledge level of the population has uniform or normal distribution. The inversions of the conditional means were calculated and probability distributions of the numbers of inversions were simulated. In practice of construction of knowledge tests for checking their suitability (validity) observed values of inversions are compared with reference values obtained by Monte Carlo experiments.


Introduction
In the works [3,2,4] of this article co-authors a mathematical model of closed type knowledge assessment test (with multiple-choice answers) was proposed. The model was based on the assumptions that each test taker has a certain level of knowledge p ∈ (0, 1) and probabilities of the correct answer to j-th test item could be expressed with nondecreasing functions of variable p -the so-called characteristic functions of test items k j : [0, 1] → [0, 1]. Let's us denote M n (p) the average number of correct answers to n items test of a test taker whose knowledge level is p: M n ∈ {0, 1, . . . , n}. It's clear that M n (p) must be nondecreasing function of the variable p, i.e. the test takers with a higher knowledge level on average are receiving more test points (numbers of correct items). In theory, this allows to find the inverse function M −1 n , to determine the knowledge level of test takers according to their received test points. However it is difficult in practice and it requires the further study. The paper [6] presents an empirical study of such a task when the knowledge level of test takers was considered as a priori known information. Student's knowledge was assessed in various ways, then the average value of the estimates was calculated and depending on the received value the test taker's grade point was restored. With regard to the stochastic nature of phenomenon sometimes the stronger student (having higher knowledge level p i ) is receiving lower test scores and vice versa, the student receiving higher test score can have a lower level of knowledge. If such discrepancies are occurring frequently, the test isn't qualitative; it unreliably measures student's knowledge, so the test hasn't validity property. In this article we offer the design scheme of an evaluation criteria for such discrepancies which is realized by Monte Carlo experiments.

The experiment
Some methodological assumptions were done for the experiment. There are m students having knowledge levels distributed according to m 1 + m 2 + · · ·+ m k = m where m i is the number of students having knowledge level p i . Theoretical assumption of experiment -characteristic functions k j of test items are known and they are power functions: k j (p) = p α . Depending on the parameter α ∈ (0, 1) values these functions simulate the difficulty of the questions (test items): the greater is α the harder the question being asked (student with the level of knowledge p answers to the question correctly with a lower probability p α with the higher α value). The difficulties of test questions considered to be known, so test consists of questions with known n characteristic functions k 1 , k 2 , . . . , k n . The result of experiment is the computer simulation of student's responses to n test items -random vectors r = (r 1 , r 2 , . . . , r n ) where r j ∈ {0, 1} consisting of zeros and ones generated by a priori distributions P {r j = 1} = k j , Note that the characteristic function can have another form (see [4,7]), and it would not change design of the study.
Calculations were limited with the test length of n = 15 items. There were 5 "easy" items (α = 0.25), 5 items of "medium difficulty" (α = 0.5) and 5 "hard" items (α = 0.75). Each of them could be responded correctly (r j = 1) or not (r j = 0). By computer simulation (m = 50) students' answers to such test questions have been generated. Students' knowledge levels were set in advance. The example of simulation result is random zeros and ones matrix shown in Fig. 1.
Then conditional averages [1,5] were calculated for each row and column of probabilities matrix: , i = 1, 2, . . . , 10, j = 0, 1, . . . , 15, m j is the average knowledge level of students correctly responded to j test items. n i is the average amount of correct answers for students having knowledge level i. It is obvious that the student with the higher knowledge level at an average get more test points (the number of correct responses). Then rows and columns inversions were calculated. For example, number of column inversions is the number of pairs (i 1 , i 2 ) with i 1 < i 2 and n i1 n i2 .

The results of experiment
Therefore 2 series of experiments described in previous section were conducted. In the first experiment probability distribution of population knowledge level was considered to be uniform and in the second -Gaussian (normal). During each experiment 100 random matrices of items responses were generated. Each of experiments was repeated for 5 times. One of the resulting probabilities matrices is shown in Fig. 3. Numbers of columns inversions are listed in the tables in Fig. 4. In the first table of Fig. 4 knowledge level in the population has uniform distribution (there are chosen  50 students, each 5 students have the same knowledge level), in the second -normal distribution (Fig. 5). Statistical hypotheses about probability distribution of numbers of column inversions were tested. According to Kolmogorov-Smirnov goodness of fit criteria for all 5 cases of uniform distribution and 5 cases of Gaussian distribution of population knowledge level hypotheses of Poisson distribution of the number of columns inversions weren't rejected. Point estimates of parameter λ values (the average rate of column inversions occurrences) and p-values of two tailed goodness of fit criteria are listed in Table 1.
At an average for uniform distributionλ = 2.303, for Gaussian distributionλ = 1.958. As the number of column inversions has Poisson distribution we can calculate values of cumulative distribution function which are presented in Table 2.
So, number of column inversions for uniform distribution of population knowledge level in the case of m = 50, n = 15 (items are described in Section 2) couldn't

Conclusions
The purpose of this paper -to show the methodology for construction of test validity criteria. In this study the simulated test results were presented when having a priori information about: (1) the difficulty of the test items which are power functions, If the number of inversions didn't exceed referenced values we can say with a certain confidence level that the test is reliable (valid). Note that the reference number of inversion's distributions depends on the above mentioned information and practical application of criteria requires a number of additional studies.