Calibrating Questionnaires by Psychometric Analysis to Evaluate Knowledge

The experience achieved using the tool “Questionnaires,” available inside the Virtual Campus of an architectural engineering school in northeast Spain, is presented. “Questionnaires” is a mechanism/tool adequate and simple to evaluate the knowledge level achieved by students. This work shows and identifies the control indices of adaptation for questionnaires, such as the Facility Index, the standard deviation, the Discrimination Index, and the Discrimination Coefficient. From these parameters, educational performance is inferred, identified, and predicted. The conclusions of this work will allow the modification of deficient knowledge-evaluation practices and the identification of needs for specific groups or for students with particular requirements, thus making it feasible to apply these parameters with a guarantee of success in similar evaluation processes.


Introduction
European universities nowadays are prompting conceptual and structural changes inside the European educational space. These universities should be about the work of professors, the form by which the knowledge should be transmitted, easier ways of learning for students, and finally, achieving satisfaction according to the social context of a competent education as required by society and the educational institution (Goldston, Dantzler, Day, & Webb, 2012;Veiga & Amaral, 2009). To achieve these objectives, higher education institutions in Spain in recent years have started improvement processes in their teaching practices (J. M. Gómez-Soberón, 2009;J. M. Gómez-Soberón & Gómez-Soberón, 2007;J. M. Gómez-Soberón, Gómez-Soberón, & Gómez-Soberón, 2009).
Within the outlined context, in which the principal challenge for educational development is the generation of mechanisms or evaluation systems that produce relevant information on what is taught and learned in a way that is effective in schools, we have begun to deliberate on all our educational processes. For this, we have incorporated the statistical use of questionnaires that allow us to define parametric indices about the learning of students. These questionnaires may be seen as a new educational tool that responds to the current demand for the analysis of learning and the redirection of possible tendencies or undesirable deviations in students.
There are published studies concerning the application of evaluation methods to assess possible improvements that foster learning in students (Bowles, 2008;Britt, McCall, Austin, & Piterman, 2007;Fančovičová & Prokop, 2010;Jaeger, 1998;Mowder & Shamah, 2011;Murayama, Zhou, & Nesbit, 2009;Rivero, Martínez-Pampliega, & Olson, 2010;Zhang, 2011). This research presents evaluation questionnaires as tools by which data processing systems can be an effective and convenient strategy to reinforce student learning. Some of the advantages of these systems are the efficient management of results, the speed by which the evaluation can be performed, and avoidance of the use of paper questionnaires (Shepard, 2006). However, there are some objections to the implementation of such systems. These objections concern the confidentiality of the identity of the student, the subsequent use of the information and its possible impact on the educational process (Burden, 2008;Nulty, 2008), and the implications of the use of such systems as the 499159S GOXXX10.1177/2158244013499159SAGE OpenGómez-Soberón et al. sole criterion for information in the learning process of students (Garfield, 2003).
For the design of evaluation tests, the writing of multipleoption questions is a specialized task that requires personnel with experience and training. If such questions are adequately elaborated, they will be able to measure complex educational abilities (depending on the knowledge and experience of the person writing them; Esquivel, 2000). The description of items or questions implies verifying the relation between the item and the content supposedly being measured by the item. This verification is considered a central part of test validation processes (it is usual to carry out such verification by student-professor feedback). Therefore, to confirm the validity of the test, the information about the content and the clarity and comprehension of the items are important.
In light of the previous statements, it is necessary to perform statistical analyses to determine the characterization indices of items (difficulty, correlation item-total score, discrimination, and answer frequency according to option) and to select the adequate items for evaluation by theory of tests with reference to norms, establishing as ideal questions those close to 50% of difficulty and a discrimination above 0.40, and a correlation item-total score that is positive and significantly higher than zero (Esquivel, 2000).
However, both processes, namely, the analysis of test results and the application of questionnaires, require extra effort from teachers, which causes lost time from teaching or assessment. The main causes of lack of interest would be the following: 1. Resistance to new evaluation tools by the traditional professor, 2. Refusal to investigate in situations that are not expected to be repeated, 3. "Extra" time in educational tasks, 4. Possibility of generating additional resources, which cannot be assumed.
It is important to note, however, that there currently are tools and calculation processes that make possible the analysis of multiple processes, the generation of simulations, or the validation of prediction hypothesis about guidelines inside the education field, all of which can become very useful in specific cases or individuals, if they consider theoretical and mathematical principles on which they are based and applied (Hutchison, 2009).

Educational Framework and Study Participants
"Constructions of concrete" is the course studied in the present work; this course is part of the studies toward a university technical degree. This is a 4-month course in the 2nd year of study (Obligatory in the Curriculum Block). It takes place in the 4-month term 2B, and it consists of six credits (not European Credits Transfer System credits), subdivided into 4.5 theoretical credits and 1.5 practice credits.
The subject is simultaneously given to four groups in all the 4-month terms (1Q: autumn; 2Q: spring): two groups of students in the morning (Groups 1M and 2M) and two groups of students in the afternoon (Groups 3T and 4T). Table 1 presents the number of students who have taken this course in recent years. It can be observed that the number of students registered has been increasing with the time, resulting in educational problems, such as extra work by professors, decrease in teaching quality, difficulty in evaluation using traditional systems, and so on.
It can be said that in this discipline, a high number of students fail and enroll in the course multiple times. Study participants are all in their first attempt at this course (Escuela Politécnica Superior de Edificación de Barcelona, 2009).

Research Design
To design the evaluation analysis system and to obtain the control indices to utilize in this work, some general criteria and practical recommendations were followed to guarantee a correct application of the work and to avoid bias with an incorrect use of the work (Myford & Engelhard, 2001;Ravela, 2000;Tiana, 1997). In this way, the information on the person's process, the statistics of the answers that were used, and the analysis and the percentages of the evaluation questionnaires (multiple for each student) were accomplished in the tool "Questionnaires" of the Virtual Campus (Dougiamas).
As the starting point in the process of analyzing the concerned data, the border of sampling was delimited to the course and group submitted for analysis, considering the following aspects: The motives that are reduced by the analysis of this work to the previous variable (period, course, and group) are the only time available for classes and teachers who agreed to participate, and classes with approval by the school: experience initial calibration, verification of their suitability, and so on, considering that this is a pilot. The study group represents the total number of students in the group and course, and therefore constitutes a stratified response to a type of nonparametric sampling.
The analysis presented pertains to data processed and extracted from specific evaluations (two midterm exams), from two works done by students, and from two tests (multiple-option and paired-test type; Tuparov & Dureva-Tuparova, 2008). Table 2 shows the evaluations of previous techniques.
The specific evaluations were individual, consisting of solving graphic-conceptual problems. The activities developed by the students involved the resolution of real cases, with applications related to topics developed inside the classroom. These activities were developed individually and were valued according to some preestablished principles (rubric).
Test 1 (multiple option) consisted of 30 items having between three and five possible answers from which to select. Test 2 (paired) consisted of four blocks of questions, with each block containing 8 to 12 questions, for a total of 41 questions. The structure of the two tests assumes the implications and reasoning presented in the literature in this respect (Berrios, Rojas, & Cartaya, 2005). Both tests were implemented in the Virtual Campus of the course through the data processing platform Moodle (Dougiamas), although currently it is feasible to apply them in other similar platforms (Tuparov & Dureva-Tuparova, 2008). The Moodle platform allows evaluations to take place virtually inside (our case) or outside the classroom, and evaluation is done using the previous test program. As a result of the process, the system generates an output file in Word, Excel, or RTF, thus allowing processing.
Tests were defined based on the following criteria and data processing adjustments, which help to standardize their application (regulations were provided to the student body prior to administration of evaluations): The tests were also proposed to evaluate the different knowledge levels achieved by students, based on the Taxonomy of Bloom (Van Niekerk & Von Solms, 2009). Table 3 summarizes the subdivision of knowledge levels evaluated, including the number of questions for each one of them.
For the analysis in the first part of the statistical study, four different variables were used. Table 4 shows the codes and meanings assigned to these variables. With the criteria given earlier and the variables to analyze, the data processing program SPSS V17 for Windows was utilized, for the purpose of obtaining the general descriptive statistical parameters of each variable, in a separated form, and thus to understand and distinguish them. The studied parameters were as follows.    1. Central tendency measures (mean, median, mode, and sum); 2. Dispersion measures (standard deviation, variance, amplitude, minimum, maximum and error of mean), sampling distribution (asymmetry and kurtosis), and finally the percentile values. Table 5 presents the general results obtained for the four analyzed variables regarding their general statistical description.

The Analyzed Variables
With respect to the measures of central tendency, one can say that the final score average of the study groups are located in the range of 5.5 to 7.0 (high score possible 10). Groups 1M, 2M, and 3T have an average score, but not Group 4T (of study). The study tests (Tests 1 and 2) applied to Group 4T showed average values over the values above (close to 8.0 for Test 1 and 9.0 for Test 2; high score possible for both cases of 10). From the above-mentioned values, it can be said that Group 4T performs better than the other reference groups, and that the tests discussed in this article do not represent difficulty in resolution. Therefore, their use as a teaching tool has a manageable difficulty in this course (test appropriate to the content of the subject to value; see Table 5 and Figure 1). Finally, the students who took Tests 1 and 2 were about the same in number, thus improving the interpretation and correlation between variables.
With regard to measures of dispersion, as shown in Table  5 and Figure 2, the standard deviation is always less (for the final score, and for both tests studied in this work) than the average score end of the reference groups (1M, 2M, and 3T), with a difference of about 0.5 unit and becoming similar when the coefficients of variation of the test study and final average score groups study are compared. Therefore, this indicates that both the test study, and the behavior of the results of the control groups are substantially the same, and the study variables are related among themselves, anticipating the absence of other variables.
Moreover, it may be observed that the amplitude of the scores achieved in each of the evaluations indicate that the tests applied "focus" better on student scores (amplitude of 3.5 for Test 1 and 7.5 for Test 2, while for the control groups and final score study group the amplitude reached between 8.5 and 9.5 point difference).
With respect to the shape of the distribution curve having the different variables of study, it can be seen that the elevation of the distribution is more pronounced for the test case study (comparing the average score at the end of the course achieved for both the control and study groups, between 12 and 15 times higher; see Table 5 and Figure 3). Similarly, it can be said that the data distributions for all variables have unilateral values, extending into the negative zone (left branch distributions are larger).
Finally, with reference to the distribution of the scores achieved by students, for each of the study variables (see Table 5 and Figure 4), we can say that the average value of the final score for the control groups is linear, incremental, and positive, much like the notes of Test 1 (similar slope, but increased at the beginning), while in the case of Test 2, it becomes constant curvature from 50%. This may help provide an understanding of how to distribute the scores and compare different evaluation techniques.
In conclusion, we can highlight two general ideas from the comparative statistics. First, the mean of the score of the students (VAR04 = final score) who took the test (VR02 and VR03) is higher (Group 4T) than that of students who did not take the test (Groups 1M, 2M, and 3T). Second, the variance    of the results is smaller for the group (4T) that took the test (VR02 and VR03).

Psychometric Analysis of the Items
Psychometric analysis is a mathematical procedure that applies statistical principles to determine the suitability of the proposed questions based on the responses and their individual relationship with the rest of the answers, thereby detecting whether the proposed questions are appropriate to assess the level of knowledge, degree of difficulty, and degree of discrimination between high and low conceptual skills (Heck & Van Gastel, 2006;Revuelta, Ximénez, & Olea, 2003).
From the results of the multiple-option and the paired tests, as previously discussed, some parameters were extracted and utilized. These are defined and analyzed in Tables 6 and 7 where the processed data of the surveys are presented in a manner that permits the analysis and evaluation of the performance of each question, taking into account the global evaluation of the sample. The statistical parameters utilized in these tables were determined with the evaluation of the classical theory of tests (Batrinca & Raicu, 2010;General Public License GNU, 2010). The theory behind the analysis chosen to calibrate questionnaires or assess psychometric properties are not presented in this work, as on one hand, the system used is the existing in college, and on the other hand, his theoretical justification are in the tool and can be found on the WEB (Dougiamas).
The first parameter presented in the Tables 6 and 7, for the analysis of the tests, is the Facility Index (FI; % correct), which is defined as the mean value of how easy or difficult an item is, with regard to the rest of the questions inside the same analysis group (test). This parameter is determined with the following equation: where X mean is the mean value from all values obtained for the total users who did every item and X max is the maximum value obtained for that item.
If the questions could be distributed in dichotomous categories (correct/incorrect), this parameter would coincide with the percentage of students who responded to the questions correctly.
In our study, and considering Figure 5, most of the questions in Test 1 are concentrated on the band from 70% to 90% of FI, while in Test 2, they are located in a band from 85% to 90%. From these results, it is deduced that the questions or blocks of questions located out of both extremes of previous bands should be eliminated in future editions of the test because they are trivial (FI very low) or they are of a high difficulty level (FI very high). In either possibility, these questions should not be utilized as criteria to discern an educational evaluation, because they are not useful as evaluation criteria. The graph in Figure 5 shows the areas discussed.
Another possible alternative in deciding which questions or blocks of questions could be eliminated from a test is to verify that the questions are correctly defined, not including errors in their formulation and complying with basic criteria of logic. To accomplish this task, an exhaustive review of the editing, structure, logic, and coherence of questions must be done before using them again in another evaluation.
The second parameter evaluated in this work is the standard deviation (SD), which indicates the dispersion of the response in relation to the answers given by the entire population analyzed. As a comment to this parameter, it can be said that in the event that all students respond equally to a specific question (item), the value obtained for SD would be zero.
SD is obtained with the statistical standard deviation of the sample (classical analytic statistical), or if not, with the mark of class (relation obtained/maximum) for each specific item.
In our case, and considering Figure 6, this parameter can be utilized as a criterion of detection to verify the knowledge acquisition by part of the student body in a determined concept or item. This knowledge contributed by SD should not be seen as particular or individual; the correct interpretation is from a perspective that is most general and uniform for all the members (collective general knowledge of the theme).
In Test 1, the questions that surpass the upper band of the established criterion (in this case, it could be set as an SD close to 0.30) are questions with thematic content advisable to be reviewed again in the classroom to guarantee some minimum content learned by all students.
For Test 2, there is a great divergence between the two clearly defined groups of SD. Thus, the form in which the questions have been grouped (paired questions) should be changed. The four blocks of items should be centered, improving the verification uniformity of the acquired knowledge. The graph in Figure 6 shows the area discussed.
Another interesting parameter for the analysis of test results is the Discrimination Index (DI), which provides an approximate indicator of each item (question) or analyzed response (separately) on its performance with regard to the answer with a smaller performance level. This way, it allows one to deduce between high punctuation with respect to global punctuation, and a less-expert user with respect to the experienced. This parameter is obtained by dividing the student group analyzed by thirds, keeping in mind its scoring with reference to the global questionnaire. Below, for the superior and inferior groups the average punctuation from the analyzed item is obtained (continuing the performance order of up downward); finally, from the previous value is subtracted the average of the punctuation. The mathematical expression is as follows: Note. Possible answer: order in which each possible answer is presented; Possible value for each individual answer: reduction of the punctuation (incorrect response) and increase of the punctuation (correct response); No. of times responded/Total no. responded for question: number of times that this question is answered with reference to the total of possible answers of the test.
where X top is the sum of the reached fraction (obtained/maximum) for this item, for a third of students with higher qualifications in the whole questionnaire; this is the number of correct answers in this group; and X bottom is the analog sum for the students located in the lower third of the questionnaire.
This parameter has values in the range of +1 to −1. Its meaning should be interpreted as follows: When DI is getting greater than 0.0, more low-performance students have been assumed to be better in this item than students with Note. Possible value for each individual answer: reduction of the punctuation (incorrect response) and increase of the punctuation (correct response); No. of times responded/Total no. responded for question: number of times that this question is answered with reference to the total of possible answers of the test. higher performance. Therefore, these items, as questions for evaluation, should be eliminated for being inadequate. In fact, these items reduce the global score precision of the test. In our work (Figure 7), and with the aim of validating an evaluative questionnaire, it will be necessary to eliminate the questions in Test 1 that have a DI lower than 0.4 because these are located in the third of students with low performance and having assumed knowledge assessed. It is important to note that, in this case, these questions are not badly designed, but they are not necessary for evaluation because of their simplicity. The graph in Figure 7 shows the border discussed.
In Test 2, the concepts before established for Test 1 are applicable, thus completing this questionnaire, with exceeded reliability for future applications. Therefore, it is necessary to adjust the test for application in new practices.
The last statistical parameter analyzed in this work is the Discrimination Coefficient (DC), which is considered another parameter of measure to achieve the separation of adequate items and low-performance items from the learning evaluation.
DC is a coefficient of correlation among the scores of each particular item with respect to the complete questionnaire. Its mathematical expression is as follows: where Σ(xy) is the products' summation of the deviations for the samples marks of items, with reference to the total survey or test, N is the number of answers obtained for a question or item, S x is the standard deviation value of the results for the fraction of the question, and S y is the standard deviation value of the results of the total questionnaire.
As in the previous parameter (DI), DC can obtain a range of values from +1 to −1. Positive values indicate items that discriminate right questions, while indices with negative values are items that are answered by low-performing students. This means that items with a negative DC are answered incorrectly by students, which penalizes the majority of students. Therefore, these topics or test questions must be removed.
The advantage of DC with respect to DI is that the former utilizes the entire population of the analysis group to obtain information for its decision, and not just the extreme upper and lower thirds as DI does. Consequently, DC can be considered more sensitive in detecting the performance of the items or questions. In our case, as shown in Figure 8, the detection of the ineligible questions to be considered in   future versions of tests is more evident with DC than with DI. The graphs of Figures 7 and 8 show the comments.
For Test 1, besides Question 17, which was detected by DI,Questions 5,6,7,11,19, and possibly 29 also show serious problems in their resolution by part of the students. For Test 2, the only difference in the use of DI with regard to the values reported using DC is the value reached in its scale, as well as its higher proximity to the nil value. However, DI and DC describe similar order and relation.
To finish, although the following comments are out of the scope of the statistical analysis of the test in this work, Figure  9 shows our case average time employed in the resolution of the test with reference to the average grade reached by students. In general conditions, and for the case of Test 1, there are scores with high values unrelated to the time spent in the resolution. This fact could be used to detect concepts used in the learning, such as bright or effective students. However, students with low scores, who recognize their knowledge deficiency, decline to use adequately all the available time to resolve the questionnaire. In the case of Test 2, the students who achieve high or medium scores do not utilize the total available time (up to 27 min), whereas students with low scores use such time. It is evident that in this test, the resolution time should be adjusted downward, to better adapt its use and evaluation.

Monitoring of the Process and Result of the Improvement
To verify the adaptations, modifications, and replacement proposed in both tests, we performed a second evaluation on students of the following year. For this occasion, the thematic content and teacher were the same, but the students were different. In the comparative analysis of "pre" and "post" test, we observed an improvement in the control parameters.
To obtain control parameters representing the test study, we obtained the average of the results of each index calculated before (FI, SD, DI, and DC), summing the individual value of each question and then dividing by the number of test questions. These parameters measured in global terms whether a test is easier than the other or if the results are more uniform or dispersed.
As a result of this second evaluation, the average control parameters were assessed again and associated with the term posttest. The results obtained were as follows: Posttest: For FI: Test 1: between 69% and 81% (average 72.1%); Test 2: between 75% and 84% (average 73.5%). For SD: Test 1 Finally, comparing the results pretest and posttest (see Figure 10), one can say that the average FI parameters facilitate the resolution of the test (negative slope of the lines between pretest and posttest). However, lower values are reported in the case of SD (greater uniformity of response in the test). Finally, for the case of DI and DC, values are reported at the initial upper level (positive slope of the lines between test), indicating that these tests are more robust and useful as a tool to assess student knowledge.

Conclusion
The final general comments are as follows: At the Moodle platform, the tool "Questionnaires" gives faculty the possibility to implement active learning and selflearning experiences for educational purposes. It is also a simple-use instrument that is suitable for evaluating the knowledge level reached by students.
The use of the available questionnaires on this platform is a big and versatile tool, with applications in educational aspects, such as self-learning and learning evaluation, and as a criterion for particular adaptation in teaching.
This tool allows the promotion of learning activities outside the classroom, reduction in evaluation times (especially in big groups of students), and detection of specific or particular needs of a student or group of students.
The implementation of this tool requires extra work by the teacher at the beginning of its use. This initial effort is compensated with the satisfaction that comes from meeting the predicted educational expectations, improvements in the educational level reached, and the acceptance of its use by students.
The specific final comments of this work are as follows: The processed information obtained in tests can contribute "extra information" that allows adapting the entire teaching process in a better form.
The FI permits discernment among the difficulty levels of the questions established in a test, so it can be used as a criterion to select questions, and thus to guarantee the adaptation of each of them, or in lack of that, a scrupulous review of its logic.
The SD permits the detection of knowledge acquisition by students. This parameter has a general and uniform character for all the members of the group (general collective knowledge of the theme). Thus, it contributes criteria of what is or what is not learned by students.
The DI allows one to detect those questions that should be eliminated in tests because they are inadequate for evaluation. This way, the precision of the global score of the test can be improved. It is important to note that these questions are not badly designed, but they are not necessary to evaluate because of their simplicity.
The DC permits one to obtain a parameter with detection of ineligible questions in a test. This is a more sensitive parameter than DI, as it can be used to select with success those items more adequate for the knowledge evaluation of students.
The control and analysis of the time used in the evaluation test can contribute with adjustments and additional information on the entire evaluation process.

Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding
The author(s) disclosed receipt of the following financial support for the research and/or authorship of this article: The authors thank Project S0117-CTTP-UPC for the financing and the group of investigation GICITED-UPC.