Visual attention while solving the test of understanding graphs in kinematics: an eye-tracking analysis

This study used eye-tracking to capture students’ visual attention while taking a test of understanding graphs in kinematics (TUG-K). A total of N = 115 upper-secondary-level students from Germany and Switzerland took the 26-item multiple-choice instrument after learning about kinematics graphs in the regular classroom. Besides choosing the correct alternative among research-based distractors, the students were required to judge their response confidence for each question. The items were presented sequentially on a computer screen equipped with a remote eye tracker, resulting in a set of approx. 3000 paired responses (accuracy and confidence) and about 40 h of eye-movement data (approx. 500 000 fixations). The analysis of students’ visual attention related to the item stems (questions), and the item options reveal that high response confidence is correlated with shorter visit duration on both elements of the items. While the students’ response accuracy and their response confidence are highly correlated on the score level, r(115) = 0.63, p < 0.001, the eye-tracking measures do not sufficiently discriminate between correct and incorrect responses. However, a more fine-grained analysis of visual attention based on different answer options reveals a significant discrimination between correct and incorrect answers in terms of an interaction effect: incorrect responses are associated with longer visit durations on strong distractors and less time spent on correct options while correct responses show the opposite trend. Outcomes of this study provide new insights into the validation of concept inventories based on the students’ behavioural level.


Introduction
In 1994, Robert Beichner introduced the test of understanding graphs in kinematics (TUG-K) to the physics education research community [1], which has become one of the most widely used tests to date designed to evaluate students' understanding in this subject (see, for example, [2][3][4]). Since then, the TUG-K was used by teachers and researchers to assess students' understanding of graphs in kinematics, as well as their learning about them. For example, the TUG-K has been used as a pre-and post-test to investigate the effectiveness of instructions (e.g. video-based motion analysis [5]) or to study the relationship between the understanding of kinematics graphs and other variables (e.g. gender) [6]. The test has also been used as a reference to design new, related tests: for example, a test to assess the understanding of graphs in the context of calculus [2], the kinematics concept test [7] or the kinematics representational competence inventory [8]. For the TUG-K, the graphs of all test items relate to motion in one dimension and the items address the concepts of graph slope and area under the curve for different objectives [1]. All items were created based on extensive research on student difficulties with graphs of position, velocity, and acceleration versus time. Besides the correct alternative, the items offer incorrect options (distractors) that address typical student difficulties with graphs, such as the graph-as-picture error [1].
Recently, several modifications of the TUG-K have been made to achieve parallelism in the objectives of the test [9]. The modified version of the test was proven to fulfil the statistical tests of difficulty, discriminatory power, and reliability; also, that the great majority of the modified distractors were effective in terms of their frequency selection and discriminatory power. Although much research has been done to validate the instrument properties, the methods being used are limited to classical test analysis and student interviews [10,11]. While analysing written responses and student interviews is very effective for identifying student misconceptions and for detecting typical sources of erroneous reasoning, these methods are very time-consuming and require interactions between an interviewer and the interviewee [12]. Consequently, these methods are typically used in an early stage of test construction. In addition, the data gathered in the large-scale assessment scenarios described above are typically reduced to the frequency of distractors and the response accuracy rates, neglecting the capability to directly probe the students' thinking processes at a finer level during problem solving. One way to tackle these problems could be the analyses of gaze data during test completion by eye-tracking technology.
Meanwhile, eye-tracking has received attention from the physics education research community. Previous eye-tracking research on students' understanding of kinematic graphs revealed that students with low spatial abilities tend to interpret graphs literally [13], yielding a link between an important cognitive variable (visuo-spatial abilities) and graph understanding. In the similar context, Madsen et al showed that students who answered a question correctly spent more time on relevant areas of a graph, e.g. the axes [14]. Their finding also suggests a link between the previous exposure to a type of a problem and the ability to focus on important regions [15]. Recently, Susac et al compared students' understanding of graphs between physics and non-physics students in different contexts (namely, physics and finance) [16]. The authors found that physics students outperformed the nonphysics students for both contexts. The analysis of eye-movement data revealed that physics students focused significantly longer on the graph and spent more time looking at questions in the unfamiliar context of finance. The results have broadly been confirmed by Klein et al in a replication study using physics students and a different non-physics sample, namely economics students [17]. The physics students also solved the problems better than the nonphysics students but, in contrast to the work by Susac et al, Klein et al found that both groups of students had very similar visit durations on the graphs, consequently proving total visit duration to be an inadequate predictor of performance. These contradictory results obtained in the studies require further investigation.
Some items of the TUG-K have been investigated with eye-tracking technology before. Kekule reported qualitative results that indicate different task-solving approaches between best-and worst-performing students when they solved some specific items of the TUG-K [18,19]. However, up to now there has been no study to use the complete test and apply a rigorous analysis to all test items. Given that the TUG-K is continuously applied for research purposes and researchers are still putting effort into its development and modification, a thorough analysis of visual attention might reveal further information about test validity, distractor functionality, discrimination, and about the students' cognitive processes involved in problem solving on the test level. Accurate measures of the time spent on the various distractors and how students distribute their attention (we use time as a proxy for attention) will provide insights on the cognitive demands of the complete test rather than individual items on a behavioural level.
Following this line of research, Han et al recently used eye-tracking technology to investigate students' visual attention while taking the complete force concept inventory (FCI) in a web-based interface [20]. They compared two samples of students, one that was tested before instruction and the other after attending classical mechanics lectures for several weeks. The authors were able to show that the students' performance increased but there was no correlation between the performance and the time spent to complete the test. The authors also investigated three questions about Newton's third law more carefully. They found that even though the students' expertise shifted towards more expert-like thinking, the students still kept a high level of attention at incorrect choices that addressed misconceptions, indicating significant conceptual mixing and competition during problem solving while answering these three items. In contrast to the TUG-K, the FCI does not emphasize a special kind of representation (i.e. graphs) and uses exclusively text-based options. Hence, there is value in expanding the research methodology to investigating complete item sets of other concept inventories. In addition, it is also of importance to extend the method of classifying answer choices to all test items (instead of using a subset) to generalize the hypothesis of conceptual mixing and to investigate how the mixing is influenced by student expertise.
Apart from eye-tracking, the assessment of confidence ratings adds more information about the quality of students' understanding when answering multiple-choice questions [21][22][23][24]. In general, confidence judgements reflect the examinee's belief in the correctness of the chosen alternative and can be interpreted as one aspect of metacognition [21]. Being able to distinguish between right and wrong answers may be a condition to reflect and regulate learning and is linked to the students' ability to evaluate their own understanding. In the above-mentioned eye-tracking study on students graphical understanding of the slope and the area concept, Klein et al assessed the response confidence level of the students and found that physics students have a higher ability to correctly judge their own performance compared to economics students [17]. While the authors evaluated the confidence scores quantitatively, they did not link the confidence results to the eye-tracking data they obtained, leaving this question open for future research. In an investigation about a more advanced physics topic (namely, rotational frames of reference), Küchemann et al used two multiple-choice questions to assess students accuracy and response confidence after the students observed real experiments about non-inertial frames of [25]. The authors identified that longer visit durations were related to low confidence ratings, suggesting that students with low confidence lack a cohesive understanding of the representations involved and tried longer to make referential connections between them as compared to confident students.

Research questions
In this study, eye-tracking was used to record students' visual attention while completing a computer-based version of the TUG-K. The goal of the study was to investigate how the students' conceptual understanding of kinematics graphs influenced the students' allocation of attention to the questions and to the different answer options. Additionally, the study explored how the students' response confidence was related to their eye movements and to their conceptual understanding. To ensure that the students had sufficient prior knowledge to reasonably understand all questions, the test was administered as a post-test after the topic was taught at school. The research questions are twofold: 1. How is the time spent on individual questions and options related to student response accuracy and to students' response confidence? 2. How do students of different expertise (as indicated by the total test score) distribute their attention on different answer options?
For the second research question, the different option choices of individual items were classified based on a distractor analysis, resulting in four categories: correct option, most popular incorrect options (oftentimes addressing misconceptions), popular incorrect options and unpopular (implausible) incorrect options, see section 3.2 for details. Based on prior findings of expertise research [26], the following hypothesis was examined: Experts spend more time on conceptually relevant areas of the option (i.e. the correct alternatives) and less time on conceptually irrelevant areas (i.e. the strong distractors and the implausible options) compared to novices.

Subjects and data collection
The sample consisted of upper-secondary students from two German grammar schools ('Gymnasium', N = 68) and one Swiss Gymnasium (N = 47), including 12 different physics courses in total. When the test was administered, the subject kinematics had been completed in all courses. According to the teachers, the formal basics of kinematics had been presented to the students, including different types of motion, by lecture and demonstration experiments. Therefore we can consider the measurement in this study as a post-test. All physics teachers discussed kinematics graphs in their courses but the intensity was neither assessed nor controlled. Apart from that, no special educational requirements were necessary in regard to the participating physics courses.
All students (58 female, 57 male; all with normal or correct-to-normal vision) completed the TUG-K in its original sequence (26 questions in German language) on a computer screen equipped with an eye tracker. Four identical eye-tracking systems were set up in the school libraries and the students took part in the experiment in groups of up to four, either during their spare time or during regular lessons (given that they had the teachers' permission). At the beginning of the experiment, students were placed in front of a computer screen, and a nine-point calibration and validation procedure was used. A researcher (one of the co-authors) instructed the students before the experiment started, carried out the eye-tracking calibration, and checked the accuracy of the gaze detection during the experiment. Besides the test takers and the researchers, no other persons were present in the room. The students read the material without any interruptions by the researcher. The students were free to use as much time needed for answering the questions. Whenever a student was ready to give an answer, he or she pressed a button and gave his or her answer as well as his or her response confidence. The students did not receive any feedback after completing a task and were unable to skip back to previous tasks.

Materials
The TUG-K contains 26 multiple-choice questions. The original version of the test exists in the German language [27] but the modified version (TUG-K 4.0)-which was used in this study-does not. Therefore, the modified items were translated into German. The test was converted into a computerized assessment using the Tobii Studio eye-tracking software, which was also used to present the stimulus to the subjects and to host the eye tracker [28]. Each item was presented on one single slide, starting with the item question on the top of the page. The item question contained either only text or text and a graph. Each item had five answer options presented either as graphs or as text (written words or numbers). The concepts addressed in the single items can be found in the original publication of the test [1,9] and a short overview is given in table 5 in the appendix. The TUG-K is a research-based instrument, and all items are developed based on qualitative studies that pointed towards typical student errors and misconceptions with kinematic graphs. In many cases, the incorrect answer options are well designed to address typical learning difficulties with kinematic graphs. These strong distractors were constructed based on empirical research and typically receive many votes from students who answer incorrectly. However, some of the alternatives are quite implausible and do not reflect popular misconceptions. These alternatives were only selected by a minority of students who answer incorrectly as indicated in prior investigations [9] and by our own data set. Based on the test takers choices of the distractors provided in the original work (N≈500) [9], the options were split into four categories (see table 4 in the appendix for an overview): • correct option; • most popular incorrect options (highest proportion of incorrect choices); • popular incorrect options (more than 10% and less than the most popular choices); and • unpopular incorrect options (quite implausible and only few choices (10 %)).

Definition of areas of interests (AOIs)
For each item, two areas of interest (AOIs) were defined. The first AOI ('Q') covers the question (also referred to as item stem) that consists of text or text and a graph. The second AOI ('O') covers all five answer options. Figure 1 shows one example for the definition of AOIs with an excerpt of the eye-tracking data from one student.
Furthermore, we have defined five smaller rectangular AOIs ('A'-'E') inside the O-AOI for every item. They are equal in size and non-overlapping, each surrounding one of the five answer options. In other words, Q contains the subsets (A-E) and some white space between the options. Since we do not compare the items among each other, there is no effect if the AOIs are different in size among different questions. Based on the measurement data, 96% of the fixations were located inside È Q O.

Eye-tracking apparatus
The tasks were presented on a 22-in. computer screen. The resolution of the computer screen was set to 1920×1080 pixels with a refresh rate of 75 Hz. Eye movements were recorded with a Tobii X3-120 stationary eye-tracking system [28], which had an accuracy of less than  0.40 of visual angle (as reported by the manufacturer) and a sampling frequency of 120 Hz. The system allows a relatively high degree of freedom in terms of head movement (no chin rest was used). Regular eye movement alternates between fixations and saccades. During fixations the visual gaze on a single location is maintained to process information. Saccades are eye movements used to move the focus of visual attention from one point of interest (fixation) to another, see figure 1. To detect fixations and saccades, an I-VT (identification by velocity threshold) algorithm was adopted [29]. An eye movement was classified as a saccade (i.e. in motion) if the acceleration of the eyes exceeded 8500°s −2 and velocity exceeded 30°s −1 .

Data and data analysis
For the analysis we used the response accuracy (0=incorrect response, 1=correct response), the response confidence (as measured by a Likert scale ranging from 1 to 6), and the visit duration. The confidence ratings were linearly transformed to a [0, 1]-scale where 0 means lowest and 1 means highest confidence. The confidence index and the difficulty index then refer to the average of the students' confidence ratings and accuracy scores per item, respectively. The median per item was used to define three confidence levels: low (below median), intermediate (equals median), and high (beyond median). The visit duration was calculated for the single AOIs (Q,O, A-E). This measure was conveniently extracted using the Tobii Studio software.
A two-factorial analysis of variance (two-way ANOVA) was used to investigate the differences in the eye-tracking measures between response accuracy levels (factor 1; correct versus incorrect) and between response confidence levels (factor 2; high versus intermediate versus low) as well as their interaction. Given the huge data set (26 items×115 students ≈3000 data points), each effect was considered statistically significant when the p-value was below the 0.1% threshold (p<0.001). In these cases, we also report the effect size measure in terms of Cohen's d to judge the magnitude of the phenomenon. Descriptors of magnitudes for d=0.01 to 2.0 have initially been suggested by Cohen and expanded by Sawilowsky; d>0.01: very small, d>0.20: small, d>0.50: medium, d>0.80: large, d>1.20: very large, d>2.0: huge [30,31].
Furthermore, the students' visit duration on options representing different concept models were compared using a repeated measure ANOVA with the attention on option choices as the within-subject variable (correct versus incorrect options or more fine-grained categories, see above) and the expertise level as the between-subject variable.

Descriptives and correlations
The students' mean test score and standard deviation were (59 ± 25%), ranging from 4% (=1 question correct, 1 student) to 100% (26 questions correct, 2 students). A histogram is shown in figure 2(a). The mean response confidence score was (72±15)%, ranging from 29% (1 student) to 100% (1 student). Correct answers (N = 1773) were given with higher response confidence (78%) than incorrect answers (N = 1217; 63%), and consequently, the students' mean response confidence score for correct answers (74%) was higher compared to incorrect answers (67%), t(112)=6.6, p<0.001, see also figure 2(b). Correlation analysis also confirmed the result, which showed that accuracy and confidence scores are significantly correlated, r(115)=0.63, p<0.001. Table 1 summarizes the students' response accuracy and confidence scores for each item. The items cover a reasonable range of difficulty from 0.34 (hardest; item 1) to 0.89 (easiest; item 13 and item 14), and the mean score of 0.59 falls into the suggested range by [32]. The confidence index varies from 0.57 (item 10) to 0.83 (item 25). Both indices are correlated on the item level, r(26)=0.57, p<0.01. The eyetracking data was used to calculate the time required to complete the TUG-K, the time spent on each single TUG-K question, and the visit duration on different elements of each question. The average time spent completing the TUG-K was (20.2 ± 4.6) min, ranging from 10.4 to 34.1 min. There is no significant correlation between total time spent on the TUG-K and the accuracy score, r(115)=−0.17, p>0.05, whereas the correlation between time spent on the test and confidence scores is significant, r(115)=−0.25, p<0.01.  For each item, the mean visit duration was split into time spent on the question (Q) and on time spent on the options (O), see table 1. On the item level, there is no correlation between the difficulty index and any of the visit-duration measures, confirming the result from above. In contrast, the confidence index is correlated with the visit duration on the question r (26)=−0.54, p<0.01 but not with time spent on the options.

Visit duration on questions and options
Student time spent on the different elements of an item (question or options) was analysed using a two-way ANOVA with response accuracy as factor 1, confidence as factor 2, and the time as the dependent variable. The results are summarized in figures 3(a), (b) and table 2. For visit duration on the question, we found a statistically significant main effect of confidence [F(2, 2984)=44.6, p=10 −20 , small effect size d=0.20], whereas the response accuracy had no significant impact [F(1, 2984)=0.2, p>0.001]. Figure 3(a) shows that low confidence is related to longer visit durations both for correct and for incorrect responses. However, the difference between low and intermediate confidence is bigger for correct responses than for incorrect responses, indicating an interaction effect. Indeed, the interaction between confidence and accuracy is statistically significant but the size of the effect is negligible [F(2, 2984)=8.75, p=0.001, very small effect size d=0.07].
For visit duration on the options we found a significant main effect of confidence with small effect size (d=0.34) whereas the accuracy had no impact on the visit duration on the  options. Figure 3(b) again shows that low confidence is related to longer visit durations, and the trend is very similar between correct and incorrect responses without an interaction effect.

Students' visual attention on different answer options
Student fixations on different answer options reveal information about their reasoning process when solving problems on kinematics graphs. As described in section 3.2, most of the incorrect alternatives were constructed based on student difficulties with graphs and prior studies showed that some of them were more popular than others (in terms of being selected by the students), see table 4 in the appendix. Analysis of visual attention on different answer options might therefore reveal important information about the attractiveness of the alternatives and the presence of misconceptions on a behavioural level. The results in the previous section showed that the students' understanding-as reflected by their response accuracy (i.e. correct or incorrect)-is not associated with the time spent on the options. As a next step, the answer options were distinguished more carefully: (i) visual attention on correct and incorrect options is investigated and related to the response accuracy and (ii) three categories of incorrect options are differentiated (most popular, popular, and unpopular) and the visual attention on these categories is related to the students' expertise level as an indicator of their conceptual states. We defined the expertise level of the students by the total test score. For the second purpose, no distinction is made between text-based, graphical, and numerical choices.
To ensure an unbiased comparison between option categories, the visit duration is normalized to the number of options that the categories consists of. When comparing the visit duration between one correct and the incorrect alternatives, the mean visit duration among the four incorrect alternatives was calculated. Figure 4(a) shows the visit duration on correct and incorrect options for students who responded correctly and incorrectly. Since every participant fixated on every option type in every item, the data structure calls for a repeated measure ANOVA (2 × 2) with response accuracy as the between-subject factor and option type as the within-subject factor. We found a significant main effect of option type, F(1, 2921)=402, p<10 −20 , effect size d=0.80. That is, correct options received more attention from students than incorrect options. The interaction between accuracy and option type is also statistically significant with large effect size, F(1,2919)=540, p<10 −20 , d=0.95: Correct responses are associated with more attention on correct options and less attention on incorrect options, whereas there is an opposite trend for incorrect answers, confirming the research hypothesis from section 2. This result was obtained using the accuracy and eye-tracking data from all items. By repeating the analysis for every single item, the occurrence and the magnitude of the interaction effect between accuracy and option type reveals whether the hypothesis also holds on the item level. The data is presented in table 3. As can be seen, all interaction effects with one exception (item 12) are significant on the 5% level and the descriptive data confirms the trend described above, i.e. that the correct options received more attention when students were answering correctly whereas the incorrect responses receive more attention when students were answering incorrectly. The effect sizes range from small (item 20; d=0.41) to huge (item 21, d=5.94). Since a multitude of 26 statistical tests were conducted on one data set to obtain these results, effects of multiple testing have to be controlled for. Therefore, we also indicate statistical significance based on a lower threshold Table 3. Visit duration (VD) in seconds for correct and incorrect options by response accuracy (correct versus incorrect). Standard errors are given in parentheses. The statistics were obtained from a repeated measure ANOVA and refer to the interaction between option type and accuracy on VD. defined by the (conservative) Bonferroni correction and find that the interaction effect disappears for the items 13, 20, and 26. For a more detailed analysis of the answer choices, the distractor analysis from the original data was used [9]. Based on the frequency of answer choices published by Zavala et al, three categories of incorrect alternatives have been defined (most popular, popular, and unpopular; see section 3.2 and table 4) and the mean visit duration on each category was determined for every person and every item. The visit duration on every option category is shown in figure 4(b) as a function of the students' expertise (that was determined by the total test score). A repeated measures ANOVA (4×5) was conducted with the option category as the within-subject variable and the expertise level as the between-subject variable. A significant main effect of the factor option type was found, F(3, 4734)=110.4, p<10 −20 , d=0.55, indicating that the attention was not equally distributed among the different option types. Post-hoc analyses with pair-wise comparisons revealed that the correct responses received significantly more attention than all other types of answer choices with effect sizes between d=0.14 and d=0.46. The unpopular choices received the least attention (d=0.18−0.46), and the most popular and popular incorrect options are in between (separated from each other with effect size d=0.16). Furthermore, a significant interaction effect with small effect size was confirmed, F(12, 4734)=8.1, p<10 −15 , d=0.27. The Table 4. Categorization of options based on the empirical results from the original work by Zavala et al [9].

Item Correct Most Popular Popular Unpopular
a Option choices A-E are related to graph choices I-V. interaction between students' expertise level and option type is most pronounced for the highest expertise levels. When transitioning from the 80% to the 100% expertise level, the students focus more often on the correct options and their visual attention on all other options decreases. As can be seen from figure 4(b), the difference between the students' attention on the most popular incorrect options and the correct options increases from low to high expertise levels.

Discussion and conclusion
In this study, we used eye-tracking to capture the students' visual attention while they solved the test of understanding graphs in kinematics (TUG-K) during a computer-based assessment scenario. Besides choosing the correct alternative among research-based distractors, the students were required to judge their response confidence for each question. Even though all students were exposed to the subject at school, the mean test score of 59% does not indicate mastery of all concepts in the test, indicating that more emphasis should be put on graph interpretation. Results from recent studies also highlight the importance of an instructional adjustment towards a more graphical-based education [16,17]. Especially items 1, 4, 10, 16, and 19, which deal with the area concept, were among the hardest for the students, confirming previous findings about students' difficulties with the interpretation of area under a curve [1,5,33].
Overall, the students provided correct answers with higher confidence ratings in comparison to when they gave incorrect answers. Thus, the students' ability and their confidence were highly intercorrelated. However, some items show below-average item difficulty but above-average confidence scores, indicating that the correlation between response accuracy and response confidence varies among the items. For instance, students were quite confident (70%) when responding to item 4 (calculating the area under a curve) but they were correct only in 38% of the cases. The strongest distractor (E, chosen by 36% of students) reflects the popular error of determining the area of a rectangle instead of a triangle. Therefore, the participants were able to select an incorrect alternative that reflected their flawed thinking process, resulting in a high response confidence, an observation that was also examined elsewhere [8]. It was not in the scope of this paper but the analysis of confidence scores besides response accuracy might reveal more information about student misconceptions, and we encourage researchers to add confidence scales to their assessment.
From the eye-tracking data, we found that the TUG-K took about 20 min to complete on average. We found a significant correlation between the time spent on completing the items and the confidence index with longer visit durations indicating low confidence. This result was confirmed by analysing the time spent on the questions and option choices with respect to the students confidence ratings on single test items. The effect was more pronounced for the options (d=0.34) than for the questions (d=0.20) and the visit duration on the options provided a good discriminator between low, intermediate and high confidence levels. The magnitudes of effect sizes are similar to those reported in a previous study comparing visit durations and confidence ratings [25]. Students with such high confidence appear to think they know what answer choice to look for (in the case of number choices) or which features of the graphs are relevant to solve the questions (in the case of graph choices) so it takes less time to evaluate the options. In contrast, students with low response confidence might take more options into consideration, comparing them and therefore need more time to select an alternative. The data also showed a very small interaction between confidence and accuracy on the time spent on the question: high confident students spent less time on the question when answering correctly compared to students that answer incorrectly. For students with low confidence it is vice versa and further qualitative research is required to explain this result.
There was no correlation between time spent on the test and the students' performance in general. There was also no difference in the visit duration between correct and incorrect responses, neither regarding the item stems (questions) nor the options, confirming the result obtained from Han et al in the context of the FCI [20]. However, similar to Han et al, when the options are split into correct and incorrect choices, the attention on these choices discriminates the correct from the incorrect answers in terms of an interaction effect: incorrect responses are associated with longer visit durations on incorrect options and less time spent on correct options while correct responses show the opposite trend. This confirms a general trend of answering multiple-choice questions found by Tsai et al; while solving a multiplechoice science problem, students pay more attention to chosen options than rejected alternatives, and spent more time inspecting relevant factors than irrelevant ones [34]. Given that the interaction effect occurs for (almost) every single item of the TUG-K and on the test level (26 items), we can consider the result obtained in this study as a broad generalization of Tsai et al who investigated a rather small sample of six students and only one item. The effect size ranges from medium to huge on the individual item level and the average effect size (considering all items) is large, pointing towards the practical importance of the effect. However, there are very few exceptions with no effect at all (item 12) or small effect sizes (items 13, 20, and 26) that should be mentioned: three of these items (12,20,26) require the selection of a graph from a textual description. The options show five graphs (I, II, ..., V) and the test taker must judge which of the graphs suit a condition (e.g. that shows a motion with constant speed). The final options are presented in a complex multiple-choice format, e.g. '(a) graphs I, III and V', '(b) graphs I and III', etc. Therefore, incorrect answers-for instance '(a) graphs I, III and IV'-involve a subset of correct graphs-for instance I and III-and therefore the analysis procedure cannot discriminate between the visit duration on correct and incorrect options by correct and incorrect responses. For future analysis, we advocate to treat these items separately. Additionally, we encourage test developers to change the response format from complex multiple-choice to single-choice to have a consistent format.
Finally, the incorrect option choices were divided into three categories based on the empirical data about selection frequency obtained in the original study [9]. It was found that unpopular options receive the least attention among all levels of expertise. The higher the expertise level of the students, the more attention is allocated on the correct options and the less attention is allocated on the most popular options. Even in the highest levels of expertise, the most popular incorrect options, that were designed to address popular misconceptions and learning difficulties with graphs, receive more attention that the other incorrect options. Expert students (as indicated by high test scores) who shifted their attention to the correct choices still keep a higher level of attention to the popular options, indicating conceptual mixing [20]. From an assessment point of view, this result provides evidence for the test validity at the behavioural level. The choices were well designed to reflect typical flawed thinking processes that students have to encounter during problem solving. Future work could focus on specific alternative conceptions and student errors while taking the TUG-K and whether they are related to certain eye-gaze patterns.
In summary, the study showed that eye-tracking can make unique contributions to the validation of concept inventories on a behavioural level without using interview or survey data. While simple time measures (visit durations on the question or options) are well suited to discriminate between different confidence levels of the test takers, they do not discriminate the correct from the incorrect performers for this type of questions. A more fine-grained partition of option choices-based on educational considerations or empirical data-is required to relate the students' accuracy to eye-tracking measures.