A Tablet-Computer-Based Tool to Facilitate Accurate Self-Assessments in Third- and Fourth-Graders

Although student self-assessment is positively related to achievement, skepticism about the accuracy of students’ self-assessments remains. A few studies have shown that even elementary school students are able to provide accurate self-assessments when certain conditions are met. We developed an innovative tablet-computer-based tool for capturing self-assessments of mathematics and reading comprehension. This tool integrates the conditions required for accurate self-assessment: (1) a non-competitive setting, (2) items formulated on the task level, and (3) limited reading and no verbalization required. The innovation consists of using illustrations and a language-reduced rating scale. The correlations between students’ self-assessment scores and their standardized test scores were moderate to large. Independent of their proficiency level, students’ confidence in completing a task decreased as task difficulty increased, but these findings were more consistent in mathematics than in reading comprehension. We conclude that third- and fourth-graders have the ability to provide accurate self-assessments of their competencies, particularly in mathematics, when provided with an adequate self-assessment tool.


Self-assessment benefits
In their integrative review on the relation between self-assessment and academic achievement, Brown and Harris (2013) [1] reported positive median self-assessment effects ranging from d = 0.40 to 0.45, indicating that student self-assessment is substantially related to educational success; see also [6] [7] [8] [9].
The relation between self-assessment and academic achievement can be explained through the processes of self-regulation [10] [11] [12] and through self-efficacy, the latter defined as "beliefs in one's capabilities to organize and execute the courses of action required to produce given attainments" [13] p. 3. It has been found that selfregulating students engage in three processes to observe and to interpret their behaviors: self-observations by which students concentrate on specific aspects of their performance relevant to their perception of success, self-judgments by which they assess the extent to which they have met their goals, and self-reactions by which they assess how satisfied they are with the results of their actions [10] [8] [11] [12]. A positive or high self-assessment induces higher self-efficacy and consequently leads to goal-setting that is conducive to learning (mastery learning) and to higher persistence in goal-attainment [8] [14] and hence to improved learning outcomes. However, a negative or low selfassessment does not automatically lead to low self-efficacy if students believe in their ability to learn and they adapt their behavior accordingly through self-regulation [15]. Nevertheless, self-assessment does not automatically induce self-regulation. The positive impact depends on a variety of factors, for example, characteristics of the learner, of the task, and task outcome; characteristics of the assessment feedback given; and the conceptual model and setting in which the (self-)assessment takes place [16]. Regarding the last factor, student self-assessment combined with the feedback from a contextually important evaluator (i.e., teacher, tutor, peer) is more likely to induce self-regulatory activity [16]. Self-assessment training influences one or all of the previously mentioned self-regulatory processes and ideally includes interactions with or feedback from teachers or peers [6] [7] [8] [9].

Self-assessment accuracy
Despite the growing body of knowledge regarding the benefits of student self-assessment, skepticism about learners' self-assessment accuracy remains [1] [23] [24]with self-assessment accuracy generally defined and operationalized as the consistency between self-assessment and corresponding external judgements (e.g., test scores, school grades, peers' judgments, parents' judgments) [25] [24] [23] [1] [26]. An accurate validation measure of high psychometric quality is recommended when measuring the accuracy of students' self-assessment [25].
The self-assessment accuracy of lower performers. The self-assessments of lower performing students tend to be less accurate than those of higher performing students [1]. Claes and Salame (1975) [51] found significant differences (F = 11.25, p < .01) between high and low performers in their self-assessment accuracy. Possible explanations for these findings are that lower performing students may lack an adequate representation of what is expected from them and might not understand the assessment criteria [52] [51] [53], both of which may lead to inaccurate self-assessment. When reporting their school grades, lower performing students may succumb to social desirability or self-enhancement factors [54]. Similar to self-assessment accuracy in relation to student age (see the Self-assessment accuracy and student age section), developmental differences or gaps in metacognitive skills [38] are possible explanations for the observed discrepancy. An interesting finding is that low performers seem to gain the most from self-assessment training-which usually includes the training of metacognitive skills (defining assessment criteria and using them for self-assessment)-in terms of performance gains (ES = 0.58) [52] (see also [55]).

The Present Study
In the present study, we aimed to rigorously empirically investigate whether or not third-and fourth-graders in elementary school are able to provide accurate self-assessments of key academic competencies (mathematics and reading comprehension) when equipped with a self-assessment tool that combines all the conditions that are favorable for accurate self-assessment (i.e., a non-competitive setting, task-oriented questions, limited reading, and no verbalization required). In our study, we adopted an innovative approach to self-assessment by introducing a tablet-computer-based self-assessment tool that is rich in illustrations and has a language-reduced rating scale, thus reducing the bias that may come along with poor reading and language skills. Classrooms are becoming increasingly digital [56], and tablet technology [57] [58] may help to facilitate and efficiently integrate self-assessment in the classroom. We intend to feed the self-assessment body of knowledge with new, high-quality empirical data. First, we collected self-assessment data from two independent representative samples of thirdand fourth-graders (ages 8 to 9 years) in Luxembourg. Second, we used measures of high psychometric quality, namely, the standardized test scores from the national school monitoring program [59] as measures of validation for students' self-assessment.

Contextual embedding of the study
The present study was located in Luxembourg, which provides a rather unique learning environment. Specifically, Luxembourg has quite a distinct multilingual educational context (three official languages: Luxembourgish, French, and German; bilingual education in French and German; literacy acquisition in German) and a very heterogeneous student population. Almost half of all students in public education are foreign nationals, and more than half do not speak Luxembourgish as their native language (e.g., [59]). International large-scale assessments (e.g., the OECD's PISA studies) have repeatedly shown that many educational systems in modern societies-and Luxembourg is by far no exception-struggle with the adequate handling of increasingly diverse student populations [60]. Understanding and learning how to effectively deal with highly heterogeneous groups of learners (i.e., solving the problem of how to provide equal opportunities for success to everybody independent of their socioeconomic, sociocultural, and linguistic background) may be considered the largest educational challenge in Luxembourg today.
Although language learning plays a predominant role in elementary education in Luxembourg and consumes over 40% of the total teaching time, 45% of third-graders do not reach the minimum competence standard in German reading comprehension defined by the national school curriculum for this age group [59]. Because reading skills constitute the basis for all future learning, achievement and learning in all other school subjects are at stake [61]. In 2009, Luxembourg's pre-and elementary schools underwent profound changes. Among other changes, the 2009 education act put particular emphasis on formative assessment, regular feedback, and student self-assessment. Given that fair assessments in mathematics should not be confounded with reading skills and that German is the language in which literacy is acquired in Luxembourg, we created an innovative tool that lowers the impact of language and reading in mathematics self-assessment and provides a way for students to assess their German reading comprehension. Thus, we intended to facilitate and support the goals of the education act in the current study.
Luxembourg's increasing diversity, a logical consequence of the demographic change that comes along with a globalized world, is not exclusively a domestic matter. However, owing to several national specificities (e.g., relatively small size, open borders, situated in the heart of Europe, traditionally multilingual, with an economic model built on and relying on immigration), change might occur more quickly in Luxembourg than in other countries. Accordingly, Luxembourg provides a unique educational and societal learning environment, a living laboratory so to speak, that is prototypical and anticipatory of the demographic changes and the related challenges that its geographical and metaphorical neighbors may very likely face over the next decades.

An innovative tablet-computer-based self-assessment tool
We decided to use tablet technology to develop the digital self-assessment tool (referred to as "the tool"). On the one hand, tablet computers offer the same technological opportunities as computers [57]. On the other hand, tablet devices allow maximal mobility when used in schools and classrooms and have an intuitive design, a simple interface, touch screen function, and multimedia capabilities that facilitate the user's interaction with the program, particularly for pre-and elementary school children (ibid). On the basis of the principles of the Cognitive Theory of Multimedia Learning (CTML) for the design of multimedia instructional messages [62] [63] [64], we developed a tool that fulfills the following requirements (see also the Self-assessment accuracy and student age section): (1) self-assessment in a non-competitive setting, (2) task-oriented self-assessment, and (3) self-assessment that requires only limited reading and no verbalization. The details of how these requirements have been taken into consideration in this tool are elaborated below.
1. In order to avoid competition and social comparison between students [31] [41] [42] [25], we based the items used in the tool on an external reference standard: the national school curriculum. We concentrated on the domains of language (reading comprehension in German) and mathematics (arithmetic operations; geometry and space). The items were developed along three proficiency levels, attuned to the required competency standards of third-and fourth-graders. 2. The self-assessment items are concrete tasks on a sub-competency level and on three proficiency levels. We consciously avoided general questions about competency (e.g., How well are you doing in mathematics?) because they are distal from criterial performance and mastery learning objectives. According to CTML, meaningful learning occurs when the processes of selecting, organizing, and integrating take place for visual and verbal representations. A multimedia instructional message consists of pictures (i.e., animated or static) and words (i.e., spoken or written). It is effective if it helps students hold visual and verbal representations in working memory simultaneously [65] [62]. In our tool, we avoided extraneous material that was not related to the self-assessment tasks, and we aimed to highlight essential material [64] [66]. An important objective for the mathematics tool was the reduction of (written) language. Consequently, we based the items on illustrations and short animations combined with written on-screen text, where necessary, to represent the key sub-competencies in the school curriculum. The language used in the tool is German, the language of instruction in Luxembourg's elementary education. Because the tool was designed for classroom use on an individual basis, we decided not to insert spoken language and sound. The reading comprehension tool consists primarily of written text and does not contain many illustrations. 3. The use of concrete objects [34], pictorial inventories [17] [37], and language-reduced answer scales [67] might be effective for helping young students overcome the verbalization and literacy barriers they experience. Figure 1 provides an example of one of the competencies represented in the tool: the application of arithmetic operations in concrete life situations. In a short video, Paul gets 25 euros and wants to spend them on roller coaster rides. The student is told that 1 ticket costs 6 euros. The student is then asked the actual self-assessment item: How many tickets can Paul buy? The rating is introduced by the sentence: I could solve this problem. We used a language-reduced and pictorial visual analog scale (VAS) as the rating scale for the self-assessments. A VAS is appropriate for capturing subjective perceptions [68] and is considered to be reliable when used with children [67]. Furthermore, a comparative study showed that children with an immigration background preferred the language-reduced VAS over the Likert-type scale [67]. The use of a VAS avoids all bias associated with poor language and reading skills. In our tool, a pictorial nodding head and a pictorial shaking head were placed at the opposite ends of a scale-free line. Students rated how confident they felt about whether they could solve the presented item by moving a slider along the line.

Research aims
Skepticism toward elementary school students' self-assessment persists despite the demonstrated benefits of self-assessment in this age group (see Sections 1.1 and 1.2). In the present study, we wanted to investigate whether or not third-and fourth-graders (ages 8 to 9 years) can provide accurate self-assessments of key academic competencies (mathematics, reading comprehension) when equipped with an adequate self-assessment tool.
More concretely, accurate self-assessment requires three things: 1. Consistency between self-assessment and a corresponding external judgment (e.g., test scores, school grades; see the Self-assessment accuracy section). In other words, students' self-assessments should reflect their actual competencies. The strength of the relationship between self-assessment and a test score is typically identified through correlational analysis. If students' self-assessments reflect their actual competencies, medium to large [40] correlations between self-assessment and test scores should emerge. Moreover, when students are organized into proficiency groups on the basis of their standardized test scores from the national school monitoring program, students in lower proficiency groups should provide lower self-assessments on average in comparison with students in higher proficiency groups. Analysis of variance (ANOVA) can be applied to test for significant differences between these groups. 2. Obtaining self-assessments from third-and fourth-graders requires adapted conditions and instruments (see the Self-assessment accuracy and student age section). Self-assessment accuracy with our tablet-computer-based self-assessment tool (mastery condition, items on the task level) requires that students recognize the inherent difficulty of self-assessment items (see Section 2.2, Point 1, and 2). We can compute self-assessment item mean scores as a function of the (theoretical) difficulty level of the items. ANOVA can be applied to test for significant differences between these mean scores. If students recognize the inherent difficulty of the items, their confidence in solving an item should decrease as item difficulty increases. 3. Lower performers tend to be less accurate in their self-assessments because they do not understand the tasks and assessment criteria or they succumb to social comparison and desirability when asked to report grades (see the Self-assessment accuracy of lower performers section). The tablet-computer-based self-assessment tool has features that can help overcome these obstacles: It is language-reduced and proposes self-assessment on a task level in a non-competitive setting. Consequently, accurate self-assessment with the tablet-computer-based self-assessment tool requires that even lower performers recognize the inherent difficulty of the self-assessment items (see Section 2.2, Point 1, and 2). By organizing the students into proficiency groups on the basis of their standardized test scores from the national school monitoring program, we can compute self-assessment item mean scores for each group as a function of the (theoretical) difficulty level of the items. ANOVA can be applied to test for significant differences between these mean scores within the groups. If lower performers (i.e., students in the lowest proficiency group) recognize the inherent difficulty of the items, their confidence in solving an item should decrease as item difficulty increases.

Sample and procedures
Our study was based on two independent and representative samples from Luxembourg's elementary school population. The samples were chosen randomly from the elementary school districts all over the country. The students were in Grades 3 and 4 and were 8-to 9-years old. Our tool was designed for classroom use, and we facilitated the administration of the tool ourselves by bringing the tablet computers to the schools. Teachers were present during the testing.
The first round of data collection with the tool took place in autumn 2014 in 14 different fourth-grade classes. The final samples consisted of N = 191 students (51.31% girls, 48.69% boys) in mathematics and N = 187 students (51.87% girls, 48.13% boys) in reading comprehension. 42.93% of the mathematics and 44.39% of the reading comprehension samples were students who predominantly spoke a language other than Luxembourgish or German at home (vs. 53% in the population; see . 54.97% and 56.15% were students with a migration background (first and second generation, respectively); (vs. 50% in the population, see . The second round of data collection took place in spring 2015 in 29 different third-grade classes. The final samples consisted of N = 370 students (47.03% girls, 52.97% boys) in mathematics and of N = 340 students (45.88% girls, 54.12% boys) in reading comprehension. 54.05% of the mathematics and 51.76% of the reading comprehension samples were students who predominantly spoke a language other than Luxembourgish or German at home. 49.19% and 49.41% were students with an immigration background (first and second generation, respectively). We discarded cases from the initial samples with missing data on the performance tests (see the Standardized tests section). For Grade 3, we discarded data from n = 32 students in the mathematics sample and n = 25 students in the reading comprehension sample. For Grade 4, we discarded data from n = 20 students in the mathematics sample and n = 21 students in the reading comprehension sample. We also discarded cases that were consistently too quick in their self-assessments, with a median time of 2 seconds or less for one rating, and at the same time showed a mean score of 95 or higher on a 0 to 100 visual analog scale (see Section 2.2, Point 3). For Grade 3, we discarded data from n = 4 students in the mathematics sample and n = 29 students in the reading comprehension sample. For Grade 4, we discarded data from n = 14 students in the mathematics sample and n = 23 students in the reading comprehension sample. We conducted our study with the approval of the Luxembourg Ministry of Education in accordance with the data protection rules of the National Commission for Data Protection. Parents and students were informed in writing about the scientific background of the study well in advance and were given the opportunity to refuse to participate in the study.

Measures
Self-assessment. The students' self-assessments in mathematics and German reading comprehension were measured with the tablet-computer-based self-assessment tool. The scales contained 66 items in mathematics and 34 items in reading comprehension. In mathematics, the internal consistency reliability (Cronbach's alpha) was .97 in Grade 3 and .96 in Grade 4. In reading comprehension, the internal consistency reliability (Cronbach's alpha) was .94 in Grade 3 and .95 in Grade 4 (see Table 2). The rating scale we used was a visual analog scale (VAS) that contained 101 hidden positions, allowing students to score from 0 to 100 [0, 100] after each item. We computed the mean self-assessment scores in mathematics and in reading comprehension separately.
Each of the self-assessment items in our tool was assigned one of three possible difficulty levels defined on the basis of an external reference standard: the national school curriculum. These assignments were approved by a group of experts, composed of elementary school teachers and researchers who focused on item and test development and were responsible for the standardized tests used in the Luxembourg school monitoring program. In the tool, the theoretical item difficulty levels increase from level 1 to level 3. Level 2 items represent the minimum competency standard required for Grade 3; the level 1 items are below and the level 3 items are above this competency standard. In reading comprehension, the number of items per levels 1, 2, and 3 were 9, 13, and 12, respectively. In mathematics, the item distribution was 4, 21, and 41, respectively. Knowing from other studies that students generally tend to self-assess high [18] [43], we deliberately integrated more items on levels 2 and 3 than on level 1 to avoid ceiling effects. Moreover, the number of items ensures a valid representation with regard to the content of the measured competencies. The testing time was 40 minutes for mathematics and 20 minutes for reading comprehension. The introduction to the tool took 10 minutes.
Standardized tests. Luxembourg's school monitoring program [59] consists of yearly standardized tests in mathematics and German reading comprehension in Grade 3. These tests are based on the competency standards of the national school curriculum. We used the standardized test scores from the school years 2013/2014 and 2014/2015 to validate the self-assessment measures because they represent an accurate measure of students' academic competencies with high psychometric quality. On these tests, the person parameters (Warm's Weighted Likelihood Estimator scores; see [69]) for the whole population of third-graders were standardized to M = 500 and SD = 100. In mathematics, the WLE reliability was .  Table 2).
In our study, on the basis of students' standardized test scores, we assigned them to three proficiency groups (see Table 1). The cut scores are theoretically embedded and derived from the national monitoring program [70]. Students in proficiency group 1 performed below the minimum competency standard required for Grade 3; those in proficiency group 2 (>437.57 points in mathematics; >484.90 points in reading comprehension) reached the standard; and those in proficiency group 3 (>520.53 points in mathematics; >543.95 points in reading comprehension) performed above the standard. In both samples, the scores from the standardized tests closely corresponded to the standardized population mean of M = 500 and SD = 100 (see Table 2). Note. n = number of students per proficiency group. Proficiency group: 1 = below the competency standard for Grade 3; 2 = at the standard; 3 = above the standard. % in sample = percentage in the self-assessment sample. % in population = percentage in the school monitoring population.

Statistical analyses
All statistical analyses were computed separately for each sample (Grade 3, Grade 4) and each domain (mathematics, reading comprehension).
We applied correlational analysis (Pearson product moment correlation) between students' mean self-assessment scores and their standardized test scores to check whether students' self-assessments reflected their actual competencies.
After assigning each student to one of the three proficiency groups (see the Standardized tests section), we computed a univariate analysis of variance (one-way ANOVA) and post hoc comparisons to test for significant differences between the mean self-assessment scores of these groups. Students' proficiency (three groups) served as the independent variable; the dependent variable was the mean score on the self-assessment scale. We computed another one-way ANOVA and post hoc comparisons to test whether students recognized the inherent difficulty of the self-assessment items. The theoretical item difficulties (three levels) served as the independent variable; the dependent variable was the mean score on the self-assessment scale. Finally, we computed 12 independent one-way ANOVAs and post hoc comparisons, one for each proficiency group in each domain and for each sample (3 x 2 x 2). These analyses allowed us to check whether students recognized the items' inherent difficulty independent of their proficiency. The theoretical item difficulty served as the independent variable; the dependent variable was students' mean score on the self-assessment scale computed separately for each proficiency group.
For all tests, we applied a significance level of α = .05. If there was heterogeneity of variance in the independent variable groups, we applied Welch's F test (robust ANOVA) to test for main effects. If there was a main effect, we computed a Games-Howell post hoc test for a pairwise check of the effects between the groups. When homogeneity of variance was confirmed, we applied the ANOVA F test, followed by a Tukey HSD test when there was a significant main effect. Welch's F test, Games-Howell, and Tukey HSD post hoc tests were computed to control the Type I error rate when heterogeneity of variance and nonnormality were present (e.g., [71] [72] [73] [74]). We used Hedge's g, a measure of effect size weighted according to the relative size of each sample.

Results
On a general level, self-assessments were high, and there was a tendency for them to be higher for reading comprehension than for mathematics (see Table 2).

Relation between self-assessment and standardized tests
The correlations between the self-assessment scores and the standardized test scores were moderate to large [40], ranging from r = .40 to .58 in mathematics and from r = .45 to .46 in reading comprehension (see Table 2). Except for a correlation of r = .58 in mathematics in Grade 3, there were no significant differences in the strengths of the relations between the self-assessment and standardized test scores between domains and grade levels.
For mathematics, as student proficiency increased, the mean of the self-assessment scores increased (see Table 3). The ANOVAs showed significant main effects (see Table 3). In Grade 3, post hoc comparisons revealed significant differences between the mean self-assessment scores of all three proficiency groups, with medium to large effect sizes (see Table 3). In Grade 4, there were significant differences between proficiency groups 1 and 3 and groups 2 and 3 with medium to large effect sizes (see Table  3).
For reading comprehension, as student proficiency increased, the mean self-assessment scores increased (see Table 3). The ANOVAs indicated significant main effects (see Table 3). The post hoc comparisons revealed significant differences between the mean self-assessment scores of proficiency groups 1 and 2 and groups 1 and 3, with medium to large effect sizes (see Table 3).

Self-assessment by item difficulty level
For mathematics, as the theoretical item difficulty increased, students' self-assessment scores decreased (see Table 4). The ANOVAs indicated significant main effects (see Table 4). Post hoc comparisons revealed significant differences between the mean self-assessment scores of item level groups 1 and 3 and levels 2 and 3, with large effect sizes (see Table 4).
For reading comprehension, as the theoretical item difficulty increased, students' self-assessment scores decreased from level 1 to 2, but they increased again from level 2 to 3 (see Table 4). The ANOVAs revealed significant main effects (see Table 4). Post hoc comparisons showed significant differences between the mean self-assessment scores of item level groups 1 and 2, with large effect sizes (see Table 4).

Self-assessment by item difficulty within proficiency groups
For mathematics, as the theoretical item difficulty increased, students' self-assessment scores decreased in all proficiency groups (see Tables 5 and 6). The ANOVAs revealed significant main effects (see Tables 5 and 6). Post hoc comparisons indicated significant differences between the mean self-assessment scores of item level groups 1 and 3 and levels 2 and 3, with large effect sizes (see Tables 5 and 6).
For reading comprehension, as the theoretical item difficulty increased, students' self-assessment scores decreased from level 1 to 2 but increased again from level 2 to 3 (see Tables 5 and 6). The ANOVAs showed significant main effects (see Tables 5  and 6). In Grade 3, for proficiency group 1, the post hoc comparison reported a significant difference between the mean self-assessment scores of item level groups 1 and 2, with a large effect size (see Table 5). For proficiency group 2, post hoc comparisons indicated significant differences between item level groups 1 and 2 and levels 1 and 3, with large effect sizes (see Table 5). For proficiency group 3, post hoc comparisons did not indicate any effects between groups (see Table 5). In Grade 4, the ANOVA indicated a main effect only for proficiency group 3 (see Table 6). Post hoc comparisons revealed a significant difference between mean self-assessment scores of item level groups 1 and 2, with a large effect size (see Table 6). Table 3.
Self-assessment by proficiency group in mathematics and reading comprehension in Grades 3 and 4 Grade 3 Note. n = number of students. Proficiency group: 1 = below the competency standard for Grade 3; 2 = at the standard; 3 = above the standard. M = mean self-assessment score. SD = standard deviation. Games-Howell post hoc tests used for all comparisons. Mean differences in bold are significant at p < .05. Effect sizes (Hedge's g) are in bold and in parentheses. Levene's F = F-ratio for equality of variance. Welch's F = robust F-ratio for analysis of variance. ANOVA F = F-ratio for analysis of variance. df1 = degrees of freedom for the effect of the model. df2 = degrees of freedom for the residuals of the model. p = probability. Table 4. Self-assessment by item difficulty level in mathematics and reading comprehension in Grades 3 and 4 Grade 3  .010     Note. n = number of students per proficiency group. ni = number of items per level. Item level: 1 = below the competency standard for Grade 3; 2 = at the standard; 3 = above the standard. M = mean self-assessment score. SD = standard deviation. Tukey HSD post hoc test used for proficiency group 1 in reading comprehension. Games-Howell post hoc test used for all the other groups. Mean differences in bold are significant at p < .05. Effect sizes (Hedge's g) are in bold and in parentheses. Levene's F = F-ratio for equality of variance. Welch's F = robust F-ration for analysis of variance. ANOVA F = F-ratio for analysis of variance. df1 = degrees of freedom for the effect of the model. df2 = degrees of freedom for the residuals of the model. p = probability. Table 6. Self-assessment by item difficulty level and proficiency group in mathematics and reading comprehension in Grade 4

Mean differences |Mi-Mj|
Self-assessment mathematics Self-assessment reading comprehension   Note. n = number of students per proficiency group. ni = number of items per level. Item level: 1 = below the competency standard for Grade 3; 2 = at the standard; 3 = above the standard. M = mean self-assessment score. SD = standard deviation. Tukey HSD post hoc test used for proficiency group 1 in mathematics. Games-Howell post hoc test used for all the other groups. Mean differences in bold are significant at p < .05. Effect sizes (Hedge's g) are in bold and in parentheses. Levene's F = F-ratio for equality of variance. Welch's F = robust F-ration for analysis of variance. ANOVA F = F-ratio for analysis of variance. df1 = degrees of freedom for the effect of the model. df2 = degrees of freedom for the residuals of the model. p = probability.

Discussion
In the present study, we wanted to investigate whether or not third-and fourth-graders (ages 8 to 9 years) are able to provide accurate self-assessments of their key academic competencies when equipped with an adequate self-assessment tool.
1. The first requirement for an accurate self-assessment was that students' self-assessments reflect their actual academic competencies (see Section 2.3). For both samples and in both domains, students' self-assessment scores had medium to large correlations [40] with their standardized test scores. Moreover, students in lower proficiency groups provided lower self-assessments on average in comparison with students in higher proficiency groups. In other words, students' self-assessments reflected their actual academic competencies. 2. The second requirement for an accurate self-assessment was that students recognize the self-assessment items' inherent difficulty (see Section 2.3). In general, when the (theoretical) item difficulty increased, students' confidence in solving the item decreased. Overall, these findings were less consistent for reading comprehension than for mathematics. 3. The third requirement for an accurate self-assessment was that even lower performers recognize the inherent difficulty of the self-assessment items (see Section 2.3).
In general, independent of their affiliation with a performance group, students' confidence in solving the item decreased as item difficulty increased. This finding means that even lower performing students, who have a tendency to provide less accurate self-assessments [1], were able to compare the different items presented in the tool and to recognize the items' inherent difficulty. Overall, and the same as for Point 2 above, these findings were less consistent for reading comprehension than for mathematics.
We conclude from these results that third-and fourth-graders (ages 8 to 9 years) have the ability to provide accurate self-assessments on key academic competencies when provided with an adequate self-assessment tool. Our results were more consistent in the domain of mathematics than in reading comprehension.

5.1
Self-assessment accuracy Self-assessment accuracy and student age. We deduced from other studies that the accuracy of pre-and elementary school students' self-assessments is strongly influenced by the conditions under which the self-assessment is practiced [31] [32] [33] [34] and the appropriateness of the self-assessment tool [35] [36] [37] rather than students' age per se. Our study offers support for this argument with our main finding that thirdand fourth-graders have the ability to provide accurate self-assessments on key academic competencies, particularly in mathematics, when provided with an adequate tool. In other words: under favorable conditions, even young students have the ability to provide accurate self-assessments. In mathematics, our language-reduced and illustration-rich self-assessment tool probably enabled the 8-to 9-year-old students to better understand the items. In both domains, the language-reduced rating scales allowed students to communicate their auto-perceptions with greater independence from language and reading skills. The items, which were given on a task level instead of a domain level, most likely led to good representations of the skills in question and distracted students from making comparisons with their peers.
Contrary to the general tendency in which students were able to recognize the selfassessment items' inherent difficulty, there was an exception for reading comprehension in both of the samples: Students' self-assessment scores decreased from level 1 to level 2 but increased again for the level 3 items. Most items on level 3 theoretically measure text interpretation competency. According to the national school curriculum, text interpretation competency is on a higher level of difficulty than the competency localization and understanding of information in the text, mostly represented by level 1 and level 2 items in the self-assessment tool. We found the same result pattern in the two samples, showing that students consistently had trouble recognizing an increase in the theoretical difficulty from the level 2 to the level 3 items. This increase in the mean scores from an easier to a more difficult level was not statistically significant (see Tables 4, 5 and 6). Despite this increase, the level 3 item mean scores remained consistently lower than the level 1 item mean scores, but the differences between the two were not statistically significant. Thus, we argue that the differences between level 3 and level 2 as well as between the level 3 and level 1 items are very small and not easily discerned by the students. There might also be a discrepancy between the theoretical and actual difficulties of the level 3 items. Overall, differences between the mean selfassessment scores were less often statistically significant in reading comprehension than in mathematics. This finding shows that the trend in reading comprehension went in the expected direction, but it was not as clear-cut as for mathematics. In item development, it is more difficult to calibrate item difficulty in reading comprehension than in mathematics. For the student, accurate self-assessment is probably more difficult in reading comprehension than in mathematics because assessment criteria and teacher feedback are less clear in reading comprehension compared with mathematics [75].
In self-assessment research with adults, a commonly approved standard for self-assessment accuracy is the strength of the correlation between the self-assessment and some performance measure [24]. Because there is no commonly approved standard for self-assessment accuracy in elementary school, we applied the same standard in our study as for self-assessments with adults. When comparing our results to those of other comparable studies [76] [77] [49] [78] regarding students' age, self-assessments of academic competencies, correlations between self-assessment and a performance measure, group testing on class basis, and sample size, the magnitudes of the correlations in our study tended to be stronger. Even compared with the range of meta-analytic effect sizes listed in Zell and Krizan's (2014) [24] meta-synthesis on the self-assessment of academic ability with adults and adolescents (with mean correlations ranging from r = .21 to .39, with one outstanding value of .63), the magnitudes of the correlations in our study, with 8-and 9-year-olds, tended to be stronger. This finding applies to mathematics as well as reading comprehension. Most likely, our results are due to the languagereduced self-assessment tool, which displayed items on a task level in a non-competitive setting. In addition, the tool fulfills symmetry principles that were derived from Brunswik's lens model (see the Self-assessment accuracy and student age section): Students' self-assessment and their objective performance were measured on similar levels of abstraction.
Self-assessment accuracy and lower performers. Previous studies found that lower performing students tend to be less accurate in their self-assessments than higher performing students [1] [51] (see the Self-assessment accuracy of lower performers section). Possible explanations for these findings are a lack of adequate representations of both the expectations and the assessment criteria [52] [51] [53] or the temptation to respond in a socially desirable manner when reporting school grades or comparing their achievements with those of other classmates [54]. In our study, in both domains, selfassessment scores increased as proficiency increased with statistically significant differences between the groups for the majority of comparisons (see Table 3). In mathematics, the less proficient group 1 was as good at recognizing the inherent difficulty of the self-assessment items as the more proficient groups 2 and 3 were (see Tables 5 and  6). In reading comprehension, these findings were less consistent, but there was not a definitive finding that the more proficient groups had better recognition than the less proficient group did (see the Self-assessment accuracy and student age section in the Discussion for a possible explanation). Lower performing students, specifically thirdand fourth-graders, may lack sufficient competencies in reading comprehension (see the section called An innovative tablet-computer-based self-assessment tool). With conventional questionnaires and inventories that depend on written language, these students would have (more) trouble understanding the descriptions of the tasks on which they had to assess themselves. In this sense, a language-reduced self-assessment tool, which displays items on the task level and avoids social comparison, might be particularly beneficial for lower performing students.

Strengths and limitations
The main objective of our study was to investigate whether or not third-and fourthgraders (ages 8 to 9 years) have the ability to provide accurate self-assessments of key academic competencies when provided with an adequate self-assessment tool. The results discussed in the Discussion section (Points 1, 2, and 3) show that students were able to do so although more consistently in mathematics than in reading comprehension. This outcome can be explained by certain aspects of our tablet-computer-based self-assessment tool; it integrates all the features identified in previous research findings needed to allow elementary school students to provide more accurate self-assessments of their academic competencies. These features are: self-assessment in a noncompetitive setting, task-oriented self-assessment, and self-assessment that requires only limited reading and no verbalization (see Section 2.2). The innovation consists of an illustration-rich self-assessment tool with a language-reduced rating scale, thus reducing the bias that may come along with poor reading and language skills. In addition, the tablet-computer's touchscreen offered the students a more intuitive handling of the tool-particularly the slider on the rating scale-compared with a computer keyboard or mouse. The tool administration and data collection across 43 elementary school classes was successful. Because of these features, too, the tool might once again be particularly beneficial for lower performers.
Our findings are based on two independent and representative samples, randomly chosen out of all the possible elementary school classes in Luxembourg. In general, we found similar patterns in the self-assessments and in the self-assessment accuracy between the two samples, thus confirming the consistency of the measurements with the self-assessment tool (e.g., [79]).
Due to persistent doubts regarding young students' self-assessment abilities, selfassessment research on pre-and elementary school students is still scarce. For this reason, we reviewed findings from different research disciplines and areas (developmental, cognitive, educational, and social psychology; educational science; metacognition research; self-regulation of learning; self-concept and self-efficacy research) to conclude that accurate self-assessment is less a question of age by itself than a question of the conditions under which self-assessment is conducted. This argument allowed us to test the hypothesis that even third-and fourth-graders are able to provide accurate selfassessments of key academic competencies when equipped with an adequate self-assessment tool. We consider this approach to be a strong point of our study, although it implies an oversimplification of the constructs, concepts, and findings of the cited studies. Consequently, we are limited in discussing our results in comparison with the concrete findings from previous studies.
A limitation of our study is the lack of a control group that would have allowed us to compare the accuracy of self-assessment when administered via our innovative tool with self-assessment accuracy in a conventional (e.g., paper pencil) and predominantly language-based setting.
Because the tool was designed for classroom use, testing time was limited (40 plus 20 minutes for the self-assessments in math and German, respectively; 10 minutes for the introduction), and we did not ask students to actually solve the items on the selfassessment tool. Comparing students' actual solutions with their self-assessment of whether they could solve the very same items would have provided an additional analysis for assessing accuracy (e.g., [51] [80]). This might be covered in a future study. Nevertheless, we would like to highlight that the self-assessment tool and its very accurate measure of validation (i.e., standardized tests from Luxembourg's school monitoring program) are based on the same reference standard: the national school curriculum. Items from the standardized tests and the self-assessment tool were problem isomorphs in the majority of cases. The self-assessment tool covers curriculum-relevant competencies, thus allowing high-quality feedback to be provided to teachers and good chances for the tool to be integrated into teaching.
When comparing the self-assessment results between mathematics and reading comprehension, we conclude that the self-assessment tool worked better for the former than the latter. A possible explanation might be that language reduction through illustration-an important feature of the tool-(obviously) did not apply to reading comprehension, except for the rating scale. Another explanation might be that the two factors interacted: In item development, the calibration of item difficulty is more problematic in reading comprehension than it is in mathematics, but teachers' feedback and assessment criteria are less clear to students in reading comprehension than in mathematics [75], which leads to less accurate self-assessments. The empirical validation of the difficulty levels of the self-assessment items would provide further answers to this question.

Conclusions
We conclude from these results that, under favorable conditions, third-and fourthgraders (ages 8 to 9 years) have the ability to provide accurate self-assessments of key academic competencies, but they can do so more consistently in mathematics than in reading comprehension. The favorable conditions are (1) self-assessment in a non-competitive setting, (2) self-assessment items presented on a task level instead of questions about general competency, and (3) the use of a language-reduced and illustration-rich self-assessment tool. The use of tablet technology, specifically a tablet-computer-based self-assessment tool that we developed, was found to be a suitable instrument for providing such conditions, particularly in mathematics.