Item quality analysis using the Rasch model to measure critical thinking ability in the material of the human digestive system of Biology subject in high school

Article History


INTRODUCTION
Science learning is expected to be able to guide students to develop 21st-century skills. One of the skills needed in the 21st century is critical thinking skills. They are important for student development and valuable for living in society (Danczak et al., 2017). The development of critical thinking skills needs to be continuously evaluated as a measure of the success of the learning process and as a reflective study material to improve the quality of learning.
The success of the teaching program can be seen from the results of the learning evaluation. There is one important component in the evaluation, namely, the test. A test is a tool for measuring the achievement of learning objectives (Widoyoko, 2009). The test as a measuring tool must be valid and reliable. A measuring instrument is said to be valid if it measures what it is supposed to measure without any significant bias or distortion (Matondang, 2009). Validity and reliability are affected by the instrument and the subject being measured. Invalid and reliable tests will give biased results and can even harm students (Widyaningsih & Yusuf, 2018).
One of the goals of learning biology is to apply what is learned to everyday life so students need to be trained to think critically to apply biology concepts in dealing with problems in everyday life. Critical thinking involves reasoning and the ability to separate biology and opinion (Chukwuyenum, 2013). A critical person will examine information before accepting or rejecting a solution to an existing problem. Practicing critical thinking skills becomes very important because it can help understand the logical relationship between ideas, build and evaluate arguments, and solve problems systematically (Riyanti et al., 2016).
The consequence of thinking that critical thinking skills are important in learning is that the teacher must provide an element of stimulation by creating an evaluation system that can open mindsets from remembering facts to critical thinking. Following its characteristics, critical thinking requires practice, one of which is working on evaluation questions that develop critical thinking skills (Kartimi, 2012).
Empirical data on the critical thinking skills of high school students show that there is a need for improvement. Studies show that students' critical thinking skills are generally in the low to moderate category, with some indicators being weaker than other indicators. For example, in a study, the evaluation, analysis, and self-regulation sub-skills were found to be the lowest critical thinking sub-skills mastered by students compared to other critical thinking sub-skills (Basri et al., 2019). However, some studies have found that certain learning models, such as the 5E Learning Cycle (Miarti et al., 2021) and the Conceptual Understanding Procedures (CUPs) learning model (Ariesta et al., 2019), can have a positive effect on students' critical thinking skills. In addition, research shows that the intellectual level (IQ) of students has a correlation with their critical thinking abilities (Hasanah et al., 2020). Overall, further research is needed to improve high school students' critical thinking skills.
Program for International Student Assessment (Center for Educational Assessment, 2018) proves that the science achievement score of Indonesian students is 396 out of 489. It shows that the average critical thinking ability of students in Indonesia is still relatively low. The PISA results obtained can be used to see the level of students' critical thinking because the questions contained in PISA are higher-order thinking skill (HOTS) questions; of course, this can encourage students to have the ability to solve problems, think critically, think creatively, reason, and make decision.
The results of interviews with biology teachers at a Yogyakarta high school found that critical thinking skills had never been measured using tests in the form of essay questions. However, educators stated that the average critical thinking was still needed to be improved because very few students actively asked questions while learning. Most students scored less than the minimum mastery criterion (MMC). For cognitive ability, the average did not meet expectations. For example, during midterm exams, the teacher expected that the students who could achieve MMC was 70%, but it turned out that only 50% could.
Observations on the discovery of questions used in one of the senior high schools in Yogyakarta showed that most teachers took sample questions from the handbook and the questions' validity and reliability were not measured. Thus, the quality of the questions used is unknown. Some of the questions given are low-order thinking skill questions, namely C1-C3, so students only memorize the material provided during the learning process and cannot understand the material properly. This allows students to be lazy to think. The teacher has not considered aspects of critical thinking skills because there is no test for critical thinking skills in biology teaching at school.
The importance of developing the HOTS instrument can be seen in efforts to measure and develop higher-order thinking skills which are urgently needed in today's education. HOTS instruments go beyond basic cognitive understanding and encourage students to apply, analyze, evaluate, and create knowledge in greater depth. By using the HOTS instrument, schools can evaluate students' abilities to explore more complex conceptual understandings, think critically, solve problems, and make decisions based on critical thinking. The development of HOTS instruments also helps provide teachers with valuable feedback to design more challenging and relevant lessons, and to prepare students to face challenges in the real world outside of school. HOTS instruments are important in developing students' skills needed to adapt, innovate, and contribute to an increasingly complex and rapidly changing society.
The use of the Rasch model to measure critical thinking skills has received attention in recent years. Several studies have used the Rasch model to analyze the quality of critical thinking items and categorize students' abilities. For example, using the Rasch model to measure the critical thinking skills of elementary school students in STEM learning (Hamdu et al., 2020), while finding that the Rasch model test shows that the experimental class has the ability to synthesize attitudes toward scientific inquiry (Wahyudiati, 2022). The Rasch model has also been used to validate critical thinking test items based on Buddhist philosophy (BCTA) (Susongko et al., 2022) and developing science learning handouts on the human digestive system to improve critical thinking skills (Sulastri et al., 2022). In addition, the Rasch model has been used to test critical thinking skills in in ecosystem materials (Karoror & Jalmo, 2022) and developing test instruments based on critical thinking skills that are integrated with Javanese cultural traditions in an Islamic context (Agustina et al., 2023). Overall, the Rasch model proves to be a useful tool in analyzing critical thinking skills and improving learning outcomes in various fields. Thus, the importance of measuring critical thinking skills and analyzing instruments using the Rasch model to assess critical thinking skills has become very important.
Accordingly, this study analyzed the instrument reliability using the Rasch model (Nielsen, 2018;Sumintono, 2018). The Rasch model is a statistical approach used to measure performance, perceptions, and attitudes (Bonsaksen et al., 2013;Nielsen, 2018). Evaluation of critical thinking skills with the Rasch model has more advantages than classical test theory because it can improve evaluation quality in quantitative and qualitative studies (Chan et al., 2014). Some of the advantages of using the Rasch model are: (1) it produces a linear and one-dimensional scale, (2) there is a need for conformity between the data and the measurement model, (3) it can call it calculates the standard error, (4) it can estimate the size of people and the level of difficulty of items through a linear scale that is similar to a standard unit (log), and (5) it can check the evaluation system logically and consistently (Planinic et al., 2019).
The Rasch model can analyze evaluation instruments based on several parameters. For the advantages of the Rasch model, an instrument must be tested for reliability, validity, differentiability, suitability, and level of difficulty using the Rasch model (Nielsen, 2018;Sumintono & Widhiarso, 2015). These stages are very important to obtain a reliable evaluation instrument. Therefore, it is necessary to analyze the instrument based on the needs in evaluating critical thinking skills, where the instrument reliability is analyzed using the Rasch model. Thus, this study aims to analyze the validity, reliability, scale comprehension, item difficulty, item bias, and interactions between items and persons through ICC plot images on creative thinking ability assessment instruments, and digestive system material using the Rasch model.

RESEARCH METHOD
This research was conducted as one of the stages of research and development of a problem-based learning virtual laboratory using the ADDIE model. The sampling technique used in this study is the simple random sampling technique. The participants of this study are 63 students of high schools in Yogyakarta. Characteristics of class and gender respondents are presented in Table 1. Data collection used a test that had been prepared according to the needs of critical thinking skills. The test items are 15 essay questions on the material of the human digestive system using a five-point Likert scale. The indicators of critical thinking skills used include interpreting, analyzing, evaluating, explaining, and concluding (Facione, 1992). The collected data were analyzed using the Rasch model with the Winstep 5.0.3.4 program (Faradillah & Adlina, 2021). The research was conducted at a Yogyakarta high school in October 2022 using the Google Form application. The analysis phase begins with validity testing. Validity testing includes overall validity using summary statistics, item validity using item: fit order, and construct validity. Instrument reliability analysis was reviewed using alpha value and item reliability in the statistical summary test. Respondents understood the scale using the partial credit rating scale and probability curve.
Items are analyzed by creating a log bar to classify the item difficulty level based on the logit and wright map. Bias items were analyzed using DIF tables and plots. The interaction between items and persons is analyzed through ICC plots of images which have been modeled using probabilities developed by the Rasch model analysis, and are governed by two main parameters, namely the difficulty of the item and the ability of the person.

Instrument Validity Analysis
The validity test result is divided into two, namely, the overall validity of the instrument and the item (Planinic et al., 2019). The result of the analysis is presented in Table 2. The result of the analysis of the instrument's validity in Table 3 from the statistical summary shows whether the instrument is valid to use (Runco & Acar, 2012). Based on the value of the MNSQ outfit item (statement item), the instrument is suitable for evaluation because the result shows 1.00 and is included in the ideal value of 1.00. Based on the ZSTD item and person outfit values, the instrument shows that the data have a logical estimate because the result shows -0.28, close to the ideal value of 0.00 (Sumintono & Widhiarso, 2015).  (Kim, 2021;Sumintono, 2018) No.
Value on Measurement Measurement Type Score Value category 1.
Outfit ZSTD Item -0.28 Accepted Based on the American Educational Research Association (AERA) and American Psychological Association (APA), strong validity has evidence, and response validity is the reliability of the instrument when the respondent gives a response. The instrument's validity has used expert judgment, then directly used for the test. The validity test in the Rasch model informs the quality of the instrument so that the validity test is now more reliable (AERA & APA, 2014). The results of the item dimension test can be seen in Table 3, which shows that the instrument's construct validity has met the criteria. The results on the unexpected variance of 1 PCA residue contrast indicate that the criteria are accepted and that all statement items show conformity. The result is unidimensional. This means that the instrument can measure the range of variables or responses to questions to measure critical thinking skills. The construct validity of the instrument content variable can already measure what you want to know. Using the Rasch application model can determine the construct validity of the instrument. Based on research conducted by Madyani et al. (2020), construct validity has not been analyzed so this test has novelty. The results of the instrument reliability test can be seen from the suitability analysis of the items used to find out which items are not appropriate. Suitability analysis uses items that fit the order in Table 4. The results show that all items can be used to measure responses (Austvoll-Dahlgren et al., 2017). Based on Table  4, all items do not require revision to meet these criteria except for item 5, which must be repaired or replaced because it does not fit. The Rasch analysis model can direct instrument makers to revise the items that are not appropriate so that they have reliability in measurement.  Table 5 shows the results of the reliability test. The overall instrument with a Cronbach alpha value of 0.81 has a very good category. The reliability of the item is 0.88 which has a very good category (Sumintono & Widhiarso, 2015). The instrument has consistent results when tested on the population (Plucker et al., 2014;Runco & Albert, 1985). The grouping of items has a very good category because there are three categories of item difficulty levels on the instrument. The grouping of respondents is in a good category because the respondents have five ability levels. Thus, the instrument can be used to determine the grouping of the items and respondents in evaluating critical thinking skills (Göçmen & Coşkun, 2019;Sumintono & Widhiarso, 2015).

The Rating Scale Understanding Analysis
Rating scale evaluation (1, 2, 3, 4, 5) can be seen from the peaks of each scale on the probability curve in Figure 2, which shows separate peaks; details can be seen in Figure 3. Table 6 shows that the scale five ratings (excellent, good, sufficient, low, and very low) on the instrument did not work well if respondents understood each scale category in the questions given, meaning that the rating scale did not work well. One of the characteristics that the rating scale has not functioned properly is looking at it based on the Andrich Threshold which does not show a monotonous increase (Sumintono & Widhiarso, 2015). Figure 2 represents the probability curve of the rating scale instrument. This curve depicts the probability distribution for each rating scale on the instrument for analyzing critical thinking skills in the human digestive system. In this context, the probability curve provides information about how often or how likely a particular rating scale is used by respondents when providing assessments or ratings for the given questions or items. In the analysis of this probability curve, each rating scale (1, 2, 3, 4, 5) has distinct peaks. This indicates that respondents have clear tendencies in using specific rating scales when providing answers or assessments in the instrument. Through this probability curve, we can gain insights into the preferences or tendencies of re- spondents to use the provided rating scale (Hansen & Kjaersgaard, 2020;Lin et al., 2017). This can be helpful in evaluating the quality of the instrument and understanding how respondents respond to the questions or items in the instrument.   Figure 3 presents probability curves where the y-axis represents the probability of each category of the five response options at each level of item difficulty on the x-axis. Each color corresponds to a different response option: 1 = red, 2 = blue, 3 = pink, 4 = black, 5 = green. The intersections of adjacent curves are the thresholds. The category probability curves for the item above have disordered thresholds. The responses to this item are not distributed in a logically progressive order, and categories 3 and 4 do not have a high probability.  Understanding of the scale can be seen through statistics on the Rasch model. The scale is only analyzed through descriptive analysis when the respondent fills in all the questions. The researcher concluded that the respondents could not understand the rating scale properly. So far, the understanding of the scale has not been tested. A bad probability curve can be used to analyze a lack of scale, for example, by reducing the scale range or eliminating meaningful neutral ratings.
Rasch analysis also provides information about the number of best response categories on the scale. To analyze whether the category calibration increases regularly, the response options are assessed by the category probability curve which can be seen in Figure 3 (Linacre, 2002). They indicate the likelihood that a subject with a certain person measure relative to item difficulty will choose a category (Pesudovs et al., 2007). The threshold is the midpoint between adjacent response categories thereby expressing the point at which the probability of selecting either response category is equal (McAlinden et al., 2012). If disordered thresholds occur, the situation must be changed by collapsing the required categories into adjacent categories (Andrich, 2013b(Andrich, , 2013aPesudovs et al., 2007). To detect this situation, the Andrich threshold measurements must be checked so that the thresholds must be spaced at least 1.4 logits (Linacre, 2002). Item reduction is made iterative so that one item is removed at a time (Pesudovs et al., 2003). Thus, when an item is omitted, the fit to the model is consequently re-estimated because it has been shown that the fit is relative so omitting an item leads to variation in the fit. Then, items with the most candidate criteria, sorted by priority, are eliminated first (Cantó-Cerdán et al., 2021).

The Analysis of the Difficulty Level of the Items
Assessment of the difficulty level of the items is done by using item size. Separation or difficulty level of the items is determined by adding up the average value with the standard deviation (0.00+0.45= 0.45 which is used to create a log bar. The log bar can be used to determine difficult, moderate, and easy items and outliers, as seen in Table 7. It is necessary to pay attention to the values of item and person measurements to determine the response given. The results of the person measurement show an average value (M) of 0.02 logit on the measurement item, namely -0.28 logit. Thus, the ability to answer is above the average difficulty level of standard statement items (Sumintono & Widhiarso, 2015).
Logit bar in Table 7, then integrated into the measure items in Table 7 and the wright map in Figure 4, to determine the item question's classification. The results showed that the difficulty level of the items indicated the order of the items from the most difficult items to be worked on by respondents to the easiest items to work on. The table that item 14 is the most difficult to work on or has the highest level of difficulty because it has the highest level of difficulty, so it is included in the outlier item category. Then the items with a medium category are items 13, 12, 1, 4, 10, 6, 2, 5, and 7. The items that are easy to work on or items with a low level of difficulty are items 11, 15, and 9. The difficulty level of items can be categorized by adding up the mean value and standard deviation. In the table, the sum of the mean and standard deviation values is equal to 0.45. Items that exceed the logit value of 0.90 or less than -0.90 indicate the item is an outlier or cannot be used and needs to be discarded.

The Analysis of Gender-Bias Items
The result of the bias test using DIF in Figure 5 shows a probability with a bias criterion of <0.05. The blue L code indicates the male gender, and the red P code indicates the female gender. One item was found to be seen from the gender factor, namely item 14. DIF analysis was used to confirm whether there were items that had a gender bias (women and men) that affected critical thinking skills on the material of the human digestive system. DIF analysis can identify participant bias based on subgroups or variables for each item in the instrument used (Boone et al., 2014;Khine, 2020). DIF was determined following two categories: significant probability (p < 0.05) and DIF contrast. Three DIF contrast classifications (Zwick et al., 1999) are negligible, slight to moderate (|DIF | 0.43 log), and moderate to large (|DIF | 0.64 log). Figure 5 shows that having items that fall into the DIF category is based on a significant probability. Thus, we can conclude that item 14 is categorized as DIF indicating that the instrument has a bias problem.
The farthest distance of the items from the average indicates a significant difference in difficulty level between men and women (Azizah et al., 2022). In this case, there are groups that benefit more and there are groups that are disadvantaged because a problem seems more difficult for women than for men. The graph shows that item 14 is more favorable for males than females. When these items were confirmed, it turned out that the narratives on the questions that had been prepared gave different assumptions to the respondents and had an impact on gender status. This is in line with several studies that have used Rasch analysis to investigate gender bias in various scales and questionnaires. This study has found evidence of gender bias in items using Rasch analysis. For example, finding the uniform differential item function (DIF) for two items in the Patient Assessed Elbow Evaluation questionnaire, with one item indicating the DIF for gender (Vincent et al., 2015). Similarly, we found that several items in the STEM Career Interest Survey detect gender bias (Ardianto et al., 2023). It was found that the three items on the GAIN substance problem scale appeared different between men and women (Claro et al., 2015). For this reason, it is very important to compose narratives and choose words so that the resulting items or questions do not lead to different assumptions about gender, so as not to cause gender bias.

Investigating the Interaction between Item and Person via ICC Plot
Item and person interaction analysis on the Rasch plot ICC model is an important aspect in evaluating the psychometric properties of measurement instruments. Research by Planinic et al. (2019) which uses the Rasch model reveals that it is a probabilistic model that describes the interaction of a person with a test or survey item and is governed by two parameters: item difficulty and person's ability. Figure 6 shows the ICC plot for critical thinking which explains that the red line indicates the ideal line of the Rasch model or the line modeled with the probability developed by Rasch analysis. The red line illustrates that the higher it gets indicates the item's difficulty level. On the other hand, the blue line shows the data or distribution of answers from respondents. Overall, the person is considered capable of answering all items. This can also be confirmed by the results presented in Figure 4. Wright Item Map, although there are items that are not fit and must be replaced or discarded, overall, based on testing using the ICC expected score, shows that there are no items that show an inappropriate response. In other words, all items are on the curve of the outfit confidence space. This means that the 15 items in the critical thinking ability instrument on the material of the human digestive system are able to measure exactly what is being measured.
The research conducted by McCamey (2014) revealed the Rasch model allows estimation of item difficulty and person's ability on a general scale, which is useful for evaluating the reliability and validity of measurement instruments. The person-item distribution map (PIDM) is a graphical representation of the interactions between persons and items, which can provide meaningful information about the effectiveness of student learning (Nopiah et al., 2012). The Rasch model is also useful for examining individuals in terms of how they respond to items using a person map (Subroto et al., 2022). The Rasch model is designed to find the characteristics of persons and items that are independent of each other and interval levels, and it can generalize the characteristics of persons and items outside of a sample of respondents and certain items (Benson et al., 2018). On the other hand, in the study by Rifbjerg-Madsen et al. (2017) Rasch analysis revealed acceptable psychometric rating scale properties, including calibration of thresholds, no interference in category size for all items, and small distances between thresholds for all items. Likewise, in the study of Imani et al. (2018) Rasch analysis shows that the Epworth Sleepiness Scale for Children and Adolescents is a reliable and valid instrument. Internal consistency, test-retest reliability, and Rasch analysis show that the instrument is reliable and valid. Therefore, item and person interaction analysis on the Rasch plot ICC model is a valuable tool for evaluating the psychometric properties of measurement instruments.

CONCLUSION
Based on the result of this study, it can be concluded that the instrument for measuring critical thinking skills in the material of the human digestive system has good criteria when applied in the teaching-learning process. Based on the analysis, the overall validity is acceptable, and the item validity requires improvement; item 5 must be repaired or replaced because the item does not fit. Based on the results of the analysis, the overall reliability is very good, and the item reliability is good. Rating scale analysis shows that respondents do not understand the fivepoint Linkert scale, so a bad probability curve can be used to analyze the lack of a scale, for example, by reducing the scale range or eliminating a meaningful neutral rating. The analysis of the item difficulty showed that the item 14 was the most difficult to work on or had the highest level of difficulty because it had the highest level of difficulty so that it was included in the outlier item category. The items with a moderate category are items 13, 12, 1, 4, 10, 6, 2, 5, and 7. The items that are easy to do or items with a low level of difficulty are items 11, 15, and 9. The bias results show that one item is found in terms of the gender factor, namely item 14. Moreover, the results of the interaction between the item and the person through the ICC plot image show that even though there are items that do not fit and must be replaced or discarded, overall it is based on testing using the expected score ICC, showing that there are no items that show inappropriate responses. In other words, all items are on the curve of the outfit confidence space and are by the Rasch modeling.