THE EFFECTS OF TEST CHARACTERISTICS ON THE HIERARCHICAL ORDER OF READING SKILLS

Purpose – The study sought to determine the hierarchical nature of reading skills. Whether reading is a ‘unitary’ or ‘multi-divisible’ skill is still a contentious issue. So is the hierarchical order of reading skills. Determining the hierarchy of reading skills is challenging as item difficulty is greatly influenced by factors related to test characteristics. To examine the interaction between these factors and item difficulty, and determine the possibility of such a hierarchy, this study used the multifaceted Rasch approach. Methodology – In this descriptive study, a 42-MCQ reading test was administered to 944 ESL lower secondary students, randomly selected from eleven Malaysian national-type schools in the Federal Territory of Kuala Lumpur and the state of Selangor. These student populations were selected as the development of reading ability was considered critical at this stage of schooling. The reading test items were identified according to the following aspects: Reading Skill Areas (Interpreting information, making Inference, Understanding figurative language, Drawing conclusions, Scanning for details and 64 Malaysian Journal of Learning and Instruction: Vol. 14 No. 1 (2017): 63-82 Finding word meanings), Context Type (Linear and Non-Linear), and Text Type (Ads, Notice, Chart, Story extract, Short message, Poem, Short news report, Brochure, Formal letter, Conversation, Long passage and weather forecast). Applying the Many-Facet Rasch model of measurement, the study analyzed student responses to the test items with the help of FACETS, version 3.7.1.4. Findings – The findings showed that context types, skill areas, and text types differed in difficulty (p<.01), with those items that required understanding and interpretation being more demanding. Test items based on linear contexts were more difficult than those based on non-linear contexts. Understanding figurative language was found to be the most difficult skill followed by Making inference and Interpreting information. The easiest reading skill was Scanning for details, followed by Finding word meanings. The reading skill, Drawing conclusions, was close to the average difficulty level. The findings also indicated that texts that were longer and had more information tended to be more difficult. Significance – This study has also shed new light on the theory and practice of reading. The findings support the hierarchical nature of reading skills. Different reading skills were found to exert differential cognitive demands, and those which required higher cognitive ability were more difficult for learners to acquire and perform. Understanding the hierarchy of reading skills will help language teachers to target their teaching more effectively; course designers to produce more appropriate teaching and learning materials; and test writers to develop test items that better meet students’ reading competencies.

Finding word meanings), Context Type (Linear and Non-Linear), and Text Type (Ads, Notice, Chart, Story extract, Short message, Poem, Short news report, Brochure, Formal letter, Conversation, Long passage and weather forecast).Applying the Many-Facet Rasch model of measurement, the study analyzed student responses to the test items with the help of FACETS, version 3.7.1.4.

Findings -
The findings showed that context types, skill areas, and text types differed in difficulty (p<.01), with those items that required understanding and interpretation being more demanding.Test items based on linear contexts were more difficult than those based on non-linear contexts.Understanding figurative language was found to be the most difficult skill followed by Making inference and Interpreting information.The easiest reading skill was Scanning for details, followed by Finding word meanings.The reading skill, Drawing conclusions, was close to the average difficulty level.The findings also indicated that texts that were longer and had more information tended to be more difficult.
Significance -This study has also shed new light on the theory and practice of reading.The findings support the hierarchical nature of reading skills.Different reading skills were found to exert differential cognitive demands, and those which required higher cognitive ability were more difficult for learners to acquire and perform.Understanding the hierarchy of reading skills will help language teachers to target their teaching more effectively; course designers to produce more appropriate teaching and learning materials; and test writers to develop test items that better meet students' reading competencies.

INTRODUCTION
Given the importance of reading skills in language development, a considerable amount of research has tried to identify the nature of reading skills (Alderson, 2005;Grabe & Stoller, 2002;Hedgcock & Ferris, 2009); how they develop: how they should be taught; and how they should be assessed (Alderson, 2005;McNamara, 1996).
The former view implies that reading skills do not have clear separable and identifiable sub-skills or components (Alderson & Lukmani, 1989;Alderson 1990Alderson a, 1990 b; b;Bachman, 1990;Rost, 1993;Weir & Porter, 1994).However, Weir and Porter (1994) argued that a bi-divisible view of reading is plausible as 'vocabulary' seems to be a separate component from reading comprehension as evidenced in a number of quantitative researches.Weir, Huizhong and Yan (2002) quoted several studies in which these two components of reading were identified (Berkoff, 1979;Carver, 1992;Farr, 1968;Guthrie & Kirsch, 1987).Alderson (2005) also highlighted that the common view in the research literature was that reading could be seen as comprising the following two components: decoding (word recognition) and comprehension.Hughes (2003) explicated that despite the issue of the existence of subskills in reading, the reading test must include samples of skills that are relevant to the test purpose.Hence, it is important that reading assessment should be guided by a clear reading theory that defines the reading skills accurately.This helps the measurement and interpretation of students' performance in the specific reading skill of concern much more precisely (Engelhard, 2001).
The latter view maintains that a particular reading skill has separable and identifiable sub-skills.It could be divided into different subskills as greatly evidenced in the literature (See Davis, 1968;Farhadi & Moeini, 2005;Farhady & Hessamy, 2005;Kim, 2009;Matthews, 1990;Munby, 1978;Sainsbury, Harrison, & Watts, 2006;Spearritt, 1972;Weir, Hughes, & Porter, 1990).In this respect, the distinction between high and low order skills of reading and the relationship between them is essential for a better understanding of the nature of a reading skill, for constructing valid items to test reading ability as well as for planning syllabuses (Alderson, 2005;Lumley, 1993;Weir, Hughes, & Porter, 1990).
One important issue that clouds the multi-divisible view of reading skills is the controversy surrounding the number of skills and the hierarchical ordering of these skills or sub-skills.Questions like "what such skills might consist of and how they might be classified, acquired, taught and tested" (Alderson, 2005, p.10) need to be answered.Hudson (2007) argued that both L1 and L2 research consistently failed to support a "strictly hierarchically ordered reading skills" (p.103) position.There is no clear evidence that reading could be divided into high and low order skills (Alderson 1990a(Alderson , 1990b;;Hudson, 2007;Rost, 1993).Despite the different views about the nature of reading, the notion of skills and subskills is influential (Alderson, 2005).This is seen through the use of various taxonomies which are used in teaching reading as well as for testing it (See Alderson & Lukmani, 1989;Grabe, 1991;Hudson, 2007;Matthews, 1990;Pearson & Johnson, 1978;Urquhart & Weir, 1998;Vacca & Vacca, 2008).
Despite their seeming utility, the use of these taxonomies is not without criticisms (Alderson, 2005).For instance, Matthews (1990) argued that Munby's taxonomy (1978) is a knowledgebased taxonomy rather than a skill-based one, and thus the latter taxonomy should be disregarded.Moreover, some taxonomies, such as Munby's, Barrett's and Bloom's were developed largely based on theoretical assumptions and not on empirical frameworks (Hudson, 2007).It has also been pointed out that "skills hierarchies should not be interpreted in an a priori fashion, as the field has not reached consensus on what constitutes higher or lower order skills, which are relative and subject to the influence of the context for reading" (Hedgcock & Ferris, 2009, p.38).
In the testing of reading ability, it was found that item difficulty was influenced by a number of factors other than the inherent difficulty of the skill (Alderson, 2005;Bachman, 1990;Day & Park, 2005;Kobayashi, 2005;McNamara, 1996).These included factors such as question type, context type, question format, cognitive demand, explicitness and implicitness of information, students' test-taking skills (Alderson, 2005;McKenna & Stahl, 2009;Pearson & Johnson, 1978), text type as well as text length (Scheuneman & Gerritz, 1990).With regard to the type of question format, a multiple choice question is influenced by its stem length, stem content words, structure of options, length of correct answer and distractors (Alderson, 2005).Furthermore, Pearson and Johnson (1978) found that question types varied in difficulty based on the explicitness and implicitness of question information.The research literature also showed that item characteristics and the interaction of these characteristics affected the item difficulty (Alderson, 2005).Alderson further elaborated that the factors or influences that affected item difficulty or made the task more demanding should be controlled; otherwise, they could be a risk to test validity.Hence, for accurate and valid conclusions of skills hierarchies it is essential to account for these factors and model their influence on item difficulty.
One important concern in investigating skills hierarchies and item difficulty is in the quantitative or qualitative method used.Daftarifard and Lange (2009) noted that judgmental analysis of item difficulty was insufficient, as many studies had found discrepancies between the hypothesized order based on expert judgment and the item difficulty estimates from empirical analysis (see also Weir et al., 2002).Such a judgment should be empirically tested (Lumley, 1993).The estimation of item difficulty without modeling the effects of the factors mentioned earlier is equally problematic.Hence, a multifaceted approach using the Many-facet Rasch analysis has been recommended to examine item difficulty.Such an approach would consider the influence of other related variables on item difficulty (Daftarifard & Lange, 2009).According to them, …given this lack of correspondence, we propose that notions of items complexity require careful distinctions between the qualitative and quantitative aspects of reading theory.For instance, it may be necessary to distinguish between the complexity of a concept and the complexity of the question designed to assess this concept.Rasch [analysis] is likely to remain the tool of choice in this research, but it seems likely that multifacetted approaches will be needed to accommodate both types of complexity simultaneously (p.1212) Given the possibility of examining the "complexity of a concept and the complexity of the question designed to assess [the] concept" using the Many-facet Rasch analysis, this study explored the hierarchical assumption of reading skills for support of a developmental hierarchy of reading ability using the Many-facet Rasch model (Linacre, 1989(Linacre, , 2014a)).It also examined the influence of particular item characteristics on item difficulty.The findings of this study would be able to shed light on the issue of the hierarchy of reading skills, and the robustness of the Many-facet Rasch model in this regard.

METHODOLOGY Participants and Sampling
This study employed the descriptive design method (Fraenkel, Wallen, & Hyun, 2012;Gay, Mills, & Airasian, 2000;Keeves, 2004).Here, a dual-purpose instrument comprising two sections was used; Section A asked some questions about students' demographic information and Section B comprised a 42-MCQ item reading test.
The participants consisted of lower secondary ESL students (i.e.; Forms 1, 2 and 3; 13 to 15 year olds) in Malaysia.The lower secondary level begins at the end of the primary level which lasts six years and before the upper secondary level which lasts two years.This population was selected as the development of reading ability was considered critical at this stage of learning.It was also considered one of the foremost components and indicators of being literate (McGee & Richgels, 2004); it was also seen as helping students to succeed in their studies (Holme, 2004), and to perform their daily and personal affairs more effectively (Vacca & Vacca, 2008).
A representative sample was chosen from 11 national-type secondary schools which were randomly selected from the Federal Territory of Kuala Lumpur and Selangor state in Malaysia.For each school, 30 students were again randomly selected from each Form (grade level) giving a total of 990 students (30 students x 3 Forms per school x 11 schools).However, the total number of students included in the final analysis was 944 out of the 990 who were selected.When the test papers were examined to ensure the integrity of the data collected, some test papers were found blank, so they were excluded; and some of the selected students were not available on the days the data were collected.

Instrument
A 42-MCQ English reading test was used in the study.Ingebo (1997) recommended the use of 40 items as an acceptable number for tests using Rasch analysis.The instrument for this study was developed as follows.First, three sets of past English language reading tests (for the years 2004, 2005, and 2007) used to assess reading comprehension at the national level (Penilaian Menengah Rendah [PMR]) were selected.The PMR is a national standardized examination conducted at the end of the lower secondary education for Form 3 students.The papers also included specific tasks and competencies associated with English reading literacy that students were expected to possess over time in the lower secondary period.The reading skills included in the PMR exam papers were therefore, those skills that had been taught to students in Forms 1, 2, and 3, given the spiral curriculum design adopted by the Malaysian Ministry of Education.
The three PMR tests were similar in test format and number of items.Each paper included 60 multiple-choice items (40 for comprehension questions and 20 for grammar) with four options.Based on the syllabuses of English for Forms 1-3, the researchers, with the help of experts in the English language comprising two university lecturers and two school teachers of English, analyzed the three sets of PMR tests and came out with item content descriptors.In doing so, they were able to pinpoint the level of item difficulty, skill/sub skills and grade level the test items represented.It is worthwhile to mention that these PMR tests were developed by content experts and teachers from the field who were appointed by the Ministry of Education; therefore, content validity was not considered an issue.

Data Analysis
To select the most appropriate test, the MFORMS for concurrent analysis (Linacre, 2014a) was used to link the three tests.To allow for common item linking, a set of 20 grammar items were included in all the three tests.Each set was administered to different groups of school children (n=269) selected from Forms 1, 2, and 3 in three national type secondary schools in Kuala Lumpur, Malaysia.From the concurrent analysis, the 2007 exam paper was found to be appropriate to be used in the final study because it had items that fit the Rasch Model, measuring different reading skills with different difficulty levels.However, language experts, one from the English Department and another from the Faculty of Education at the International Islamic University Malaysia, suggested carrying out certain modifications on several of the items and adding others from the other two exam sets for sufficient coverage of the various skills being investigated in the main study.The final test included 60 multiple-choice items with 4-options (42 for comprehension questions and 18 for grammar).In the analysis, only the 42 reading comprehension items (see Table 1) were analyzed using the Manyfacet Rasch analysis.The Many-facet Rasch Model (Linacre, 1989(Linacre, , 2014a)), an extension of the basic Rasch Model, was used for analysis as it could allow for other facets other than person ability and item difficulty to be modelled and evaluated (See Linacre, Engelhard, Tatum, & Myford, 1994;Lunz & Wright, 1997;McNamara, 1996).And since item characteristics exerted an influence on item difficulty (Alderson, 2005), the multifaceted measurement approach was considered appropriate.For this study, the computer program, FACETS version 3.7.1.4(Linacre, 2014 b) was used.The analysis was conducted in two FACETS runs to determine the descriptive summaries of effects of item and test characteristics (Linacre, 2014b).In the first run, person ability estimates and item difficulty estimates were calibrated.
The subset connection which showed the link between all elements in the analysis (persons, items, text type and context type) indicated that all elements of analysis were estimated in an unambiguous frame of reference.The mean standardized residuals (0.02) and the sample standard deviation.(1.01), indicated that the data fit the Rasch Model.The Rasch person reliability coefficient was acceptable (.88) and close to .90, which is expected in high stakes national level testing.All items showed Infit Mean square (MNSQ) values between 0.70 and 1.30, thus showing good fit and no unexpected randomness in responses.It is important to highlight that fit statistics help determine the quality of the collected data and suitability of the items used in the measure (i.e., the test in this case) (Bond & Fox, 2015).The Mean-square fit statistics would show the size of "the randomness, i.e., the amount of distortion of the measurement system" (Linacre, 2002, P.878).In the second run, the items were anchored at the item difficulty measures derived in the first run.Item characteristics that were expected to influence item difficulty (i.e., the context type, text type, and skill areas associated with the items) were then estimated and evaluated in this second run.

RESULTS
The first Facet run showed that the measures of item difficulty estimates spanned about four logits (-1.98 to +2.07 logits).The point measure correlation (PTMEA CORR.) coefficients (which is similar to the point-biserial correlation) for the 42 items were positive and almost all were above 0.3, indicating that the items were effectively discriminating between persons with high ability and those with low ability (Bond & Fox, 2015).For the Infit MNSQ, all items were within the recommended range (0.7-1.3), implying that all the items were productive and meaningful for measurement (Bond & Fox, 2015;Wright, Linacre, Gustafson, & Martin-Lof, 1994).The mean of the Infit MNSQ was 0.99 logit, close to the expected value of 1.00, and the standard deviation was very small (0.11 logit).The high reliability of item difficulty measures (.99) indicated that the ordering of item difficulty estimates was highly replicable with other comparable sample of students and that the items were wellseparated in terms of difficulty.The item separation index was 12.49, indicating that the items could be divided into at least 13 difficulty levels.The analysis also showed that the mean for person ability was 0.48 logits and standard deviation was 1.15 logits.
The measures of person ability spanned about six logits (+4.22 to -2.34 logits).For the person fit statistics, the Infit MNSQ indicated that only 38 (4.02%) persons were under fit i.e., with values above 1.3; however, most values were not far departed from 1.3.The reliability of person ability measure was also high at .88, suggesting that it was highly likely that the ordering of students could be replicated with similar items of the same difficulty.The person separation index was 2.66, indicating that the reading test could divide the students into three levels of ability.Figure 1 shows the distribution of students (i.e., the stars in column 2) and the items (i.e., the numbers in column 3) on the same interval scale.The upper part of the scale indicates the most able students who answered most of the items correctly, while the lower part shows the least able students, with more incorrect answers.Items most often correctly answered are positioned towards the lower part of the scale and the least correctly answered ones are positioned towards the upper part of the scale.13  --------------+------------+------------------------ In the second run, the difficulty estimates for context type (linear or non-linear); skill In the second run, the difficulty estimates for context type (linear or non-linear); skill area (Interpreting information, Making inference, Understanding figurative language, Drawing conclusions, Scanning for details and Finding out word meanings); and text type (Ads, Notice, Chart, Long passage, Weather forecast, Conversation, Newspaper report, Story extract, Short message, Poem, Brochure, and Formal letter) were calibrated, with item difficulty values anchored to the ones derived from the first run.Facets 2 (context) and 3 (skill area) were centred (i.e., mean=0.0)while the fourth facet (text type) was non-centred in this analysis.Figure 2 gives a graphic summary of item difficulty and examinee ability distribution, location of reading skill categories, and context types as well as text types.

DISCUSSION
The results obtained from the Many-facets analysis showed the order of item difficulty measures with regard to skill areas as well as item context types.The most demanding reading skill areas were Understanding figurative language, followed by Making inference, and Interpreting information.The easiest was Scanning for details and followed by Finding out word meanings and Drawing conclusions.Rubin (1993), pointed out that questions that required understanding and interpretation would be difficult for children because readers had to possess "problem-solving ability and be able to work at various levels of abstraction" (p.196).He added that, to some extent, skill of interpretation depended on the students' ability in skill of inference.For the easy skills, they may represent the literal level, the easiest level of reading comprehension in Barret's taxonomy (Day & Park, 2005;Dupuis, Lee, Badiali, & Askov, 1989;Pearson & Johnson, 1978).This level was described as "an understanding of the straightforward meaning of the text, such as facts, vocabulary, dates, times, and locations.
Questions of literal comprehension can be answered directly and explicitly from the text" (Day & Park, 2005, p. 3).Alderson (2005), for example, also maintained that questions might vary from easy to difficult as a result of cognitive demand; questions which required searching for specific facts were usually less difficult than questions that required synthesis, analysis, or inference.In this respect, Pearson and Johnson (1978) categorized question type into three levels ranging from easy to most difficult: textually explicit, textually implicit and script based.They highlighted that textually explicit questions were "those where both the question information and the correct answer are found in the same sentence.Textually implicit questions, on the other hand, require respondents to combine information across sentences.Script-based questions require readers to integrate text information with their background knowledge since correct responses to the questions cannot be found in the text itself" (p.87).
The findings on the hierarchy of reading subskills in this study were consistent with those of Hessamy (2013) who concluded that there was the possibility of getting "empirically-based hierarchies of difficulty and importance among the subskills."In other words, a hierarchy of difficulty of subskills as higher order and a lower order could exist.Hessamy (2013) ordered the examined reading subskills from the easiest to the most difficult as: identifying writer's views/ claims, understanding specific information, identifying main idea, and extracting information from a text to put into diagrammatic representation.This finding supported the view that reading skills could not be a unitary skill.
Additionally, in the current study, the results showed that linear contexts were more difficult than non-linear contexts.Linear contexts were more difficult possibly because most texts of this type were long (four to five paragraphs) and more cognitively demanding, i.e., most of the information was not explicitly stated, and they required students to spend much time on understanding the texts and applying appropriate skills to answer the questions.Whereas, the non-linear contexts included short texts which required less time to read and to figure out the answers.Most items (67%) of this type could be answered easily because the information was explicitly stated in the texts.Of course, not all items of linear contexts were difficult, and not all items of non-linear contexts were easy.
The additional empirical evidence that reading skill had multidivisible sub-skills with different difficulty levels could help and guide language teachers, course designers, and test item developers to better teach and produce teaching materials as well as test items that would be able to meet their students' reading competencies (Hessamy, 2013).Hughes (2003) also maintained that despite the issue of the involvement of subskills in reading, the reading test would have to include samples of the subskills relevant to the test purpose.Brown (2003) further expounded that the skills used in reading were essential considerations in the assessment of reading ability.It is worthwhile to add that most reading models refer to reading skills or sub processes for profiling purposes, and so language learners should be tested on a range of relevant skills or strategies (Alderson, 2005).

CONCLUSION
The results of the study support the notion that different reading skills exert differential cognitive demands.Those that require higher order thinking skills, such as making inference and interpreting information, are more difficult than those requiring lower order skills, such as finding out word meanings and scanning for information.The results of the Many-facet anlysis provide the much needed evidence that there is a strong possibility of a hierarchy of reading skills.In this respect, factors that affect item difficulty or order should be estimated to get a more accurate picture of their influence.It will be beneficial to analyze test items and examine features that influence item diffculty before they are administerd to students.Future research that examines the interaction between different reading skills and their effect on language learning and test performance will also greatly benefit reading theory and pratice.

Figure 1 .
Figure 1.Distribution of Items and Persons on the Logit Scale.

Figure 1 .
Figure 1.Distribution of Items and Persons on the Logit Scale

Figure 2 .
Figure 2. Location of Context Type, Reading Skills Associated with Items, and Text Type on the Logit Scale

-Figure 2 .
Figure 2. Location of Context Type, Reading Skills Associated with Items, and Text Type on the Logit Scale

Figure 2
Figure2reveals that skill categories too did not have the same difficulty level.The skill categories were ordered from the most difficult at the top and the least difficult at the bottom of the scale.Four of the skills were located above the item mean (0.0 logit) and the other two below it.The skill measures showed that the most difficult skill is Understanding figurative language (0.56 logit), followed by Making inference (0.41 logit), and Interpreting information (0.37 logit).The easiest skill was Scanning for details (-0.81 logit) and was followed by Finding out word meanings (-0.60 logit).On the other hand, Drawing conclusions (0.07 logit) was close to the average difficulty level.With a reliability index of 1.00 and a chisquare 2082.3 with 5 df, significant at p <.01, it could be concluded that the skill categories were not equally difficult.In terms of fit statistics, all the skills fit the expectations of the Rasch model as they fell within the recommended range of Infit MNSQ of 0.7 to 1.3 (see Table2)(Bond & Fox, 2015).

Table 1
Test Items: Reading Subskills, Context types, and Text TypesThese test items represented the following skill areas: Interpreting information (5 items), Making inference (6 items), Understanding figurative language (2 items), Drawing conclusions (8 items), Scanning for details (6 items), and Finding out word meanings (15 items).With regard to context type, 30 linear items and 12 nonlinear items were identified.In terms of text type, the distribution of items was as follows: Ads (2 items), Notice (3 items), Chart (2 items), Story extract (3 items), Short message (1 item), Poem (4 items), Short news report (2 items), Brochure (5 items), Formal letter (6 items), Conversation (8 items), Long passage (5 items), and Weather forecast (1 item).The items or tasks vary in terms of difficulty and were targeted to assess the reading skills of students with different levels of ability.

Table 2
Reading Skill Areas Measurement Report