Using Theory-Based Test Construction to Develop a New Curriculum-Based Measurement for Sentence Reading Comprehension

Reading comprehension at sentence level is a core component in the students’ comprehension development, but there is a lack of comprehension assessments at the sentence level, which respect the theory of reading comprehension. In this article, a new web-based sentence-comprehension assessment for German primary school students is developed and evaluated using a curriculum-based measurement (CBM) framework. The test focuses on sentence level reading comprehension as an intermediary between word and text comprehension. The construction builds upon the theory of reading comprehension using CBM-Maze techniques. It is consistent on all tasks and contains different syntactic and semantic structures within the items. This paper presents the test development, a description of the item performance, an analysis of its factor structure, and tests of measurement invariance and group comparisons (i.e., across gender, immigration background, over two measurement points, and the presence of special educational needs; SEN). Third grade students (n = 761) with and without SEN finished two CBM tests over three weeks. Results reveal that items had good technical adequacy, the constructed test is unidimensional, and it is valid for students both with and without SEN. Similarly, it is valid for both sexes, and results are valid across both measurement points. We discuss our method for creating a unidimensional test based on multiple item difficulties and make recommendations for future test construction.


INTRODUCTION
Reading acquisition, particularly reading comprehension, is one of the most important academic skills (Nash and Snowling, 2006); however, multiple student groups [i.e., students with special educational needs (SEN), or students with immigration background] have disadvantages to reach a high reading comprehension competency level. In particular, reading comprehension is a challenge for students with SEN resulting in problems in both primary and secondary school (Gersten et al., 2001;Cain et al., 2004;Berkeley et al., 2011;Cortiella and Horowitz, 2014;Spencer et al., 2014). Many of students with SEN have language difficulties that accompany a high risk for difficulties in literacy. In the United States and the United Kingdom, the vast majority of students with SEN struggles in this area (Kavale and Reese, 1992;NAEP, 2015;Lindsay and Strand, 2016). Similar results were reported for learners speaking in phonologically consistent languages such as German (Gebhardt et al., 2015). Additionally, in most western countries, including Germany, students with an immigration background (i.e., they were born in another country, have at least one parent were born in another country, or are a non-citizen but born in the target country; Salentin, 2014) have greater academic challenges than native students (OECD, 2016). Consequently, they also tent to have a lower reading competency (Schnepf, 2007;OECD, 2014;Lenkeit et al., 2018) which partially related to a discrepancy in language skills for the host country language (Kristen et al., 2011). Recently, Lindsay and Strand's (2016) large scale and longitudinal study emphasized the importance of identifying children with reading problems and their individual needs as early as possible. Alarmingly, students with reading comprehension problems during primary school demonstrate reading difficulties well into adolescence (Landerl and Wimmer, 2008;Taylor, 2012). Additionally, it is widely agreed that there is a large need for adequate assessments to achieve positive outcomes for children with SEN (OECD, 2005;Hao and Johnson, 2013;Lindsay, 2016).
One possible assessment type is curriculum-based measurement (CBM), which involves short, time-limited, and frequent tests to visualize the learning progress of low achieving students (e.g., students with SEN; Deno, 1985Deno, , 2003. For instance, the CBM-Maze task is a reading comprehension task where students receive a complete text with multiple blanks. They fill in each blank with one word from a few choices . CBM-Maze was designed to monitor the growth of intermediate and secondary students' reading comprehension. More recent studies showed that CBM-Maze measures early language skills, such as sentence level comprehension and code-related skills rather than higher language skills, such as inference-making, text comprehension, and knowledge about text structures (Wayman et al., 2007;Graney et al., 2010;Muijselaar et al., 2017). Because CBM-Maze assesses earlier reading skills, it may be adapted for younger students, including low achieving students. In this paper, we develop a new assessment for reading comprehension at the sentence level for primary school students, in the lines of reading comprehension theory and CBM framework. Our goal is to create an assessment that considers different structures of language and is highly suitable for both researchers and practitioners.

Reading Comprehension
Reading comprehension is "the ultimate goal of reading" (Nation, 2011, p. 248); it is necessary for everyone to be successful in school and society. In general, reading comprehension is a system of cognitive skills and processes (Kendeou et al., 2014). Multiple underlying skills, such as rapid naming, phonological, and orthographic processing, fluency, vocabulary, and working memory, need to interlock to allow for good comprehension performance (e.g., Cain et al., 2004;Cain and Oakhill, 2007;Kendeou et al., 2012;Keenan and Meenan, 2014). According to the simple view of reading (Gough and Tunmer, 1986;Hoover and Gough, 1990), reading comprehension is defined as a product of code-decoding skills and linguistic comprehension. Furthermore, it is divided into three related levels: word, sentence, and text. Comprehension at the word level is the lowest-tier ability. Readers visually identify a word as a single unit and compare it with their mental representation (Coltheart et al., 2001). It includes subskills such as phonological awareness, decoding, and written word recognition. The sentence represents the middle tier. When readers connect several elements of a sentence, such as words, and phrases, they frame a local representation at this level. Sentence comprehension builds a fundamental bridge between the lower (word) and upper (text) levels using syntactic parsing and semantic integration (Frazier, 1987;Ecalle et al., 2013). At the top tier is text comprehension. In order to understand connected texts, reading learners need to establish additional complex cognitive processes, such as inference-making, coherence-making, and background knowledge (Van Dijk and Kintsch, 1983). In contrast to the simple view of reading, the hierarchical construction-integration model of text comprehension (originally by Kintsch, 1998), is divided into two mental representations: textbase and situation (Kintsch and Rawson, 2011). In textbase representations, readers combine the low level reading skills on word and sentence levels (i.e., microstructure) to build a local, coherent representation of the macrostructure of a text. The situation model relates to text content (i.e., integration of further information and knowledge) and is comparable with the text level comprehension. Both the textbase and the situation model of a text are fundamentally connected (Perfetti, 1985). Without understanding single words or sentence meaning, no reader is able to make inferences, or coherences. Consequently, in both the simple view of reading and the construction-integration model, sentence comprehension is a critical skill for advanced text comprehension.

Processes of Sentence Comprehension
General syntactic and semantic comprehension processes influence the comprehension at the word and sentence level. Additionally, word recognition is linked with the syntactic and semantic analysis (Oakhill et al., 2003;Frisson et al., 2005). All these processes can be understood as parallel, modular, or dominated by one process (Taraban and McClelland, 1990;McRae et al., 1998;Kennison, 2009). Even though research results are not consistent, relevant factors about sentence comprehension are known. For instance, word classes guide syntactic parsing. Kennison (2009) found that verb information affected syntactical parsing in undergraduate and graduate students. In contrast, Cain et al. (2005) studied primary school students' ability to choose correct conjunctions in tasks similar to CBM-Maze tasks (specifically Cloze tasks). Results showed that filling in additive and adversative conjunctions is easier than temporal and causal ones for 8-10 years old. Thus, syntactical information influences the extraction of the sentence meaning, and semantic information influences the comprehension of individual words. More precisely, their results show that the part of speech causes different difficulty levels in the children's sentence comprehension. Therefore, reading comprehension assessments need to include these possible difficulties in the test structure to be able to measure these core skills. In line with earlier claims, these results confirm that reading comprehension assessments should reveal whether students are struggling in lower-or upper-level subskills, and where the difficulties lie (Catts et al., 2003).

Reading Difficulties and Sentence Comprehension
Comprehension problems are very heterogeneous. They can be caused by lexical processes (e.g., phonological and semantic skills or visual word recognition), by the capacity of the working memory, or by higher text processes (see Nation, 2011). Students might develop word recognition difficulties, isolated comprehension difficulties, or combined problems in both areas (Leach et al., 2003;Nation, 2011;Catts et al., 2012;Kendeou et al., 2014). Some studies suggest that younger readers' problems can be mostly ascribed to poor word recognition skills (Vellutino et al., 2007;Tilstra et al., 2009). Accordingly, less than one percent of early primary students who perform well in decoding, and vocabulary show isolated comprehension problems (Spencer et al., 2014). However, poor readers also differ from good readers in the efficiency of related cognitive processes (Perfetti, 2007). At the sentence level, poor comprehenders use sentence content for word recognition, which is especially challenging in complex semantic and syntactic structures. West and Stanovich (1978) showed that sentence content supports word recognition processes of fourth graders more than the process of advanced adult readers. Consequently, all readers can develop greater difficulties if the sentence content is not congruent to the word meaning. Martohardjono et al. (2005) investigated the relevance of the syntax of single sentences in reading comprehension on bilingual students. Results revealed that the participants could not use the syntactic structure to support their word recognition. Relatedly, poor readers in third grade struggle to extract the meaning of syntactically complex sentences when they were presented verbally (Waltzman and Cairns, 2000). Besides the semantic and syntactic deficits, poor readers take longer to read complex sentences (Graesser et al., 1980;Chung-Fat-Yim et al., 2017). The more complex the sentence, the more time is needed to process the syntax and semantics of the sentence. Thus, poor readers need more cognitive resources to read and understand single sentences.
Furthermore, sentence level comprehension tests can measure general reading comprehension. Ecalle et al. (2013) examined the role of sentence processing as a mediatory skill within reading comprehension in second through ninth graders. First, they presented the students semantically similar sentences with different complexity and vocabulary. The students had to judge whether the contents were similar or not. Second, they examined the impact of these skills on expository text comprehension. The results confirmed that sentence processing increases over age and that sentence comprehension "could constitute a good indicator of a more general level of reading comprehension irrespective of the type of text" (Ecalle et al., 2013, p. 128).
Overall, these results show that even small changes (i.e., semantical and syntactic ones) within a sentence can influence students' comprehension. Especially important is that the comprehension ability of poor readers (i.e., students with SEN or with immigration background) is sensitive to both word meaning and sentence structure. Additionally, these results affirm the need for specific reading comprehension assessments at sentence level in contrast to word recognition or fluency tests (Cutting and Scarborough, 2006).

Curriculum-Based Measurements
Curriculum-based measurement (CBM) is a problem-solving approach for assessing the learning growth of low achieving students (e.g., students with SEN) in basic academic skills, such as reading, writing, spelling, or mathematic competencies (Deno, 1985(Deno, , 2003. The main idea of CBM is to monitor the children's development with short and very frequent tests during regular lessons. This allows teachers to graphically view the slope of the individual learning growth and to evaluate the effectiveness of the instruction. Teachers can then link the results with their decision-making and lesson planning.

Curriculum-Based Measurements as an Assessment of Reading Comprehension
A lot of research relating to CBM has been conducted (Fuchs, 2017), especially in reading tasks. In primary schools, two kinds of CBM instruments are ordinarily used to measure reading competencies: CBM-R and CBM-Maze (Graney et al., 2010). CBM-R involves individual students reading aloud from a word list or a connected text, and it measures oral reading fluency and accuracy (i.e., word recognition). CBM-Maze was developed to compensate for the disadvantages of CBM-R, namely individual administration and teacher distrust of CBM-R as a reading comprehension measurement. CBM-Maze is a group administered silent reading task that measures general text reading comprehension. However, both tests provide valid measurements of students' reading comprehension skills (Ardoin et al., 2004;Marcotte and Hintze, 2009). For both types of CBM, many studies report technical adequacy, strong alternateform reliability, moderate to strong criterion-related validity, and predictive validity (e.g., Shin et al., 2000;Graney et al., 2010;Espin et al., 2012;Ardoin et al., 2013). Furthermore, correlations between CBM-R and CBM-Maze were found (Wayman et al., 2007). CBM-R is typically used for measurements within the first three grades and CBM-Maze for fourth and higher graders.

Principles of CBM-Maze
In traditional CBM-Maze tasks, students read a timed short passage (∼250 words), in which different words are deleted. For each blank, one target word and two or more distractors are presented. The students choose one word for each blank. In general, the first and the last sentence of each passage are kept intact to allow the context to guide comprehension. The number of correctly filled blanks is then used as the competence measure for reading comprehension. The CBM-Maze task is based on the Cloze (Louthan, 1965;Gellert and Elbro, 2013) test, where the students fill out the blanks without any time limit or word suggestions. Since then, many studies examined CBM-Maze test construction and administration issues.  were among the first researchers to describe the CBM-Maze task. In contrast to the CBM-R, they highlighted the higher classroom usability because the CBM-Maze can be administered in groups and at computers. Recently, Nelson et al. (2017) found that computer adaptive tests-as a practice of CBM-are valid for progress monitoring with fourth and fifth graders. They reported that with frequent data collection, computer testing systems can examine the overall learning growth of individuals and student groups. One main feature of CBM tests is that they can be administered multiple times, which requires both alternative test forms and sensitivity to student growth (Fuchs, 2004).

Test administration
In their meta-analysis, García and Cain (2014) determined general reading comprehension assessment characteristics. They observed significant differences for the linguistic material and the administration procedure (i.e., reading aloud or silently, and test time). However, this was not upheld for CBM-Maze tasks because research has not found significant differences for primary students between reading silently or aloud (Hale et al., 2011). Accordingly, the CBM-Maze assessments are mostly administered silently for higher practicability in group settings. The CBM-Maze test time is usually short, from 1 to 10 min (e.g., Brown-Chidsey et al., 2003;Wiley and Deno, 2005). While no specific limit has been agreed upon, Brown-Chidsey et al. (2003) suggested that a test time of 10 min is too long for a 250-word passage.

Item construction
Traditionally, the items of CBM-Maze tasks are connected passages. Depending on the age of the students, different kinds of passages are used, such as fables (Förster and Souvignier, 2011), newspaper articles (Tichá et al., 2009), and historical texts (Brown-Chidsey et al., 2003). Outside the CBM approach, maze tasks are commonly used to measure semantic and syntactic skills within sentence processing (Forster, 2010). In these cases, single sentences are mostly used instead of connected passages (e.g., Witzel and Witzel, 2016). January and Ardoin (2012) examined differences in the construction of the CBM-Maze probes with third, fourth, and fifth graders. They examined whether the students' performance is contingent on the content of the passages. The findings indicated that primary school students performed well on both intact (i.e., sentences in order) and scrambled (i.e., sentences out of order) CBM-Maze passages. Additionally, they concluded that the CBM-Maze task measures reading comprehension at the sentence level because the students did not need the context to perform well. Taken together with the results of Ecalle et al. (2013), these results suggest that CBM-Maze could also be administered with single sentences instead of connected passages.
Furthermore, item construction depends on the deletion pattern and on the linguistic material. Some test designers use a fixed (i.e., every seventh word) or a lexical (e.g., deletion of nouns, verbs, adjectives, or conjunctions) deletion pattern, however, Kingston and Weaver (1970) showed that the lexical deletion pattern had similar results as fixed deletion. Similarly, January and Ardoin (2012) could not find significant differences in the students' accuracy based upon different lexical deletion patterns.
While not explicitly tested, results that indicated similar accuracy for different types of lexical deletion suggest unidimensionality. This is important for teachers and researchers because items that fall on the same dimension are easier to interpret (Gustafsson and Åberg-Bengtsson, 2010). This does not mean that the underlying construct of reading comprehension is unidimensional, but that the results are interpretable along a single dimension of reading comprehension (Reise et al., 2013). Furthermore, individual test performance differs based upon factors such as SEN and immigration background (Cortiella and Horowitz, 2014;Spencer et al., 2014;OECD, 2016). Thus, in order for a test to be fair for all test takers, the linguistic material should be similar in construction and the items need to be appropriate for diverse groups of students, such as learners with SEN, those with an immigration background, and learners of both genders (i.e., measurement invariance; Good and Jefferson, 1998;Steinmetz, 2013).

Distractors
Other studies discussed the influence of distractors on a correct answer. Early studies indicated two types of distractors: semantically plausible with incorrect syntactic structure or semantically meaningless with correct syntactic structure (Guthrie et al., 1974;Gillingham and Garner, 1992). Resulting suggestions for distractors include a similar look as the target word, a lack of contextual sense, words with a related, incompatible contextual meaning, or nonsense words . McKenna and Miller (1980) found that syntactically correct distractors are more difficult for students to exclude in comparison to similar looking words. Meanwhile Conoyer et al. (2017) found that tests using content-based and part of the speech-based distractors were similar.

The Present Study
Because reading comprehension at the sentence level is a necessary skill (Ecalle et al., 2013), we developed a new webbased test to measure this competence. Our new assessment focuses on sentence reading ability within a CBM framework for primary school students (i.e., third graders). Our study details test development and analyzes item difficulty. It also assesses dimensionality with an analysis of the factorial structure and tests measurement invariance across several relevant groups. Additionally, we track the performance of our participants across two measurement points and examine the effect of subject variables including SEN, immigration background, and gender. Accordingly, we developed two groups of research questions. The first group of questions relates to technical evaluation of the test and the second group relates to the overall performance of our participants.
The first three questions relate to the technical aspects of test construction and interpretation including item difficulty, unidimensionality, and measurement invariance: 1. What are the item difficulties and do they relate to different item types? We use multiple word types to create different difficulty levels and we expect that some word types will be more difficult than others (see Cain et al., 2005;Frisson et al., 2005;Kennison, 2009).

Can the instrument results be interpreted unidimensionally?
A unidimensional test structure would allow for easy test interpretation for both researchers and educators, because they only need to consider overall performance on the test. We use a consistent and straightforward sentence structure with age-related words to create a test structure that is applicable to both good and poor comprehenders (e.g., students with SEN). Thus, we hypothesize that all items fit on a unidimensional test structure because the item structure is consistent and all items represent the same underlying reading competence. 3. Does the test possess measurement invariance relating to SEN, immigration background, gender, and measurement points? Test construction followed guidelines for CBM test construction and evaluation (Fuchs, 2004;Wilbert and Linnemann, 2011). This includes multiple alternate test forms of equal difficulty, integration of several subskills (e.g., word recognition, syntactic parsing, and semantic integration). Because we adopt these established recommendations and combine them with CBM-Maze praxis (e.g., Brown-Chidsey et al., 2003;January and Ardoin, 2012;Conoyer et al., 2017), we hypothesize that our test will be invariant over student groups and measurement points.
The last two questions focus on the performance of the participants in relation to classroom and individual factors: 4. What is the intraclass correlation? Although the test was given to different classrooms, we expect the results to be comparable in each classroom. Therefore, we expect a relatively low intraclass correlation, meaning that the test performed similarly across all classrooms. For comparison, Hedges and Hedberg (2007) use the guideline of 0.05-0.15 in their large-scale assessment. 5. Did performance improve over time, and did subject variables, including SEN, immigration background, and gender influence performance? We compare the sum scores of our participants across two measurement points. We expect that there will be an improvement in performance from measurement point one to two and learners with SEN will perform worse (see Gebhardt et al., 2015;Lindsay and Strand, 2016).

Test Administration
The new reading assessment is administered online via a German web-based platform for CBM monitoring, called Levumi (www. levumi.de). The code for the Levumi platform is published on Github, and all tests, test materials, and teacher handbooks will be published with a creative commons license. This means that this test is free of charge for teachers and researchers (Jungjohann et al., 2018). The platform runs on all major browsers and only requires an internet connection. It records each student's response and reaction time for every item. The test can be administered in groups or individually. Each student has his or her individual student account where the activated tests become available. Teachers or researchers activate the test for each measurement point for the participants (e.g., students) in the test-taker's individual account. A computer or tablet is required for each simultaneous test-taker.
At the beginning of each test, a simple interactive example is shown. This prevents an accidental test start or other interface problems. After the example, students have 8 min to answer as many items as possible. On the screen, the items appear one after another. The students see a single sentence with a blank. Underneath the sentences, the target word and the distractors are displayed in a random order. When a student chooses one of the possible words, it appears in the blank. The students can change their minds by clicking on another response, and afterwards they confirm their answer by clicking on the "next" button. When the time limit runs out, the students can finish the current item, and then the test closes automatically.
At each measurement point, the item order is different, allowing for alternate test forms for frequent measurements over a school year. The items have a fixed item order for the first measurement (see Table 1), with items alternating between each different word-category (as described below). This fixed item order creates a baseline for comparison for the first measurement point. Random orders are generated for the second measurement point on to allow for a large number of alternative test forms. In these orders, a category is randomly selected, and then an item from each category, but the ratio of items from each category is kept equal.

Item and Distractor Construction
The overall process of item creation followed the CBM-Maze's principles; however, some modifications were made according to reading comprehension theory in order to create a test using sentences rather than connected passages. First, all items were carefully created as individual sentences. The entire pool of 60 items can be seen in Table 1. To ensure that every sentence is appropriate for third graders, all important words were collated from curricula within the German primary school systems (e.g., lists of frequent words based on grade level). Every sentence is a sentoid, meaning it is semantically explicit including the distractors. All sentences are constructed in the active voice and with age-appropriate syntactic structures (i.e., avoiding sentences with multiple clauses).
Second, a lexical deletion pattern was chosen to set different item difficulties within one test. Research results showed that the use of single word types can affect a different sentence comprehension. To adopt these results for the test construction, all items were classified by the lexical deletion pattern. The essential hypothesis is that difficulty is determined by the type of word deleted (i.e., part of speech). Because all items were set with a similar sentence structure, they relate to the same competence (i.e., reading comprehension at sentence level). To build up the variation of the German language, word types were summarized in multiple categories. This new assessment considers only possibilities relevant for third graders and not all possibilities within the German language. Therefore, three categories were set. The first word-category included nouns (n = 20) used as both subjects and objects. The second category included verbs and adjectives (n = 21). The third category included conjunctions and prepositions (n = 19).  Third, three distractors were created for every target word. Every distractor was contextually meaningless but syntactically possible. Three rules guided the distractor creation: one distractor had a similar look, another distractor was related to the contextual sense of the target word or sentence but lacked the correct meaning, and the last distractor made no contextual sense. Because of vocabulary limitations, these rules could not be implemented precisely in each item. In these cases, words from the other rules were adapted. In every case, the same number of words were presented. The following example illustrates these rules: (Item 4-German) Ein Lama hat vier Beine/Bücher/ Daumen/Kamele. (Item 4-English) A llama has four legs/books/thumbs/camels.

Participants
Participants were third grade students attending regular elementary schools in the northwest of Germany (n = 761). Approximately half of the participants were female (46.5%). The participants' teachers were asked about the immigration background (n = 344) and SEN. SEN were listed as learning (n = 37), cognitive development (n = 2), and other (i.e., speech and language impairments, emotional disturbed, or functional disability; n = 67). Participants with low German language competence (n = 40) were also categorized as SEN.

Procedure
Trained research assistants (i.e., university students) contacted local elementary school administrators and teachers to recruit participants. All data was collected with the informed consent of participants, parents, teachers, and administrators. The research assistants tested participants in small groups. Each participant worked individually on a single school computer. After the first measurement, researchers returned 3 weeks later to collect the data for the second measurement. Some students were not available due to illness or other reasons at the second measurement (n = 94). In this case, their data for the second measurement was treated as missing, but their data from the first measurement was used where appropriate.
During both measurements, research assistants followed the same scripted procedure. Children were told that the little dragon Levumi has brought many sentences with it, but that each of the sentences were missing a word. The child was asked if he or she could find the correct word in each sentence. Next, the participants were given an example item. Once they gave the correct answer for the example item, the research assistant showed the participant how to give this answer on the test with the mouse. After the example item, participants answered as many items as they could in an 8 min period.

Item Difficulties
Average item difficulty was obtained by averaging the percentage of correct responses across both measurement points. Sufficient numbers of participants completed all test items (n = 87, 11.4%) to allow for a power-analysis, and a further 25% of participants completed the vast majority of the test (50 items). Therefore, missing scores were treated as not yet reached in our analyses. A repeated measures one-way ANOVA compared the average difficulties based on word-category (noun, adjective/verb, and conjunction/preposition).

Dimensionality
To assess dimensionality of the instrument, the factor structure was tested via three separate confirmatory factor analyses (CFAs). All factor analyses were conducted in Mplus 7.4 Muthén, 1998-2015) using a weighted least squares with mean and variance adjusted (WLSMV) estimation method. The WLSMV estimator is more appropriate for categorical data (Muthén et al., 1997;Flora and Curran, 2004). Factor structures were based on the word type category and function in the sentence. The 3-factor model mirrored the three word-categories: nouns, verbs/adjectives, and conjunctions/prepositions. In the 2factor model, the verbs/adjectives, and conjunctions/prepositions factors were combined. Finally, in the 1-factor structure, all items were placed on the same factor. We appraised the model fits via root mean squared error of approximation (RMSEA), CFI, and gamma-hat. We considered RMSEA < 0.08, CFI > 0.90, and gamma-hat > 0.90 acceptable fits. Meanwhile, we considered RMSEA < 0.05, CFI > 0.95, and gamma-hat > 0.95 good fits (Hu and Bentler, 1998).
Next, we compared the fits of the separate models in order of increasing complexity. We compared the 2-factor model to the 1-factor model and the 3-factor model to the 2-factor model. We examined changes in CFI ( CFI) and gamma-hat ( Gammahat). We set a threshold of 0.01 for CFI and Gamma-hat as a significantly better fit (Cheung and Rensvold, 2002;Dimitrov, 2010).

Measurement Invariance
We examined the measurement invariance of each of the three models based on presence of SEN, immigration background, gender, and measurement point. We constrained or freed thresholds and lambda together. In other words, we tested the scalar (strong) model directly against the configural (base) model, as recommended when using WLSMV analysis by Muthén (1998-2015). As described above, differences in CFI and Gamma-hat greater than 0.01 were considered significant.

Intraclass Correlation
Next, the intraclass correlation was calculated using the proportion of variance explained by classroom compared to overall variance. Values were calculated based on the sum of squares in a one way ANOVA of class on average percent correct at the first measurement point.

Change Over Time and the Influence of Subject Variables
Finally, to assess the differential performance on the test by our target groups, we conducted a repeated measures ANOVA including SEN, immigration background, and gender across both measurement points on the sum score of each participant. Table 1 lists all items and item difficulties across both measurement points. ANOVA results confirmed that difficulty varied across word-categories, F (2, 57) = 25.215, p < 0.001. Tukey's honestly significant difference test revealed that items in the easier two categories (noun and adjective/verb) were similar in difficulty, p > 0.05, meanwhile items in the conjunction/preposition group were significantly harder than the other two groups, p < 0.05.

Dimensionality
Fit metrics for all three models surpassed our criteria for good fits, as described in Table 2. Fit metrics were only slightly worse in the 2-factor model than in the 1-factor model, and virtually identical between the 2-factor and 3-factor model. None of the model comparisons exceeded the critical value of CFI or Gammahat of 0.01. Therefore, we conclude that all models fit equally well, and on the grounds of parsimony, we prefer the simpler 1-factor model. Thus, the instrument can be considered unidimensional.

Measurement Invariance
Measurement invariance test results are shown in Table 3. In each case, CFI and Gamma-hat are below the threshold of 0.01, meaning that the scalar model fit similar to the metric model. Therefore, we conclude that all three models possessed strong measurement invariance across presence of SEN, gender, immigration background, and measurement point. Because invariance was upheld for all models, the simpler 1-factor model is still preferable to other models. Thus, a unidimensional interpretation is equally valid for all subgroups within our data.

Intraclass Correlation
The intraclass correlation coefficient, as measured by proportion of total variance, indicated the test functioned similarly across all classrooms in our data, ICC = 0.15. This is relatively high, but still in the guidelines used in previous work (see Hedges and Hedberg, 2007). Table 4 shows the results of the sum score analysis. The repeated measures ANOVA revealed that students performed better on the test at measurement point 2, F (1, 658) = 93.32, p < 0.001. Additionally, learners with SEN performed worse overall than those without, F (1, 658) = 89.01, p < 0.001. Furthermore, a significant interaction indicated that learners with SEN did not improve from measurement point 1 to measurement point 2, F (1, 658) = 5.45, p < 0.05. No other interactions or main effects were found, all ps > 0.05.

Overview of Findings and Theoretical Implication
This study developed and evaluated a new theory-based formative assessment that measures reading comprehension at sentence level and that follows the CBM approach for practical use in inclusive primary schools. The main goal was to create a unidimensional test structure with different item difficulties to allow for easy interpretation and a high usability for heterogeneous classrooms. Within our theory-based test construction, we linked common reading comprehension models at the sentence level with the principles of the CBM-Maze task.
In addition, guidelines were set to assure that all finalized items from the same word-category were equivalent in construction and difficulty. In general, the evaluation of the test construction revealed a 1-factor model with items of varying difficulty. Our results indicated significant differences between the three deletion pattern categories (e.g., word-categories) of the single items. In this study, the German third graders had fewer problems identifying correct target words for nouns (e.g., category 1), verbs, or adverbs (e.g., category 2) compared to conjunctions, or prepositions (e.g., category 3). These results are in line with previous results from Kennison (2009) andCain et al. (2005), which indicated that different word types affect the syntactical parsing in different ways. Our results showed that these previous results could be generalized to the German language. Additionally, the different item difficulty between the word-categories can assist teachers to precisely screen problems in reading development. Förster and Souvignier (2011) argued that precise identification is an important feature of CBM assessments. Furthermore, the CFA demonstrated a unidimensional test structure, which allows for a simple interpretation from educators. Overall, our theorybased test construction demonstrated both adherence to reading comprehension theory and technical adequacy, making it useful The significance column denotes the between subjects results of the ANOVA for the special education needs, immigration background, and gender rows, but for the all groups row, it denotes the within subjects variable of measurement point. ***Significant at p < 0.001.
to both teachers and researchers. Additionally, these results indicated that it is possible to set item difficulty by language related rules without creating several test types (i.e., letter, sentence, and text-based tests). The assessment's practicability for inclusive classrooms was verified by the results of the measurement invariance tests, the intraclass correlation, and the analysis of the sum scores in three key ways. First, the intraclass correlation showed that our test performed similarly within different inclusive classroom settings. Additionally, the sum score comparison showed that changes in the students' ability were detectable without any changes in the test administration for a specific student group. This indicates that no specific class or student characteristics are required for test administration. Second, the alternative test forms were invariant across both measurement points. Meaning, that the random drawing order created good multiple alternative test forms, which can ease the test handling for both researchers and teachers. Within big classrooms, teachers do not have to track which student completed which test form or remember test dates manually, meaning that Levumi can reduce teachers' workload in multiple ways. Similarly, researchers can create a large number of alternative test forms. Thus, our web-based Levumi platform demonstrates one of the key benefits of computer-based formative assessment (see Russell, 2010). Third, especially in inclusive classrooms, the students are heterogeneous in their academic performance (Gebhardt et al., 2015). While students with SEN performed significantly lower, our Levumi reading comprehension test was invariant for different student groups (i.e., SEN, gender, and immigration background). Meaning, that teachers can use the Levumi test for all these students because the test assesses competence fairly across these groups. Furthermore, students with and without SEN can use the same test system over multiple measurement points in inclusive classrooms. Again, this reduces the teachers' workload because teachers do not have to use other materials for special student groups within one classroom (Jungjohann et al., 2018).
Our Levumi reading comprehension test includes the test administration benefits of the CBM-Maze, such as group administration, and silent reading (see Graney et al., 2010). Additionally, it is suitable for early readers and for readers with low reading abilities (e.g., students with SEN). In particular, our test uses a sentence-based item pool, rather than a complete text-based item pool. Fuchs et al. (2004) and Good et al. (2001) argued that complete text-based tests can cause floor effects on low performance readers. Because of this, the Levumi test may be more suitable for even younger students and those who might have difficulties with complete texts. Correspondingly, these test characteristics demonstrate that it is possible to expand the established CBM assessment types in new ways. Additionally, we expand the existing techniques of evaluating CBM assessments with intraclass correlations, factor analyses, and measurement invariance tests. These three evaluation techniques are well established on other fields of test evaluation, and their use can help to rigorously evaluate existing and future CBM assessments. This demonstrated technique of theory-based test construction and evaluation provides an essential template for other researchers who may be developing a diverse range of CBM assessments.

Limitations and Future Work
Nonetheless, this study has some limitations. The findings still need to be replicated with a more varied participant pool and with a larger sample from other regions inside and outside of Germany. This actual study also focused on third graders, but reading difficulties can appear earlier in first and second grade, when students start to develop an understanding of written words and sentences (Richter et al., 2013). Therefore, further studies should also include first and second graders to expand adequate CBM assessment for this age.
Besides a broader participant pool, future longitudinal research is necessary to validate our hypotheses. One main CBM characteristic is the ability to track the students' learning growth across multiple measurement points over a long period (e.g., one school year; Deno, 2003). In our study, we confirmed that our test is invariant over two measurement points for a period of 3 weeks. For classroom use, it is necessary to analyze the test's ability to measure the students learning slope over a larger period with more than two measurement points per student.
Additionally, we have not established a concrete indicator of criterion validity yet (see Fuchs, 2004). Besides the Levumi reading comprehension test, participants could complete additional CBM assessments with a complete text-based item pool, and other established reading comprehension screenings. This would establish if the Levumi reading comprehension test relies more on to code-related skills (e.g., reading fluency) than on language related skills (e.g., reading comprehension), as suggested by Muijselaar et al. (2017). This can indicate which reading problems our test is effective at identifying. This is particularly important because established reading comprehension tests do not agree with each other in the identification of reading problems (Keenan and Meenan, 2014).
Our last key limitation relates to the item language. Our research is limited by only using the German items. Both, the English item translation and the general theory-based test construction need to be evaluated in additional languages. At first, studies should evaluate the translated items with native speakers to test the quality of the items. Results of these studies would confirm the usability of our theory-based guidelines for CBM test construction in other languages. The original items can be translated into additional languages based upon these studies. This procedure will expand CBM offerings into new languages and regions.
Further work should also focus on instructional utility. This study followed Fuchs's (2004) recommendations for CBM test construction and examined the technical adequacy for formative learning growth monitoring (e.g., stage 1 and 2; Fuchs, 2004). Instructional utility means that teachers can include the CBM test system in their actual lessons, that they can understand, and interpret the results, and that they can link the students' learning slopes with their reading instructions. In Germany, the CBM approach is still unknown by many teachers. Therefore, the three main aspects of the instructional utility (Fuchs, 2004) are also concerns for the Levumi reading comprehension test and should be investigated in further research. First, the acceptance and the application of the Levumi platform needs to be evaluated within the school context. For the practical use of the Levumi platform, teachers need access to a computer, or a tablet and an internet connection. Some German schools already have good technical equipment, while others do not. Even assuming access to good technical equipment, teachers must be willing to use a web-and computer-based assessment. To that end, a clear user interface and teacher-focused supporting material will encourage user adoption. Second, further studies should examine how the teachers can handle the Levumi test results. Recently, studies revealed that preservice and in-service teachers can have problems in understanding CBM graphs ( Van den Bosch et al., 2017;Zeuch et al., 2017). As one example, preservice teachers estimate fictitious future student achievements lower than can be expected by a linear regression model (Klapproth, 2018). In all these studies, the participants were not able to adjust the layout, visualize additional information from the tests (e.g., correct, and incorrect answers), or receive statistical help (e.g., trend line, or goal line). Consequently, future studies need to test these interpretation difficulties in order to adapt the specific Levumi output (i.e., CBM graphs, and further information). Third, especially, the web-based test system brings a high potential for automatized support in CBM graph interpretation and instruction making. For instance, the Levumi platform could highlight at risk students, automatize, or add additional information into the graph, such as statistical trend lines, instruction phases, or students' moods. Furthermore, the Levumi platform could learn typical problem patterns and suggest relevant instruction materials. Lastly, work needs to be done to identify the connection between specific aspects of reading competency and performance on individual items and overall results. All these possibilities should be implemented carefully so that teachers are not confused by CBM data.

CONCLUSION
We created a new CBM assessment using theory-based test construction and evaluation. Evaluations indicated the test to be of high use for further research and praxis. This provides three key implications. First, researchers can adapt our approach as guidelines for further CBM assessments (i.e., further language, learning domains) to enhance the CBM research and evaluation field in new ways. Second, the web-based Levumi platform and the reading comprehension assessment are suitable for inclusive classrooms and their use can reduce the teachers' workload in multiple ways. And third, our procedure demonstrates the development of a test with multiple item difficulties which can be interpreted along a single dimension.

ETHICS STATEMENT
Permission for this study was granted through dean of the Faculty of Rehabilitation Science, Technical University of Dortmund. Following the requirements of the ministry of education of the federal state North Rhine-Westphalia (Schulgesetz für das Land Nordrhein-Westfalen, 2018), school administrators decided in co-ordination with their teachers about participation in this scientific study. An additional ethics approval was not required for this study as per Institution's guidelines and national regulations. Parents obtained written information about the study and any potential benefits. They gave their written consent for each child. Participation was supervised by school staff. Participation was voluntary and participants were free to withdraw at any time.

AUTHOR CONTRIBUTIONS
JJ developed the Levumi reading comprehension assessment and coordinated the study. As the primary author, JJ did most of the writing, research, and some of the analyses. JD did most of the data analyses. AM programmed the Levumi platform, realized the Levumi reading comprehension assessment based on the test specification of JJ, and edited the manuscript. MG gave the initial study design, supervised the entire research process, and edited the manuscript.

ACKNOWLEDGMENTS
We acknowledge financial support by Deutsche Forschungsgemeinschaft and Technische Universität Dortmund/ TU Dortmund Technical University within the funding programme Open Access Publishing.