Examination of Psychometric Evidence for Criterion-Referenced Scores from the SSIS SEL Brief Scales

Efficient and intuitive interpretive frameworks for social-emotional learning (SEL) measures are necessary for identifying student needs and informing programming decisions across multitiered systems of support in schools. Though familiar to educators and often used with standardized tests of academic achievement, criterion-referenced frameworks are less common in SEL assessment. As such, the current study examined the psychometric evidence for scores from one such framework, the Competency-Referenced Performance Framework, which was developed to inform universal screening decisions based on the SSIS SEL Brief Scales (Elliott et al., 2020). Specifically, we evaluated stability, test-criterion relationships with academic outcomes, and treatment sensitivity of the CRPF using data from an efficacy trial of a universal SEL program. Results provided preliminary supportive evidence for the CRPF.

framework, the Competency-Referenced Performance Framework (CRPF; Elliott et al., 2020). The CRPF categorizes the total scores of the SSIS SEL Brief Scales into four overall SEL competency levels (Emerging, Developing, Competent, and Advanced) based on frequency (never, seldom, often, and almost always) of performance relative to developmentally appropriate expectations from a learning progression perspective. Specifically, children in the Emerging level receive consistently low frequency ratings (never or seldom) on strength-focused items, and children in the Advanced level have consistently high ratings of almost always. Cut scores corresponding with these proficiency levels were established based on data from a large nationally representative sample of K-12 students as well as reviews by school professionals and SEL experts (Elliott et al., 2020). However, no studies to date have examined psychometric evidence for the CRPF levels using an independent sample.
The Standards for Educational and Psychological Testing (Standards; AERA, APA, & NCME, 2014) specify that evidence of reliability and validity of test scores be provided to support their proposed use. With respect to SEL assessment, the psychometric evidence for a measure should be evaluated with respect to whether the scores it provides will guide intended uses (e.g., screening) and assist in reaching conclusions about student SEL competency levels (Buros Center for Testing, 2020). Screening requires examining broad outcomes to assess levels of risk and identify student needs (Kettler et al., 2014). McKown (2019) suggested criteria for evaluating SEL assessments includes temporal stability for score reliability, correlations with other relevant variables, and evidence that students exposed to high-quality instruction improve more than a control group. Similarly, Gross et al. (2019) described scores from sound screening instruments as demonstrating evidence including reliability, responsiveness to change, and generalized utility. Maintaining technical rigor while maximizing efficiency and utility are important considerations for developing SEL screening assessments for use in research and practice (Kim et al., 2022).
For criterion-referenced score interpretation, reliability evidence can be established using decision consistency indexes, such as percentage of correct decisions and Cohen's kappa, across replications of the same testing procedure (Standards,p. 40). One way to replicate the same testing procedure is to repeat administration of the same test to the same group of examinees (i.e., testretest). Although test-retest stability for SSIS SELb-T total scores has been found to be sufficient in previous studies (.78 in Elliott et al., 2020 and .85 in Anthony et al., 2021), stability evidence for the CRPF levels has not been evaluated to date.
Validity refers to "the degree to which all the accumulated evidence supports the intended interpretation of test scores for the proposed use" (Standards, p. 14). One relevant and common source of evidence to support the use of criterion-referenced interpretations is test-criterion relations, which can be established using correlational and experimental methods to examine concurrent and predictive evidence (Hambleton et al., 1978). A body of literature exists to support the association between SEL competence and academic outcomes for both elementary and secondary grades (Panayiotou et al., 2019), and the relationships between prosocial behaviors and academic achievement found in previous research has been characterized as moderately positive (see DiPerna et al., 2016).
Given that academic achievement is one of, if not the, primary intended outcomes of the formal schooling process, whether the CRPF levels are related concurrently and predictively to academic outcomes is an important consideration for MTSS decision-making in schools. As one example, correlations between student reported SEL competence and standardized academic test scores were found to be mediated by mental health difficulties, suggesting important relationships between SEL, academic achievement, and student mental health needs (Panayiotou et al., 2019). Currently, no published studies have examined the test-criterion relationships between the CRPF and academic outcomes, and this important form of validity evidence warrants examination.
Lastly, as the SSIS SEL Brief Scales are intended to be used for universal screening and progress monitoring of SEL skills within MTSS, evidence of treatment sensitivity (which is essentially the experimental method that Hambleton et al., 1978 described) would be vital to support the use of scores for assigning students to tiers of interventions, progress monitoring of intervention responsiveness, and evaluating of intervention efficacy. For example, the SSIS-Classwide Intervention Program (SSIS-CIP; Elliott & Gresham, 2007) is a universal SEL program used in schools to promote social skill development and positive behavior. Via 10 core units and 30 lessons delivered by the classroom teacher, the program focuses on instruction of foundational social skills such as listening to others, paying attention to your work, and asking for help. Although previous evaluations have demonstrated that teacher ratings using the SSIS Rating Scale-Teacher form (SSIS RS-T; Gresham & Elliott, 2008) are responsive to the SSIS-CIP intervention (DiPerna et al., 2015;, the sensitivity of the CRPF to skill changes resulting from such universal interventions has not yet been studied.
Thus, the current study is guided by three research questions: (1) How stable are students' CRPF levels over time (i.e., stability)?
Although initial reliability and validity evidence support the use of SSIS SELb-T normreferenced total scores for lower-stakes decisions (Anthony et al., 2021), stability, test-criterion relationships with academic outcomes, and treatment sensitivity of the CRPF levels have not been thoroughly examined to date. Based on previous findings with total scores from the SSIS SELb-T, the CRPF levels are expected to be moderately stable in the absence of intervention (Anthony et al., 2021), positively related to academic outcomes (e.g., Panayiotou et al., 2019), and relatively sensitive to skill change resulting from intervention (e.g., DiPerna et al., 2016;.

Method
To generate psychometric evidence to evaluate the CRPF, we conducted a secondary analysis of data from a previous efficacy trial of a school-based SEL intervention. Specifically, evidence of reliability (test-retest stability) and validity (test-criterion relationships with external criteria as well as treatment sensitivity) appropriate for criterion-reference score levels were examined.

Sample
Data for this study were drawn from a multi-year efficacy trial of the SSIS-Classwide Intervention Program (SSIS-CIP; Elliott & Gresham, 2007) in the Mid-Atlantic region of the United States (See DiPerna et al., 2015; 2016 for more information). The original study used a multisite cluster randomized controlled trial design in which second-grade classrooms (clusters) were randomly assigned to treatment conditions (intervention or business-as-usual [BAU; standard practices employed by school if a research study was not being conducted]) within seven schools (sites). It included four cohorts of students who were enrolled in participating Grade 2 classrooms across four successive years. Follow-up data were collected for two additional years in the same schools during which schools were free to use the SSIS-CIP program on a voluntary basis.
A total of 641 students (354 treatment, 287 BAU) from 54 classes (31 treatment, 23 BAU) constituted the total analytic sample for the current study. A small urban district (5 schools; nearly 100% of students eligible for free or reduced-price lunch) and a small rural district (2 schools; 42% of students eligible for free or reduced-price lunch) participated. Table 1 presents the demographic details of the total analytic sample. Within this student sample, about 17% were Black, 5% Hispanic, 19% received at least one form of supplemental services (Title I, instructional support, tutoring, Response to Intervention, or other) at school, and 8% received one or more special education services (mostly speech and learning disabilities). A little over half of the student sample (53%) were girls. To eliminate potential confounds resulting from exposure to SEL intervention, only data from participants in the BAU condition were used to address the first two research questions about stability (based on pretest and posttest with an approximate 3-month interval) and test-criterion relationships (including concurrent relationship using baseline data and predictive relationship using 1-year follow-up data on academic outcomes) for the CRPF levels. The treatment sensitivity question was addressed using data from the whole sample.

Measures
Three measures were used in this study, one for SEL skills and two for academic outcomes (reading and math). SEL skills were measured by items on the SSIS SELb-T (Elliott et al., 2020). The SSIS SELb-T was developed by applying Item Response Theory to the standardization sample of the more extensive and full-length SSIS RS-T and SSIS SEL Rating Forms (Gresham & Elliott, 2017). A set of maximally efficient items was selected that maintained content coverage as well as similar psychometric properties to the original forms (see Anthony et al., 2021 for more information about the development of these brief forms). The published SSIS SELb-T (Elliott et al., 2020) includes 20 items, four for each of the five CASEL domains of selfawareness, self-management, social awareness, relationship skills, and responsible decision-making. Anthony et al. (2021) reported evidence of internal consistency (Cronbach's α = .93 for composites and .79-.87 for subscales), test-retest reliability (.84 for composites and .75-.83 for subscales), and interrater reliability (.65 for composites and .47-.65 for subscales). Validity evidence was demonstrated by the scales' relationships with the Social Skills Rating System-Teacher Rating Scale (SSRS-T; Gresham & Elliott, 1990), the Behavior Assessment System for Children-Second Edition (BASC-2; Reynolds & Kamphaus, 2004), and Vineland Adaptive Behavior Scales, Second Edition (Sparrow et al., 2005) as indicated in Anthony et al. (2021). For this study, the total scores were computed using the SSIS SELb-T items that were originally present on the SSIS RS-T, administered during an efficacy trial of a universal SEL program. The total SSIS SELb-T scores were then transformed to four CRPF levels using cut scores reported in the SSIS SELb-T manual (Elliott et al., 2020). 1 Star Reading (Renaissance Learning, 2010) is a brief, computer adaptive test that assesses students' reading comprehension and overall reading achievement. It includes 25 items requiring students to construct sentence meaning based on vocabulary knowledge and contextual information. Split-half reliability estimates reported by the publisher were .89 in both Grades 2 and 3. Publisher-reported validity evidence included high correlation coefficients between Star Reading scores and other standardized reading assessment scores (e.g., American College Testing, Scholastic Aptitude Test, and Iowa Test of Basic Skills). Star Reading produces scaled scores, which are transformations of the Rasch ability estimate resulting from the test, that range from 0 to 1400. The scores represent a student's performance on a vertical developmental scale that spans the K-12 grade levels (Renaissance, 2010). For example, the first to third quartiles (Q1, Q2, and Q3) benchmarked for Grade 2 students in the fall (around baseline assessment for our sample) were 126, 224, and 322, respectively; and the corresponding quartiles for Grade 3 in fall (around the 1-year follow-up assessment for our sample) were 259, 357, and 461 (Renaissance Learning, 2015).
Star Math (Renaissance Learning, 2009) uses multiple-choice questions to assess students' number concepts, computation, algebraic thinking and other fundamental math skills. Split-half reliability estimates provided by the publisher were .78 in both Grades 2 and 3. High positive correlations with scores from other standardized math assessments provide support for the validity of Star Math scores. Star Math scaled scores range from 0 to 1400. The first to third quartiles benchmarked for Grade 2 students in fall (around baseline assessment for our sample) were 357, 414, and 467, respectively; and the corresponding quartiles for Grade 3 in fall (around the 1-year follow-up assessment for our sample) were 443, 500, and 552 (Renaissance Learning, 2015).

Procedures
Active parent/guardian consent and student assent were obtained for all students participating in the data collection associated with the larger efficacy trial; approximately 52% of students in participating classrooms had parental consent to participate. Classrooms in both intervention and BAU conditions followed the same data collection procedures. Teachers completed an online questionnaire for each participating student that included the SSIS SELb-T items as well as other items from the full-length SSIS teacher form. Teachers also provided student demographic information such as gender, race/ethnicity, receipt of special education, or supplemental services. Trained research assistants administered the Star Reading and Math assessments. Baseline data were collected within a 4-week period (e.g., October-November) before implementation of the SSIS-CIP (Elliott & Gresham, 2007) in classrooms randomly assigned to the treatment condition. Posttest data were collected after the SSIS-CIP implementation (e.g., March-April). Follow-up Star data were collected in the year following the SSIS-CIP implementation in winter/early spring (e.g., February-March).
Teachers assigned to the intervention condition received training in using the SSIS-CIP program (see DiPerna et al., 2015DiPerna et al., , 2016 for more information). The SSIS-CIP, which focuses on social skills identified by teachers as being important for successful classroom learning, consists of 10 core units that are each taught via three lessons. Program materials include a teacher manual providing scripted lesson plans, brief video examples, and role-play activities for students. Lessons are delivered through direct instruction of skill steps, modeling and practice of steps, and monitoring and generalization activities. Fidelity of implementation of each of the lesson components was monitored via periodic real-time observations in the classroom by trained research assistants as well as weekly self-report questionnaires completed by teachers. Fidelity reported by both trained independent observers and teachers was high (97-98%). APA ethical guidelines were followed for conducting human participants research, and all procedures were approved by institutional review board.

Data Analysis
To address the first research question about stability of the SSIS SELb CRPF levels, only data from participants in the BAU condition were used because their post-test scores were not affected by the intervention. Specifically, we cross-tabulated frequencies between baseline and post-test CRPF levels for the BAU group and calculated percentages of each baseline level distributed across different CRPF levels at posttest. The association between baseline and posttest CRPF levels was estimated by multiple indexes (including the decision consistency indexes of percentage of agreement and Cohen's kappa, phi-coefficient, and Spearman's rho) and tested for statistical significance using chi-square. Stronger association suggests higher stability and less movement across CRPF levels over the approximate 3-month interval.
To address the second research question and examine test-criterion relationships with academic outcomes for the CRPF levels, we again used participants in the BAU condition who completed the Star measures at baseline for concurrent relationship evidence and 1-year follow up for predictive relationship evidence. The CRPF levels defined at baseline were analyzed to provide concurrent and predictive relationship evidence with academic outcomes at baseline and 1-year follow up, respectively. We first conducted a simple one-way ANOVA test to examine CRPF level differences on each of the standardized Star Math and Star Reading measures at each time point. Pairwise comparisons between CRPF level means were conducted using Tukey's HSD to control familywise Type I error. Next, we conducted a random-intercept ANOVA model using the mixed procedure in SAS (SAS Institute Inc, 2020a) to account for clustering by classrooms and to test CRPF level mean differences on math and reading at each time point while controlling for participants' demographic variables, including receipt of special education (1 = yes, 0 = no), receipt of supplemental services (1 = yes, 0 = no), race/ethnicity (1 = White student, 0 = racialethnic minority student), and gender (1 = male student, 0 = female student). Benjamini and Hochberg (1995) correction was applied to control for false discovery rate due to multiple comparisons between groups (i.e., 6 pairwise comparisons). Significant differences among CRPF levels on baseline math or reading scores would provide concurrent relationship validity evidence for the CRPF. Significant CRPF level differences on math or reading scores at 1-year follow up would provide predictive relationship validity evidence for the CRPF. In addition, we ran a random-intercept ANCOVA model using the mixed procedure in SAS (SAS Institute Inc, 2020a) on each of the follow-up math and reading measures while controlling for participants' corresponding baseline math or reading scores and demographic variables to investigate whether the CRPF levels could predict relative change over a year in academic outcomes.
For Research Question 3 regarding the sensitivity of CRPF levels to intervention, we used data from the efficacy trial of the SSIS-CIP for second grade participants with both baseline and posttest CRPF levels. First, baseline equivalence between treatment conditions was gauged for baseline CRPF levels and each of the demographic variables using the quick and simple chi-square test of independence. We did not use more advanced methods to assess baseline equivalence as these baseline variables were all included in the analysis model and thereby controlled. Then, we ran a random-component proportional odds model with multinomial distribution and cumulative probit link to test whether students in the SSIS-CIP and BAU comparison conditions had differential probabilities of moving up or down the CRPF levels from baseline to post-test. Participants' special education status, supplemental service status, race-ethnicity (White student or racial-ethnic minority student), and gender were included as covariates and clustering by classrooms were accounted for in the model using the glimmix procedure of SAS (SAS Institute Inc, 2020b).

Results
The listwise-deleted sample sizes (by analysis) and descriptive statistics of participants' baseline SSIS SELb-T prorated scores are presented in Table 2 by CRPF levels. For the stability research question, there were no missing data on the CRPF levels at baseline from the BAU comparison sample but four (1.4%) missed posttest. For CRPF-academic outcome relationships, missing was mostly due to the 1-year follow-up attrition from the BAU comparison group (60 participants or 20.9% missed follow-up). Six to seven other participants (2-3%) missed baseline academic scores and another 10 (3.5%) missed demographic variables. For the treatment sensitivity question, 21 students (3.3%) were not assessed at baseline, 11 were lost to attrition at posttest (1.7%), and 55 (8.6%) were missing demographic variables.
As expected, the SSIS SELb-T prorated total scores were different across CRPF levels, by about 10 points between adjacent levels from Emerging to Competent and 15 points between Competent and Advanced. The SSIS SELb-T score variability was quite similar across the CRPF levels. The score pattern was also similar across analysis samples.

Stability Evidence
The cross-tabulated frequencies between baseline and posttest CRPF levels for the BAU group are shown in Table 3, and percentages of each baseline level distributed to different levels at posttest are included in parentheses. At baseline about 25.4, 50.5, 16.6, and 7.4% of students were in the Advanced, Competent, Developing, and Emerging levels, respectively. The majority of the initial Advanced (75%), Competent (69%), and Emerging (57%) students remained in the same level about 3 months later. A substantial percentage of initial Competent (20%), Developing (43%), and Emerging (38%) students moved up to a higher level of competence at post-test. A relatively smaller percentage of students who were initially in the Competent (10%) and Developing (17%) levels moved down to lower levels; 25% of students in the Advanced level moved to a lower level as well. Overall, the CRPF levels were relatively stable over an approximate 3-month interval (χ 2 = 217.17, df = 9, p < .0001) with moderate to strong association between the baseline and posttest CRPF levels (percentage of exact agreement = 64.7, kappa = .46, phi coefficient = .88, Spearman's rho = .683). The simple ANOVA test for CRPF level differences at baseline was statistically significant for both math, F (3, 206) = 5.21, p = .002, and reading, F (3, 207) = 4.14, p = .007. The test for CRPF level differences at 1-year follow up was also statistically significant for both math, F (3,206) = 5.88, p = .001, and reading, F (3,207) = 3.21, p = .024. As expected, higher math and reading scores were related to higher SEL performance levels. After adjusting for clustering and demographic differences, CRPF levels remained statistically significant for math both at baseline, F (3,199) = 3.17, p = .026, and at follow up, F (3,172) = 4.55, p = .004. Specifically, students at the Emerging SEL level scored significantly lower on math (average below the Q2 benchmark for Grade 2 students at Fall) than their Advanced peers (average above Q3) at baseline and lower than all other three SEL levels at 1-year follow up (average remained below the Q2 benchmark for Grade 3 students vs. average above Q3). However, CRPF levels became statistically nonsignificant for reading at both baseline, F (3,195) = 1.61, p = .189, and follow up, F (3,191) = 1.21, p = .307, after accounting for demographic differences and clustering (although the Emerging group consistently scored below the Q2 benchmark while the other groups scored around Q2 but below Q3). Moreover, CRPF levels were not statistically significant in predicting relative change in math, F (3, 168) = 1.96, p = .122, or reading, F (3,181) = .43, p = .735 at 1-year follow up when the corresponding math or reading baseline differences were also adjusted.

Treatment Sensitivity
Based on chi-square tests of independence, the two treatment groups did not appear to be greatly unbalanced with respect to baseline CRPF levels (χ 2 3 = 7.34, p = .062), special education status (χ 2 1 = 2.60, p = .107), receipt of supplemental services (χ 2 1 = 2.09, p = .148), and gender (χ 2 1 =.05, p = .815). However, the treatment group had a larger proportion of racial-ethnic minority students than the BAU group (35.67% vs. 23.68%; χ 2 1 = 9.26, p = .002). Results of the random-component proportional odds model showed that treatment condition was a statistically significant predictor of posttest CRPF levels after accounting for baseline CRPF levels, demographic variables, and clustering (b = .53, SE = .19, p = .007). Table 5 shows the model-estimated conditional transition probabilities by treatment group. Students in the SSIS-CIP condition had higher probabilities of transitioning to higher CRPF levels and lower probabilities of regressing to lower CRPF levels from pretest to posttest than the BAU comparison students. In an exploratory fashion, we also examined relationships between student demographic variables and the transition probabilities given findings in the original efficacy study that student-level demographic variables, such as receipt of supplemental services and initial skill levels, were related to intervention effects (DiPerna et al., 2015;2016). Among the included demographic variables, students' gender and receipt of supplemental services were statistically significantly associated with transition probabilities. Holding other variables constant, female students had slightly higher estimated probabilities of transitioning to higher CRPF levels than male students Adjusted for receipt of special education, receipt of supplemental services, race-ethnicity (White student or racial-ethnic minority student), gender, clustering by classrooms, and false discovery rate (Benjamini & Hochberg, 1995). c Including baseline.
(b = .23, SE = .11, p = .04), and students who did not receive supplemental services were more likely to transition to higher CRPF levels than those who did (b = .31, SE = .14, p = .03).

Discussion
School-based interventions to promote student social-emotional competence have become widespread during the last decade (Bryant et al., 2021). However, the development of sound assessment tools and practices to inform intervention planning have lagged (Gross et al., 2019), and data-based decision making about screening and intervention remains a challenging task for MTSS teams (VanDerHeyden et al., 2018). Efficient SEL measures with practical interpretive frameworks are essential to facilitate SEL screening in schools and inform programing decisions . The competency-referenced performance framework (CRPF) was developed for the SSIS SEL Brief Scales to facilitate screening decisions (Elliott et al., 2020). However, the psychometric evidence for performance levels yielded from this framework has not been examined with samples beyond the original standardization data. This study examined evidence of stability, test-criterion relationships with academic outcomes, and treatment sensitivity of the CRPF for SSIS SELb-T using data drawn from a large efficacy trial.

Key Findings
Results of the study showed that the SSIS SELb-T CRPF levels appeared to be relatively stable despite longer than typical time intervals between assessment administrations for stability indices. Kappa appeared somewhat low because it was affected by unequal distribution across CRPF levels. Test-retest stability for the SSIS SELb-T composite scores reported in the manual (Elliott et al., 2020) was .78 presumably for a typical 2-week interval. Stability based on the SSIS SELb-T scores for our BAU sample was comparable at .77. Therefore, the slightly lower estimates for CRPF levels (e.g., Pearson r =.7, Spearman's rho =.68) appear to primarily be due to categorization of scores rather than longer time intervals. However, this might not hold for a longer time interval (e.g., from beginning to end of the school year) as changes might occur resulting from teachers naturally correcting social interactions and reinforcing appropriate peer-to-peer engagement in classrooms over time.
The SSIS SELb-T CRPF levels were also related to standardized math achievement scores both concurrently and predictively, with or without adjusting for clustering and demographic differences. However, the CRPF levels did not significantly relate to standardized reading scores after controlling for clustering and demographic differences, nor did they predict relative change on math or reading outcomes. Although the omnibus tests for CRPF level differences on academic outcomes were not statistically significant when baseline differences were adjusted for, the gap in math outcome between the emerging level and the other CRPF levels was notably larger at 1-year follow up compared to baseline. The lack of statistical significance could be due to low statistical power resulting from the smaller sample size for the BAU sample when broken down by CRPF levels (particularly for the Emerging category). However, the relative frequency distributions of students in our samples across the CRPF levels are similar to the distribution of the standardization sample published in the manual and consistent with developmental expectations (Elliott et al., 2020). Specifically, the percentages of students in different CRPF levels are not expected to be equal, and the percentage in the Emerging level is expected to be relatively small. As such, the pattern of between-CRPFlevel differences would not likely change dramatically if the total sample size were to increase. Given that significant difference between groups parallel different benchmark quartiles for the Star scaled scores, we are confident that the findings are not significantly negatively impacted by the relatively small sample size for one group. Nevertheless, the finding should be replicated in future research with larger samples to explore if students with emergent SEL skills might require special attention not only in the SEL domain.
Previous research has highlighted differences between early math and reading skill development, suggesting that math skill acquisition may rely more heavily on active and quality learning environments in schools (Ginsburg et al., 2008;Rimm-Kaufman et al., 2007), be more highly compromised by behavioral problems in the classroom (Miller et al., 2017), and relate positively to students' behavioral skills (Ponitz et al., 2009). Similarly, improvements in math skills have been linked to interventions that promote self-regulation (Schmitt et al., 2015) and selfaffirmation (Borman et al., 2016). Future work is necessary to further investigate if SEL constructs are more salient to math compared to reading and examine factors that may mediate such relationship.
Moreover, the SSIS SELb-T CRPF levels demonstrated sensitivity to the SSIS-CIP intervention, an important condition if the assessment is to be used to evaluate SEL programming outcomes. However, given the alignment between skills assessed by the SSIS SELb-T and skills targeted by the SSIS-CIP lessons, the treatment sensitivity results may not generalize to other SEL programs. As such, future studies are needed to determine if the CRPF is sensitive to SEL skill change resulting from the implementation of other universal SEL programs (e.g., Second Step, [Committee for Children, 2016]; Promoting Alternate Thinking Strategies [PATHS; Kusche & Greenberg, 1994]). In addition, female students and those who were not receiving supplemental services had a higher probability of transitioning to a higher CRPF level. In reviewing the demographic characteristics of students reported in a large meta-analysis of universal SEL impact studies, Rowe and Trickett (2018) found that gender was the most analyzed moderator of treatment effect, and 41% of studies that tested treatment-by-gender interactions found significant relationships. However, they noted that there was no consistent pattern in the direction of this moderation and that very few studies reported student disability status or receipt of services. Especially given the mixed results in previous research, there is need to replicate the current findings with more samples and programs in future studies.

Limitations and Future Directions
In addition to the aforementioned limitations and future directions relative to specific research questions, this study only examined the SSIS SELb Teacher form with Grade 2 students. As such, findings also may not generalize to other SSIS SELb forms completed by other informants (students or parents) or to other grade levels. Students' behaviors might vary in different contexts (e.g., home or community vs. school), and there might be developmental differences across grades or ages (e.g., different variabilities [besides different means] that may affect strengths of intervariable relationships). It is important to further investigate whether evidence of reliability and validity for CRPF levels is similar for other contexts, grades, or age levels.
In addition, teachers who taught the SSIS-CIP lessons also rated students' behavior outcomes in this study, an unavoidable limitation for gauging treatment sensitivity of teacher ratings relative to a universal program that is meant to be facilitated by classroom teachers. As such, it is possible that teachers were more primed to notice SEL skills after teaching the lessons. However, given that teachers in both treatment and BAU conditions rated students at baseline using the same rating form, it is possible that both groups could have been primed to notice SEL skills after the first exposure to the rating form. One could also argue that being more primed to notice SEL skills is part of the program impact; that is, the program could have changed teachers' perceptions about SEL (e.g., Domitrovich et al., 2016). Future investigation could include a qualitative component to better understand the program impact on teachers and their perceptions of student behaviors.
Finally, it is important to acknowledge limitations associated with the items used to obtain the CRPF levels in this study. First, because there were three new items introduced for the Self-Awareness scale of the SSIS SELb measures subsequent to data collection for the efficacy trial, scores for only 17 of the 20 SELb items were available in the current database and prorated to obtain the total scores. This approach implicitly assumes that the missing items would function similarly to the available items, which may or may not be tenable. Second, Self-Awareness was underrepresented because the three unavailable items were all from this subscale. Although the CRPF levels were based on cut scores for the composite rather than subscale scores and Spearman-Brown's stepped-up reliability was high for the composite, classification of participants might nonetheless be different when all items are completed. As such, it is crucial to replicate the current study with the intact SSIS SELb-T form and larger independent samples.

Implications and Conclusions
As the demand for efficient and informative SEL assessments continues to grow in schools, the availability of criterion-referenced interpretative frameworks such as the SSIS SELb CRPF may help improve the practical utility of scores from SEL assessments. Understanding student functioning relative to a performance level is an approach with which teachers, families, and even students are already familiar. The use of criterion-referenced score interpretation frameworks is common in standardized tests of academic achievement such as those used in statewide student assessment systems. As such, stakeholders are likely to find such scores easier to understand, interpret, and act upon compared to the norm-referenced approaches most commonly used in SEL assessments today. Furthermore, performance level interpretation may assist in making MTSS intervention decision-rules that are intuitive, consistently applied, and easy to document and track. When considering adoption of an assessment for universal SEL screening in schools, members of MTSS teams have been encouraged to consider the reliability and validity of the scores yielded by the measure. We further encourage them to also consider the type(s) of interpretation framework(s) offered by an SEL measure and how they may be used by school personnel and understood by other stakeholders. In addition, developers of SEL assessments should continue to investigate the potential use of criterion-referenced frameworks to advance efficient, informative, and useful SEL assessment decisions for students.
Results of this study provide initial independent evidence relative to the use of the SSIS SELb-T CRPF for universal SEL screening and monitoring student response to implementation of the SSIS-CIP universal program in schools. Although future investigations with additional samples and intervention programs are needed, the current results complement and expand upon prior findings of reliability and validity evidence for SSIS SELb scores using data drawn from the standardization sample (Anthony et al., 2021).
Statement of Potential Conflicts of Interest: Pui-Wa Lei and James C. DiPerna are authors of the SSIS SEL Brief Scales and receive a royalty from the publisher.
The research reported here was supported by the Institute of Education Sciences, U.S. Department of Education, through Grant R305A090438 to The Pennsylvania State University. The opinions expressed are those of the authors and do not represent the views of the Institute or the U.S. Department of Education.

Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Pui-Wa Lei and James C. DiPerna are authors of the SSIS SEL Brief Scales and receive a royalty from the publisher.