Introduction

Individuals with autism spectrum disorder (ASD) are involved in numerous interventions throughout their lifespan, the most common of which are aimed at improving social communicative behaviors (Fuller & Kaiser, 2020; Sandbank et al., 2020). Quantifying and measuring the effectiveness of interventions is essential to understanding and monitoring the development of skills in the individuals involved. However, the field of autism intervention research faces significant limitations in measuring effectiveness, including biases inherent in parent- or clinician-report and the reliability of available measures of change over brief periods of time (Grzadzinski et al., 2020). Furthermore, the lack of a uniform measurement approach across studies complicates the comparison of the effects of various interventions, making it unclear which may be most beneficial for whom and when (Cunningham, 2012). There is a critical need for outcome measures that adequately address the limitations discussed below and that reliably detect changes in the social communicative behaviors of individuals with autism, especially in a short period of time.

Limitations of Previously Used Outcome Measures in Intervention Research

Relying on Parent or Clinician Report has Biases

Outcome measures used in intervention research commonly rely on caregiver or clinician reports which can be problematic due to the likelihood of systematic measurement error, expectancy bias or placebo effects (Anagnostou et al., 2015; Sandbank et al., 2020). The Clinical Global Impression (CGI) rating scales (Busner & Targum, 2007), one of the most commonly used outcome measures in intervention research (Bolte & Diehl, 2013; Toolan et al., 2022), is a subjective measure of relative improvement completed by clinicians. In the case of some behavioral interventions, the CGI is completed by the clinician responsible for delivering treatment, in part because this mirrors typical clinical practice. In medication trials, caregivers or physicians may begin the study blind to treatment condition, and later become “unblinded” as a result of possible side effects the child experienced throughout the course of the study (Wolery & Garfinkle, 2002).

Biases inherent in caregiver- and clinician-report measures can accentuate the appearance of treatment effects, leads to the belief that strong effects are present beyond the more subtle changes in social communicative behaviors which are truly occurring (Grzadzinski et al., 2020). For example, a number of studies have demonstrated that caregiver-rated treatment response was associated with caregiver beliefs regarding allocation of treatment condition, even when no significant differences were found between placebo and intervention groups on objective outcome measures (Guastella et al., 2015; Owley et al., 2001). Furthermore, Jones et al. (2017) found a decrease in parent-reported autism-related behaviors and problem behaviors over an eight-week period when, in fact, no treatment was employed.

Caregiver-reported biases may be attributed to the Rosenthal effect, in which expectations about the outcome of an intervention may affect caregivers’ responses. Other caregiver biases include overestimating a child’s abilities due to reluctance to acknowledge a child’s delays, difficulty recalling a child’s developmental milestones, investment in positive outcomes, and paying greater attention to challenging behaviors as opposed to prosocial behaviors, each of which could affect measurement of change (Miller et al., 2017; Nordahl-Hansen et al., 2014; Ozonoff et al., 2011; Zapolski & Smith, 2013). While caregiver report and clinician judgment are important sources of information, reliance on these measures alone limits the interpretation of intervention responses (Miller et al., 2017).

Diagnostic Tools are Not Sensitive to Change in Short Period of Time

Changes in autism-specific behaviors are often measured using diagnostic instruments (Dawson et al., 2010; Green et al., 2022). Yet, diagnostic instruments, such as the Autism Diagnostic Observation Schedule, Second Edition (ADOS-2; Lord et al., 2012), were not intended to be used as outcome measures of short-term treatment response. Rather, they were intended to measure relatively stable constructs over time (Cunningham, 2012).

Thus, diagnostic instruments are typically not sensitive enough to detect subtle changes in a short period of time (Grzadzinski et al., 2020; Owley et al., 2001). While some studies have found significant changes over time in ADOS raw scores, these changes were also evident in treatment-as-usual groups (Green et al., 2022; Gutstein et al., 2007). In other studies that found significant raw score changes, changes were usually not evident over short periods of time and were related to changes in other domains, such as language development, as opposed to changes in the severity of autism features (Esler et al., 2015; Gotham et al., 2012). Furthermore, the use of raw score changes on diagnostic instruments, such as the ADOS-2, must be interpreted with caution due to the influence of age, language level, and verbal IQ on raw scores (Kim et al., 2018). As a result, Calibrated Severity Scores (CSS) were created as a standardized metric of autism symptom severity that is less confounded by changes in general maturity or language development (Gotham et al., 2008).

The ADOS CSS has successfully measured changes in autism symptom severity (Gotham et al., 2008; Grzadzinski et al., 2020). Yet, these changes have been evident over long periods of time (i.e., years as opposed to months; Estes et al., 2015; Gotham et al., 2012; Pickles et al., 2016; Thurm et al., 2015). CSS scores demonstrate high test–retest reliability, meaning that scores are stable in a short period of time (Janvier et al., 2022). Considering that short-term intensive interventions are common for individuals with autism, CSS scores are likely not a useful outcome measure to be used to test their effectiveness.

Finally, diagnostic measures require substantial training to use reliably and are often time-consuming to administer. The amount of time needed, and the level of training required to administer and score these assessments make diagnostic instruments difficult to implement in large scale, multisite studies, especially considering that they must be administered more than once to measure changes over time. An instrument that reliably measures autism-specific behaviors and is more sensitive to subtle changes in a short period of time will be of crucial importance for autism intervention research moving forward (Matson, 2007).

Lack of Uniform Measurement Approach

Autism intervention research has utilized hundreds of disparate outcome measures to test the effectiveness of various treatments (Bolte & Diehl, 2013). There is little consensus regarding which symptoms to target or which tools to use in determining intervention effectiveness (Anagnostou et al., 2015; Sandbank et al., 2020). This is due, at least in part, to the heterogeneity of the type and severity of autism features. For example, deficits in social communication could be considered to include verbal or nonverbal communication delays, trouble developing or maintaining relationships, among many other possible areas of difficulty (Volkmar et al., 2004).

The lack of uniform measurement approach is also the result of the use of study-specific outcome measures that are used to measure specific behaviors, such as joint attention (Green et al., 2022; Kasari et al., 2012; Rogers et al., 2012; Yoder et al., 2014). In a large-scale review of 195 prospective intervention trials for individuals with ASD, 289 outcome measures were identified (Bolte & Diehl, 2013). Of the 289 measures, 62% were found in only one publication over a 10-year period and 21% of these measures were designed or modified by the research investigator specifically for use in that study.

Study-specific outcome measures are often limited to quantifying the frequency of highly specific behaviors (Kaale et al., 2012), as opposed to capturing changes in broad social communicative behaviors (Spence & Thurm, 2010). These measures are often proximal to the treatment and may reflect learning a task in a specific context, though they are targeted with the hope that improvement of these behaviors will have positive cascading effects on other domains, such as language development (Green et al., 2022). While identifying changes in specific behaviors is important, whether these context-specific behaviors result in generalized gains across broad social communication strategies often goes unmeasured (Sandbank et al., 2020; Yoder et al., 2013). Moreover, behaviors can be operationalized differently across studies, making the comparison of results across outcome measures nearly impossible, even when they appear to measure the same behavior (Wolery & Garfinkle, 2002).

Call for Novel Measures

Reliably measuring changes in social communicative behaviors as a result of intervention has proven especially difficult. These behaviors are often subtle, meaning their measurement must be sensitive enough to capture small, but clinically meaningful changes that indicate measurable improvement and, ideally, predict other positive gains (Anagnostou et al., 2015; Grzadzinski et al., 2020). Expert panels have concluded that existing outcome measures widely used in ASD intervention research are not appropriate intervention response measures without certain modifications (e.g., only appropriate for specific populations, such as young children or those with average or greater IQ), making the use of a uniform outcome measurement approach difficult (Anagnostou et al., 2015; McConachie et al., 2015; Scahill et al., 2015). Moreover, few measures are flexible enough to be available for use across studies or sites. There is currently a call by intervention researchers for novel outcome measures that can reliably detect change, be used across studies, and fill the gaps left behind by the limitations of existing measures (Fletcher-Watson & McConachie, 2017; McConachie et al., 2015).

The Brief Observation of Social Communication Change (BOSCC)

The Brief Observation of Social Communication Change (BOSCC) was developed to provide a blinded, standardized, and efficient method of measuring subtle changes in the social communicative behaviors of individuals with ASD and other neurodevelopmental conditions over relatively short periods of time (i.e., as short as 8–12 weeks). It is a play-based assessment conducted with a participant and play partner, such as a parent or research staff member. It was initially developed using codes from the ADOS, which apply specifically to ASD symptoms, but these codes were modified and expanded upon to examine and measure more subtle social communicative behaviors (Grzadzinski et al., 2016; Lord et al., 2012).

The BOSCC is flexible and standardized, which allows for its use across sites and studies. It is observation-based, using interactions with play partners who may be blinded to treatment or not, and coded by individuals who must be blind to treatment condition and goals, lessening the possibility of bias or placebo effects. The goal of the BOSCC is to address the challenges that intervention research faces in measuring intervention effectiveness by providing a novel, standardized outcome measure that is minimally biased, sensitive to change in short periods of time, easy to code, and flexible enough to be used in a variety of settings, with a variety of populations and research contexts, and by people of all skill levels.

Current State of the BOSCC

The BOSCC has been validated for use with minimally verbal (MV) children with ASD, called the BOSCC-MV (Grzadzinski et al., 2016; Kim et al., 2018; Kitzerow et al., 2016; Nordahl-Hansen et al., 2016). Using a sample of 56 children between the ages of 1–5 years, results demonstrated statistically significant changes in the “Core Total” items as compared to a no-change alternative; ADOS CSS scores over the same period showed no statistically significant changes (Grzadzinski et al., 2016). Furthermore, psychometric properties of the BOSCC-MV showed high to excellent inter-rater reliability and test–retest reliability. Exploratory Factor Analyses (EFA) revealed two underlying factors: Social Communication (SC) and Restricted and Repetitive Interests and Behaviors (RRB). This two-factor structure mapped onto well-known behavioral diagnostic assessments of ASD, such as the ADOS, and generally fits well with diagnostic features of ASD as specified in DSM-5 and ICD-11 (American Psychiatric Association, 2013; World Health Organization, 2019). Since its original validation study was published, several other studies have corroborated the strong psychometric properties of the BOSCC-MV, its ability to detect changes in a short period of time, and be conducted in a variety of settings with various play partners (Gengoux et al., 2019; Green et al., 2022; Kim & Lord, 2010; Kitzerow et al., 2016; Nordahl-Hansen et al., 2016).

During the development of the BOSCC, pilot testing was conducted to determine whether the BOSCC-MV administration and coding scheme could be used with older and more verbal children. This developmental stage revealed that the BOSCC-MV was unable to identify changes in children over the age of 8; scores were variable over time and not related to intervention status in children receiving a range of treatments from four different sites. Furthermore, changing the coding scheme without modifying the administration was insufficient. Both the context in which the interaction occurred, and the codes used for scoring, needed to be altered for changes to be detected. We used this pilot data collected during the developmental stage of the BOSCC to extend the work already conducted on the BOSCC-MV by creating new contexts (e.g., administration materials appropriate to developmental level) and a new coding scheme more appropriate for older and more verbal children with ASD.

Current Study

The aim of the present study is to determine the utility of the BOSCC as an outcome measure of intervention response in a sample of young autistic children who are verbally fluent, or autistic individuals of any age who consistently and spontaneously use phrase speech. This version is called the BOSCC Phrase Speech/Young Fluent (PSYF). More specifically, this paper will (1) determine items for inclusion in the final BOSCC-PSYF coding scheme and its algorithm, (2) analyze the factor structure of the measure by exploring the relationships between items, (3) examine the initial psychometric properties, including inter-rater and test–retest reliability, and (4) provide evidence of its utility as a measure of change by examining changes in scores over time in autistic individuals receiving various behavioral interventions and compare these changes to changes in scores over the same period of time in other behavioral and parent-report measures.

Method

Participants

Participants included 160 English-speaking children between the ages of 2–10 years (Mage = 4.9 years) with a documented diagnosis of autism spectrum disorder (ASD). Sex at birth was reported by caregivers in which 82% were male and 18% were female. The caregiver-reported racial and ethnic identities and parental education can be found in Supplement Table 1. All participants had language abilities suitable for the PSYF administration. Specifically, the BOSCC-PSYF is appropriate for: (1) individuals of any age who use phrase speech, (defined as spontaneous, non-rote two-word phrases which include both a noun and a verb, such as “want ball”), or (2) children with fluent language (defined as multiclausal sentences with flexible grammatical and sentence structures) who are younger than 8 years of age. All administrations and scoring were completed in English.

All participants were actively receiving behavioral intervention at the time of participation, though the types of intervention varied. For example, some children were enrolled in a short-term intensive day program (approximately 35 h per week for 16 weeks), while others were enrolled in various less-intensive (at least once per week) long-term, ABA-style programs (approximately 10–20 h per week). For the purposes of this validation study, comparison of specific treatment effects across the various interventions will not be explored.

Procedure

Participants (n = 160) were recruited from three sources. The first was a short term, intensive partial hospitalization program for children with ASD (n = 30). Two other sources (n = 25, n = 105) were research studies that took place on UCLA’s and Weill Cornell Medicine’s campuses. All participating families signed informed consent forms approved by the participating institutions’ Institutional Review Board before participating in this study.

Whenever possible, each source administered the BOSCC at (at least) two timepoints, along with other diagnostic, cognitive, and adaptive behavior measures. Between one and six videos were available for each child (M = 2.05 videos, SD = 0.57). Sixteen participants were lost to follow up; thus, two or more videos were available for 144 participants. All BOSCCs were conducted in an in-person laboratory setting with a research assistant. Participants with only a single BOSCC datapoint available were retained for purposes of psychometric analyses of validity (e.g., factor analyses) and reliability (e.g., inter-rater), but not in analyses of change.

Measures

Brief Observation of Social Communication Change (BOSCC-PSYF)

The BOSCC was developed as an intervention response measure of social communicative and other behaviors associated with autism. The BOSCC is a 12-min, videotaped play interaction between an individual and a play partner. The BOSCC can be administered in a lab, clinic, or home setting, though it is essential that this context and the type of play partner remain consistent across each observation. The play interaction is conducted with a standardized set of toys that is designed to offer opportunities for active participation and various levels of play between the participant and play partner. It was designed to be easy to administer and, thus, can be implemented with caregivers, research assistants or clinicians who receive minimal instruction, as long as someone of the same role administers the BOSCC at all time points for a given participant. For purposes of this study, all BOSCC administrations were implemented by clinicians (not administering treatment) or research assistants.

Coding Procedures

The BOSCC videos are split into two 6-min segments (Segment A and Segment B) which are each watched and coded twice. The BOSCC-PSYF includes 17 items that are scored on a 6-point scale ranging from 0 (“atypical behavior not present”) to 5 (“atypical behavior present and significantly impairs functioning”). The PSYF items are comprised of 10 items from the BOSCC-MV (Grzadzinski et al., 2016) modified to fit the social communicative behaviors of children with greater language skills, and three novel items (i.e., verbal exchanges, offering information, and stereotyped speech). These 13 items are averaged across the two segments and summed to create a total BOSCC score. The final four items, which are not included in the scoring process, are used as indicators of mood/disposition and other co-occurring behaviors sometimes seen in ASD (i.e., social engagement in play activities/interaction, activity level, disruptive behaviors, anxious behaviors). These items are scored to determine the validity of the administration and to provide quantitative information about non-ASD specific behaviors that are seen frequently in children which may impact social communicative behaviors (e.g., oppositional behaviors). High scores suggest that difficulties in the assessment may be exacerbated by issues other than those related to ASD symptoms.

The BOSCC coding scheme employs empirically based decision trees for ease of use (Grzadzinski et al., 2016). Each decision tree contains detailed information regarding the frequency and quality of specific behaviors. At each branch, the coders answer a specific question (e.g., yes or no, frequency, consistency) concerning the child’s behavior on that specific item (e.g., eye contact) until they arrive at a numerical code. Videos were coded by one psychologist, one postdoctoral researcher, four graduate students, and one research assistant. All coders obtained reliability before beginning the coding process (as described in Grzadzinski et al., 2016) and were blind to timepoint and treatment status. A random sub-sample of 54 videos were chosen to determine inter-rater reliability.

Additional Measures

Other diagnostic, cognitive, and adaptive functioning assessments were collected from all participants as part of their involvement in various intervention programs. The battery of assessments each participant received varied depending upon which source the participant was recruited from; however, whenever possible, all participants were administered at least one measure of ASD symptom severity, one cognitive test, and one measure of adaptive functioning. As a result of COVID restrictions, this was not possible for everyone. Thus, participation within each measure (described below) was variable. The results of these assessments were included in this study for purposes of comparing rates of change in these measures to that of the BOSCC.

ASD Symptom Severity

ASD symptom severity was measured in two ways: The Autism Diagnostic Observation Schedule (ADOS-2; Lord et al., 2012) and the Social Responsiveness Scale (SRS-2; Constantino & Gruber, 2012). The ADOS is a standardized diagnostic measure comprised of both structured and semi structured tasks used to assess symptoms of ASD. The ADOS-2 provides a total Calibrated Severity Scores (CSS) that indicates severity of autism symptoms during the assessment and can be used to compare symptom severity levels across individuals of varying developmental levels. Domain severity scores are also provided for social affect (CSS SA) and restricted and repetitive behaviors (CSS RRB) domains (Gotham et al., 2008). The ADOS-2 was administered to 88 of our participants at one time point. Twenty-five participants received Module 2, which is appropriate for individuals of any age who speak in phrases but are not verbally fluent. The remaining 63 participants received Module 3, which is appropriate for verbally fluent children and adolescents. ADOS-2 scores were collected for 61 individuals at a second timepoint, which allowed for analysis of change in scores over time. None of the clinicians who administered the ADOS-2 were involved in the coding of the BOSCC, allowing coders to be completely blind to the participant and timepoint.

The SRS is a parent-report measure that identifies the presence and severity of social impairment in individuals with ASD. The SRS was collected for 51 participants at one time point, and 18 participants at two time points.

Cognitive Functioning

Verbal and nonverbal cognitive functioning were assessed using a variety of measures, including the Mullen Scales of Early Learning (MSEL; Mullen, 1995), the Differential Ability Scales (DAS-2; Elliot et al., 2018), the Wechsler Preschool and Primary Scale of Intelligence (WPPSI-IV; Wechsler, 2012), the Peabody Picture Vocabulary Test (PPVT-IV; Dunn & Dunn, 2007), and the Ravens Progressive Matrices (Raven et al., 2000). The MSEL was collected for 30 children, the DAS-2 for 62 children, the WPPSI-IV for 19 children, and the PPVT-IV and Ravens for 25 children. Cognitive measures were only collected at one timepoint, so analysis of change in scores over time was not conducted.

Adaptive Functioning

The Vineland Adaptive Behavioral Scales (VABS-3; Sparrow et al., 2016) is a measure of adaptive functioning that provides standard scores in the domains of Communication, Daily Living Skills, Socialization, and Motor Skill, as well as an overall Adaptive Behavior Composite. Only the Communication and Socialization domains were used for purposes of this study. A combination of the comprehensive interview form and the caregiver report form was used. The VABS was administered to 151 participants at one timepoint and 73 at a second timepoint, allowing for analysis of change in scores over time.

Data Analysis

All analyses were carried out using R version 4.0.2 (R Core Team, 2021)–the Lavaan package was used to estimate all factor analysis models (Rosseel, 2012).

Item Level Descriptive Information

Several versions of the BOSCC-PSYF item level coding schemes were tested with the goal of achieving as close as possible to uniform distribution of codes across all items, though a normal distribution was also acceptable. Item level codes were re-written over several iterations until near-flat distributions were achieved (Grzadzinski et al., 2016). Consistent with the BOSCC-MV, we did not expect a uniform distribution for items related to restricted and repetitive interests and behaviors (RRBs) because the presentation of these behaviors is extremely heterogeneous across individuals and the short duration of the BOSCC assessment may not allow for consistent presentation of these behaviors (Kim & Lord, 2010).

Factor Structure

A multi-step process was undertaken to systematically evaluate the factor structure of the BOSCC-PSYF. While the factor structure of the BOSCC-MV has been validated, an exploratory approach was taken here due to differences between the coding schemes and the intended populations of the two versions (Grzadzinski et al., 2016).

Model fit was determined using the Tucker-Lewis Index (TLI), Comparative Fit Index (CFI), and root mean squared error of approximation (RMSEA). Values closer to “1” using the TLI and CFI and values closer to “0” using the RMSEA indicate better model fit. Recommended cutoffs for well-fitting models are typically greater than 0.95 for the TLI and CFI and ≤ 0.06 for the RMSEA, though these cutoffs may be overly exclusive in small samples (Hu & Bentler, 1999).

Using baseline data, scree and parallel plots were generated on which to base decisions of the number of factors to extract. Subsequently, exploratory structural equation models (ESEM) were fit to the data. This involved fitting an Exploratory Factor Analysis (EFA) model with an oblimin rotation using maximum likelihood estimation, and testing one-, two-, three-, and four-factor solutions. This was followed by a Confirmatory Factor Analysis (CFA) using the cross loadings from the EFA as the starting point for estimation. Factors were allowed to covary in these models. ESEM was chosen to balance the drawbacks of overly restrictive CFA models (e.g., cross loadings between factors are typically set to zero) while allowing for modifications and extensions (Asparouhov & Muthen, 2009; Marsh et al., 2014).

Lastly, a traditional CFA model was fit to the full dataset based on the observed factor structure from the ESEM to confirm that the factor structure holds up using the full dataset. Factors were allowed to covary. TLI, CFI, and RMSEA were used to evaluate model fit.

Longitudinal Measurement Invariance

Four steps were taken to evaluate the measurement invariance of the BOSCC-PSYF over time. These four steps were: (1) configural invariance, which tests whether the factor structure is comparable across entry and exit; (2) metric invariance, which tests whether the items load onto the same factors across entry and exit; (3) scalar invariance, which compares the intercepts across entry and exit; and (4) strict invariance, which tests whether the residual variances are comparable across entry and exit. Nested models were tested using chi-square difference tests; non-significant tests indicate invariance across the models that were tested.

Reliability

Test–retest and inter-rater reliability were analyzed. Test–retest reliability was estimated from 26 participants who had a second BOSCC conducted within one-month of each other. Inter-rater reliability was estimated from 54 double coded videos. Absolute agreement was assessed using two-way random effects models. Inter-rater and test–retest reliability results are described for each domain of the BOSCC-PSYF derived from the factor analyses and the Core Total.

Change Analyses

Following procedures from the analyses of the BOSCC-MV (Grzadzinski et al., 2016; Kim & Lord, 2010), first, paired sample t-tests were used to compare BOSCC Core Total and domain scores and other behavioral measures (i.e., VABS Communication and Socialization, ADOS CSS and SRS Total T-Scores) from the first available timepoint to the last available timepoint. This raw score difference was also standardized as a Cohen’s d effect size. Next, individual growth models were fit separately using BOSCC Core Total and domain scores, as well as the other behavioral measures with sufficient data (i.e., VABS Communication and Socialization, ADOS CSS and SRS Total T-Scores). This involved fitting a linear regression separately for each participant using participant’s age at the time of assessment as the independent variable to generate an average rate of change per month. Rate of change over 4.5 months, the average length of time between the entry and exit appointments in our sample, was reported. To be consistent with prior analyses, these rates of change were also converted to expected change scores over 6 months (Grzadzinski et al., 2016; Kim & Lord, 2010). These rates were then divided by the standard deviation of the measure at entry to generate an effect size comparable to a Cohen’s d. Due to the wide range of age and cognitive abilities of the participants, we also ran a linear mixed effect model to evaluate whether age, verbal IQ (VIQ) and nonverbal IQ (NVIQ) at entry were related to or moderated change in BOSCC scores over time.

Lastly, again following the procedures of Grzadzinski et al. (2016) and Kim & Lord (2010), change of greater than or equal to 8 points on VABS Standard Scores and SRS Total T-Scores (half a standard deviation) and greater than or equal to 1 point on ADOS CSS Scores (1 standard deviation) were used to classify “responders” on each of the behavioral measures (i.e., VABS Communication and Socialization, ADOS CSS and SRS Total T-Scores). After response status was determined, independent samples t-tests were used to determine whether “responders” and “non-responders” for each measure differed in the amount of change on BOSCC domain and Core Total Scores.

Results

Item Level Descriptive Information

The distribution of BOSCC-PSYF codes (averaged for Segment A and B) across 14 out of the 17 items in the final version of the BOSCC-PSYF can be found in Supplement 1. Activity Level, Disruptive Behavior/Irritability, and Anxious Behaviors are not depicted because these items were rarely observed and scored; however, these items provide useful information in determining whether the BOSCC administration is a representative sample of the child’s behavior. Thus, these codes were retained in the final coding scheme.

Factor Structure

Two sets of ESEM models were tested; the first included all items. The scree and parallel plots indicated a four-factor solution best fit the data (RMSEA = 0.045, CFI = 0.987, TLI = 0.971). These factors could be described as: (1) Early Communication, (2) Social Reciprocity/Language, (3) Play and (4) Restricted and Repetitive Behaviors and Interests (RRBs).

Due to concerns about over-specifying and mis-specifying the model driven by substantive and statistical concerns, such as a negative variance estimate for the “Play with Objects” item, a second ESEM model was fit excluding the “Play with Objects” item. The scree and parallel plots suggested a three-factor solution best fit the data. One- two- and three-factor solutions were tested. The best fitting solution was the three-factor solution (RMSEA = 0.074, CFI = 0.962, TLI = 0.929). Parameter estimates for the three-factor model are included in Table 1. These factors could be described as: (1) Early Communication, (2) Social Reciprocity/Language, and (3) RRBs. The “Engagement in Play with Others” item loaded onto the Social Reciprocity/Language factor in the absence of the “Play with Objects” item. The fit of the final model was adequate without the inclusion of the “Play with Objects” item. However, due to the clinical value that the item provides and the possibility of play skills improving with intervention, a decision was made to keep this item in the final codes but not include it in the measure’s factor structure. Figure 1 depicts the final items, domains, and Core Total.

Table 1 Exploratory structural equation model factor loadings without play
Fig. 1
figure 1

Visual depiction of BOSCC items, domains, and totals

The CFA model adequately fit the data (CFI = 0.937, TLI = 0.921 and RMSEA = 0.076). Item loadings across the three factors were high apart from some RRB items. The factor loadings of the items onto the Early Communication factor ranged from 0.63 to 0.95, 0.71 to 0.84 for the Social Reciprocity/Language factor, and 0.25 to 0.58 for the RRB factor. Model parameters across items are include in Table 2.

Table 2 Final confirmatory factor analysis parameter estimates

Measurement Invariance

Across time, there was evidence of configural, metric, scalar, and strict invariance. This suggests that the factor structure, item loadings, intercepts and residuals do not change substantially when measuring individuals across time. This is an indication that the BOSCC-PSYF measures the same factor structure across timepoints; thus, comparing mean scores across time is appropriate. Comparisons of the model fit statistics are provided in Table 3.

Table 3 Measurement invariance model fit

Test–Retest Reliability

Test–retest reliability was estimated from 26 videos. Adequate test–retest reliability for one item, “Engagement in Play with Others” was not able to be reached, likely due to the variable nature of the construct it measures. Additionally, when coding this item, coders reported anecdotally that the score seemed to fluctuate based on the child’s mood and disposition during the assessment. As a result, this item was removed from the algorithm and added as an “Other Abnormal Behaviors” code.

Overall, test–retest reliability was good for the Early Communication and Social Reciprocity/Language domains and fair for the RRB domain. The Intraclass Correlation Coefficient (ICC) value for the Early Communication domain was 0.74, 95% CI [0.50, 0.87], 0.68, 95% CI [0.40, 0.84] for the Social Reciprocity/Language domain and 0.51, 95% CI [0.15, 0.75] for the RRB domain. The ICC value for the combined Social Communication domain was 0.75, 95% CI [0.51, 0.88] and 0.71, 95% CI [0.45, 0.86] for the Core Total score.

Inter-Rater Reliability

Inter-rater reliability was estimated from 54 double-coded videos. Inter-rater reliability was good for the Early Communication domain and Social Reciprocity/Language domain and fair for the RRB domain. The ICC value for the Early Communication domain was 0.85, 95% CI [0.75, 0.91], 0.89, 95% CI [0.81, 0.94] for the Social Reciprocity/Language domain and 0.60, 95% CI [0.38, 0.75] for the RRB domain. The ICC value for the combined Social Communication domain and Core Total was 0.91, 95% CI [0.85, 0.95] and 0.90, 95% CI [0.83, 0.94], respectively.

Change Analysis and Validity

Paired t-tests indicated statistically significant decreases (improvement in symptoms) in scores from entry to exit on the Early Communication domain (M = − 0.71 [0.17, 1.25], t(111), = 2.61, p = 0.01, Cohen’s d = − 0.25), the Social Reciprocity/Language domain (M = − 1.17 [0.55, 1.78], t(111), = 3.77, p < 0.01, Cohen’s d = − 0.36), the combined Social Communication domain (M = − 1.87 [0.94, 2.81], t(111), = 3.97, p < 0.01, Cohen’s d = − 0.38), and the Core Total (M = − 1.87 [0.82, 2.92], t(111), = 3.53, p < 0.01, Cohen’s d = − 0.33). There were no statistically significant changes in the RRB domain. Over the same length of time, there was no statistically significant change in ADOS CSS scores (M = − 0.486 [− 0.17, 1.15], t(36) = 1.48, p = 0.49, Cohen’s d = − 0.24), in VABS Communication scores (M = 0.46 [− 2.67, 3.61], t(29) = 0.30, p = 0.76, Cohen’s d = 0.05) or in VABS Socialization scores (M = − 1.23 [1.54, 4.02], t(29) = 0.91, p = 0.37, Cohen’s d = 0.17). There was statistically significant change in SRS Total T-Scores (M = − 6.16 [1.75, 10.58], t(11) = 3.07, p = 0.01, Cohen’s d = 0.89).

Results from the individual growth models indicated that the average rates of change on the BOSCC over 4.5 months was small in the Early Communication domain (Cohen’s d = − 0.28, 95% CI [− 0.65, 0.09], see Fig. 2a), and greater for the Social Reciprocity/Language domain (Cohen’s d = − 0.40, 95% CI [− 0.65, − 0.15], see Fig. 2b), the combined Social Communication domain (Cohen’s d = − 0.38, 95% CI [− 0.63, − 0.13], see Fig. 2c), and the Core Total (Cohen’s d = − 0.39, 95% CI [− 0.65, − 0.13], see Fig. 2d), with larger changes when comparisons were made for 6 months. The average rate of change over 4.5 months was smaller in the VABS Communication Standard Score (Cohen’s d = − 0.07, 95% CI [− 0.23, − 0.08]), the VABS Socialization Standard Score (Cohen’s d = − 0.09, 95% CI [− 0.25, 0.06] and ADOS CSS (Cohen’s d = − 0.32, 95% CI [− 0.80, 0.16]) than the BOSCC-PSYF. The average rate of change over 4.5 months was larger for the SRS Total score (Cohen’s d = − 1.33, 95% CI [− 2.31, − 0.36]) than the average rate of BOSCC-PSYF change (though, as mentioned previously, this finding should be interpreted with caution due to the small sample of SRS scores at exit).

Fig. 2
figure 2

a Decrease in Early Communication domain scores (improvement in behavior) over 4.5-month period. b Decrease in Language/Social Reciprocity domain scores (improvement in behavior) over 4.5-month period. c Decrease in combined Social Communication domain scores (improvement in behavior) over 4.5-month period. d Decrease in combined Core Total scores (improvement in behavior) over 4.5-month period

Moderating Variables

The results of the linear mixed models suggested that children’s chronological age (t = − 2.00, p = 0.049) though not VIQ (t = − 1.78, p = 0.079) nor NVIQ (t = 0.203, p = 0.84) was related to Early Communication scores, where younger children had higher (more impaired) scores on the Early Communication domain. Children’s chronological age (t = − 2.49, p = 0.015) and VIQ (t = − 3.62, p < 0.001), though not NVIQ (t = 1.46, p = 0.149) were related to Social Reciprocity/Language scores, where younger children and children with lower VIQ’s had higher Social Reciprocity/Language scores. The Combined Social Communication Total scores were related to both children’s age (t = − 2.44, p = 0.02) and VIQ (t = − 2.97, p = 0.004), though not NVIQ (t = 0.93, p = 0.35). VIQ (t = − 4.11, p < 0.001), though neither chronological age (t = − 1.84, p = 0.07) nor NVIQ (t = 0.99, p = 0.32), was related to Core Total scores. Most relevant to the use of the BOSCC as a measure of intervention response, change over time in all domains and for Core Total scores was not moderated by age, VIQ or NVIQ.

Response Status

T-tests comparing the amount of change in BOSCC domain and Core Total scores by response status indicated that the individuals who were considered “responders” on the SRS Total Score demonstrated significantly more change in the BOSCC Early Communication (t(16) = 2.34, p = 0.03) and combined Social Communication domains than SRS “non-responders” (t(17) = 2.21, p = 0.04). There were no statistically significant differences on any BOSCC domain score for VABS Communication, VABS Socialization or ADOS CSS responders.

Discussion

Results from our analyses confirm prior literature that the BOSCC is a promising outcome measure of intervention response (Grzadzinski et al., 2016). The BOSCC-PSYF, which is intended to be used with individuals of all ages who speak in flexible phrases or children under the age of 8 who speak in complex sentences, has been demonstrated to be sensitive to subtle changes in social communicative behaviors over a brief period of time. To the best of our knowledge, the BOSCC is the first brief, observation-based outcome measure of intervention response which measures a range of broad social communicative behaviors that is sensitive to changes in a short period of time. The BOSCC can be conducted by individuals of any skill level, including caregivers, therapists, naïve research assistants or highly trained clinicians.

A three-factor structure proved to be the best fit to the data. The items relating to broad social communicative behaviors were split into two domains—one including nonverbal and early communicative behaviors, the second including behaviors that relate to social reciprocity and are mostly based in language skills. The three-factor structure of the BOSCC-PSYF diverges from the two-factor structure evident in the BOSCC-MV (Grzadzinski et al., 2016), but is similar to the factor structure in the ADOS Module 3 described in Zheng et al. (2021). The three-factor structure of the BOSCC allows for researchers to decide whether to examine the two Social Communication domains separately or together, depending on the goals of treatment.

Using individual growth models, the Social Reciprocity/Language domain demonstrated statistically significant changes over time, whereas the Early Communication domain did not (though paired t-tests showed significant changes in both domains). This finding diverges from the BOSCC-MV, in which young children are most likely to demonstrate changes in items such as eye contact, facial expressions, and gestures (similar to the BOSCC-PSYF Early Communication domain) whereas older and more verbal children in our sample were most likely to demonstrate changes in the Social Reciprocity/Language domain (Grzadzinski et al., 2016). As mentioned previously, it is possible that interpreting the domains separately may prove most useful in identifying changes, depending on the type of intervention children are involved in. Researchers should make a-priori decisions regarding which skills are most likely to demonstrate change depending on the age of the child and the goals of the specific intervention.

Over a 4.5-month period, the BOSCC-PSYF demonstrated small, statistically significant effect sizes in the Social Reciprocity/Language, combined Social Communication, and Core Total domains. In contrast, the VABS Communication and Socialization scores over the same 4.5-month period demonstrated much smaller effect sizes. The effect size of the ADOS CSS was slightly smaller. While the SRS demonstrated large effect sizes over the 4.5-month period, this measure is a parent report measure which may allow for the possibility of bias because parents were aware of their child’s participation in an intervention at the time. These results suggest that the BOSCC-PSYF may be more sensitive to changes over brief periods of time than the VABS and possibly the ADOS CSS, two commonly used outcome measures (Grzadzinski et al., 2020), but should be used in conjunction with other measures, such as parent report measures, to achieve a comprehensive understanding of treatment response.

Similar to the results of the BOSCC-MV, the RRB domain of the BOSCC-PSYF demonstrated lower inter-rater and test–retest reliability and did not achieve a uniform or normal distribution of codes. While this was an expected outcome based on previous literature and from initial analyses of the ADOS (from which the BOSCC items were developed), it nonetheless indicates that the BOSCC RRB domain may not prove useful in identifying changes over short periods of time (Grzadzinski et al., 2016; Kitzerow et al., 2016; Lord et al., 2012). It may be that subtle changes in RRBs are difficult to capture in a brief observational measure, or that these behaviors do not vary greatly over time (e.g., they are either present or not). Nonetheless, the Core Total scores (i.e., RRB domain combined with the Social Communication domains) demonstrated significant amounts of change, indicating that RRBs are worth considering in conjunction with social communicative behaviors. Parent report data on RRBs should continue to be collected and used in combination with observation-based measures, such as the BOSCC.

Psychometric analyses indicate that the BOSCC has good inter-rater reliability. This is a promising finding, considering that individuals of all levels of experience coded these BOSCC-PSYF videos (e.g., undergraduate research assistants, graduate students, and post-doctorate level scholars). As is mentioned in Grzadzinski et al. (2016), because the BOSCC measures social communication changes within an individual, inter-rater reliability between individuals at one site is crucial, whereas reliability across sites is less important (unlike common diagnostic measures, such as the ADOS).

Psychometric analyses indicate that the BOSCC-PSYF also has good test–retest reliability for each domain and total (except for RRBs), though it is still lower than would be preferred. There are a few plausible reasons for this. First, due to collecting data during the COVID-19 pandemic, the number of test–retest cases we had available to analyze (n = 26) was less than ideal. Future test–retest data will continue to be collected. Furthermore, conducting test–retest reliability of brief observational measures has proven challenging. The mood and disposition of children can fluctuate easily, changing the behaviors that arise during the assessment (which is true for many observational assessments, but especially challenging for brief assessments). Thus, the BOSCC should be conducted at a time in which the child is in a neutral or positive mood and should be discontinued if the mood or behavior of the child is not representative of their usual behavior. Nonetheless, even in controlled settings, children demonstrate varying behaviors across the BOSCC administrations over short periods of time which we hope will not mask the measurement of intervention response.

The results garnered from this study are promising. However, there are several limitations to consider, the most prominent of which is the lack of a control group. All participants in this study participated in treatment of some kind, meaning that there was no “true” control group to which we could compare change scores. Future work will include the use of control groups to compare social communication changes across groups. Additionally, we would not expect that all participants would change in response to a particular treatment. Changes may be variable across individuals and across treatments. Larger sample sizes would allow consideration of individual differences in response to treatment, which we did not do here. In addition to larger sample sizes, future work on the BOSCC should also include racially diverse samples. Finally, the BOSCC is a measure of the generalization of changes in social communication to a standard set of activities; it is possible that some treatments result in proximal changes that, in the end, yield more general improvements that are not measured by the BOSCC.

Conclusion

Results from this study provide initial validation of the BOSCC-PSYF as an outcome measure of treatment response for individuals of all ages who have phrase speech and for fluent speaking children under the age of 8. The BOSCC provides a standardized, reliable, and valid measure of social communication changes over a short period of time. Its flexible nature allows for individuals of varying skill levels to administer and code the assessment. While telehealth administration was not used for the current study, it can be conducted through telehealth (by providing kits to families and videotaping caregiver-implemented BOSCC administrations through videoconferencing platforms), increasing accessibility across communities, and meeting the needs of the changing environment during the COVID-19 pandemic and beyond (Zwaigenbaum et al., 2021). The BOSCC coding scheme can be applied to videos that do not implement the standardized BOSCC administration, including caregiver child interactions or segments of ADOS-2 administration (Kim et al., 2018). The way the BOSCC is conducted (i.e., videotaped and later scored) allows for “truly blinded” coders who are unaware of participant characteristics, timepoint, or treatment status and, if carried out with “blinded” interaction partners, they too can remain unbiased. This measure, used in conjunction with existing measures of treatment response, including caregiver report, has the potential to fill an important gap in currently available outcome measures used in autism intervention research.