Differences between Novice and Expert Raters Assessing Trunk Control Using the Trunk Control Measurement Scale Spanish Version (TCMS-S) in Children with Cerebral Palsy

The Trunk Control Measurement Scale (TCMS) is a valid and reliable tool to assess static and dynamic trunk control in cerebral palsy. However, there is no evidence informing about differences between novice and expert raters. A cross-sectional study was conducted with participants between the ages of 6 and 18 years with a CP diagnosis. The TCMS Spanish version (TCMS-S) was administered in-person by an expert rater, and video recordings were taken for later scoring by the expert and three other raters with varying levels of clinical experience. The intraclass correlation coefficient (ICC) was used to evaluate reliability between raters for the total and subscales of the TCMS-S scores. Standard Error of Measurement (SEM) and Minimal Detectable Change (MDC) were also calculated. There was a high level of agreement between expert raters (ICC ≥ 0.93), while novice raters demonstrated good agreement (ICC > 0.72). Additionally, it was observed that novice raters had a slightly higher SEM and MDC than expert raters. The Selective Movement Control subscale exhibited slightly higher SEM and MDC values compared to the TCMS-S total and other subscales, irrespective of the rater’s level of expertise. Overall, the study showed that the TCMS-S is a reliable tool for evaluating trunk control in the Spanish pediatric population with cerebral palsy, regardless of the rater’s experience level.


Introduction
Cerebral palsy (CP) is defined as a group of disorders in the development of movement and posture, attributed to non-progressive brain lesions that occur during the perinatal stage, leading to limitations in activities and participation [1,2]. Children and youth with CP present a great heterogeneity of clinical forms and severity levels that are challenging to assess and measure [3,4]. It is essential to have reliable and valid assessment tools to identify weaknesses as well as strengths for each International Classification of Functioning (ICF) dimension in this population [5][6][7].
Trunk control has been shown to be a key function in the development of postural and movement control [8,9], seems to be related to the level of performance in the activities [10,11], and is also related to independence in self-care and mobility [12]. Wallard et al. [13,14] and Pierret et al. [15,16] investigated the importance of head and trunk control in children with CP. They found that children with CP have difficulty stabilizing their head and controlling their trunk during movement, which can affect their ability to walk and perform daily activities independently. Recent studies also found that a rehabilitation program focused on trunk postural activities can improve trunk control and walking ability in children with CP [16,17]. These findings highlight the importance of evaluating and addressing trunk control in the rehabilitation of children with CP.
The Trunk Control Measurement Scale (TCMS) is an assessment tool that measures static and dynamic trunk control, providing qualitative information on functioning in the three planes of space: frontal plane (inclinations), sagittal plan (flexion-extension), and transversal plane (rotations) [18,19]. It consists of three sub-scales: static sitting balance, selective motor control, and dynamic reaching. Various cultural adaptations of the TCMS have been developed, and each of them has demonstrated comparable psychometric properties to the original version. This confirms their suitability for both clinical and research purposes [18][19][20][21][22][23][24]. TCMS Spanish version (TCMS-S) has recently been adapted to Spanish children and youth with CP demonstrating adequate psychometric properties similar to those of the original TCMS scale [25].
Because CP is a lifelong condition, it is common for patients to be assessed by different raters over time. The results of assessments can be influenced by the level of experience and specific training of the rater, leading to differences from scores obtained by expert raters. There is evidence for this issue for widely used scales as the GMFM [26], the Test of Gross Motor Development [27,28], and the Balance Error Scoring System (BESS) [29], where novice raters have shown lower agreement and, consequently, higher Standard Error of Measurement (SEM) and Minimal Detectable Change (MDC). It is important to note that higher SEM and MDC indicate lower accuracy when interpreting results of clinical studies, assessing a patient's functional capacity, or making clinical decisions. However, while there have been several studies exploring the reliability of the TCMS [12,18,[20][21][22]25], none have previously explored the reliability of rater type, i.e., novice vs. expert. It seems important to verify that the TCMS-S maintain its psychometric properties independently of the rater and his/her experience [30].
The aim of the present study is to analyze the differences between ratings performed with the TCMS-S by expert and novice raters.

Study Design, Setting and Participants
The study was conducted using a cross-sectional design that followed the guidelines set by the Consensus-based Standards for the Selection of Health Status Measurement Instruments (COSMIN) [31]. The study received approval from the ethics committee of the Bio-Medical Foundation of the Hospital Infantil Universitario Niño Jesus of Madrid (registration number: R-0066/20) and was carried out in compliance with the Declaration of Helsinki.
Participants were recruited from the schedules of the traumatology and rehabilitation services at Hospital Infantil Universitario Niño Jesus in Madrid, using a method of consecutive sampling by convenience. Inclusion criteria were diagnostic of CP, age between 6 and 18 years, able to sit without trunk or feet support and to follow test instructions, and Gross Motor Function Classification System (GMFCS) levels I-IV. Exclusion criteria were orthopedical surgery or botulinum toxin injection during the last 6 months [12]. Before initiating any action related to the study, informed consent was obtained from the families of all participants. Additionally, participants over 12 years old provided written consent, while those under 12 years old provided verbal consent to participate in the study.

Trunk Control Measurement Scale-S
The TCMS-S scale evaluates seated trunk control across three dimensions, with a maximum score of 58 points. These dimensions are static balance (20 points), selective movement control (28 points), and dynamic reaching (10 points). Each item is scored on a scale of 0 to 3, where 0 indicates an inability to perform the task and 3 represents complete performance. The test is active, and the evaluator provides verbal instructions, demonstrates movements visually, or guides the participant before asking them to perform the test. The best score from three attempts is recorded, and administration typically takes between 15 and 20 min [18].

Study Procedures
First, all raters were previously instructed about the evaluation through TCMS-S by the gold standard expert. A two-hour training session was held for all raters, during which the TCMS-S scale was presented, training videos were analyzed, and the score was practiced. Any doubts that may have arisen were also answered [29]. The gold standard expert was a physical therapist with over ten years of clinical experience in the specific field of pediatric care, as well as five years of experience using the TCMS scale. The other expert rater was a physical therapist with over ten years of clinical experience in pediatric care but without any experience using the TCMS scale. The novice raters consisted of a master's student in physical therapy and a physical therapist without any clinical experience in pediatrics. Neither of the novice raters had any prior experience with the TCMS scale [18,19,26,29]. The gold standard expert administered the TCMS-S scale to the participants with CP in a face-to-face clinical setting. These assessments were also video recorded for later scoring by all raters [12,27,29]. The video-assessment was performed by this researcher and the three other raters with different levels of clinical experience and experience with the scale. During video scoring, raters were allowed to pause the video recordings and to review items as many times as necessary [26]. All raters were blinded to others during the study period [27,29].
All requisites for protection of personal data were met. The processing, communication, and transfer of personal data of all patients complied with the provisions of Organic All data were recorded on an evaluation sheet, checked before the end of the assessment, and clarified any participant's or parent's doubts [18,19].

Data Analysis
The data were analyzed using SPSS v. 29 software (SPSS Inc., Chicago, IL, USA). The level of significance was set at p < 0.05. A normal distribution of the variables was assumed based on the results of the assumption tests, as well as on the central limit theorem (due to the large sample size; N > 30).
The intraclass correlation coefficient (ICC) by a 2-way fixed-effect model was used to evaluate the inter-rater reliability for the absolute value of the total and subscales TCMS scores. ICC values were interpreted as follows: excellent (ICC ≥ 0.90); good (0.90 > ICC ≥ 0.70); fair (0.70 > ICC ≥ 0.40); and poor (ICC < 0.40) [32]. The precision of the reliability results was measured by the standard error of measurement (SEM), which was calculated as standard deviation of the difference score/ √ 2. In addition, the minimal detectable change, required to be 95% confident that the observed change between 2 measurements reflects true change and not measurement error, was calculated (MDC95 = SEM × √ 2 × 1.96) [33].

Results
A total of 96 participants were included in the present study, of whom 44 were female, with a mean age of 12.5 ± 3.3 years (range 6-18 years). Among them, 47 participants were between 6 and 12 years old with a mean age of 9.6 ± 2.2, and 49 participants were between 12 and 18 years old with a mean age of 15.1 ± 1.8 years. Table 1 presents the distribution of participants by diagnoses and functional levels according to GMFCS for each age group. The means and standard deviations of the total scores obtained by novices and experts, as well as the TCMS-S subscales, were calculated and grouped based on the GMFCS level. These data are presented in Table 2. Higher total TCMS-S scores and subscale scores are observed in participants with higher functional GMFCS levels, regardless of the rater's experience level. The scoring of the video-recorded assessments was conducted by four raters: two experts and two novices. Both expert raters were physical therapists with specific postgraduate training in pediatric physical therapy. One of the experts (ER1) had 20 years of clinical experience in pediatric physical therapy, as well as experience using the TCMS-S, and was consider the gold standard expert (ER1) [34]. The other expert (ER2) had 15 years of clinical experience in pediatric physical therapy but initially was not familiar with the TCMS-S. The novice raters (N1 and N2) were physical therapists without prior experience in pediatric physical therapy or TCMS-S. N1 had graduated 10 years ago, while N2 was a recent graduate. The mean age of raters was 35.8 years (range 25-42 years), and 75% of them (NR1, NR2, and ER2) were women. Table 3 summarizes data regarding the inter-rater reliability for the total TCMS-S score and subscales. Regardless of whether inter-rater reliability was assessed for the total scale or for the sub-scales, the agreement between experience raters was excellent (ICC ≥ 0.93). Novice raters had slightly lower agreement than experienced raters, with good inter-rater reliability for the total TCMS-S score and subscales (ICC = 0.72-0.82). Thus, novice raters obtained higher SEM and MDC than experienced raters. However, irrespective of the rater's degree of experience, the highest SEM was found in the "Selective movement control" subscale, followed by "Static sitting balance" and "Dynamic reaching". Regarding agreement between experienced and novice raters, excellent reliability was observed for the TCMS-S total score and its subscales (ICC ≥ 0.93), except for the "Selective movement control" subscale, where the observed reliability was rated as good (ICC = 0.85). The SEM and MDC established for agreement between expert and novice raters on the TCMS-S total score were 3.1 points and 8.7 points, respectively. The highest degree of measurement error was found in the "Selective movement control" subscale (SEM = 2.3, MDC = 6.3).

Discussion
The aim of this study was to explore the expertise influence on assessing trunk control in children with CP using TCMS-S. The results indicated that regardless of the level of experience, the agreement between raters was good to excellent for the total TCMS-S score and its subscales.
The distribution of participants according to GMFCS levels showed that most participants with unilateral impairment had level I (63.9%), while those with diparesis were distributed between levels I (36.3%) and II (54.5%). This distribution coincided with that reported in previous studies in which children with hemiparesis presented higher functional levels, and children with bilateral involvement unfilled all GMFCS levels [3,35].
Participants with better motor abilities and walking skills showed greater trunk control. These findings are consistent with previous studies, including Heyrman [36,37]. Additionally, we examined the score differences between expert and novice raters and found that experts assigned slightly higher scores in GMFCS levels I and II. However, these differences were not observed in GMFCS levels III and IV, and the trend was inverted for some subscales. Therefore, no definitive trend was observed.
Among the raters, although only ER1 had previous experience in the use of TCMS-S, a comparison was conducted between two expert raters (ER1 and ER2) with more than ten years of experience in pediatric physiotherapy and novice raters (N1 and N2) without experience in the field of pediatric physiotherapy nor in the use of the scale. Such comparisons are common in similar studies [26,27,29,38]. However, to our knowledge, while intraand inter-rater differences have been analyzed for TCMS, there has been no research on differences among raters with varying levels of experience [12,[18][19][20][21][22]24].
With respect to the procedure for the evaluation, in the present study, TCMS-S was scored by all raters using video recordings. TCMS-S considers compensations and lower amplitude of movements to evaluate performance quality [18,36], so raters need to detect differences in the execution of the different items measured by TCMS-S to ensure consistent scoring. The use of video recordings during the assessment provides increased safety for the patient [26], as well as greater objectivity and reduced potential for facilitation or bias based on test results [38,39].
The TCMS-S total and subscales demonstrated excellent inter-rater reliability among expert raters (ICC ≥ 0.93). Similar results were observed when comparing the ratings of experts and novice raters, except for the Selective Movement Control subscale (ICC = 0.85). These findings are consistent with previous studies that have examined the reliability of different versions of the TCMS [12,18,20,22,40]. However, there were differences in the procedures of all these studies: Heyrman et al. [18], in the original version, compared scores of two trained raters who scored live but independently. Marsico et al. [12], in the German version, compared the scores of two expert raters, with more than 10 years of clinical experience, who administered the scale in real-time but who scored from the recorded videos. Heo et al., in the Korean version [24], compared results of four raters with more than 8 years of experience in pediatric physical therapy. While they reported that these four raters scored the videos, it remains unclear who administered the scale. The remaining versions [20][21][22] indicate that two raters scored the assessments, but it remains unclear whether they evaluated at two different times or during the same session, as well as their experience level. As a whole, findings suggest that all versions of the TCMS have excellent inter-rater reliability, even when comparing live versus video assessments [12,18]. However, this statement should be approached with caution due to the differences between the procedures and the lack of information on raters' experience and scoring modes. Further research comparing TCMS scores between live and video ratings could shed light on this aspect.
In the present study, while novice raters had slightly lower agreement than experienced raters, inter-rater reliability was still good for the total TCMS-S score and subscales (ICC ≥ 0.72). In line of these findings, Kuo et al. [29] found excellent reliability when comparing ratings of videorecorded assessments of the BESS scale by expert raters in pediatric physical therapy who were familiar with the BESS scale. In addition, they also found good reliability among novice raters regardless of whether they had received specific training or not. These results could be related with previous training about the scale, consisting of an online module with 4 examples of BESS scale evaluation, which included written instructions and required a 100% match in the scoring of the proposed examples [29]. In the present study, participants received two-hour-long training sessions from the gold standard expert (ER1) in online education, which included scoring two video assessments using test instructions and the opportunity for participants to ask questions. Additionally, Franki et al. [26] found excellent novel vs. expert rater reliability when comparing one expert rater and two non-clinically experienced raters who were familiar with the scale GMFM-88 and who underwent training by scoring 5 videos with the possibility to ask questions [26]. Other clinical tools, such as the Assisting Hand Assessment (AHA) [41,42] or the Pretchl's General movements and the Hammersmith Neonatal Neurological Examination (HINE) [2,43], require specific and regulated training as an accreditation certificate is mandatory for using these tools. The training of raters seems relevant, as it may influence inter-rater reliability. However, there could be different forms of training that could be used, and investigating which forms of training ensure greater rigor and inter-rater reliability would be very interesting. In the current research, the briefness and format of the training in the use of the TCMS-S may have limited the accuracy of novice raters, although the results do not seem to highlight this.
Furthermore, regarding the procedure of video ratings, the raters in the current study were allowed to pause and review the video as many times as necessary, similar to Franki et al. [26]. In contrast, Kuo et al. [29] and Dewar et al. [44] required videos to be viewed at regular speed and only once in order to simulate in-person assessment. These differences could have influenced why Franki et al. [26] reported excellent inter-rater reliability between novice and expert raters, while Kuo et al. [29] found good agreement. Dewar [44] found excellent inter-rater reliability for the total score of the Balance Evaluation Systems Test (BESTest), but both raters were experts. These findings may indicate that the mode of video viewing influences inter-observer reliability, especially among novice raters [26,27]. In clinical settings where recording material or time for subsequent video analysis are not available [27], a support person may be helpful in securing the patient and scoring the scale. Indeed, the TCMS has been designed to be administered in clinical settings where a careful in-person assessment allows the identification of elements of postural control such as lack of selective control of trunk rotations or difficulty in maintaining balance during a reach. These clinical insights can be useful in setting therapeutic goals or aiding in the selection of a technical aid, but scoring can be challenging during assessment. Therefore, video-recording can be a valuable tool in cases where there is no support person available during the assessment, providing an alternative option for rating [26,45].
This study found that, based on expert inter-rater analysis, the TCMS-S total score had a SEM of 2.0 points and an MDC of 5.7 points (9.8%). These values are consistent with those reported for other versions of the TCMS [12,18,22]. For the total score, Heyrman et al. [18] reported an SEM of 1.68 points and an MDC of 4.66 (8%) points, while Marsico et al. [12] reported an SEM of 1.9 points and an MDC of 5.27 (9%) points, and Ravizzotti et al. [22] reported an SEM of 1.72 and an MDC of 4.83 (8%). Our findings suggest that the MDC between expert raters using the TCMS-S is six points. To ensure independence from measurement error, any changes in the scale score must exceed six points. However, our study did not report on the smallest clinically significant change. Therefore, further longitudinal research is necessary to determine this value [46].
Additionally, our results show that novice raters obtained higher SEM (4.2 points) and MDC (11.6 points, 20%) values than expert raters. These findings suggest that experience plays an important role in accurately assessing TCMS-S score. These findings are consistent with results reported for other scales such as the GMFM-88 [26], BESS [29], and the Test Gross Motor Development [27,28]. These results highlight the importance of rater experience in obtaining the most accurate and reliable measurements in clinical trials [26,27,29,44]. Indeed, although agreement among novice raters using the TCMS-S is considered good, the increased SEM and MDC should be considered given the relevance of the assessment in a clinical trial, to assess the functional level of a patient, or to use the data to make clinical decisions. It seems necessary to specify the experience and training of the rater when disseminating the results, as experience conditions the reliability, precision, and sensibility of the ratings [27,29].
Furthermore, the present study revealed that the lowest ICCs for the TCMS-S were observed in the agreement both between novice raters and between novices and experts for the Selective Movement Control subscale (ICC = 0.72 and 0.85, respectively). Moreover, with analyzing TCMS-S total scores and subscales classified by GMFCS levels, it becomes apparent that the Selective Movement Control subscale is the most challenging of the three subscales. Even at GMFCS levels I or II, which obtained the highest scores, the medians do not approach the maximum score. Both the original and German versions of the TCMS have demonstrated similar results, suggesting that this subscale poses a significant challenge for participants, even those with higher functional levels [18,19].
Moreover, this Selective Movement Control subscale has a higher SEM and MDC, even between expert raters. Minor differences, but in the same direction, were noted in previous studies on the original, German, and Tanzanian versions of the TCMS [12,18,22]. These findings could be explained as this subscale is considered the most challenging to assess due to the complexity of the movements involved, such as selective rotations of the upper and lower trunk (item 8 and 9). In addition, the rater must differentiate the quality of the movement, identifying if there are compensations or if the movement is of adequate amplitude. Although all TCMS versions have precise instructions for each item, results suggest that selective movement control subscale requires further attention and even specific training to reach higher consensus when used in clinical or research settings [18,36].
Overall, the TCMS-S has demonstrated excellent reliability among expert evaluators, as well as good reliability among novice raters with minimal training, resulting in a reliable tool for assessing trunk control in the Spanish pediatric population with CP.

Conclusions
This study investigated the influence of expertise on the assessment of trunk control in children with CP using the TCMS-S scale. The results showed that regardless of the level of experience, the agreement between raters was good to excellent for the total TCMS-S score and subscales. However, novice raters had slightly lower agreement and higher SEM and MDC than experienced raters. Regarding expert inter-raters scoring, a difference of 6 points in the TCMS-S total score can be considered as a real change in the participant's trunk control. The Selective Movement Control subscale was found to be the most challenging to assess, even among expert raters. The study highlights the importance of the rater's experience in achieving the most accurate and reliable measurements and suggests that further training may be necessary to improve the inter-rater reliability of the TCMS-S, particularly for the Selective Movement Control subscale. As a whole, the TCMS-S has demonstrated to be a reliable tool for assessing trunk control in the Spanish pediatric population with CP.  Informed Consent Statement: Informed consent was obtained from all subjects or parents involved in the study.

Data Availability Statement:
The data associated with the paper are not publicly available but are available from the corresponding author on reasonable request. For additional information please refer to: javier.lopez3@universidadeuropea.es.