Intra and interrater reliability and clinical feasibility of a simple measure of cervical movement sense in patients with neck pain

Background Pattern tracing tasks can be used to assess cervical spine movement sense (CMS). A simple clinical measure of CMS (tracing fixed figure-of-eight (F8) and zigzag (ZZ) patterns with a head mounted laser) has been proposed and assessed in asymptomatic subjects. It is important to determine if examiner ratings of the traces are reliable and feasible for clinical use in those with neck pain. We therefore examined the intra- and inter-rater reliability of rating video recordings of the CMS tasks, and the feasibility of undertaking the tests in clinic by comparing slow motion versus real-time video ratings. Methods Cross-sectional study examining neck pain subjects from a physiotherapy clinic. F8 and ZZ patterns traced with a head-mounted laser pointer at two velocities (accurate; accurate & fast) were videoed and later examined. Time (total time taken to complete the pattern), error frequency (number of deviations) and error magnitude (sum of deviations multiplied by distance from the central line) were measured. Two assessors independently evaluated the laser tracing videos in slow motion; a third rated the videos in real time. Intraclass correlation coefficients (ICC) and standard error of measurements (SEM) were calculated for intra- and inter-tester reliability, and feasibility. Results Twenty neck pain patient (13 women) videos were assessed. Intra-and inter-rater reliability was substantial to almost-perfect (ICC 0.76–1.00; SEM < 0.01–2.50). Feasibility was moderate to almost-perfect (ICC 0.54–1; SEM <  0.01–2.98). Conclusions Video (slow motion) ratings of time and errors for F8 and ZZ movement patterns in neck pain subjects showed high intra and inter-rater reliability. Achieving reliable ratings in clinic (real-time) appears feasible. Synthesising our results, the most reliable and feasible CMS ratings appear to be when the subject uses accurate rather than accurate and fast execution. The ZZ movement pattern may be superior to F8 in terms of rating. Time and error frequency for tracing F8 and ZZ as accurately as possible in determining CMS appears promising for use in clinic. Future research directions were identified.


Background
Neck pain is a common musculoskeletal disorder with a global prevalence of around 5 % (women 5.8%, men 4.0%) [1]. It is a disabling condition with one of the highest socioeconomic burdens globally and is forecast to escalate with the world's ageing population [2]. Neck pain is categorised into: pain secondary to an identifiable pathology like cervical myelopathy, neoplastic conditions, upper cervical ligamentous instability, vertebral artery insufficiency or inflammatory/systemic disease [3]; and non-specific neck pain with a poorly understood causation and into which the majority of sufferers are categorised. There is a mounting need to better understand important factors influencing non-specific neck pain (referred to as neck pain to follow).
Cervical movement sense is defined as the ability to smoothly and accurately move the head/neck to a given pattern [16]. To date, several different methods to assess cervical movement sense have been used but all use head-mounted motion sensors and dedicated software to track, measure and calculate head motion accuracy; these methods have all shown reduced movement accuracy in neck pain subjects [16][17][18][19][20]. The measurement most studied is called the "Fly" and is purported to be the best test to differentiate asymptomatic from neck pain subjects and further to distinguish between neck pain subgroups like whiplash associated disorder (WAD) and non-specific neck pain [16,20]. However, these tests require equipment that is generally cost-prohibitive for clinical practise. Consequently, a cost-effective and simple alternative for clinical use, has been promoted by Pereira et al. [21] based on a preliminary study examining asymptomatic subjects. Given the tasks and methodology, to what the subject is asked to perform, is similar with respect to previous work [19,22], the primary difference here is the method of analysis of that performance. Therefore, it is important to establish if clinicians are able to reliably assess CMS (considering pattern and task type) using this simplified method of analysis and to explore the feasibility of using these tests in real time in the clinic by assessing subjects with neck pain. Thus the aim of this study was to determine the inter-and intra-rater reliability while rating videos in slow motion, and their feasibility when rating the videos in real-time. The influence of pattern shape (F8 and ZZ) and task type (accurate or accurate & fast) were considered.

Methods
This observational, cross-sectional study consecutively recruited consenting neck pain subjects (non-specific or whiplash associated disorder (WAD)) attending the physiotherapy department of the Schaffhausen, Canton Hospital, Switzerland from April to October 2017. The clinic receives patients on referral from medical doctors that are internal and external to the hospital. Additional advertisements to address employees of all hospital departments were e-mailed. The ethics committee of the Canton of Zurich approved the study, and all patients signed their informed consent prior to participation.
Included were adults of either gender, aged 18 years or older with a Neck Disability Index score [23][24][25] of at least five points (or 10%). Subjects had to be suffering from WAD II (according to Quebec task force [26]) or non-specific neck pain for at least 3 months, were not familiar with movement sense tracking and were able to read and communicate in German.
Excluded were subjects with specific neck pain conditions like fractures, osteoporosis, myelopathy, nerve root entrapment, or WAD III or higher; Disorders of the ear, nose or throat resulting in vertigo or dizziness, like sudden hearing loss, Meniere's disease or Tinnitus; Systemic diseases associated with neck pain like diabetes and rheumatoid arthritis; Neurologic diseases like multiple sclerosis or stroke affecting cervical spine musculature; Manual treatment of the cervical spine within 3 days prior to the measurements; and medication with potential to affect perception like Naproxen or opioids (e.g. Tramadol).
Testing procedure for video capture of CMS Movement tests were undertaken in random order. The subject sat on a chair (with backrest) positioned 1 metre from a vertical wall to which the test patterns were fixed. Patterns were printed on A3 paper where a 5 mm thick black band (F8) and 10 mm thick green band (ZZ) represented the central (main) pattern. The F8 pattern was 13 cm high and 34.5 cm wide, with a total inner zone length of 94 cm. The ZZ pattern was 13 cm high and 23.4 cm wide with 23.4 cm long horizontal lines, 26.6 cm long diagonal lines and a total inner zone length of 100 cm. Both patterns had five thinner additional lines every 5 mm to both sides from the main line to distinguish five zones of deviation. With a laser pointer affixed to their forehead, subjects were instructed to follow the bands of each pattern: "as accurately as possible", or "as accurately and fast as possible" and in two directions, clockwise or counter-clockwise to start from the centre of each pattern. Subjects were allowed to practise each task once. For all tests, the laser point tracing of the pattern was videoed using a webcam (Microsoft LifeCam Studio 1080p HD Sensor) positioned at 0,5 m in front of the patient (see Fig. 1). Video files were saved on a WINDOWS-Laptop. A pattern was considered completed when the subject returned to the central starting position.

Evaluation of video capture of CMS tests by blinded raters
Video files were evaluated independently by two raters (R1 and R2) in slow motion at 1/8th of normal speed using the programme SMIPlayer (https://www.smplayer. info). All subjects were rated and results compared to determine inter-rater reliability. All videos from three randomly selected subjects were re-evaluated 4 weeks later by each rater blind to their initial results in order to determine intra-rater reliability. To reduce work-up bias, raters were blinded to other subject characteristics. Raters had received sufficient time for training to count error frequency by zone using twelve test videos. In determining feasibility, a third rater (R3; IMW) with similar pre-study training, determined time per subject at the time of recording in clinic and using the video replayed in real-time directly following the recording to determine error frequency.

Outcome measures
Time, error frequency, and error magnitude while tracing the F8 and ZZ patterns were used to determine intra and interrater reliability and feasibility. Time was defined as tracing from the centre of the pattern once either into clockwise or counter-clockwise direction by stopping again at the centre of the pattern. Error frequency measured the number of errors occurring for each pattern tracing, defined by the laser pointer leaving/exceeding the pattern inner zone (F8 = 5 mm; ZZ = 10 mm). Error magnitude reflected by a composite error score, which comprises the sum of the product of error frequency times the zone (maximum of five), was additionally assessed. For example, number of errors occurring in zone 1 was multiplied by one, errors in the second zone by two, and so on. In addition, age, duration of pain and dizziness, current pain and dizziness (both separately using a visual analogue scale (VAS) [27]), traumatic/non-traumatic injury, which medication they were taking, NDI-G and the Dizziness Handicap Inventory -German version [28] (DHI-G) were recorded.
Interpreting NDI-G and DHI-G: While benchmarks for the NDI-G are not defined, recommendations interpret 0-4 points as no disability, 5-14 points as mild disability, 15-24 points as moderate disability, 25-34 points as severe disability, and 35-50 points as completely disabled [23,24]. DHI-G is a reliable German version of the DHI used to assess the disability of patients suffering from dizziness [28]. Tesio et al. [29] developed a short form version of the English DHI where a score of 13 represents no disability and zero indicates being completely disabled secondary to dizziness. Without a validated German DHI-short form to use, the equivalent items used in the English short form were selected to represent a German DHI-short form.

Data processing and analysis
Outcome variables were initially tested for any directional effects (clockwise/counter-clockwise) using paired Wilcoxon signed-rank tests. As no directional effects were found, results of both directions were combined for analyses.
Four variables were recorded for each of time, error frequency and error magnitude: two patterns (F8, ZZ) and two movement velocities (accurate, accurate & fast). The intraclass correlation coefficient (ICC) for agreement was used to determine intra-and inter-rater reliability. Both velocities (accurate and accurate &fast) Fig. 1 Test set-up. Subject sitting on a chair with LASER-Pointer on her head, at 100 cm distance from the ZZ-pattern. Laptop connected to a webcam at a distance of 50 cm from the centre of the pattern were combined for intra-rater reliability, resulting in 12 observations (3 subjects × 2 ratings × 2 pattern) for each rater and outcome variable. Inter-rater reliability was based on 160 observations (20 subjects × 2 ratings × 2 patterns × 2 velocities) for each outcome variable. The standard error of measurement (SEM) as a measure of absolute reliability in the unit of the test was computed by using the formula: SD x square root of (1 -ICC) [30,31]. ICC values obtained were interpreted to be moderate (between 0.4 and 0.59), substantial (0.6 and 0.79), and almost-perfect (0.8 or more) [31,32].
To examine feasibility, real time ratings of time and error frequency were compared to final slow motion video ratings of each of the two video raters using the ICC agreement and the standard error of measurement (SEM) [30]. Error magnitude was not considered feasible to be achieved in real-time and was consequently omitted from this analysis of feasibility.

Results
Twenty-seven subjects were recruited and 20 progressed after application of exclusion criteria where subjects with tinnitus (× 2), NDI-score < 5 points (× 2), and Diabetes type II (× 1), unable to communicate in German (× 1), and who were unwilling to participate (× 1) were excluded. Demographic data is shown in Table 1.

Interrater reliability
Interrater reliability for time for both patterns and velocities was perfect (1.0, SEMs from < 0.01 to 0.05), almost-perfect for error frequency with F8 ranging from 0.76 to 0.91, (SEMs 0.47 to 1.74), and ZZ = 0.80 to 0.84, (SEMs 0.48 to 0.78). Similar values were seen for error magnitude (Table 3).

Feasibility
Real-time compared to both video slow motion ratings agreements were almost-perfect for time with ICCs between 0.99 to 1.0 (SEMs < 0.01 to 0.05) for both pattern and velocities. For error frequency moderate to almost-perfect agreements were shown but overall higher ICCs and lower SEMs were found for ZZ with accurate velocity, while lowest agreement was found for ZZ with accurate & fast velocity and largest SEM values were shown for F8 and accurate velocity. Overall, the real-time ratings of R3 agreed better with the slow motion ratings of R1 than R2 (

Discussion
This study demonstrated promising intra and inter-rater reliability and clinical feasibility for assessing the performance of the F8 and ZZ cervical movement sense tests performed by people with neck pain. Overall, the combined results, considering intra and inter rater accuracy and feasibility, suggest that the time taken and frequency of errors during the accurate task, particularly using the ZZ pattern, has the most potential for clinical use.
Our study showed the best reliability (both intra-and inter-rater) and feasibility was in rating the time subjects needed to perform the tasks. Almost-perfect intra-rater and substantial almost-perfect inter-rater reliability was demonstrated for error frequency and error magnitude. Tracing the ZZ pattern was slightly more reliable than for the F8 pattern (better ICCs and lower SEMs). Further, error magnitude was not feasible for real-time ratings, which may point to time and error frequency being most useful in the clinical situation. Encouragingly, similar inter-rater reliability values for error frequency (ICC = 0.93) were shown in the Australian study of asymptomatic controls who overall demonstrated less mean errors than the neck pain subjects in the current study [21]. Furthermore, intra-rater reliability shown in our study is comparably high to values reported for rating similar test procedures like joint position error (JPE) measurements [36,37]. In a study requiring head repositioning after neck rotation or flexion/extension returning to a neutral and target head position, similar ICCs and SEMS to our results were reported (intra: ICC between 0.70-0.83, SEM 1.45-2.45; inter: 0.62-0.84, SEM 1.50-2.23) [36]. Juul et al. [37] reported lower ICCs but better SEMs in examining the reliability of rating JPE returning to a neutral head position from rotation, extension and flexion (intra: ICC 0.48-0.82, SEM 0.19-0.26; inter: ICC 0.50-0.75, SEM 0.20-0.50). Within this context, our almost-perfect intra-rater and substantial to almost-perfect inter-rater reliability of error frequency and magnitude slow motion video ratings in the current study appear to be excellent results.
The feasibility of achieving reliable ratings at real-time in clinic is essential given the complexity and inefficiency of videoing patients and rating them later. The feasibility of error counting during F8 tracing was similar for both velocities; however, the accurate velocity showed larger SEMs, which may relate to the total amount of errors that were more than double for F8 compared to ZZ tracing with accurate velocity, while the time needed to trace each pattern increased equivalently. The F8 pattern central line was narrower and may have related to increased error, while the ZZ accurate task seemed easier for our raters to follow; yet, challenging enough for the patients. Despite better inter-rater reliability, the accurate & fast ZZ tracking appeared, less feasible for assessing in real time with ICCs for error frequency of 0.54 and 0.56 (Table 4), respectively. SEMs of 1.42 and 1.71 (Table 4) in relation to a range of eleven (Table 1) would also support this. Thus considering all of the results, evaluation of error frequency and time for the ZZ pattern traced within an accurate velocity appears to be the most promising task for application in clinical practise.
Future directions with respect to test-retest reliability of subjects' performance and validity of the measures can now be explored [31,38]. Comparison of our results to those given for asymptomatic controls by Pereira et al. suggest similar results for time to trace each pattern and velocity, but lower error frequency and magnitude values to those found in our neck pain group [21]. The

Limitations of the study
There were limitations to our study that should be considered in interpreting our results. The line thickness for F8 and ZZ were not equal and may have influenced subjects' performance and reliability. Perhaps accordingly, our neck pain patients demonstrated more errors and needed longer for the F8 (5 mm) than the ZZ pattern (10 mm). In addition, feasibility testing may have been subject to expectation bias in R3 when reconciling disagreement between R1 and R2; however, if applicable, its influence would be low as only 25% of observations disagreed, there was 3-5 weeks between ratings, and R3 was blind to her real-time ratings of those subjects. Finally, the aim of our study was to determine the intra and inter-rater reliability and feasibility of assessing the patient performing the tasks. A necessary progression will be to compare responses between neck pain and asymptomatic control subjects and examine the reliability of subjects' repeatable performance, which may influence the responsiveness of the measure and future use of these assessments [20,39].

Conclusions
Rating the time taken and number of errors during tasks designed to assess cervical movement sense is reliable (intra and inter tester) and seems feasible for use in clinical practice. Rating of videos in slow motion, for time, error frequency and magnitude, of participants tracing a F8 or ZZ pattern with a head-mounted laser is reliable. Real time rating of Time and error frequency of an accurately traced ZZ pattern seems most feasible for clinical practise. The results of this study support directions for future research to understand whether these simple movement sense tests allow for meaningful distinction of neck pain, and between sub-groups of this prevalent musculoskeletal condition. A further direction is to determine test validity and within-subject test-retest repeatability.