Movement Control Impairment and Low Back Pain: State of the Art of Diagnostic Framing

Background and objectives: Low back pain is one of the most common health problems. In 85% of cases, it is not possible to identify a specific cause, and it is therefore called Non-Specific Low Back Pain (NSLBP). Among the various attempted classifications, the subgroup of patients with impairment of motor control of the lower back (MCI) is between the most studied. The objective of this systematic review is to summarize the results from trials about validity and reliability of clinical tests aimed to identify MCI in the NSLBP population. Materials and Methods: The MEDLINE, Cochrane Library, and MedNar databases have been searched until May 2018. The criteria for inclusion were clinical trials about evaluation methods that are affordable and applicable in a usual clinical setting and conducted on populations aged > 18 years. A single author summarized data in synoptic tables relating to the clinical property; a second reviewer intervened in case of doubts about the relevance of the studies. Results: 13 primary studies met the inclusion criteria: 10 investigated inter-rater reliability, 4 investigated intra-rater reliability, and 6 investigated validity for a total of 23 tests (including one cluster of tests). Inter-rater reliability is widely studied, and there are tests with good, consistent, and substantial values (waiter’s bow, prone hip extension, sitting knee extension, and one leg stance). Intra-rater reliability has been less investigated, and no test have been studied for more than one author. The results of the few studies about validity aim to discriminate only the presence or absence of LBP in the samples. Conclusions: At the state of the art, results related to reliability support the clinical use of the identified tests. No conclusions can be drawn about validity.


Introduction
Low Back Pain (LBP) is one of the most frequent health problems causing absenteeism and disability, and it is the most expensive diagnosis in the Western World [1][2][3]. LBP is defined as pain "strong enough to limit normal activities for more than one day" [4] in the lower part of the column, between the 12th thoracic vertebra and the 1st sacral, with possible projection to the lower limb [5].
Temporal staging defines LBP acute when an episode occurred not more than 6 weeks previously, subacute between 6 and 12 weeks, and chronic beyond 3 months [6].
Smoking and obesity have shown a significant association for developing LBP [7], while sedentary lifestyle, low aerobic capacity [8], and psychological factors related to personal or professional discomfort [9] have been indicated as highly related.
Patients with LBP generally improve in the first 6 weeks after an acute episode [10], but approximately 70% of patients show a recurrence in the following year [11,12] while 40% develop chronic LBP [13].

Diagnostic Values
Properties of tests taken into consideration in the synthesis are validity and reliability, described through the typical coefficients of biomedical statistics.

Data Sources and Search
A systematic search was conducted on the Medline, Cochrane Library, and MedNar (grey or unpublished literature) databases without time filters. The selection of articles can be considered updated to 13 May 2018. Table 1 summarizes the strategy used.

Study Selection
The studies obtained were initially reported in a comprehensive database, and double reports were excluded. Only one reviewer performed the first screening following the reading of the title and abstracts. Relevance was then assessed by reading the full text: any doubts were resolved with the intervention of a second reviewer. The inclusion process is summarized graphically in a flowchart in the results section ( Figure 1). Hand searching has been conducted checking bibliographies of included articles.

. Diagnostic Values
Properties of tests taken into consideration in the synthesis are validity and reliability, described through the typical coefficients of biomedical statistics.

Data Sources and Search
A systematic search was conducted on the Medline, Cochrane Library, and MedNar (grey or unpublished literature) databases without time filters. The selection of articles can be considered updated to 13 May 2018. Table 1 summarizes the strategy used. Table 1. Search strategy used for every database.

Database
Search Strategy

MEDLINE-Clinical queries
Low Back Pain AND motor control (Impairment AND (motor control OR movement OR movement control OR movement coordination OR movement system OR muscle control OR trunk motor control)) OR (Dysfunction AND (movement control OR movement OR stability)) OR (deficit AND (movement precision OR trunk muscle timing OR trunk movement control)) OR MCI OR altered sensory function OR segmental instability) AND (Low Back Pain OR LBP OR non-specific low back pain OR NSLBP) Cochrane Library-Simple Search Low Back Pain AND motor control MedNar-Simple Search Low back pain AND motor control

Study Selection
The studies obtained were initially reported in a comprehensive database, and double reports were excluded. Only one reviewer performed the first screening following the reading of the title and abstracts. Relevance was then assessed by reading the full text: any doubts were resolved with the intervention of a second reviewer. The inclusion process is summarized graphically in a flowchart in the results section ( Figure 1). Hand searching has been conducted checking bibliographies of included articles.

Data Extraction and Synthesis
The relevant data were organized in a synoptic tables (Tables A1-A4) which shows author and year of publication, objectives of the study, the characteristics of the participants (number, sex, age, and condition), the characteristics of the examiners, the diagnostic test/examination and the

Data Extraction and Synthesis
The relevant data were organized in a synoptic tables (Tables A1-A4) which shows author and year of publication, objectives of the study, the characteristics of the participants (number, sex, age, and condition), the characteristics of the examiners, the diagnostic test/examination and the procedure followed, the statistical values, and the main results. No meta-analysis of the collected data was performed, but a narrative synthesis in accordance with the emerging evidence was performed.

Risk of Bias Assessment
The quality of each study was assessed for methodological rigor and risk of bias by one reviewer using the tool described by Brink and Louw [27] (Table A5) and developed for the analysis of validity and reliability studies. Doubtful opinions have been resolved with the help of a second reviewer. This appraisal tool does not incorporate a quality score, but instead, the impact of each item on the study design should be considered individually. This tool contains 13 items, which should be considered according to the nature of the study: 4 are useful only for the evaluation of reliability studies, 4 are useful only for validity studies, and 9 are useful for both. The results were summarized in a synoptic table (Table A1), and a critical discussion of the strengths and weaknesses of the studies included was drafted.

Results
The database research identified 1203 articles, while 8 others have been identified with free research in the bibliographies of relevant studies for a total of 1211 articles; 180 articles were deleted because they were duplicated, resulting in 1031 basic articles as a partial result. Following the reading of the title, 386 articles were discarded; following the reading of abstract, 548 remained. Following the reading of the full text, 13 studies were included in the review and 84 studies were excluded as not relevant. The steps related to the selection of articles are outlined in the flow-diagram below ( Figure 1). Of the 13 studies included, 10 investigated inter-rater reliability [28][29][30][31][32][33][34][35][36][37], 4 investigated intra-rater reliability [31][32][33]38], and only 6 studies analyzed validity [28,29,31,32,39,40]. Overall, the tests showed reliability ranging from fair to excellent (K value between 0.32 and 1.00) for the inter-rater and from moderate to excellent for the intra-rater (K value from 0.42 to 1.00). The ICC also varied from 0.41 to 0.98, indicating a range from poor to very good (Table A1). A meta-analysis of collected data was not conducted due to the small number of studies that have investigated the same test. In addition to this, the highly heterogeneous nature of the descriptions and the small samples make the calculation superfluous.

Tests Described by More Than One Study that Did Not Give Consistent Results
Bent knee fall out was studied by 3 authors out of 132 patients. It was identified as having modest reliability by Luomajoki et al. [33] (K = 0.38) and as poor-excellent by Roussel et al. [36] (ICC = 0.61-0.91) and Enoch et al. [30] (ICC = 0.94). Active straight leg raising has been described in 3 studies on 158 total subjects. Roussel et al. [35] and Bruno et al. [29] showed good reliability (K from 0.70 left leg to 0.71 right leg for the first study and 0.79 for the second). Also, Roussel et al. in the study of 2009 [36] provide more variable values, with an ICC from poor to excellent (ICC = 0.41-0.91).

Tests Described by More Than One Study that Showed Agreement between the Results
Substantial reproducibility was found for both waiter's bow (investigated in 2 studies [33,36], 92 subjects, K = 0.62 and 0.78) and prone hip extension (investigated in 2 studies [29,34], with 112 total subjects, K = 0.72-0.76). The Sitting knee extension was analyzed in 2 studies for a total of 80 subjects. It provided a good K in the study by Luomajoki et al. [33] (K = 0.72) and was excellent in the study by Enoch et al. [30] (ICC = 0.95). The one leg stance was described in 3 studies for a total of 95 participants. Only Luomajoki et al. [33] identified a moderate-good reliability (K = 0.43-0.65), while both Roussel et al. [35] and Tidstrand and Horneij [37] obtained good-excellent values (K from 0.75 to 1.00).

Tests Described by a Single Study
Excellent reliability has been identified for joint position sense [30], sitting forward lean [30], and leg lowering [30]. Substantial reliability was identified for pelvic tilt [33], rocking pelvis forwards [33], standing back extension test [31], static lunge test [32], and dynamic lunge test [32]. Moderate reliability was identified for knee lift abdominal test [36], rocking pelvis backwards [33], prone active knee flexion [33], and standing knee-lift test [32]. The unilateral pelvic lift showed moderate reliability for the left side (K = 0.47) and substantial for the right side (K = 0.61) [37]. The sitting-on-a-ball test, on the other hand, was substantial for the right (K = 0.79) but excellent for the left (K = 0.88) [37]. The trunk forward bending and return to upright test, described by Biely et al. [28], showed K values from 0.35 to 0.89, depending on the criterion used to define the positivity of the test. Also, static lunge test [32], dynamic lunge test [32], and standing knee-lift test [32] showed different reliability values depending on each component observed during the execution of the test.

Intra-Rater Reliability
A total of 13 tests were investigated for intra-examiner reliability (Table A3), all by a single author. Waiter's bow, pelvic tilt, one leg stance, sitting knee extension, rocking backwards, rocking forwards, prone active knee flexion, and crook lying hip abduction were investigated by Luomajoki et al. [ [32] and showed good to poor reliability (ICC from 0.54 to 0.87). In the same study, the intra-examiner reliability of different aberrant movements analyzed during the execution of the above 3 tests was also investigated, and in this case, an extreme variability in the results also emerged (K from 0.42 to 1.00).

Validity
A total of 10 tests (including batteries) have been reported with indicating their validity, represented in Table A4 and all investigated by a single author. The battery of Luomajoki et al. [39], the knee-lift abdominal test, the bent knee fall out, the prone hip extension, and the active straight leg raise showed significant relationships between test positivity and the presence of LBP compared to healthy subjects (all p < 0.05). The use of Judder/shake/instability catch (JUD), deviation from sagittal plane (DEV)and aberrant movement score (AMS) as positive criteria in anterior trunk flexion movement and return to upright position also showed significant correlations with the presence of LBP. On the contrary, for the standing back extension test, standing knee-lift test, static lunge test, and dynamic Lunge, test there were not enough high values of diagnostic power (AUC from 0.47 to 0.78).

Risk of Bias in Included Studies
All studies included reported a complete description of the selected sample (Table 2). In several [29,34,37,38], however, there was no method for calculating the sample size, so we do not know with certainty the statistical power of the results obtained. The presence of an adequate method for calculating the sample size was not described as a parameter to be evaluated in criterion 1, and for this reason, it was considered satisfied in all the studies. Three studies [36,38,40] did not clarify the characteristics of the evaluators. The main source of risk of bias in 8 out of 11 studies dealing with reproducibility was the simultaneous evaluation by the observers [34,37]. Three studies did not clarify or carry out the randomization of the order of the patients evaluated [33,36,39]. Four did not randomized the order of the tests administered [30,33,37,39], and one did not clarify it [36]. In addition, in 2 studies, the blindness of the evaluators to the results between them was not clearly explained [36,37]. In studies dealing with intra-operator reproducibility, the concealment of patients or an adequate time gap between the two observations was adopted, except for 1 study [38], where the assessments were re-performed in a matter of minutes. In the studies that dealt with validity, they were not met or it was not possible to judge the criteria (3,7,9,11) because there is no shared reference in the literature. Analyses of diagnostic accuracy were developed with respect to the presence of LBP or not. Only 1 study [31] gave a description of the reference standard used, but in our opinion, the choice was not appropriate. The choice of statistical methods was considered appropriate for all studies; only 1 study [34] introduced a possible distortion of the effect of the results because it presented data of a nonparametric nature by inserting the standard deviation. 1. Human subjects and detailed description of the sample (validity and reliability studies) Qualification or competence of rater/s clarified (validity and reliability studies) Blinding of raters to the findings of other raters (inter-rater reliability studies) Blinding of raters to their own prior findings (intra-rater reliability studies) Latency between application of reference and index test reasonably (validity studies) Explanation of the withdrawals (validity and reliability studies) 13. Appropriateness of statistical methods (validity and reliability studies)

Discussion
This review is the first to include and summarize results from reliability (inter-and intra-rater) and validity studies of tests designed to detect MCI in subjects with NSLBP.
In 2013, Carlsson and Rasmussen-Barr [23] studied the reliability of tests to diagnose MCI and found it difficult to identify consistent results because they were investigated by studies with a high risk of bias. At the time (with a research updated to October 2011) only prone knee bend and the one leg stance were indicated by the author as useful because they were presented in one study with a low risk of bias. Recently, Denteneer et al. [24] identified a greater number of tests (specifically 30) but the limit of his research were the inclusion criteria. Studies included populations classified with functional lumbar instability or MCI or with the association of both. This leads to sampling limits with difficult interpretation and comparison of results.
In the present research, 15 tests have shown good inter-examiner reliability in at least one study, but only waiter's bow, one leg stance, sitting knee extension, and prone hip extension had almost overlapping values in at least 2 studies.
As is well known, inter-rater reliability is just a component of the reliability of a test and take greater importance when its context of use is characterized by the alternation of operators. NSLBP rehabilitation process is in most cases managed by a single therapist; nevertheless, the number of studies that have dealt with intra-rater reliability is far less than those of the inter-rater reliability.
From the few data available, there seems to be a good degree of agreement in the case of repeated measurements by the same therapist for almost all tests. The summary of results about intra-and inter-rater reliability shows that observing abnormal movement strategies in patients with NSLBP seems to be possible through simple tests; anyway, positivity criteria and execution modalities need to be standardized with precise protocols, as suggested by Enoch et al. [30].
The clinical use of the tests has to be based on consistent evidence both for the intra/inter-rater reliability, and these conclusions must derive from at least 2 studies of good quality.
Compared to knowledge set by Carlsson and Rasmussen-Barr [23], we can still recommend the use of the one leg stance, but we add also waiter's bow and sitting knee extension for the low risk of bias of the studies. These 3 tests are the only ones to have been studied both for inter-rater and intra-rater reliability. The use of prone knee bend suggested by Carlsson and Rasmussen-Barr [23] is less corroborated because, to date, it remains investigated only by one author and values of inter-rater reliability are moderate. Prone hip extension cannot be recommended due to high risk of bias in one of the two studies in which it is investigated. Moreover, there are no studies available about intra-rater reliability for prone hip extension.
Since 2011, the literature did not add much to previous knowledge, because of both the number of studies published and the quality of them.
As well as for the intra-rater reliability, the studies that have dealt with the validity are few in number. There is not a single test that has been evaluated by more than one author. The studies included in this review show that most tests are able to distinguish only subjects with LBP from healthy subjects (knee-lift abdominal test, bent knee fall out, and trunk forward bending and return to upright) [28,40]. This means that they do not provide any additional information to that which may result from a well-conducted medical history. It must be said that, in general, there is a higher sensitivity of the tests [39] towards subjects with chronic LBP, suggesting an association between the duration of symptoms and MCI, which would require observational studies to be demonstrated. At the same time, more patients with a history of LBP than healthy subjects [28] were positive, indicating the possibility that MCI may persist over time despite the resolution of symptoms. Again, only the design of ad hoc cohort studies could demonstrate the relationship between MCI and recurrence due to possible overloading of the tissues of the lower spine.
The validity data also shows the small number of researches that dealt with the diagnostic procedures aimed at identifying directional patterns of MCI [31]. The most important barrier to the development of validity research is the absence of a golden standard to compare the same outcome with different methods of investigation. Considering that tests for MCI evaluate the performance of certain motor tasks, the use and validation of motion capture tools seems to be the most appropriate strategy to make the evaluation as objective as possible. To date, only Wattananon et al. [41] has tried to establish reference values for the interpretation of clinical trials through comparison between the observation of examiners and the digital data collected.
Summarizing, only waiter's bow, sitting knee extension, and one leg stance are assessed across studies of good quality with good-excellent values both for intra-rater and inter-rater reliability; therefore, their use in clinical practice may be considered. However, the main problem remains the lack of clarity about the validity, which today, does not allow conclusions on the accuracy of the subgrouping procedure.

Conclusions
Implications for clinical practice: • Inter-rater reliability is widely studied. Waiter's bow, prone hip extension, sitting knee extension, and one leg stance showed good values confirmed by at least two studies; • Intra-rater reliability is not largely investigated. From the few studies available, good repeatability values seem to emerge; • Only waiter's bow, sitting knee extension, and one leg stance are assessed across studies of good quality with good-excellent values both for intra-rater and inter-rater reliability; • There is a lack of evidence regarding the validity of MCI tests, which results from diagnostic accuracy analyses aimed at discriminating only the presence or absence of LBP in the study samples; Funding: This research received no external funding.

Acknowledgments:
The papers were obtained through the NILDE library network system through the University of Genoa.

Conflicts of Interest:
The authors declare that there are no conflicts of interest.
Appendix A  To determine inter-examiner reliability of 3 tests of muscular functional coordination of the lumbar spine in patient with LBP. Pre-study trial on 10 patients.
Each test was performed once on both sides, and each test position was maintained for 20 s. Tests were administered in the same order to all patients.
Examiners were blinded to the patient's symptoms. Detected the VAS score before each test: VAS > 7/10 was an exclusion criterion.
Enoch et al. [30] To determine inter-operator reliability of MCI tests on patients with and without LBP The tests were performed in supine position and monitored with a PBU.
Biely et al. [28] To investigate the inter-examiner reliability of observation of aberrant movement patterns and whether each pattern is associated with current LBP.  The order of the test and leg lifted first were randomized.
Sensitivity and specificity p < 0.001 for group status and participant scores. Not between group and examiner classification. Not between examiner classification and participant scores. LBP group perceived significant difficulty compared to the control group. PHE: -specificity and sensitivity of participant-reported perception of difficulty scores in individuals with non-pregnancy-related LBP and controls.
Patients from local medical, chiropractic, physiotherapy, and massage therapy clinics Pre-study: 1 meeting and 3 training session to achieve a consensus.
The examiners were blinded to the group status and to the colleague's score.
Sn: 0.82-Sp: 0.69 ASLR: Patient were blinded to the evaluation of the examiners, and they were asked to express a score on a scale of 0-5 after the observer had left the room.
Ohe et al. [35] To quantify the characteristics of the trunk control during active limb movement in LBP patients with different types of LBP manifestation based on direct mechanical stress to the lumbar spine. Patients from the outpatient department of the local hospital.
Gondhalekar et al. [31] To determine the intra-and inter-rater reliability and concurrent validity of the standing back extension test for detecting MCI of the lumbar spine.  They were instructed to study each video clip no more than five times. The same procedure was repeated after 2 weeks.
Patients with LBP from private physiotherapy clinics, the healthy selected from university students and acquaintances.   Required to maintain constant pressure on the PBU during repeated lowering of the leg towards the support surface, starting with hips flexed at 90 degrees and knee extended as much as possible.
Difference in the pressure variation between the performance carried out with the two lower limbs.
One leg stance/Trendelenburg In an upright position, the patient is asked to search for the neutral lumbar position, following a maximum antiversion and retroversion of the pelvis.
A 5-cm tape positioned vertically starting from S1 (point 0) on which a laser is pointed. The patient moves the pelvis twice in anti and retroversion, finally returning to the starting position. The distance in cm between the laser pointer and S1 is measured. Required to maintain neutral lumbar spine position during knee extension with patient sitting on the edge of the cot * Capable of maintaining the neutral position of the lumbar spine up to 30-50 • knee flexion. ** A 5-cm tape is placed on the lumbar area starting from S1, on which a laser pointer is placed. After 5 full knee extensions, the distance in cm between the laser pointer and S1 is measured.      Physiotherapists valued the performance of the subjects on the six movement control tests resulting in a score of 0-6 positive tests.
Authors compared the mean number of positive tests in the two groups. The differences between the groups were analyzed by the effect size (ES).

Pelvic tilt
The statistical test showed that this was a significant difference (p < 0.001). Between all the group: p < 0.02 p < 0.01 acute vs chronic A subgroup analysis was performed of the number of positive tests depending on LBP.
p < 0.03 subacute vs chronic A statistically significant difference was also found between acute and chronic (p < 0.01) as well as between subacute and chronic (p < 0.03). No difference between acute and subacute patient groups (p > 0.7).

One leg stance
Sitting knee extension The tests were performed in supine position and monitored with a pressure biofeedback unit (PBU): maximal pressure deviation from baseline was recorded during each test. The aim was to have as little deviation as possible.
Bent knee fall out (BKFO) Roussel et al. [40] p = 0.049 (L), 0.304 (R) Significant differences were observed between dancers with and without a history of LBP (p value <0.05 bilaterally for KLAT and on the left leg for the BKFO).
Prone hip extension (PHE) Bruno et al. [29] p < 0.001 LBP group-patient score The following analyses were performed: p = 0.30 patient score-examiner classification → exam of the effects of group status (LBP/control) and examiner classification (positive/negative) on the participant-reported perception of difficulty scores (0-5) p = 0.96 LBP group-ex classification.
→ The sensitivity (LBP group) and specificity (control group) were calculated for different cut-offs used to distinguish "positive" and "negative" participant scores. Sn = 0.82 Sp = 0.69 (cut-off 0-1) Active straight leg raise (ASLR) Bruno et al. [29] p < 0.001 LBP group-patient score For both PHE and ASLR tests, a significant difference (p < 0.001) was found between the groups (LBP group perceived significant difficulty compared to the control group) but not for examiner classification. Not significant p = 0.54 patient score-examiner classification p = 0.89 LBP group-ex classification For both tests, the sum of sensitivity and specificity was highest with a cut-off of 0-1: Values are reported beside. Sn = 0.60 Sp = 0.76 (cut-off 0-1) Table A4. Cont.

Test Authors Validity Notes and Summary of Results
Trunk forward bending and return to upright Biely et al. [28] For altered lumbo-pelvic rhythm (LPR): Two different approaches for construct validity: (1) The ability of each individual aberrant movement to distinguish between patients with LBP, with history of LBP and without LBP. The p values show a statistically significant difference between all groups (p < 0.05).
No LBP: 0.8 ± 0.63 History of LBP: 1.3 ± 0.61 LBP: 2.5 ± 0.96 * p < 0.001 ** p < 0.001 *** p = 0.021 Standing back extension test Gondhalekar et al. [31] AUC: 0.785 for abdominal drawing-in maneuver (ADIM), 0.780 for ASLR To establish validity, results of movement test from the first rater were compared with the difference in thickness during ASLR and ADIM results. Area Under the Curve (AUC) was used for assessing the validity of the standing back extension test with respect to reference standard of ultrasound measurements during ADIM and ASLR maneuvers. It can be between 0 and 1: the closer the curve is to the top of the graph (i.e., to 1), the greater the discriminating power of the test. For AUC = 0.785 and 0.780, standing back extension test can be considered moderately accurate.

Standing knee-lift test (SKL)
Granström et al. [32] AUC: 0.47 The ability of the tests to classify the subjects into the healthy or NSLBP group was analyzed using the ROC curve quantified by using the area under the curve.
Static lunge Test (SL) Granström et al. [32] AUC: 0.56 Compared to the previous one, in this study, the AUC values are of lower accuracy. The authors considered an AUC of <0.5 as non-informative; 0.5 < AUC < 0.7 less accurate than chance alone; 0.7 < AUC < 0.9 moderately accurate; 0.9 < AUC < 1.0 highly accurate; and AUC = 1.0 like a perfect test. Dynamic lunge test (DL) Granström et al. [32] AUC: 0.52 Legend: Sn = Sensitivity, Sp = Specificity, ROC = Receiver Operator Characteristic. For description and criteria of tests, see table "Inter-rater reliability". Table A5. Critical appraisal tool for validity and reliability studies of objective clinical tools as described by Brink and Louw [27].

N Item
Type of Question Nature of the study 1 If human subjects were used, did the authors give a detailed description of the sample of subjects used to perform the (index) test? Validity and reliability studies 2 Did the authors clarify the qualification, or competence of the rater(s) who performed the (index) test? Validity and reliability studies 3 Was the reference standard explained? Validity studies 4 If interrater reliability was tested, were raters blinded to the findings of other rathers? Reliability studies 5 If intrarater reliability was tested, were raters blinded to their own prior findings of the test under evaluation? Reliability studies 6 Was the order of examination varied? Reliability studies 7 If human subjects were used, was the time period between the reference standard and the index test short enough to be reasonably sure that the target condition did not change between the two tests? Validity studies 8 Was the stability (or theoretical stability) of the variable being measured taken into account when determining the suitability of the time interval between repeated measures? Reliability studies 9 Was the reference standard independent of the index test? Validity studies 10 Was the execution of the reference standard described in sufficient detail to permit its replication? Validity and reliability studies 11 Was the execution of the (index) test described in sufficient detail to permit replication of the test? Validity studies 12 Were withdrawals from the study explained Validity and reliability studies 13 Were the statistical methods appropriate for the purpose of the study? Validity and reliability studies