Systematic Review of Level 1 and Level 2 Screening Tools for Autism Spectrum Disorders in Toddlers

The present study provides a systematic review of level 1 and level 2 screening tools for the early detection of autism under 24 months of age and an evaluation of the psychometric and measurement properties of their studies. Methods: Seven databases (e.g., Scopus, EBSCOhost Research Database) were screened and experts in the autism spectrum disorders (ASD) field were questioned; Preferred Reporting Items for Systematic review and Meta-Analysis (PRISMA) guidelines and Consensus-based Standard for the selection of health Measurement INstruments (COSMIN) checklist were applied. Results: the study included 52 papers and 16 measures; most of them were questionnaires, and the Modified-CHecklist for Autism in Toddler (M-CHAT) was the most extensively tested. The measures’ strengths (analytical evaluation of methodological quality according to COSMIN) and limitations (in term of Negative Predictive Value, Positive Predictive Value, sensitivity, and specificity) were described; the quality of the studies, assessed with the application of the COSMIN checklist, highlighted the necessity of further validation studies for all the measures. According to COSMIN results, the M-CHAT, First Years Inventory (FYI), and Quantitative-CHecklist for Autism in Toddler (Q-CHAT) seem to be promising measures that may be applied systematically by health professionals in the future.


Introduction
Recently, U.S. data showed that the median age at earliest Autism Spectrum Disorders (ASD; [1]) diagnosis ranged from 28 to 39 months for children aged 4 [2] and is 40 months for children aged 8 [3]. According to these data, a screening procedure during the regular well-baby check-ups was recommended [2,4] with the aim to detect the warning signs of ASD (e.g., precursors of Theory of Mind; [5]). As suggested by several authors [6,7], the process should involve the early screening of warning signs and the subsequent diagnosis made through clinical judgement, in combination with the application of reliable and standardized gold-standard measures (e.g., the Autism Diagnostic Interview-Revised, [8]; the Autism Diagnostic Observative Schedule-2, [9]).
Earlier diagnosis of ASD could lead to earlier intervention for children, which could enhance their adaptation [10][11][12] or improve their social competence (e.g., emotional expression; see for To give the reader a full and comprehensive view of the characteristics of the Level 1 and Level 2 measures available, and since the COSMIN protocol evaluates the quality of the study, but not the quality of the tool, we collected data on sensitivity, specificity, PPV, and NPV for all the included measures and we provided a discussion about those properties.

Materials and Methods
The systematic review is based on a published protocol [32], in which the authors reported a comprehensive description of the steps to follow, the methodology, and the process of the review. Furthermore, the authors provided the format of the tables to be used for the main descriptive data of the papers included in the review and the results of the examination of the psychometric properties. The methodology applied was developed based on the Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA) guidelines [33] for identifying the papers to be included in the review. An electronic search was conducted using PsychINFO, the Psychology and Behavioral Sciences Collection, Cumulative Index of Nursing and Allied Health Literature, Scopus, the Education Resources Information Center, Google Scholar, and Pubmed (including MEDical Literature Analysis and Retrieval System OnLINE). The keywords applied were: 'early diagnosis or diagnos *', 'ASD screen *', 'ASD detect *', 'ASD or autism or autist *', 'assessment tool', 'surveillance', 'develop * surveillance', 'assess *', 'instrument *', 'measure *', 'psychometric properties', 'standardiz *', 'tool*', and 'validat *'. A secondary hand search was performed to include references and citations from the identified papers. The electronic search was carried out by an author who extracted the records and tabulated the references in an excel file. Two authors independently screened the records to exclude duplicates and to remove papers according to pre-defined inclusion/exclusion criteria. The two authors reported their decisions in two different excel files and they compared their findings record by record. In case of disagreement, a third author arbitrated. Finally, three clinicians and three research experts in ASD, working respectively for the Public Health Service and for Universities respectively, were questioned. Based on the inclusion/exclusion criteria, they did not suggest any other relevant existing measure/study different from those already included in the present review.
Predefined inclusion criteria were: (1) level 1 and level 2 screening measures of ASD for children under 24 months; (2) validation studies, standardization of measures, cross-cultural comparisons, longitudinal, or follow-up studies; (3) published papers in peer-reviewed journals; (4) papers written in English; and 6) a year of publication between 1990 and October 2019. Other reviews on the same topic were examined to extract citations of studies that were eligible for our final list. Furthermore, exclusion criteria were defined as following: (1) measures of the diagnosis of ASD; (2) retrospective studies and systematic reviews; (3) measures of risk detection/diagnosis of others developmental disorders; (4) procedures for the detection of ASD other than questionnaires, interviews and observation procedures (i.e., biological markers, fMRI, blood test); (5) epidemiological studies and guidelines for experts; (6) publications that are not in peer-reviewed journals; (7) papers without the specific aim to evaluate psychometric properties or validity properties of the measures; (8) dissertation thesis or conference papers.
The evaluation of the measures applied the COnsensus-based Standards for the selection of health Measurement INstrument (COSMIN) checklist [29][30][31]. The COSMIN checklist applies nine boxes identifying the main measurement properties: (A) internal consistency (i.e., the degree to which the items of a questionnaire correlate with each other and evaluate the same concept); (B) reliability (i.e., the ability to measure a construct over time or by different persons); (C) measurement error (i.e., the error of the score not attributed to true changes in the construct); (D) content validity (i.e., the degree to which the items reflect adequately the construct measured); (E) structural validity (i.e., evaluating whether the hypothesized latent factor(s) reaches a good fit of the data); (F) hypothesis testing (i.e., considering whether the construct measured by the questionnaire reaches the expected relations with other variables); (G) cross-cultural validity (i.e., giving information on the generalization properties of the measure when applied in a different cultural context); (H) criterion validity (i.e., the degree to which the measure correlates with a 'gold-standard' measure); and (I) responsiveness (i.e., evaluating whether the measure predicts a change over time). Each box contains a different number of items (ranging from 5 to 18) evaluating 'design aspects and statistical methods' of a study [31] (p. 651), which require a mandatory assessment to obtain a full appraisal of the properties.
The COSMIN checklist provides a multi-step evaluation. The first step concerns the decision about which measurement properties have been assessed in a target paper among the nine boxes, and it is achieved by applying a binary scale (i.e., present vs. absent) considering the whole paper. For example, if the internal consistency (i.e., box A) is a property evaluated in a paper, then 'present' is attributed to box A for that paper.
The second step refines the evaluation undertaken in step 1. For each box marked as 'present' in step 1, the evaluator works through the questions, assigning to each of them an evaluation on a dichotomous scale ('yes' if the specific properties suggested by the question are present or 'no' if the specific properties suggested by the question are not present).
Finally, in the third step, the score obtained in step 2 is further refined. Every item marked as 'yes' in the previous step is now evaluated on a four-point Likert scale: excellent (+++), good (++), moderate (+), or poor (0).
A final evaluation for each box is obtained by considering the lowest score attributed to that box according to the worst score counts [31] (p. 651) procedure. Therefore, if even only one item in the box obtained a poor score, the measurement property for that box is rated as poor. Two authors independently applied the COSMIN checklist on 20 papers with an inter-rater agreement of Cohen's k = 0.94. Figure 1 shows the PRISMA diagram. The electronic search allowed to identify 691 records and a second-hand search added 26 more records. According to the inter-raters decision-making process, during the screening, two authors independently removed 365 duplicates and 300 papers according to the exclusion criteria. The final eligible number of papers included in the systematic review was 52 ( . The consistency between the two authors who screened these records was high (Cohen's k = 0.89). Sixteen measures were evaluated and classified into 3 categories: observational checklists (n = 4), questionnaires (n = 10), and interviews (n = 2). Table 1 reports the general details of each measure. Table 2 showed the details of the studies included in the systematic review. Specifically, we reported the measure name, authors and year of the study, the type of the design, population recruited, the application level (1, 2, or "hybrid"), and the diagnostic accuracy properties (i.e., sensitivity, specificity, PPV, NPV).

Overview of the Studies and Measures
The search strategy allowed to find four level 2 measures (i.e., the Autism Detection in Early Childhood; the Autism Observation Scale for Infants; the Baby and Infants Screen for Children with aUtIsm Traits; the Parent Observation of Early Markers Scale) that were also retrieved from the systematic search evaluated in eleven studies with a cross-sectional design and in two studies with a longitudinal design. Those measures were administered to two groups of children. The first group consisted of children who were already receiving attention from the local mental health service due to developmental concerns, children suspected of developmental delay, or children qualified for a medical condition that could determine a developmental delay including ASD comorbidity (i.e., epilepsy, hydrocephaly, Down's syndrome, and cerebral palsy). Henceforth this group is identified as Developmental Concerns group (DC). The second group included twins or younger siblings of children with an ASD diagnosis, henceforth defined as Genetic Risk group (GR) because they have high probability to develop ASD [20]. The studies included in level 2 aimed either to: (a) test a screening measure on DC or GR groups; (b) compare DC and GR groups between them; (c) follow DC/GR group until the diagnosis; or, finally, (d) compare children from the general population to DC or GR groups. Table 2 shows also the details of the six 'hybrid' measures (i.e., the CHecklist for Autism in Toddlers; the Developmental Behavior Checklist: Early Screen; the Modified Checklist for Autism in Toddlers; the Modified Checklist for Autism in Toddlers-Revised with Follow-up; the Quantitative-CHecklist for Autism in Toddlers; the Three-Item Direct Observation Screen) that were developed mainly for level 1 and/or level 2 screening, but they were also administered to clinical populations (i.e., children who had already received a diagnosis of ASD or of another developmental disorder). Those studies aimed either to: (a) apply the measure to a clinical sample, (b) compare samples with different diagnoses (ASD vs. PDD-NOS vs. ODD), or, finally, (c) compare children from the general population with children with an ASD diagnosis. Eleven studies were longitudinal and 19 had cross-sectional design.        All the other measures did not report any positive or negative predictive values. Overall considered, the measures for which the PPVs and NPVs were reported, demonstrated from moderate to high predictive values, although for the M-CHAT results can be considered more stable compared to other measures that need further and deeper exploration of these properties. Quality of assessment of the studies Table 3 shows the results of the evaluation of each psychometric properties of the studies through the application of the COSMIN checklist. For each box, we reported a summary of the assigned scores.
The quality of assessment revealed a heterogeneous picture. Specifically, 24 studies out of 52 received an evaluation of the internal consistency (Box A) and the scores were fair or poor, with the exception of the studies on FYI and the Q-CHAT, which received excellent scores. The reliability (Box B) was evaluated in 17 studies and the majority of the scores rating from fair to poor. Only studies considering the CHAT and POEMS received respectively an excellent and good evaluation. The measurement error (Box C) was assessed in 5 longitudinal studies and received poor or fair evaluations.
The Box D (i.e., content validity) was evaluated in 9 studies and it received excellent evaluations for studies considering AOSI, BISCUIT, CHAT, FYI, M-CHAT, POEMS, Q-CHAT, SEEK, and TIDOS. Structural validity (Box E) was evaluated in 7 studies, but only 3 received excellent scores regarding two measures (i.e., M-CHAT and Q-CHAT). The Hypothesis testing (Box F) was evaluated for several studies, which received fair or poor scores, whereas those on FYI and the M-CHAT-R/F received good evaluations, and that on M-CHAT was evaluated as excellent. For the studies on JA-OBS, the SEEK, and the YATCH-18 the property was not evaluated.
The cross-cultural validity (Box G) was examined in 11 studies and received fair or poor scores. The box criterion validity (H) was evaluated for all studies, with the exception of the one on Q-CHAT and one on SEEK. This property was rated as excellent or good in four studies for four measures (FYI, M-CHAT, M-CHAT-R/F, and Q-CHAT); whereas for all other studies it was evaluated as fair or poor. Finally, the responsiveness (Box I) was the least-evaluated property with only 3 studies receiving scores from fair to poor.
As Table 3 shows the reasons leading to the attribution of fair and poor scores are above all the missing data and the sample size criteria and the fact that they are evaluated across several measurement properties. These criteria were evaluated by the COSMIN with a conservative approach [86], which will be discussed in the following section.   [64] 0 only one measurement 0 comparison instrument [65] + time interval + sample, hypothesis 0 missing item   Note: 4-point scale rating: +++ = excellent, ++ = good, + = fair, 0 = poor. Empty cell = COSMIN rating not evaluated. Ratings fair and poor were explained with the reason(s) in italics leading the evaluation. Specifically, "administration not similar" means the two administration conditions to examine measure property were not similar; "comparator instrument" means that authors did not administered a gold standard measure for ASD to evaluate the criterion validity;"expertise translator" means that the expertise of measure translators was poor or not described by authors; "hypothesis" means that the authors did not formulate the hypothesis a priori; "missing item" means that the authors did not report the percentage and/or the handling method for missing data; "no pilot study" means the translated measure did not pre-tested in a target population; "only one measurement" means the authors did not administered the measure at least two times; "sample" means that the sample size was not adequate;"statistical method" means that authors did not calculated the right parameter(s) for the specifc property;"time interval" means that the time interval between two measurements was not adequate;"translation" means that the back-translation process was not adequately described; "unidimensionality" means that the internal consistency parameter was not calculated for each (sub)scale separately [29,30].

Discussion
The systematic review identified six level 1 measures and four level 2 measures. Moreover, the present systematic review found that six screening tools were applied to clinical populations. Among the variety of methodologies of the level 1 and level 2 measures, the questionnaire was the most applied due to several inherent advantages. First, questionnaires are normally administered in a very short time, do not require specific knowledge or training, and are much less invasive than observational checklists or interviews. Second, they often do not require specific training on the coding system or the interpretation of the scores. For many questionnaires, the imputation of a final score and the attribution of a meaning to it do not involve any clinical interpretation or specific knowledge of ASD. Nevertheless, questionnaires have several limitations. First, the score depends on the subjectivity of the informants. Since questionnaires are designed for parents, they could under-or overestimate the early signs of risk based on their ability to detect them and to distinguish signs of risk from normal deviation from the developmental trajectories. However, the impact of this limitation could be minimized with longitudinal studies testing and comparing the level 1 and level 2 screening instruments with the goldstandard measures (e.g., Autism Diagnostic Observation Schedule-2, [8]) for the diagnosis of ASD. Another inherent limitation of the questionnaires is social desirability bias in the form of over-reporting desirable behaviors. Future research in this field is needed to develop one or more validity scales, as for other clinical psychological testing procedures (i.e., the MMPI-2; see [87]).
The second aim of the present review was to evaluate the psychometric characteristics of the included measures following the COSMIN checklist. Two main considerations could be drawn by our results, one pertaining to the quantity of the psychometric evaluations and the other to their quality. First, it should be noticed that in the studies included in our systematic review, there are several psychometric properties more frequently evaluated than others. A high number of studies contained data that allowed the evaluation of the internal consistency, reliability, hypothesis testing, and criterion validity; whereas the measurement error, content validity, structural validity, cross-cultural validity, and responsiveness have been evaluated in a low number of studies. The second element to be considered is the quality of the evaluations themselves. Indeed, a high frequency of evaluations of a given property not always corresponds to a high quality of evaluation of that property. For example, the content validity was the property less frequently assessed, compared to the others, but it was rated as excellent for all the studies examined. On the other side, the hypothesis testing was frequently evaluated, but received poor or fair scores. These findings should give an impetus to researchers to design validation studies with a focus on both the quantity of the properties and their quality.
Considered overall, one very common problem for all the studies is the treatment of missing data. Few authors explicitly quantified the missing data in their data set, and very few explained the method that they followed to treat missing data. For studies that aim to identify early signs of risk of ASD, the treatment of the missing data represents a crucial aspect. For this specific case, the imputation of data through statistical procedures risks altering the data structure and the distribution beyond the over-/underestimation of the risk of ASD. Thus, it is quite important that, in the future, researchers explain whether and how they have treated missing data in their sample, especially for the parent-reported measures, for which it is more likely to have items with no answers.
According to the COSMIN evaluation, our findings highlight the necessity of further validation studies for all the measures included in the present review. Longitudinal studies involving general population following a sample over time with the purpose of making a diagnostic evaluation are particularly needed. This will allow for an in-depth study the psychometric properties, to compare the results from different measures, and consequently to increase their criterion validity, and specifically the sensitivity and the specificity through the comparisons with the gold standard measures.
Special consideration had to be drawn regarding the Sensitivity, Specificity, PPV, and NPV of the measures because they are not included in the COSMIN checklist. These properties are extensively reported in the validation studies of the M-CHAT, M-CHAT R/F, and ADEC. For other measures (i.e., CESDD, JA-OBS, POEMS, DBC-ES, and TIDOS) there is only one study each containing information of these properties (see Table 2 for the specific values). All the other measures did not report any positive or negative predictive values. Overall considered, the measures for which the Sensitivity, Specificity, PPVs and NPVs were reported, demonstrated from moderate to high predictive values (see also [27]), although for the M-CHAT results can be considered more stable compared to other measures that need further and deeper exploration of these properties.
The third and final research question aimed at the identification of one (or more) promising instrument(s) for the assessment of early signs of risk of ASD according to the COSMIN evaluations of the studies. We consider the questionnaires such as the FYI, the M-CHAT, and the Q-CHAT as promising screening measures because, according to the COSMIN evaluation, they have high number of psychometric properties evaluated and high methodological quality attributed to them. Although we found these measures promising, none of them can be currently considered as the gold standard in the early detection of risk of ASD and further development in this field is desirable. For example, future studies should improve sensitivity, specificity, NPV, and PPV properties of those measures since they are not considered at all for the FYI and they are barely considered for M-CHAT and Q-CHAT, as also suggested by [27].
On the contrary, the interviews and the observational checklists have both low number of validation studies (with the exception of the M-CHAT-R/F) and low methodological quality attributed to them. Further research should be developed on these methods of evaluation focusing on their psychometric properties, as it may be useful for health professionals to have a range of tools available for ASD risk detection that allows an in-depth analysis.
The present systematic review has several limitations. First, the COSMIN checklist is a standardized protocol for the assessment of the methodological quality of a study and not of the instrument itself. However, as suggested by others [see 86] the evaluation of the methodological quality of a study is the first step to determining whether its results are reliable and trustworthy. In other words, evaluating the methodological quality of a study allows to discover risk of bias in the results. Thus, the assessment of the quality of the study is directly related to the assessment of the measure administered in that study. Moreover, one of our inclusion criteria considered all the "validation studies, standardization of measures, cross-cultural comparisons, longitudinal, or follow-up studies", which are studies evaluating measurement and validity properties of a screening measure. Therefore, we applied the COSMIN checklist to evaluate measurement properties of studies that, in turn, evaluate the measurement properties of the screening measures. Thus, the evaluation of the properties of a study, in this case, is a proxy of the evaluation of the measure validated in that study.
Second, the worse score counts policy of the COSMIN could lead to a negatively biased view of the measure. In this vein, the COSMIN itself explains that every item of its evaluation represents an important part of the overall assessment, so a poor rating for any item should be considered as a serious flaw. Furthermore, we would like to focus on the COSMIN evaluation of the sample size. According to [31], the sample size is evaluated as excellent when it is ≥ 100, as good when it ranges 50-99, fair when it ranges 30-49 is fair, and poor when it is < 30. This categorization is a good criterion when applied to the general population, while when risk and/or clinical groups are considered, the COSMIN sample evaluation should be carefully considered according to the prevalence rate of ASD. According to this premise, recently, the researcher who developed the COSMIN protocol reformulated the evaluation of the sample size (see [86]).
Third, the Sensitivity, Specificity, Positive Predictive Value (PPV) and the Negative Predictive Value (NPV) are not evaluated in the COSMIN checklist. Within the context of screening measures for ASD, it is important that professionals are confident when using a given tool. In this field, the predictive values provide valuable information on the probability of a tool to identify that people with high scores indeed have high risk (PPV) and, vice versa, that people with low score have low risk (NPV). To avoid the omission of such important information, we extracted values of the NPVs and PPVs from the studies, we reported them in Table 2 and we discussed the evidence.
Finally, like every systematic review, the definition of inclusion criteria could have limited the electronic search, and we could have omitted several studies.
The present systematic review has two main strengths. First, the review provides an updated and complete overview of the current level 1 and level 2 screening measures for ASD. Second, our findings provide researchers and clinicians (i.e., pediatricians, GP, psychologist) the analytical knowledge on psychometric properties of the measures through the evaluation of the methodological quality of their validation studies. The outcomes of the systematic search and the results of the evaluation of the psychometric properties, through the application of the COSMIN criteria, may guide researchers and clinicians in their selection of one (or more) instrument(s), according to their specific purposes. A critical and reasoned choice of a measure combined with the good communication between clinical and patients [88] could allow for defining systematic screening procedure on general population. This is the first step for early identification of risk of ASD, which, in turn, may lead to a timely diagnosis and ultimately to better outcomes for children [10,17,18] and families [89].