Attachment measures in middle childhood and adolescence: A systematic review of measurement properties

adequacy of measurement properties. Conclusions: Attachment measures in middle childhood and adolescence currently have limited evidence for the adequacy of their psychometric properties.

adequacy of measurement properties. Conclusions: Attachment measures in middle childhood and adolescence currently have limited evidence for the adequacy of their psychometric properties.

Attachment measures in middle childhood and adolescence: a systematic review of measurement properties
Attachment theory constitutes an evolving body of work developed by dozens of researchers and theoreticians over approximately sixty years. Two propositions are central to the theory: (1) patterns of attachment developed in infancy are relatively stable across the lifespan; (2) attachment patterns can help to explain the development of psychopathology (Bowlby, 1973(Bowlby, , 1980. Both propositions can be tested empirically, but this requires the presence of valid and reliable attachment measures. Evidence points to attachment as a risk factor for the development of a range of psychopathology, including aggression and externalising behaviour (Fearon, Bakermans-Kranenburg, van IJzendoorn, Lapsley, & Roisman, 2010), internalizing disorders (Groh, Roisman, van IJzendoorn, Bakermans-Kranenburg, & Fearon, 2012), eating pathology (Jewell et al., 2016;Caglar-Nazali et al., 2014), and suicidality (Fergusson, Woodward, & Horwood, 2000). However, the power of attachment to predict later psychopathology is generally weak, raising doubts about its centrality as a causal factor (Fonagy, Luyten, & Allison, 2015). Moreover, there have been concerns that measures of attachment in childhood may have high levels of error (Fearon & Roisman, 2017), thus hampering efforts to understand the role of attachment in psychopathology.
Middle childhood and adolescence constitute developmental phases for which no 'gold standard' measures exist (Bosmans & Kerns, 2015). The aim of this study is to provide the first ever systematic review of the psychometric properties of all measures of attachment in middle childhood and adolescence. To understand the theoretical underpinnings of such measures, it is necessary to provide an account of the development of measurement approaches to infant and adult attachment. We note that the psychometric properties of these instruments have yet to be subjected to a systematic review, but recent narrative reviews of infant and adult attachment measures have been provided by Solomon and George (2016) and Crowell, Fraley, and Roisman (2016) respectively.

Measuring attachment in infancy and adulthood
The empirical assessment of human attachment began with Ainsworth, Blehar, Waters, and Wall's (1978) 'strange situation procedure' (SSP), in which trained observers rated infant behaviour in response to separation from and reunion with their primary caregiver. Although infants were rated across various scales, a discriminant function analysis suggested two underlying attachment dimensions, named avoidance and anxiety. Noting the clustering of the data in three groups resulted in the naming of three attachment categories: A, B and C, which came to be known as insecure-avoidant, secure and insecureambivalent. Later, Main and Solomon (1986) identified a fourth category in the SSP, disorganized attachment (D). The resulting four-category 'ABCD' paradigm directly informed the development of the Adult Attachment Interview (AAI) (George, Kaplan, & Main, 1985). This was originally developed in an attempt to explain the SSP classifications of infants by identifying differences in mother's representations of their own childhood experiences of being cared for. The four AAI categoriessecure, dismissing, preoccupied and disorganized -were conceived of as explaining secure, avoidant, resistant and disorganized attachment patterns (respectively) in infants.
In a separate development, researchers such as Hazan and Shaver (1987) developed self-report measures of romantic attachment style. These measures were based on putative parallels between adults' ways of relating in romantic relationships and the concept of infant attachment categories. Brennan, Clark, and Shaver (1998) undertook a factor analysis of a large pool of items taken from self-report attachment style measures, and identified two underlying factors, avoidance and anxiety, echoing the dimensions identified by Ainsworth et al. (1978). In recent years, taxometric analyses of both the SSP (Fraley & Spieker, 2003) and the AAI (Fraley & Roisman, 2014) have also suggested that attachment is distributed across these two dimensions, rather than falling within categories. Nevertheless, concordance between AAI categories and selfreported attachment is trivial to low (Roisman et al., 2007).
Thus, the two main approaches to measuring adult attachment may be tapping related but distinct constructs, although the precise boundaries of the adult attachment construct are hard to operationalize (Allen, Stein, Fonagy, Fultz, & Target, 2005). The concept of adult attachment style as tapped by self-report-measures has been defined as a constellation of knowledge, expectations and insecurities that people hold about themselves and their close relationships (Fraley & Roisman, 2018). By contrast, the AAI seems to access internal working models of childhood caregiving experiences (Stein, Jacobs, Ferguson, Allen, & Fonagy, 1998), although Allen and Miga (2010) have suggested that the AAI may be best conceived as a measure of emotion regulation in the context of discussions about caregiving experiences.

Measuring attachment in middle childhood and adolescence
As can be gleaned from this overview, whilst measures of attachment exist for both infancy and adulthood, there are enormous differences not only in measurement approach, but also in the latent construct that is being assessed. In infancy, an observational measure is used to assess behaviour in young children who are developmentally reliant on their caregivers for survival, and are at a very early stage in their emotional, cognitive and social development. By contrast, attachment measures in adulthood assess mental representations, expressed through language, in adults who have acquired formal operational thinking (Inhelder & Piaget, 1958), and whose close relationships serve quite different functions, potentially including reproduction, child-rearing and emotional support. Approaches to assessing attachment in middle childhood and adolescence therefore must take into account the 'moving target' of child development, both in terms of the evolving function of the attachment system, and the changing abilities of the child.
Researchers have tried to access children's attachment representations using three main methods. Firstly, various self-report measures have been developed, including specific measures developed for middle childhood (Kerns et al., 1996), measures developed with both adults and adolescents in mind (Feeney, Noller, & Hanrahan, 1994), and downward extensions of adult romantic attachment measures (e.g. Brenning, Soenens, Braet & Bosmans, 2011). Secondly, interview approaches informed by the AAI have been developed, such as the CAI (Shmueli-Goetz, Target, Fonagy & Datta, 2008). Finally, some researchers have developed projective measures, in which children's stories and play in response to attachment-related prompts (e.g. Green, Stanley, Smith, & Goldwyn, 2000;Cassidy, 1988) are rated by trained coders. Thus far, attempts to critically appraise the measurement properties of these various measures have been limited. The National Institute for Health and Care Excellence (NICE) (2015) conducted the most thorough review so far, including measures of attachment from infancy to the age of eighteen, but excluding self-report measures. This review recommended the CAI for middle childhood, and the AAI for adolescents aged over fifteen. Kerns, Schlegelmilch, Morgan, and Table 1 Characteristics, strengths and weaknesses of observer-rated measures.  Conflicting findings about adequacy of internal consistency and hypothesis testing.
T. Jewell et al. Clinical Psychology Review xxx (xxxx) xxx-xxx Abraham (2005) and Wilson and Wilkinson (2012) have also conducted narrative reviews but neither included ratings of the methodological quality of studies, nor did they make recommendations about which measures have the best psychometric properties.

Rationale for this review
Attachment theory constitutes a highly important paradigm within the field of child development and psychopathology, resulting in thousands of empirical papers. However, the reliability and validity of measures is fundamental to the conduct and interpretation of this body of research.

Objective
Our primary aim in undertaking this review is to make recommendations about which attachment measures have the best psychometric properties, thereby providing a guide to researchers and clinicians in the field. We will also identify gaps in the evidence and make recommendations about promising avenues for future research.

Literature search
This systematic review was registered with PROSPERO (CRD 42017057772) and completed in accordance with PRISMA guidelines (Moher, Liberati, Tetzlaff & Altman, 2009). Two independent researchers searched MEDLINE, PsychINFO and Embase databases for relevant articles up to the end of June 2017. Eligibility criteria were: (1) English language; (2) published in a peer-reviewed journal; (3) aim of study is to develop a measure, or evaluate the properties of a measure, that assesses attachment in children or adolescents using either interview, task, projective method or self-report; (4) participants in the study are aged between 6 and 18 years inclusive; (5) measure is theoretically derived from attachment theory. The search strategy was constructed by TG, and refined by TJ, over three waves of literature searching in February 2016, November 2016 and June 2017. The search strategy is publicly available at the PROSPERO protocol registration for this review.

Data extraction
This study used the COSMIN checklist (De Vet, Terwee, Mokkink, & Knol, 2011; materials available at https://www.cosmin.nl), a tool for systematic reviews of measurement properties. Data were extracted using templates from the COSMIN tool. Characteristics of studies (e.g. sample size, age of participants) can be found in Appendices A and B for observer-rated measures and self-report measures respectively. Characteristics of the measures themselves (e.g. number of items and scales) can be found in Tables 1 and 2.

Assessment of measurement properties
Pairs of reviewers independently assessed each study using the COSMIN checklist for the following criteria: internal consistency, testretest and inter-rater reliability, content validity, hypotheses testing (i.e. construct validity), cross cultural validity and criterion validity. Measurement error and responsiveness to change are part of the COSMIN checklist but were not included in this review since no studies were found addressing these measurement properties.
Each study was rated by a pair of reviewers. TJ rated all papers, and TG, KW, KS and EC rated approximately a quarter of papers each. Disagreements were resolved by PF. Each measurement property was rated on a four-point rating scale (poor, fair, good, excellent) for methodological quality of the study, thereby assessing risk of bias within studies. Each measurement property also received a rating for the adequacy of the measurement property (assigned as '+', '−' or '?'). COSMIN criteria for the adequacy of measurement properties can be found in Table 3. Ratings of the methodological quality and adequacy of measurement properties within individual studies are in Appendices C and D (observer-rated and self-report measures respectively).

Data synthesis
A synthesis of the strength of evidence for each measurement No correlations with instrument(s) measuring related construct(s) AND no differences between relevant groups reported − Criteria for '+' not met Cross-cultural validity + No important differences found between language versions in multiple group factor analysis or DIF analysis ?
Multiple group factor analysis AND DIF analysis not performed − One or more criteria for '+' not met Criterion validity + Convincing arguments that gold standard is "gold" AND correlation with gold standard ≥0.70 ?
Not all information for '+' reported − Criteria for '+' not met T. Jewell et al. Clinical Psychology Review xxx (xxxx) xxx-xxx property was conducted for all measures other than those which only achieved ratings of poor methodological quality (Tables 4 and 5: observer-rated and self-report measures respectively). Criteria used to define the strength of evidence can be found in Appendix E. Finally, we incorporated a brief narrative summary of strengths and weaknesses of each measure into Tables 1 and 2, so that researchers and clinicians can rapidly appraise the characteristics of different measures 'at a glance'.

Results
Our search yielded 601 articles once duplicates were removed (see Fig. 1). TG and KW separately screened by title and abstract, then assessed eligibility in 101 full-text articles. Disagreements at screening stage were resolved by TJ. Reference lists were checked for additional articles not picked up in the search stage. Fifty-four relevant articles were identified (see Appendix F for references of included studies).

Internal consistency
Adequate internal consistency (alpha > 0.7) was reported in studies of the following self-report measures: AFAS, AFAS-SF, ASQ, ECR-RC, ECR-RS, ECR-R-GSF, IPPA-B and IPPA-45. There were studies reporting both adequate and inadequate internal consistency for the SS, PACQ and IPPA. Study quality was generally fair, but was excellent in the case of the AFAS, AFAS-SF, ASQ, ECR-RS and IPPA-45.
For observer-rated measures, several studies examined internal consistency across questions and sub-scales (ASA, ASCT, CAI, FFI, SAT, SBST). Only the AQ-A and ASA reported adequate internal consistency, both in studies of poor methodological quality for the evaluation of this property.

Test-retest reliability
Test-retest reliability data was notably lacking for most measures. It was reported for the AAQ, AFAS, AUAQ, PACQ, CAI and MCAST but no studies met COSMIN adequacy criteria (ICC or kappa > 0.7).

Inter-rater reliability (interview, observation and projective measures)
Adequate inter-rater reliability was reported for the AAI, AAP, BND, CAI, GPACS and two measures using the secure base script paradigm, the ASA and SBST. The CAI demonstrated good inter-rater reliability in the study by Borelli et al. (2016), but not in earlier studies.

Content validity
Content validity proved hard to assess using the standard COSMIN criteria, thus these were adapted. Studies were rated as excellent for methodological quality if (1) they demonstrated evidence of iterations in the development of the measure, (2) if the assumptions underlying the measure were tested (e.g. through piloting), and (3) if they involved an expert panel. Face validity of the measure in terms of its theoretical links to attachment theory were also considered. Positive ratings of content validity were given to the AFAS, CAI, ECR-R-GSF and IPPA in studies of good or excellent methodological quality, and also to the AUAQ, CMCAST, MCAST, SAA and SAT in studies of fair or poor methodological quality.

Structural validity
Several studies evaluated factor structure using exploratory and confirmatory factor analysis. Adequate structural validity was reported for the AAQ, AUAQ, IPPA-B and PIML.
For observer-rated measures, factor analyses were conducted, typically using subscales in the place of items. In most such studies the results were given an indeterminate rating ('?') since the findings were not interpretable within the COSMIN scheme. These indeterminate ratings were given to structural validity studies of the ASCT, GPACS, MCAST and CAI, with the exception of the Zachrisson, Røysamb, Oppedal, and Hauser (2011) study of the CAI, which was given a positive rating for adequate measurement properties and rated as excellent for methodological quality. This study is notable in that the confirmatory factor analysis demonstrated adequate model fit, as defined by COSMIN, for a two factor model comprised of avoidance and preoccupation. One other notable study of structural validity was that by Waters et al. (2015). This was a taxometric study pointing to the dimensional structure of attachment as measured by the Attachment Script Assessment.

Hypotheses testing
Under the COSMIN scheme, various aspects of construct validity, such as convergent and discriminant validity, are assessed under the banner of 'hypotheses testing'. More favourable ratings of methodological quality are assigned for studies that test multiple, specific hypotheses including the direction and magnitude of correlations. Of the observer-rated measures, ratings of adequate construct validity were assigned to studies of the AAI, AAP and CMSSB. The ASA, CAI, SAA and SAT showed inconsistent findings, with both positive and negative studies of hypotheses testing reported. For self-report measures, adequate ratings were given to studies of the AFAS-SF, ECR-RC, IPPA, IPPA-B, IPPA-R and PIML. Findings for the SS were inconsistent.

Cross-cultural validity
Few studies specifically investigated cross-cultural validity as defined within the COSMIN taxonomy, which refers to the degree to which the performance of a translated or culturally-adapted instrument is an adequate reflection of its performance in its original version.
Where it was investigated, study quality was rated as poor. Adequate cross-cultural validity was demonstrated for the FFI. An indeterminate rating was given in studies of the AAQ, AUAQ, IPPA, IPPA-B, and MCAST.

Criterion validity
Studies of criterion validity were rare, which is unsurprising in a field lacking an accepted 'gold standard' measure. The AFAS-SF demonstrated adequate criterion validity against the AFAS, whilst the CMCAST did not do so against the MCAST.

Synthesis of results
Our synthesis of the strength of the evidence for psychometric properties is presented in Tables 4 and 5 (observer-rated and self-report, respectively). Overall, no measure has demonstrated consistent evidence of good psychometric properties across a range of criteria. However, our findings point to the CAI and IPPA currently having the best evidence of adequate measurement properties. The CAI has positive findings in support of its content validity, structural validity when assessed using two dimensions (Zachrisson et al., 2011), and various positive findings relating to construct validity (e.g. Borelli et al., 2016). However, its inter-rater reliability is sub-optimal in most studies, with the exception of that by Borelli et al. (2016). The IPPA exists in several versions, none of which have emerged as demonstrating adequacy across a range of psychometric properties. In general, the structural validity of the measure is inadequate, and does not accord with the twodimensional structure that our review suggests is most strongly supported by evidence. Further, its findings on internal consistency have been mixed. However, it has demonstrated adequate construct validity across a relatively large number of studies compared to other self-report measures.

Discussion
Overall our review points to a lack of evidence of adequate measurement properties for most available attachment measures in middle childhood. However, we wish to draw attention to some important points that should be borne in mind when interpreting our findings.
Firstly, the COSMIN tool yields categories, with necessarily arbitrary values chosen as cut-offs to distinguish adequate from inadequate measurement properties. In some cases the statistical values that led to a negative rating were close to the value required for a positive rating. This point applies similarly to the ratings of methodological quality, in which COSMIN operates a 'worst score counts' algorithm. This means that the final rating of methodological quality is defined by the lowest score obtained for that measurement property; thus a single flaw could lead to a rating of 'fair' when it would otherwise have been rated 'excellent'. We applied a similar rule in rating the adequacy of a measurement property where data for several subscales were presented: one sub-optimal value was enough to lead to a negative rating of adequacy for that property.
The implication of all of these points is that a quick glance at our ratings may lead to an underestimation of both the adequacy of measurement properties and also the methodological quality of the evidence. Finally, readers should note that we followed the COSMIN guidance around study selection, including only studies that specifically stated the investigation of measurement properties as an aim of the study (De Vet et al., 2011). Previous reviews (e.g. Kerns & Brumariu, 2016) have included a broader range of studies as providing evidence of validity, such as studies looking at associations between attachment and emotional regulation (e.g. Brumariu, Kerns, & Seibert, 2012). However, including studies that did not specifically aim to examine psychometric properties would have increased the risk of bias, led to an unwieldy number of studies for review, and made it harder for future researchers to reproduce our review. Nevertheless, we accept that by using COSMIN we have taken a relatively stringent approach to the selection and rating of studies.
We believe that this review has helpfully summarised the state of evidence on psychometric properties for individual measures. In addition, this review also allows us to consider some broader questions that we consider to be fundamental to the measurement and conceptualisation of attachment in middle childhood and adolescence. We have organised our discussion around two key questions, before moving on to recommendations for research and clinical practice.

What is the most valid and reliable approach to assessing attachment in middle childhood and adolescence?
In general we do not consider, a priori, that any measurement approach is inherently more appropriate to the measurement of attachment across middle childhood and adolescence. One exception to this is the assessment of attachment in the early phase of middle childhood (age 6-8 years) in which self-report measures are unlikely to be valid as a consequence of children's more limited reading and cognitive ability. Unfortunately, our review points to the relatively poor psychometric properties of measures developed for this age group (e.g. Green et al., 2000). The CAI was initially developed for children aged 7-13 years, and currently has the best evidence of psychometric properties for early middle childhood. We believe a study of psychometric properties of the CAI in a sample of 6-9 year olds would be worthwhile, as currently there is a lack of evidence for measures demonstrating adequate psychometric properties for this age group.
In older middle childhood and adolescence, interviews constitute a well-validated measurement approach. Whilst the CAI has been studied in adolescence (Venta, Shmueli-Goetz & Sharp, 2014) clinicians and researchers may want to consider whether the AAI or AAP might be more appropriate for older teenagers. Only one measure in our review, the GPACS (Obsuth, Hennighausen, Brumariu, & Lyons-Ruth, 2014), utilised an observation of adolescent-caregiver interaction, with scales measuring dyadic interaction, in addition to adolescent and caregiver behaviour within a ten-minute task discussing areas of disagreement. In addition to capturing the quality of in-vivo parent-caregiver interaction, the measure is theoretically-informed by a conceptualisation of disorganized attachment in adolescence. As such, the GPACS assesses more extreme features of disorganized attachment (such as role reversal) that are less likely to be captured by other measures.
Also promising are measures of secure base scriptedness (Dykas, Woodhouse, Cassidy, & Waters, 2006;Psouni & Apetroaia, 2014), in which the task involves creating stories using word prompts, which have demonstrated relatively strong evidence of adequate psychometric properties. Such measures have numerous advantages including shorter administration time and simpler scoring method when compared to interview measures such as the CAI. Further studies examining convergent validity of scriptedness measures with the best-performing interview and self-report measures would be an important contribution to the field.
Finally, the reliability, and especially validity, of self-report measures of attachment is important to consider, not least since NICE (2015) did not include such measures in their guideline on attachment. Most self-report measures in this review did not examine convergent validity with interview or projective assessment methods. Two exceptions are the IPPA and SS, which have both been found to be correlated with attachment as measured by the CAI (Borelli et al., 2016). Importantly, however, the correlations were below the 0.4 cut-off which is used by convention as evidence that two measures are tapping the same construct. Thus, strictly speaking, we can conclude that some self-report measures of attachment are correlated with attachment interviews, but we cannot be certain they are rating the same construct. For adolescents, measures such as the ECR and ASQ can be used to assess attachment styles, which can be assumed to have conceptual continuity with the attachment style construct as measured in adults. Such measures are needed in order to shed light on the developmental antecedents of adult attachment styles, for instance through longitudinal studies (e.g. Jones et al., 2018). However, as we discuss later, there is scope for the improvement of self-report measures.

Is attachment distributed categorically or continuously?
The four-category ABCD paradigm has held a central place in attachment theory and measurement approaches. In our review, interview and projective measures based on this paradigm typically reported sub-optimal structural validity and inter-rater reliability (kappa < 0.7). By contrast, dimensional approaches to scoring such measures demonstrated favourable reliability (e.g. Psouni & Apetroaia, 2014;Waters et al., 2015). Importantly, Zachrisson et al. (2011) found evidence of a factor structure comprising two dimensions underlying the CAI. Thus the findings from this review appear to converge with emerging findings at other points in the lifespan, with taxometric analyses of both the Strange Situation Procedure (Fraley & Spieker, 2003) and the Adult Attachment Interview (Fraley & Roisman, 2014) supporting the idea of attachment being distributed across two dimensions, rather than four categories.
If attachment is distributed continuously, are self-report measures yielding continuous scales the best approach to measuring it? Unfortunately, almost all self-report measures in this review demonstrated sub-optimal structural validity. This review included various self-report measures, often based on adult attachment style measures. These measures have often been subject to numerous revisions, such as changes to wording and item length (e.g. ECR and IPPA studies). Despite exhaustive factor analysis using large samples, these measures have failed to meet criteria for good structural validity as defined by COSMIN. This raises difficult questions for the field. Does the lack of structural validity reflect problems with the measures themselves, or are the constructs they set out to measure not reflective of the phenomenology of attachment in middle childhood and adolescence? Likewise, does attachment in this age group not break down into the ABCD categories, or are available measures not able to detect them reliably? Based on the evidence in this review, it seems plausible that the attachment construct in middle childhood and adolescence is inherently difficult to measure reliably. This may be because attachment representations themselves are relatively fluid at this age (Jones et al., 2018); perhaps also because the developing nature of children's cognitive and socio-emotional abilities presents challenges in capturing such a complex construct.
Thus, the underlying structure of attachment in middle childhood and adolescence is unclear, with neither the ABCD model for interview/ projective measures, nor the two factor (avoidance/anxiety) structure of adult attachment style measures demonstrating strong evidence of validity in this age group. This has important implications for research and clinical practice.

Implications for research
In keeping with the findings by NICE (2015), our study highlights the relatively poor methodological quality of many studies in the field. However, within our review we note with encouragement a trend towards improved study quality over time. Some key methodological principles worth highlighting are: clearly stated hypotheses that include predictions about the direction and magnitude of expected correlations, and reporting on both the amount of missing data and how it was handled. Studies of test-retest reliability and sensitivity to change are required, as are studies investigating attachment measures in a range of different sociocultural contexts. Recent studies conducted in Africa and Asia (Sochos & Lokshum, 2017;Wan, Danquah, & Mahama, 2017) are a welcome development for the field; more such studies are needed.
For interview measures based on the ABCD paradigm (e.g. CAI, MCAST), research on simpler coding systems yielding dimensional scores of avoidance and preoccupation, would lead to a number of benefits. These include improved inter-rater reliability, theoretical congruence with other developments in the field (Fraley and Roisman, 2014;Fearon and Roisman, 2017), and increased statistical power in research, such as longitudinal designs investigating the impact of attachment on developmental outcomes (e.g. Wright, Hill, Sharp, & Pickles, 2018).
Our review casts doubt on the notion that there is a single latent attachment construct which is tapped by all the measures in this review, given both the heterogeneity in measurement approaches, and the evidence surveyed on convergent validity. Like Bosmans and Kerns (2015), we agree that it is more fruitful to ask what aspect of attachment one is trying to assess, rather than what the 'gold standard' attachment measure might be. 'Attachment and affiliation' are already included in the Research Domains Criteria (Cuthbert, 2014), and in our view this presents an opportunity to advance the measurement of attachment across the lifespan. Researchers should aim to develop developmentally appropriate measures of more precise, well-validated lower-order attachment constructs, (e.g. secure base scriptedness) which ultimately belong to higher-order domains relating to socioemotional processes and reward (Fonagy & Luyten, 2018). Rather than privileging attachment above other constructs, it would be more helpful to place attachment within a broader project to improve the science of developmental psychopathology and mental health treatment. Understanding how different aspects of attachment, during different developmental phases, play a role within broader social processes, has greater potential to lead to innovations in treatment than working with poorly-validated concepts and measures.
In order to develop a more empirically-supported approach to attachment in middle childhood and adolescence, structural equation modelling (SEM) could be used to investigate the extent to which different measures, and indeed individual items, load on to latent attachment variables. Ideally this would be undertaken with large datasets in which a variety of measurement approaches have been taken, including measures of attachment and associated constructs such as mentalizing and emotion regulation. Exploratory work of this kind should be possible with existing data sets, although the ideal would be to design studies with large samples that could simultaneously investigate the structure of attachment and evaluate new measures. This should include the development of new selfreport measures, the starting point for which must be a clear conceptualisation of what is meant by attachment. In studies in which attachment has been measured by a variety of methods, the process of refining the questionnaire should include the extent to which items load on to a latent attachment variable (or variables) in SEM. By this means, the measure would avoid the pitfalls of shared method variance and reliance on adult attachment style models for concurrent validity.
Our review also highlights a clear need for research on measurement error. Fearon and Roisman (2017) recommend the development of better attachment measures that allow direct assessment of the relationship between indicators, error and underlying constructs. Once again, studies employing structural equation modelling could investigate the degree of measurement error for each item in a self-report questionnaire, or each scale in an interview/projective measure, by assessing correlations with a latent attachment variable.

Implications for clinical practice
Attachment theory is an influential theoretical framework in many clinical contexts such as psychological therapies with children and families. In the UK, national guidelines published by NICE (2015) specifically advise clinicians working with children in child protection and T. Jewell et al. Clinical Psychology Review xxx (xxxx) xxx-xxx adoption settings to consider attachment in their assessment and treatment planning, and recommend the MCAST, CAI and AAI. Our findings suggest that such measures are vulnerable to high measurement error, as evidenced by unfavourable ratings of inter-rater reliability in many studies. We therefore suggest that, when used, the findings of such measures should be interpreted tentatively as a clinical hypothesis, and understood as being informative only as one aspect of a much broader assessment of a child and their caregivers; moreover, this hypothesis may need to be reviewed over time as new information emerges. In court settings, evidence provided from such measures that has been rated by only one clinician should not be seen as authoritative, since a second rater may well disagree on the assignment of attachment category. Finally, we encourage clinicians, policy-makers and members of the public not to reify attachment categories. Studies in this review suggest that attachment status is not necessarily predictive of psychopathology in children and adolescents. For instance, in studies using the CAI, rates of secure attachment in clinical samples have been as high as 30% in adolescent psychiatric inpatients (Venta et al., 2014), whilst Scott, Riskman, Woolgar, Humayun, and O'Connor (2011) reported security in 52% and 73% respectively in moderate and high-risk samples for conduct problems. This compares with rates of approximately 60% attachment security in normative child samples (Shmueli-Goetz et al., 2008;Green et al., 2000).

Limitations
As discussed earlier, our review applied stringent inclusion criteria. As a consequence, our review may under-report the breadth of evidence on psychometric properties for attachment measures in middle childhood and adolescence, although excluded studies are likely to be of low-quality. Our review excluded studies where the mean age of participants lay outside of the 6-18 age range. Consequently, our review excluded some measures that are suitable at the extremities of this age range, such as the Attachment Style Interview (Schimmenti & Bifulco, 2015), which has been validated in a sample of 16-25 year-olds. Furthermore, we were not able to appraise the risk of bias arising from publication bias across studies. Studies in our review reported a range of different statistics across a variety of measurement properties; as such, there was no valid and reliable way to assess publication bias. Given the likely researcher bias towards publishing positive results, it is possible that our review over-estimates the adequacy of psychometric properties across measures, since there may be unpublished data showing negative findings.

Conclusion
The field of attachment is entering an exciting phase in which new empirical and theoretical insights are emerging. Longitudinal studies of attachment stability suggest that attachment may be less stable in adolescence than in adulthood (Jones et al., 2018), and that genetic influences on attachment may come in to play in adolescence that were not present in infancy (Fearon et al., 2014). Recent theories have suggested that the early stages of sexual maturation (andrenarche) may constitute a 'switch-point' in the development of attachment strategies (Del Giudice, Angeleri, & Manera, 2009), and that adult attachment styles may be more influenced by recent interpersonal experiences than distal, early caregiving experiences (Fraley & Roisman, 2018). The attachment field is reliant on the availability of valid, reliable and sensitive measures that can be used to test theories and build evidence. In the clinical realm, good measures are needed to test both aetiological models in which childhood attachment experiences are implicated (e.g. Fonagy, Gergely, Jurist, & Target, 2002) and also the role of attachment in treatments (e.g. Diamond et al., 2010). In adult clinical samples, attachment is a predictor of psychotherapy outcome (Levy, Kivity, Johnson, & Gooch, 2018) and is associated with differential response to treatment across a range of disorders including psychosis (Carr, Hardy, & Fornells-Ambrojo, 2018; Gumley, Taylor, Schwannauer, & MacBeth, 2014) and eating disorders (Tasca and Balfour, 2014). The role of attachment in the process of psychological treatments for children and adolescents is under-researched, and the further development of psychometrically sound measures will help to advance understanding in this area.
Selecting a suitable attachment measure, whether in a clinical or research context, is a complex matter. Our review provides various important sources of information that can guide the decision including measurement approach (i.e. interview, task, self-report), administration time, and the type of attachment relationship that is assessed. In particular, we advise close scrutiny of face validity: that is, clinicians and researchers need to assess the extent to which the conceptualisation of attachment and how it is assessed fit with the purposes for which the measure is being chosen.
In summary, our review suggests that there are currently large gaps in our knowledge of the psychometric properties of attachment measures, with the lack of data on sensitivity to change being particularly regrettable. We found limited evidence of adequate psychometric properties, but identified the CAI and IPPA as currently having the best evidence of such properties amongst observer-rated and self-report measures respectively. The ASA, a measure of secure base scriptedness, was identified as a promising measure worthy of future research on psychometric properties. Our findings point to the advantages of dimensional rather than categorical approaches to measurement, with more favourable inter-rater reliability and structural validity ratings observed in measures yielding dimensional scores. Future studies are needed that test specific hypotheses and that shed light on the underlying structure of attachment representations in middle childhood and adolescence.

Role of funding sources
TJ was funded by a National Institute of Health Research (NIHR) Clinical Doctoral Research Fellowship, CDRF-2014-05-024. The funder had no direct involvement in the conduct of this review. The views expressed are those of the author and not necessarily those of the NIHR.

Contributions
TJ, IE and PF wrote the protocol for the systematic review. TJ, TG and KW conducted literature searches. TJ, TG, KS, EC and KW rated the included studies. TJ wrote the first draft of the manuscript and all authors contributed to and have approved the final manuscript.