Measuring general mental health in early‐mid adolescence: A systematic meta‐review of content and psychometrics

Abstract Background Adolescent mental health is a major concern and brief general self‐report measures can facilitate insight into intervention response and epidemiology via large samples. However, measures' relative content and psychometrics are unclear. Method A systematic search of systematic reviews was conducted to identify relevant measures. We searched PsycINFO, MEDLINE, EMBASE, COSMIN, Web of Science, and Google Scholar. Theoretical domains were described, and item content was coded and analysed, including via the Jaccard index to determine measure similarity. Psychometric properties were extracted and rated using the COSMIN system. Results We identified 22 measures from 19 reviews, which considered general mental health (GMH) (positive and negative aspects together), life satisfaction, quality of life (mental health subscales only), symptoms, and wellbeing. Measures were often classified inconsistently within domains at the review level. Only 25 unique indicators were found and several indicators were found across the majority of measures and domains. Most measure pairs had low Jaccard indexes, but 6.06% of measure pairs had >50% similarity (most across two domains). Measures consistently tapped mostly emotional content but tended to show thematic heterogeneity (included more than one of emotional, cognitive, behavioural, physical and social themes). Psychometric quality was generally low. Conclusions Brief adolescent GMH measures have not been developed to sufficient standards, likely limiting robust inferences. Researchers and practitioners should attend carefully to specific items included, particularly when deploying multiple measures. Key considerations, more promising measures, and future directions are highlighted. PROSPERO registration: CRD42020184350 https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42020184350.


INTRODUCTION
Accurate and efficient measurement of adolescent general mental health (GMH) are of vital importance: Adolescence, the phase starting around age 10 (Sawyer et al., 2018), appears pivotal for mental health problems, playing host to the first onset of the majority of lifetime cases (Jones, 2013). There is also evidence mental health of young people is worse than in previous generations (Collishaw, 2015). Despite a striking need to improve our understanding of mental health in this age group, research has typically faced major methodological problems, including low statistical power, poor measurement, and analytical flexibility (Rutter & Pickles, 2016). Highquality research going forward will likely be underpinned by welldeveloped brief general measures to facilitate large samples. Brief self-report measures represent lower burden and are therefore more feasible when considering prevalence or response to intervention at appropriately large sample sizes (Humphrey & Wigelsworth, 2016).
Specifically, time is a major concern for schools who are often called on to support administration of mental health questionnaires (Soneson et al., 2020), and in large panel studies (Rammstedt & Beierlein, 2014). Brief surveys are also recommended for work with adolescents to ensure better response rates (Omrani et al., 2018).
This meta-review focuses on the content and psychometric properties of self-report measures to aid researchers and practitioners in selecting indicators and measures more likely to lead to valid inferences.
Various domains of GMH exist (e.g., disorders and wellbeing).
However, it is currently unclear how these constructs relate to one another conceptually or their relative psychometric qualities. This is needed since some work has started to explore empirically the relationships between different domains (e.g., Black et al., 2019;Patalay & Fitzsimons, 2016), but findings in this area seem to be sensitive to measurement issues such as informant (Patalay & Fitzsimons, 2018) and operationalisation (Black et al., 2021). While a body of literature has been devoted to interpreting apparently paradoxical differences between positive and negative mental health outcomes (Iasiello & Agteren, 2020), we argue the known issues with adolescent mental health data (Bentley et al., 2019;Rutter & Pickles, 2016;Wolpert & Rutter, 2018) may mean such paradoxes are in fact artefacts, as some work has suggested (Furlong et al., 2017). Psychometric and conceptual properties must therefore be attended to going forward.
While analysis of item content is lacking, there is literature describing the theoretical domains to which measures belong. For instance, measures may be based on diagnostic systems such as the Diagnostic and Statistical Manual of Mental Disorders or frameworks such as hedonic, focused on happiness and pleasure, or eudaimonic, focused on broader fulfilment, wellbeing (Ryan & Deci, 2001). However, we chose to focus on item rather than construct mapping for several reasons: First, it is a known problem that measures with different labels sometimes measure the same construct (jangle fallacy), while others with the same label can measure different constructs (jingle fallacy; Marsh, 1994). Second, measures and their subdomains are often heterogeneous . Third, psychometric validations can be data-driven, resulting in items with beneficial statistical properties prioritised over those considered to be theoretically key (Alexandrova & Haybron, 2016;Clifton, 2020). We therefore argue against further reification of construct boundaries.
From a policy perspective, there has often been a tendency to focus on diagnosis (Costello, 2015), while many have suggested attention is needed to a broader set of domains, particularly when considering early identification in general population samples (Bartels et al., 2013;Greenspoon & Saklofske, 2001;Iasiello & Agteren, 2020).
Indeed, positive mental health is increasingly collected in large epidemiological studies (e.g., NHS Digital, 2018;Patalay & Fitzsimons, 2018). Given a need to answer the question of what should be considered under adolescent GMH, we did not seek an exhaustive definition prior to conducting our review (Black, Panayiotou, & Humphrey, 2020). Nevertheless, in the following paragraph we make explicit considerations that informed the meta-review (see also eligibility criteria expanded in the Supporting Information).
Symptoms of mental ill-health, but not individual disorders were considered relevant. We adopted this approach because of the need for brief general approaches, and consistent with previous reviews (Bentley et al., 2019;Deighton et al., 2014). Following these reviews of adolescent GMH, we also considered positive mental wellbeing, including affect via models such as subjective wellbeing, and quality of life. However, the aims and scope of our meta-review, as well as issues raised in prior literature, meant it was important to impose some restrictions on wellbeing not included in these two prior reviews. The diffuse nature of eudaimonic wellbeing means it can be difficult to disentangle whether subdomains represent functioning or are predictors (Kashdan et al., 2008). Since there is a particular need to provide insight into measures for prevalence and response to intervention in adolescence, we argue non-general domains of eudaimonic wellbeing such as perseverance, which might be a mechanism, should not be considered part of GMH. Similarly, though prior reviews have included entire quality of life measures, we felt it was important to consider only subdomains more clearly focused on mental health. These restrictions were also designed to keep the range of content relatively small so as not to artificially inflate the range of item content by including potentially proximal domains. While adolescent GMH measure reviews have been conducted (Bentley et al., 2019;Deighton et al., 2014), these have not analysed content or provided robust psychometric ratings at the measure level. Furthermore, other measure reviews have often looked at narrower domains within GMH (e.g., Proctor et al., 2009), but this work across adolescent GMH has yet to be brought together. This is important since outcomes within GMH are sometimes referred to or treated interchangeably (Fuhrmann et al., 2021;Orben & Przybylski, 2019), and can be conceptually similar (Alexandrova & Haybron, 2016;Black et al., 2021). A meta-review to consider conceptual and broader psychometric issues is therefore timely.
Furthermore, we argue the assessment of item content (e.g. the symptoms, thoughts, behaviours and experiences that are considered by measures) is a key omission. For instance, some researchers and practitioners may have clear theories about why one domain of GMH in particular is of interest (e.g., affected by an intervention). However, without explicit attention to content, results may be selected in a more data-driven way. While it is the norm to register primary outcomes in trials, in adolescent mental health, some recommend multiple measures are explored for sensitivity (Horowitz & Garber, 2006). Observational studies also often collect multiple similar domains (e.g., NHS Digital, 2018). While such exploratory approaches play an important role, and flexibility can occur even after registration (Scheel et al., 2020), we suggest the content of measures should be attended to, particularly when combined. Before inferences are made about constructs, we must gain better understanding of how measures relate conceptually to increase transparency and validity.
Conceptual and psychometric insights are also vital given the recognised noisiness of adolescent mental health data (Wolpert & Rutter, 2018). Developmental considerations are particularly important when considering self-report in this age range. For instance, issues such as inappropriate reporting time frames could introduce confusion (Bell, 2007;de Leeuw, 2011), or contribute to heterogeneity in assessments . Consider a case where a symptom measure (including e.g. depression) shows significant improvement after intervention but a wellbeing measure does not. If the wellbeing measure covers theoretically distinct content or the measures have differing reference periods, this is more likely to be a robust finding. However, if, for instance, both cover depression, affect or other indicators which could appear in either domain (Alexandrova & Haybron, 2016), this is less likely to be the case.
Another measurement issue which is gaining increased attention, but has yet to be considered for adolescent GMH, is the appropriateness of scoring mental health constructs by adding heterogeneous experiences together (Fried & Nesse, 2015). Since GMH is by definition likely to be broad, and measure developers can fall prey to data-driven over conceptual considerations when selecting items (Alexandrova & Haybron, 2016;Clifton, 2020), it seems crucial to consider psychometric and conceptual issues together. In particular, the relative conceptual homogeneity within a measure might be considered useful context when assessing its statistical consistency.
The issues laid out above speak to contemporary debates: To aid comparison across studies there have been calls for common measures (Wolpert, 2020). However, a key problem is that different measures are likely appropriate for different contexts (Patalay & Fried, 2020). We argue the choice of measures for individual studies or to standardise across studies should be informed by analyses such as those reported here.

METHOD
A systematic search was conducted to identify adolescent GMH measures following the preferred reporting items for systematic reviews and meta-analyses (PRISMA) guidelines. We registered a number research questions. For clarity of reporting, these are grouped here into three overarching areas: theoretical domains, content analysis, and measurement properties. 1 For theoretical domains we considered which were included in GMH in reviews. We defined several units of analysis. First, we use the term theoretical domains to refer to constructs described at the review level (e.g. life satisfaction). We grouped included reviews into theoretical domains inductively. Second, we use indicator to refer to specific question types capturing individual symptoms, thoughts, behaviours or experiences (e.g. sadness). Finally, we use broad themes to classify whether items tapped emotional, physical, social, cognitive or behavioural content. eligible studies were also searched. Search terms relating to the population (e.g., adolescen* OR youth*, etc.), measurement (e.g., survey* OR questionnaire*, etc.), and construct of interest (e.g., 'mental health' OR wellbeing, etc.) were combined using the AND operator. Where databases allowed, hits were limited to reviews, and English, since we aimed to review English-language measures validated with English speakers.
To appraise the methodological quality of reviews from which we drew measures, we employed the quality assessment of systematic reviews of outcome measurement instruments tool (see Table S1 in the Supporting Information; Terwee et al., 2016). This provides a rubric for quality which covers aspects including clarity of aims, suitability of search strategy, and thoroughness of screening. indicator coding, we aimed to code at a semantic level. However, given we could not be blind to the intended content of measures (e.g., measures' titles could give this away), coding could not be entirely inductive (Braun & Clarke, 2006). A hybrid approach allowed initial coding to be either specific or broad, with some codes collapsed into more general categories in subsequent coding, and others split up.
After the initial meeting, the first author generated a full set of preliminary codes for all included items which were reviewed by the other authors. These were refined into a final set through discussion.
In the final coding (https://osf.io/k7qth/), we aimed to collapse as much as possible without losing information. This was to avoid false positive differences between measures .
Wherever possible, items were given a single indicator code, but for items assessing more than one experience (e.g. sadness and worry), two indicator codes were assigned.
Following  each item was also assigned one or more broad themes (e.g., losing sleep over worry was considered physical and emotional). This allowed more conservative assessment of the similarity of measures. This was particularly important for our assessment of conceptual homogeneity. This approach is also consistent with much of the psychometric theory which typically underpins the measures we were interested in. This is namely that each item contributes information on the same state (Raykov & Marcoulides, 2011), making conceptual assessment of broader dimensions important alongside individual indicators.
As has been used elsewhere, similarity between measures was calculated via the Jaccard index (Fried, 2017). This index is the number of common indicators divided by the total number of indicators across a pair of measures, and thus reflects overlap from 0 (no overlap) to 1 (complete overlap). To calculate the index, each measure gains a 1 or 0 for presence or absence of a given indicator (regardless of frequency), making the index unweighted. This was desirable to avoid biased construct dissimilarity through our strategy of often including whole measures for domains like symptoms, but shorter subscales from quality of life. Items with double codes were both included as indicators for a given measure.
Though we initially intended to conduct secondary searches for psychometric evidence (Black, Panayiotou, & Humphrey, 2020), we instead opted to use primary psychometric studies of included measures cited in reviews. This was more feasible, was supported by the quality of reviews (which tended to use a range of databases, appropriate terms for measurement properties and clearly describe eligibility, see Table S1 in the Supporting Information), and frequent inclusion of measures in several reviews (see Figure 2). We reported only psychometric properties analysed in samples consistent with our criteria (e.g., not clinical samples or other age ranges), and included only studies reporting on relevant COSMIN elements at the level we considered (subscales or whole measures). All references and raw psychometric information extracted can be found at https://osf.io/ k7qth/.
We used the COSMIN rating system for psychometric properties (Mokkink et al., 2018), which provides a standardised framework for grading the psychometric properties of measures in systematic reviews. It recommends consideration of content validity, structural validity, internal consistency, measurement invariance, reliability, measurement error, hypothesis testing for construct validity, responsiveness, and criterion validity. A few adaptations were necessary in the current study and are described in the Supporting Information. The rating takes the form: +, −, +/− (inconsistent), ?
(indeterminate), and where no information was available we rated no evidence (NE).
In order to address statistical/conceptual consistency, we assessed whether measures/subscales were conceptually homogenous (H). We considered homogeneity to be present where only one broad theme was assessed. This was combined with statistical consistency (S), which we considered to be present where measures scored at least +/− for both structural validity and internal consistency. Measures could therefore be H+S+, H−S+, H+S−, or H−S−.

Review-level results
A flowchart of the review stages is presented in Figure 1 with the primary reason for exclusion reported for full-texts. The review resulted in the inclusion of 19 reviews and 22 measures (see also Table 1). The number of measures is after collapsing different versions of the same measure. We only extracted multiple versions where these were explicitly selected in reviews. The only measure for which multiple versions were reported on across reviews was KIDSCREEN, for which both the 52 and 27-item versions were therefore extracted. We also did not count individual subscales separately. Therefore, while KIDSCREEN had multiple relevant versions and subscales, it was only counted as one measure (details of subscales can be found in Table 1).
Results of the quality assessment indicated mixed quality (see Table S1 in the Supporting Information). For instance, the vast majority of studies (94.74%) defined the construct of interest, and used multiple databases. However, reviewing, quality assessment and extraction of psychometric properties were often not clearly reported or were conducted only by a single researcher. Results are therefore in line with the general field of measure reviews (Terwee et al., 2016).
We included all criteria set out by Terwee et al. (2016). Since we had a specific age criterion, 100% of studies reported the population of interest. Nevertheless several reviews explicitly noted developmental considerations (Harding, 2001;Kwan & Rickwood, 2015;Rose et al., 2017), suggesting this had been considered in some detail.

Indicator and broad theme coding
The first round of item coding to describe items at the experience level (e.g., happiness) generated 45 codes which were then collapsed into a final set of 25 (see Figure 3). Since we had 285 items, the final reduction to indicators was substantial at 91.23%. The initial codes were typically more granular than the final set. For instance, aggression and rule-breaking were initially each assigned a single code but these were combined in the final set (see https://osf.io/ k7qth/ for the initial and final codes with descriptions). Similarly, the final emotion intensity/regulation code covered getting upset easily/ impatience/strong positive and negative emotional responses/ excited.

Psychometric properties
The psychometric properties of measures are shown in Table 3.
There was no evidence available for measurement error for any measure so this was omitted. Only six measures (27.27%) scored positively for content validity, a fundamental property (Mokkink et al., 2018). These measures all also scored favourably for construct validity, though no further positive results were found for these, suggesting overall low quality. This was echoed in mostly poor HS F I G U R E 1 Flow diagram of review process. scores (i.e., a lack of support for conceptual and or conceptual consistency), which are shown in Table 3. For the 14 measures with clear time frames, all but one considered periods of 1-4 weeks (see Table 1).

Content analysis
A few indicators stood out as appearing in >50% of measures and 80%-100% domains: happy/sad, enjoyment, fear/worry, and selfworth. This suggests these may be broadly useful, since validation processes have frequently led to their inclusion as indicators of GMH.  (Chrobak et al., 2018;Fried, 2017;Hendriks et al., 2020;Visontay et al., 2019). The percentage reduction inevitably reflects how conservative coding was, though all studies described being cautious. We also saw the full range of overlap at the measure level whereas the aforementioned studies had smaller ranges (0.26-0.61).
We found some pairs of measures ( In terms of standardising measurement for capturing the range of GMH, no single measure or domain represented the entire spectrum. As discussed, we aimed to collapse codes wherever possible, emphasising the starkness of this finding. There were therefore no obvious candidates to be used as common metrics. The measures with the highest number of broad themes (see Table 3), also tended to have the most indicators (e.g., YOQ had the most with 15 while GHQ, WEMWBS, PANAS and SDQ all had nine, see Figure 3).
However, these higher-indicator measures were not interchangeable, with the greatest similarity between YOQ and SDQ at 50% (see Figure 5, code and data, https://osf.io/k7qth/). The inconsistency found at the review level is therefore reflected in our content findings. In terms of content, measures within theoretical domains are mostly not interchangeable, while some typically understood to capture different domains could be. This is of vital significance given the leap usually made from measure to construct when discussing findings, and makes clear potential problems of generalisability (Yarkoni, 2020). Again, we recommend researchers and practitioners assume measures and constructs are relatively unrefined and factor this into analysis and treatment decisions.

F I G U R E 2
Summary of measures and reviews. Measures' full names can be found in Table 1.

Psychometric properties
Psychometric evidence was frequently lacking and COSMIN scores were low. Our results also confirm the general tendency to report only basic structural evidence (Flake et al., 2017). These findings highlight a lack of sufficient attention given to development practices in adolescent GMH. Though construct validity was frequently reported and positive, it should be treated with some caution since it has been suggested the type considered in the COSMIN rubric may not be valid if content and structural validity have not been considered (Flake et al., 2017), as was often the case here. In other words, without evidence that key stakeholders have been involved in developing the construct and or measure, and that items statistically cohere, the fact a given measure correlates with other similar outcomes is not very informative. Of the measures which scored positively for content validity, only KIDSCREEN and Healthy Pathways evaluated structural validity, scoring +/− and − respectively. Therefore, no measure benefited from both sound consultation work and had clear evidence that items successfully tapped a common construct.
Life satisfaction seemed particularly psychometrically problematic. Quality of life and outcome-focused symptom measures, on the other hand, showed better content validity, that is, at a minimum involved young people at some stage of measure development. Given that our content analysis revealed measures labelled as the same domain were often not interchangeable, while those from separate ones could be, stake-holder work on conceptualisation, and structural analysis to confirm and support this are arguably all the more necessary. Only measures' reference periods seemed to be relatively consistent and recent, in line with recommendations for this age group, though specific wording in individual measures could still introduce inconsistencies between measures or confusion (Bell, 2007).

Conceptual and statistical coherence
As noted above, statistical coherence was typically unclear or poor.
Similarly, though measures/subscales were recommended for sum scoring, they tended to cover more than one broad theme, suggesting  conceptual unidimensionality was untenable. It is likely measures/ constructs with thematic heterogeneity are not well suited to internal consistency metrics or sum scoring (Fried & Nesse, 2015). Similarly, reliability should only be prioritised by developers within theoretical units, since otherwise statistical reliability can be introduced via wording or other artefacts, rather than structural validity (Clifton, 2020).

T A B L E 2 Average Jaccard indexes within (diagonal) and between (lower triangular) domains
Most measures covered more than one broad theme, and failed to meet our +/− COSMIN criterion for both structural validity and internal consistency (H−S−). We recommend such measures are not sum scored since this is not supported theoretically or statistically.
Heterogeneous constructs may be desirable, particularly for GMH given one of its highlighted benefits is to provide broad insight (Deighton et al., 2014). We therefore question the (assumed) logic of total sum scores in this area. While items from measures included in this review could provide insight into GMH via methods other than sum scoring (e.g., network models or selecting particular items), further work is needed to validate such approaches. issues were unlikely to be driving down statistical similarity between items. For instance, age appropriateness can be a particular concern and may negatively impact psychometric properties (Black, Mansfield, & Panayiotou, 2020).
Only EPOCH (happiness subscale) and AFARS (negative affect) covered only a single broad theme and met our +/− COSMIN criterion for both structural validity and internal consistency (H+S+).
These subscales are likely more appropriate for sum scoring. However, the cost of this benefit is fewer GMH indicators (EPOCH contains four, and AFARS negative affect three). Again, this speaks to the lack of readiness of this field to land on common metrics. Additionally, these measures are by no means likely to be ideal in all scenarios.
In particular, they are both potentially limited by not scoring positively for content validity. Our HS scoring system should therefore not be used to rank measures but be considered alongside issues such as indicators of interest and analytical approach.

Strengths and limitations
This study systematically drew on a large body of systematic reviews, and therefore provides broad coverage of relevant measures and their properties. Novel conceptual and psychometric insights are provided through this approach. While some work has provided robust psychometric evaluation (Bentley et al., 2019), this was at the study level, while we were able to combine studies to provide more comprehensive ratings at the scale level. We also went beyond previous work by considering in detail which elements of quality of life were relevant to GMH, rather than providing information at the measure level (i.e. general quality of life) as has been done previously (e.g., Deighton et al., 2014). We therefore provide novel insight into the specific conceptual overlap of quality of life subdomains with other domains of GMH, as well as which subscales can be extracted and scored.
The current study provides a wealth of information for researchers and practitioners. Given the scope of such a project, some compromises were made. First, we were unable to conduct secondary searches for validation studies and therefore relied on the quality of searches conducted in reviews. Since we did not conduct secondary searches ourselves, we cannot be certain relevant papers were not missed. However, our meta-review strategy meant that measures were picked up in multiple reviews (see Figure 2). Similarly, since we brought together work from previous reviews, rather than conducting searches for measures, relevant measures/versions have inevitably been missed. For instance, we are aware that the shorter version of WEMWBS has undergone some validation with adolescents (McKay & Andretta, 2017), but this was not picked up in any review. Our list of measures should therefore not be considered exhaustive, but in light of the quality of reviews (see Table S1 in the Supporting Information). Second, we did not assess potential methodological bias in validation papers, but rather rated only psychometric quality, for feasibility. When selecting measures we recommend researchers and practitioners consider this aspect in more detail, and should attend to factors such as the similarity of the sample to their application. Third, our assessment of homogeneity was somewhat crude. However, we based this on broader themes rather than indicators to take into account relationships between indicators. Considering themes rather than indicators was therefore conservative and less likely to underestimate homogeneity and appropriateness for sum scoring.

Conclusion and recommendations
Conceptualisation was found to be problematic in adolescent GMH since measures were inconsistently defined within domains, indicators were often found across many of these, measures within domains were often not more similar to each other than between domains, and appropriate consultation practices were often not conducted. While GMH covered a diverse set of indicators, a relatively small number of these described the items we studied. This MEASURING GENERAL MENTAL HEALTH IN EARLY-MID ADOLESCENCE relatively narrow range, compared to for example, depression measures (Fried, 2017), was also seen in measurement time frames and that most items considered emotional content, whereas work looking at disorder measures found greater heterogeneity for these aspects . This suggests it may be possible to assess selfreport GMH briefly, though the psychometric work (including conceptualisation) needed to underpin this is currently largely lacking. It also suggests a singular approach to GMH could be appropriate, given that attempts to create distinct subdomains have resulted in common indicators.
Most measures also lacked sufficient psychometric evidence, and no measure or domain represented the entire spectrum of indicators.
These factors make selection and interpretation of measures challenging in research and clinical applications, and suggest the field is not ready for a common metric. Furthermore, the lack of clear conceptualisation combined with insufficient psychometric evidence suggests the risk of measurement artefacts in applied research and clinical outcome monitoring is high. This suggests critical interrogation of existing findings is needed, and that progress may be limited by the measurement landscape.
We recommend that where assessment of GMH is the goal, new measures be developed, or existing ones revised. Our review provides excellent ground work for this by identifying the range of indicators that are likely theoretically relevant. Such analysis has been used to develop general measures for adults  In terms of selecting domains, symptom measures captured a broader range likely because some symptoms do not have theoretical positive poles. Researchers and practitioners should therefore consider whether theoretical breadth is important, whether the individual items are of interest, and whether they wish to sum score (this could be problematic for diverse item sets). Our findings also underscore that a single measure cannot be selected to represent any domain conceptually (given inconsistency within these). However, in terms of psychometrics, the following measures had at least evidence of content and construct validity: YP-CORE, JWHS-76 and YOQ (symptoms), KIDSCREEN and Healthy Pathways (quality of life), and PANAS-C (affect). It is difficult to determine the relative psychometric quality of wellbeing measures reviewed given the lack of content validity evidence, though EPOCH (happiness) may be promising, given its match between conceptual and statistical coherence. From a GMH perspective, we recommend life satisfaction measures are avoided as these are psychometrically the weakest and show poorer coverage of GMH indicators. We recommend researchers and practitioners considering measures we reviewed draw on our code and data to assess specific content and properties relative to their context. Finally, our analysis suggests that researchers should not combine measures from different domains without accounting for likely similarity, and acknowledging potential systematic overlap due to common content.

ACKNOWLEDGEMENT
There is no funding to report for this study.

CONFLICT OF INTEREST
The authors have declared that they have no competing or potential conflicts of interest.

OPEN RESEARCH BADGES
This article has been awarded Open Materials, Preregistered badges.
All materials and data are publicly accessible via the Open Science

DATA AVAILABILITY STATEMENT
Open materials relating to method/results of the review are provided in Supporting Information.

ETHICAL CONSIDERATIONS
Not applicable to this study.
2 Though we treated these as a single group, quality of life measures were noted to include subscales for the following domains that met our criteria: symptoms (CHQ, KIDSCREEN and PedsQL), wellbeing (KIDSCREEN), life satisfaction (Healthy Pathways, HP and Youth Quality of Life, YQoL), and psychological quality of life (KIDSCREEN and WH-QoL), which contained a mixture of positive and negative indicators.