Measurement instruments for the core outcome set of congenital melanocytic naevi and an assessment of the measurement properties according to COSMIN: a systematic review

Background Congenital melanocytic naevi (CMN) can impact on patients’ lives due to their appearance and the risk they carry of neurological complications or melanoma development. The development of a core outcome set (COS) will allow standardised reporting and enable comparison of outcomes. This will help to improve guidelines. In previous research, relevant stakeholders reached a consensus over which core outcomes should be measured in any future care or research. The next step of the COS development is to select the appropriate measurement instruments. Aim Step 1: to update a systematic review identifying all core outcomes and measurement instruments available for CMN. Step 2: to evaluate the measurement properties of the instruments for the core outcomes. Methods This study was registered in PROSPERO and performed according to the PRISMA checklist. Step 1 includes a literature search in EMBASE (Ovid), PubMed and the Cochrane Library to identify core outcomes and instruments previously used in research of CMN. Step 2 yields a systematic search for studies on the measurement properties of instruments that were either developed or validated for CMN, including a methodological quality assessment following the COSMIN methodology. Results Step 1 included twenty-nine studies. Step 2 yielded two studies, investigating two quality of life measurement instruments. Conclusion Step 1 provided an overview of outcomes and instruments used for CMN. Step 2 showed that additional research on measurement properties is needed to evaluate which instruments can be used for the COS of CMN. This study informs the instrument selection and/or development of new instruments.


a b s t r a c t
Background: Congenital melanocytic naevi (CMN) can impact on patients' lives due to their appearance and the risk they carry of neurological complications or melanoma development. The development of a core outcome set (COS) will allow standardised reporting and enable comparison of outcomes. This will help to improve guidelines. In previous research, relevant stakeholders reached a consensus over which core outcomes should be measured in any future care or research. The next step of the COS development is to select the appropriate measurement instruments. Aim: Step 1: to update a systematic review identifying all core outcomes and measurement instruments available for CMN.
Step 2: to evaluate the measurement properties of the instruments for the core outcomes.

Introduction
Congenital melanocytic nevi (CMN) are birthmarks present at birth or soon after birth. CMN are associated with an increased risk of melanoma, neurological complications and/or psychological burden due to their appearance [1][2][3] . Treatment of CMN is either conservative (watchful waiting including histology) or interventional (full thickness: excision, partial thickness: laser, curettage or dermabrasion). Outcomes measured to evaluate the treatment of CMN are heterogeneous in care and research, which impedes the comparison and pooling of these outcomes 4 . This complicates the guidance of optimal management policy.
The aim of the Outcomes for Congenital Melanocytic Naevi (OCOMEN) project is to develop a core outcome set (COS) for measuring the outcomes of all treatment options for medium, large and giant CMN for care and research 5 , 6 . A 'COS' is a consensus-derived minimum set of outcomes that should be measured and reported in all care and clinical trials of a certain health condition 7 , 8 . The use of a COS may enhance homogeneity in outcome and measurement instrument reporting in future studies and could therefore facilitate evidence synthesis for conservative and interventional treatment recommendation in the future.
In this study, we define 'domains and outcomes' as aspects of a disease that could be measured to evaluate different management strategies. 'Domains' are broader aspects of a disease, whereas 'outcomes' are defined as more precise aspects of a disease on a lower hierarchical level, like 'presence of melanoma' is an outcome of the domain 'neoplasm'.
Patients included in the OCOMEN project are those presenting with either M1 (1.5-10 cm projected adult size (PAS)) on the face or M2 ( > 10-20 cm PAS) elsewhere, either single or multiple. The COS will be developed for international use in order to evaluate both interventional treatment and conservative treatment. In a recent consensus procedure, relevant stakeholders reached a consensus on the core domains and outcomes that need to be measured in the COS ( Table 1 ) 5 , 6 , 9 . The next step in the development of the COS is to reach a consensus on how these domains must be measured (the core outcome measurement set (COMS)). The first step of developing the COMS is to identify all instruments previously used to measure core domains and outcomes and to evaluate the quality of the measurement properties of the instruments available for the core outcomes. A previous systematic review was performed summarizing all outcomes and measurement instruments used in research for CMN between 2006 and 2019 including sixty-three individual studies 4 . This study, as

This study consists of two steps
Step 1: A systematic review to identify and describe the outcomes and instruments used in previously published studies for CMN, as an update from a previously performed systematic review 4 . The previously systematic review included all outcomes and instruments used; the update only focusses on the outcomes of the COS and their instruments.
Step 2: A systematic review to evaluate the quality of the measurement instruments developed or validated for domains and outcomes of the COS of CMN.
Both these steps were registered in PROSPERO, registry number CRD42021238242, and reported according to the PRISMA checklist. The design of the systematic review was based on the guidelines of the Core Outcome Measures in Effectiveness Trials (COMET) initiative and the Cochrane Skin Group Core Outcomes Set Initiative (CS-COUSIN). The Consensus-based Standards for the Selection of Health Status Measurement Instruments (COSMIN) methodology and guidelines were used to critically appraise the measurement properties of instruments. The OCOMEN project was registered in the COMET initiative database.
Step 1: Identification and description of instruments used in previously published studies

Search strategy, quality assessment and data extraction
This first step is an update of a previously performed systematic review in which a list of domains, outcomes and measurement instruments used in CMN research published between 2006 and 2019 were identified 4 . The search strategy used the current and previously performed systematic review was developed with the help of an information specialist (FE) and was performed in EMBASE (Ovid), PubMed and the Cochrane Library. The complete search strategy can be found in Appendix 1. The research for the current systematic review was performed between January 2019, which marked the end date of the previously performed systematic review 4 , and February 2021.
The same inclusion criteria from the previous systematic review were adopted for this study. We included all studies with ten or more patients that were written in English or Dutch. We excluded case reports, conference reports and books. Study selection was performed by two independent reviewers (ACF and TB), and disagreements were discussed with a third reviewer. Quality assessment of the included studies was performed independently by two researchers (ACF and TB) according to the level of evidence guidelines set by the Oxford Centre for Evidence-based Medicine 10 . Any disagreement regarding a study's level of evidence was resolved by discussion.
We extracted the following data: study characteristics (author, year, country, study design, intervention, number of subjects with CMN and classification system used for CMN), core domain, core outcomes and their measurement instruments. Unlike the previously performed review, we only extracted the core outcomes and the measurement instruments for the core outcomes. When diagnoses other than CMN were included in the studies, only data from CMN subjects was extracted. Data extraction was conducted independently by two reviewers (ACF and TB). Disagreements were resolved by discussion, or a third reviewer was consulted.

Data synthesis
Data on domains, outcomes and measurement instruments were extracted. Descriptive statistics were used to calculate the frequency of outcomes. Measurement instruments were labelled as clinician reported or patient-reported outcome measurement instruments (PROMs).
Step 2: Evaluation of the quality of measurement instruments developed or validated for CMN

Search and study selection
A search was performed in MEDLINE and EMBASE to identify development and validation studies of instruments for CMN that measured the core outcomes. It used the same controlled terms and words for the concepts of CMN that were used for the search strategy of Step 1 (Appendix 1), including a validated search filter for finding studies on measurement properties, developed by Terwee et al. (sensitive version, Appendix 2) 11 .
Only studies reporting on the evaluation of at least one measurement property of an instrument used or developed for CMN were included. The COSMIN taxonomy was used to select which of the following measurement properties of an instrument were evaluated: structural validity, internal consistency, reliability, hypotheses testing, cross-cultural validity and/or responsiveness 12 , 13 . We included both clinician reported and PROMs instruments including rating systems, questionnaires, medical devices or other instruments.
The following data were extracted independently by two reviewers (ACF and TB): study characteristics, patient characteristics, evaluated instruments, aspects of the measurement properties investigated and feasibility aspect of the instruments. Discrepancies were discussed with a third reviewer until a consensus had been reached.

Evaluation of the methodological quality of the included studies
The COSMIN Risk of Bias checklist was used to evaluate methodological quality of the included studies 12 , 13 . Studies were stratified as having very good, adequate, doubtful or inadequate methodological quality.

Assessment of measurement property results, best evidence synthesis and generating recommendations
Two authors (ACF and CML) independently rated the results of each study on a measurement property against the criteria for good measurement properties as either sufficient ( + ), insufficient (-) or indeterminate (?), as recommended by COSMIN 14 , 15 .
Results were summarized to produce an overall rating for each individual measurement property of every instrument. Next, the Grading of Recommendations, Assessment, Development and Evaluation (GRADE) approach was used to grade the quality of the evidence and thereby the trustworthiness of the results. A risk of bias (as determined using the COSMIN Risk of Bias checklist), the consistency of the study results on measurement properties across studies, and the sample size could all downgrade the evidence quality rating 14 .
Methods for generating recommendations for the measurement instruments of outcomes used for CMN were based on the methodological quality of the included studies and on the adequacy of an instrument. Four degrees of recommendation were assigned to the instruments included in this review (A-D) and adopted from previously performed studies 16 , 17 : category A, meets all requirements (positive rating for all boxes/measurement properties in the best evidence synthesis) and is recommended for use; B, meets two or more required quality items, but performance in all other required quality items is unclear, so the instrument has the potential to be recommended, depending on the results of further validation studies; C, exhibits low quality in at least one required quality criterion ( ≥1 rating of 'minus') and therefore is not recommended for further use; D, almost not validated, its performance in all or most relevant quality items is unclear, so further validation studies are needed.

Results
Step 1: Identification and description of instruments used in previously published studies

Search strategy, quality assessment and data extraction
The update from the previously performed systematic review yielded a total of 450 unique references after de-duplication. A total of 29 studies met the inclusion criteria, including 27 original studies with a total of 1938 patients and two systematic reviews. The selection procedure is illustrated in the flow chart of Figure 1 .
Patient and CMN characteristics of the included studies are listed in Appendix 3. Most studies were conducted in Asia (45%), followed by Europe (35%) and the USA/Canada (10%). Two studies were conducted in the Middle East and one in Egypt. Thirteen studies had a prospective study design (45%). A total of 12 studies were retrospective (41%). Two studies were cross-sectional (7%). Two systematic reviews (7%) were detected with a total of 35 studies.
Similar to the previously performed systematic review, the quality of the studies included in the update was generally low. Most studies (55%) were rated as level 3 evidence (low evidence). All other studies, 13 in total (44%), were rated as level 4 (very low evidence). The level of evidence was mainly low because of small patient groups, the absence of control groups and retrospective study designs.
The number of included patients ranged from 15 to 293 CMN patients in the update, and the female to male ratio was 1.35:1. The mean patient age was 15.2 years (range 0-73 years) mentioned in 16 out of 29 studies.
We found different classification systems used for CMN, equally to the previous systematic review. For location, most studies reported a particular part of the body, but body parts were sometimes classified together. Size was defined in the following ways: the diameter in centimetres in PAS (11 studies) and the percentage of the total body surface area (TBSA) (four studies). The classification of Krengel et al. was used in five studies. Two studies used the '6B rule' to classify the location of giant CMN. Twelve studies did not define size according to a certain classification system. Table 2 shows the frequency of the core outcomes reported in the 29 studies of the update and their frequency in the sixty-three studies performed in the previous systematic review 4 . Table 3 shows the measurement instruments used to measure the core outcomes found in the previously performed systematic review and the update, including information on the instrument, the target population, and whether it was a PROM or clinician reported.

Results
Step 2: Evaluation of the quality of measurement instruments developed or validated for CMN

Search and study selection
The search provided 677 unique studies; Figure 2 shows the flow diagram of the study selection. Two studies met our inclusion criteria, with both evaluating one measurement property, internal consistency, of an instrument measuring the domain 'quality of life' 18 , 19 .
We did not find any development studies. Besides 'quality of life,' there were no studies available for instruments measuring the other core domains and outcomes developed or validated for the CMN population. Moreover, no clinician reported instruments rating systems, medical devices or other instruments were developed or validated for CMN.

Evaluation of the methodological quality of the included studies
Both studies had scored a 'very good' for their methodological quality regarding the measurement property they assessed (Appendix 4).

Evaluation of the quality of the measurement properties, evidence synthesis and generating recommendations
The included studies evaluated the measurement property 'internal consistency' of the Paediatric Quality of Life Inventory (PedsQol) and the Children's Dermatology Life Quality Index (CDLQI) in order to measure the domain 'quality of life,' including the outcome 'emotional distress' 18 , 19 . The following measurement properties were not evaluated: structural validity, reliability, hypotheses testing, cross-cultural validity and/or responsiveness. We did not find any study evaluating these measurement properties in other instruments used for the CMN population.
Masnari et al. studied internal consistency of the PedsQol. They recruited their patients worldwide and included 235 children with a mean age of 6.3 years and a mean TBSA score of 13.14 percent. About half of the included children did not have any surgery to remove the CMN.
Neuhaus et al. studied internal consistency of the CDLQI and recruited their patients worldwide as well. They included 163 patients. The mean age of children in their proxy-report group (4-18 years) was 9.3 years and in the self-report group (14-18 years) was 16.3 years. They had a mean TBSA score of 13.6 and 16.1, respectively. More than half of the patients underwent partial removal of their CMN. Table 4 shows the rating of the results and level of evidence.   Despite most Cronbach alpha item scores being > 0.7, all ratings were scored as indeterminate due to the absence of "at least low evidence for sufficient structural validity", which is a requirement for a sufficient rating for internal consistency. Table 5 shows the feasibility aspects of these instruments. The best evidence synthesis is shown in Table 6 . As only the internal constancy of these questionnaires had been evaluated, they received recommendation D, indicating that they were almost not validated. Its performance in all or most relevant quality items is unclear; further validation studies are needed.

Discussion
This study is the first step of selecting the core measurement instruments for the COS of CMN. We showed a systematic overview of the instruments used to measure core outcomes for CMN published in addition to a previously performed study 4 . In addition, studies on measurement properties of instruments used for the CMN population were evaluated. We found a wide heterogeneity in outcomes and measurement instruments in the included studies, and there were no studies reporting all core outcomes. We showed that research on measurement properties of these instruments is limited. Therefore, none of the instruments could be recommended based on the quality of their measurement properties, and further validation studies are needed.
Research on CMN is growing; this current update included twenty-nine studies published in a period of two years, while the previously performed systematic review includes sixty-three studies in a period of twelve years 4 . Uniformity is therefore of upmost importance to enable combination and comparison of studies. However, heterogeneity in outcomes still exist, highlighting the importance of a COS. Besides heterogeneity in outcomes, we found heterogeneity in CMN classifications as well. To enhance uniformity in CMN care and research, we recommend using the consensus derived, interna- For each measurement property, the methodological quality of the study is reported as sufficient ( + ), insufficient ( −)or indeterminate (?), NA not available (analysis was not performed for this measurement property).
Recommendations: category A, meets all requirements (positive rating for all boxes in the best evidence synthesis) and is recommended for use; B, meets two or more required quality items, but performance in all other required quality items is unclear, so that the instrument has the potential to be recommended, depending on the results of further validation studies; C, low quality in at least one required quality criteria ( ≥1 rating of 'minus') and therefore is not recommended to be used anymore; D, almost not validated. Its performance in all or most relevant quality items is unclear; further validation studies are needed.
tionally used classification developed by Krengel et al. 20 and qualified (the "6B" 21 and "biker glove" distributions 22 ) for the CMN location. Relevant stakeholders should reach consensus over which instruments should be validated for CMN. In this process, the feasibility of instruments should also be considered as well; instruments should be easy and quick to use and should be low-cost or free of charges. Similar systematic reviews investigating the measurement properties according to the COSMIN checklist are available for diseases similar to CMN such as vitiligo, vascular malformations, capillary malformation and burn scars 17 , 23-25 . Although these studies also revealed a low quality of measurement instruments validated for their particular patient population, some of their recommendations may inform which instruments should be validated for CMN.
The domain 'anatomy of the skin' or 'skin appearance' is often measured by disease-specific measurement instruments, a probable result of the unique manifestations of every skin disease. For CMN, we found both objective instruments, such as L * a * b * colour-space model (CIE-LAB) measurements, as well as subjective rating systems ( Table 3 ). The systematic reviews of similar anomalies revealed that 'skin appearance' is generally measured by questionnaires or rating systems completed by both clinicians and patients. These types of instruments are often low-cost and quick and easy to use. For vitiligo, the most effective instrument that measures the size of a lesion was the disease specific (Self-Assessment) Vitiligo Extent Score ((SA)-VES) 26 . For capillary malformation, there were only low-quality clinician reported rating systems available 25 . None of these rating systems were developed by asking patients (or their parents) to determine which outcomes are important to them 25 . The systematic review for vascular malformations also showed low-quality rating systems 17 . Therefore, a new PROM questionnaire is now in development; the Outcome Measures for Vascular Malformations (OVAMA) questionnaire 27 . For burn scars, both PROMs, clinician reported rating systems and objective measurement instruments are available 28 . For instance, objective instruments to measure the colour of burn scars include the following: reflectance spectroscopy (colorimetry/spectrophotometry), laser imaging or computerized analysis of digital photographs 29 .
Various questionnaires are available to measure the domain 'quality of life', including the outcome 'emotional distress', in patients with a skin disease. To measure health-related 'quality of life', disease-specific instruments and generic instruments are available. In addition, for skin conditions, dermatology specific questionnaires are available 30 . Disease-specific instruments measure the impact of a specific condition on the different aspects of 'quality of life', while generic instruments measure the overall 'quality of life' of a subject, allowing comparisons between a group of patients with a cer-tain disease and their peers of the general populations. The systematic review evaluating 'quality of life' instruments for burn scars showed that burn scar specific instruments have the best measurement properties 24 .
No disease-specific questionnaires are available for CMN. Rare diseases may be best measured with a generic 'quality of life' measurement instrument, as the development of a high-quality diseasespecific instrument is hindered by the limited number of subjects to validate the instrument. An existing generic instrument may be the best option for CMN, as there are various generic quality of life PROMs available. The systematic review for capillary malformations provisionally recommends the PROMs Perceived Stress Questionnaire (PSQ) or the DLQI. The DLQI was proposed by the vitiligo group as well 23 . The systematic review for vascularity malformations states that the Short Form-36 (for adults) and PedsQol (for children) seem to be the most appropriate generic instrument 17 . However, this same research group showed in a subsequent study that these questionnaires do not sufficiently measure effectiveness, i.e., change in the 'quality of life' before and after treatment. They therefore advise using Patient-Reported Outcomes Measurement Information System (PROMIS) 27 , 31 . The use of PROMIS is advised for rare diseases and may be suitable to use for CMN [32][33][34] . PROMIS consists of item banks for every subdomain of 'quality of life,' which have been extensively validated in large populations. An item bank is a large set of questions for multiple 'quality of life' outcomes. These item banks are available in short form and with computer adaptive testing. With computer adaptive testing, the most relevant questions for an individual will be asked based on their previous answers. This decreases the number of questions and causes accurate and person-centred outcomes. In contrast to other generic instruments, PROMIS facilitates the measurement of the outcome 'emotional distress' without measuring the outcomes 'social and physical functioning'.
For measuring the domain 'neoplasm,' a panel of stakeholders agreed that the core outcome 'presence of melanoma' should always be measured in care and research. In this study, we found that the 'presence of melanoma' to be measured by self-/proxy-report of patients or their parents through online questionnaires or by pathological confirmations. In future research, a consensus should be reached regarding whether melanoma should be confirmed by pathology for all research or if an anamnesis of patients or parents is sufficient for survey studies.
The domain 'neurology' is defined by the outcome 'neurological symptoms and signs'. A consensus procedure with international stakeholders should be held to decide how neurological symptoms and signs should be measured. For instance, a questionnaire screening for the most common symptoms or signs could be used and/or stakeholders could decide that neurological examinations should be performed as a standard by, for example, a neurologist or paediatrician. None of the studies included in this study or the previously performed systematic review used a questionnaire for specific symptoms and signs of CMN patients 4 . Questionnaires to measure developmental delay or epilepsy are available for clinicians and for patients [35][36][37][38] . Questionnaires to measure general neurology disorders are available and are frequently developed for patients in low-and mid-income countries [39][40][41] . If relevant stakeholders decide that a neurological questionnaire should be used for the COS, future research should assess the accuracy and feasibility of the questionnaires for neurological involvement in CMN patients or decide to develop a CMN-specific instrument.
The domain 'general adverse event' includes the core outcomes 'wound problems of the CMN' and 'scar problems'. Classifications such as The Common Terminology Criteria for Adverse Events (CTCAE), the Medical Dictionary for Regulatory Activities (MedDRA) or the Clavien-Dindo Classification can be consulted to classify the severity or define the adverse events. A consensus should be reached over which classification should be used to report adverse events. For the outcome 'scar problems,' the Patient and Observer Scar Assessment Scale (POSAS) is used in four CMN studies. A new version of the POSAS is currently being developed, in which the patients' opinion on scar appearance is implemented. A consensus with international stakeholders should be reached over which standard instrument and classification system should be used to report adverse events.
The importance of the outcome 'molecular characteristics' of the domain 'pathology' is growing in the research of CMN. A quarter of the studies included in this systematic review measured this outcome. Increasing knowledge regarding molecular characteristics of CMN could help in the future to estimate the risk of melanoma or neurological complications 42 . Moreover, new pharmacological therapies may be developed that could be offered to patients with a certain DNA mutation 43 , 44 . We showed that various molecular characteristics are reported in the literature. For now, alongside all relevant stakeholders, we have decided that all molecular characteristics that are already measured for care purpose should be standard documented in research of CMN in a standardised manner.

Strengths and limitations
We systemically reviewed the availability and quality of measurement instruments of CMN according to the COMET, CS-COUSIN and COSMIN guidelines. We included a broad range of studies on CMN, including both outcomes and instruments for studies of intervention treatment and watchful waiting. A limitation could be that we only included studies written in English or Dutch; however, there is a wide geographical spread in the included publications. Because of the heterogeneity in the classification of CMN, we could not describe differences between measurement instruments used for different CMN size or location (visible/non-visible) categories.

Future perspectives
This systematic review was the first step of developing the COMS of the COS of medium-to-giant CMN care and research. Relevant stakeholders should reach a consensus over which measurement instruments should be used for the domains and outcomes of CMN. Firstly, relevant stakeholders should decide whether every domain and outcome should be clinician and/or patient reported and if questionnaires, rating systems, clinical devices or other instruments are needed. In addition, they should consider the feasibility of an instrument. Secondly, relevant stakeholders should decide which measurement instruments should be developed or validated for the CMN patient population. This study informs the instrument selection and/or the development of new instruments.

Declaration of Competing Interest
The authors have no other financial or personal relationships relevant to this study to disclose.