A systematic review of social functioning outcome measures in schizophrenia with a focus on suitability for intervention research

Social functioning is an important part of recovery and a key treatment target in clinical research in schizo- phrenia. Evaluating and comparing interventions is challenged by the choice of many measures which focus on different aspects of functioning, with little to guide selection. This results in difficulties comparing outcomes of treatment where studies have used different measures. To improve the measurement of social functioning in intervention research, we aimed to provide practical information on suitability of measures. We conducted a systematic review of measures developed or psychometrically evaluated since 2007, and assessed and discussed the structure, content, quality, and the use of the measures in intervention research. Thirty-two measures of social functioning and 22 validation papers were identified. Measures included structured questionnaires, semi-structured interviews, and assessment of performance on specific tasks. The content of measures was organised into eight categories, which are in order of frequency with which they were covered by measures: activities of daily living, productive activity, relationships, leisure activities, cognition, anti-social behaviour, psychosis symptoms and self-esteem and empowerment. In terms of quality, most measures were rated as moderate, with the Personal and Social Performance Scale gaining the highest rating. However, there was little data on responsiveness of measures, or how they compare to objective or ‘real-world ’ indicators of functioning. The Social Functioning Scale and Personal and Social Performance Scale have been most frequently used in inter- vention studies to date. Future research should aim to provide further data on psychometric properties relevant to intervention research.

Social functioning is an important part of recovery and a key treatment target in clinical research in schizophrenia. Evaluating and comparing interventions is challenged by the choice of many measures which focus on different aspects of functioning, with little to guide selection. This results in difficulties comparing outcomes of treatment where studies have used different measures. To improve the measurement of social functioning in intervention research, we aimed to provide practical information on suitability of measures. We conducted a systematic review of measures developed or psychometrically evaluated since 2007, and assessed and discussed the structure, content, quality, and the use of the measures in intervention research. Thirty-two measures of social functioning and 22 validation papers were identified. Measures included structured questionnaires, semistructured interviews, and assessment of performance on specific tasks. The content of measures was organised into eight categories, which are in order of frequency with which they were covered by measures: activities of daily living, productive activity, relationships, leisure activities, cognition, anti-social behaviour, psychosis symptoms and self-esteem and empowerment. In terms of quality, most measures were rated as moderate, with the Personal and Social Performance Scale gaining the highest rating. However, there was little data on responsiveness of measures, or how they compare to objective or 'real-world' indicators of functioning. The Social Functioning Scale and Personal and Social Performance Scale have been most frequently used in intervention studies to date. Future research should aim to provide further data on psychometric properties relevant to intervention research.

Background
Schizophrenia-spectrum disorders have some of the poorest outcomes across mental disorders (WHO, 2013), and deficits in social functioning are one of the main drivers of the global burden of the disorder (Insel, 2008). Social functioning is therefore a key target of interventions for people with these conditions (Leucht et al., 2012a,b). Despite its importance, there are established difficulties in defining social functioning (Priebe, 2007;Mausbach et al., 2009;Brissos et al., 2011 Peuskens andGorwood, 2012), and little consensus on its constituent parts or approaches to measurement. Consequently, many different measures are used, resulting in difficulties in interpreting, comparing and combining the findings of treatment trials in systematic reviews and meta-analyses (Prinsen et al., 2016). For example, a metaanalysis of clinical trials of antipsychotic medication was unable to draw conclusions about the impact on social functioning due to heterogeneity in measurement (Leucht et al., 2011), despite functioning being recognised as a necessary outcome criterion for treatment success (Juckel and Morosini, 2008).
The measurement of social functioning has evolved in line with other developments in the field. There is now consensus that cognition (Fett et al., 2011) and negative symptoms (Gonzales et al., 2013;Galderisi et al., 2014) are major determinants of social functioning, encouraging the development and evaluation of treatments targeting these deficits. Research suggests that the impact of these factors on social functioning may vary across domains of functioning (Harvey, 2013). Short-term memory and verbal learning may be more associated with employment outcomes (Bourdeu et al., 2012;Gonzales et al., 2013), whereas negative symptoms have been linked to social outcomes and relationships (Leifker et al., 2009). In contrast, positive symptoms are not necessarily related to functioning (Galderisi et al., 2018;Lin et al., 2013). Alongside this, virtual, computer-based methods have enabled less resourceintensive controlled measurement of the capacity to complete tasks relevant to real-world social functioning Harvey et al., 2007). These 'performance-based measures' capturing 'functional capacity' have evolved as intended replacements for, or adjuncts to traditional scales (Green et al., 2008).
Several previous reviews have investigated measurement of social functioning (Burns and Patrick, 2007, Mausbach et al., 2009, Brissos et al., 2011Bjornestad et al., 2019). The most comprehensive of these concluded that many measures had not been validated in people with schizophrenia, there was scant information on key reliability and validity criteria relevant to interventional research, and many were too burdensome for administration in both research and clinical practice (Burns and Patrick, 2007). Authors identified the Global Assessment of Functioning (GAF) and its predecessor (Global Assessment Scale) as the most popular measures used in schizophrenia research. However, there are established problems with their structure as single-item clinicianrated measures in which symptoms are inextricable from the evaluation of functioning (Brissos et al., 2011), and evaluation of specific aspects of functioning are not possible (Burns and Patrick, 2007). Authors of the review recommended the use of The Personal and Social Performance Scale (Morosini et al., 2000) due to its performance on reliability and validity indicators in antipsychotic medication trials, but it had not been widely used at that time (Burns and Patrick, 2007). Another review focused on the inclusion of social media activity in social functioning measures and found only one measure that included it. Data on reliability and validity were scant across measures, and the increasingly popular performance-based measures were excluded from the review (Bjornestad et al., 2019). Previous reviews have also identified a lack of measures that capture motivation and desire to engage in activities (Mausbach et al., 2009).
There has been a significant increase in the diversity of approaches to the measurement of social functioning in recent years, and researchers conducting research on the efficacy of interventions for schizophrenia require information on their content, quality and the practicalities of using them in this population. The current review aims to identify all measures of social functioning developed or psychometrically validated in people with schizophrenia and related disorders since 2007. The review aims to assess the methods of administration, content, quality (reliability and validity) and use of individual measures, specifically focusing on features relevant to the use of such measures in research on treatments or interventions.

Design
Preferred Reporting Items for Systematic Reviews and meta-analyses (PRISMA) guidelines (Moher et al., 2009) were followed. The protocol was registered online on the PROSPERO on 07/03/2018 (CRD42018090418).

Search strategy
Firstly, a scoping review of the literature was undertaken to ensure that search terms were exhaustive. Following this, Ovid and EBSco host were used to search the following databases from December 2006 to August 2021: CINAHL, PsychINFO, MEDLINE, Embase, Social Policy and Practice, AMED, Health and Psychosocial Instruments. Search terms included variants of social functi*, schiz* or psychos?s and measure, framework, concept (full search terms included in Fig. 1). Where possible MeSh headings (or equivalent) were used in each database to ensure all papers were identified which were indexed under schizophrenia-spectrum disorders. Additional hand searches were conducted of contents pages of major journals and systematic reviews. Forward citation searches were conducted in PubMed and ScienceDirect to determine the frequency of use in intervention research in schizophrenia, and this was also used to identify papers reporting further psychometric evaluation and adaptation.
Papers were eligible if they were peer-reviewed and published in English, and reported the development or psychometric evaluation of a measure of social functioning in people with schizophrenia or related disorders. Papers reporting further psychometric evaluation of older measures of social functioning in this population were included if they featured in the last comprehensive review of measures at the time the current review was designed (Burns and Patrick, 2007), suggesting their ongoing popularity. Measures that used mixed diagnostic samples were included if the data for those with schizophrenia were presented separately and comprised over 50% of the sample. Studies were excluded if they reported the development of a measure of social functioning that attempted to capture a single element of functioning (i.e. employment), a measure of 'recovery', or a measure that was validated solely in 'atrisk' for psychosis populations.

Data collection, data extraction and synthesis
All papers identified were exported into Mendeley referencing software v1.9.4 and duplicates were removed. Titles and abstracts of citations were screened for eligibility by ML. A 10% sample was independently assessed for eligibility by a second reviewer (JS). For those identified as relevant, or in ambiguous cases, the full text was sought and independently screened for eligibility by two researchers (ML and JS) who discussed any issues until a consensus was reached. In four cases a final agreement was sought from a third researcher (JM).
A data extraction database was developed and piloted independently by two researchers (ML and JS). The data were extracted in four stages. The first related to the type of study and measure, date of publication, country, participant characteristics, intended use, and details on method of administration, domains or areas of functioning covered, scoring, and development. The second related to validity and reliability properties which are discussed below. The third involved sourcing the full measure from each paper in order to categorise item content, and authors were contacted if full measures were not available. The fourth involved exporting forward citation searches and checking for intervention research that had used the identified measures.
For administration, we describe the method of administration, the where reported, duration of administration and training requirements. For content, we categorised the many aspects of functioning covered in the different measures into broad categories in order to present an overview. Categories were constructed by reviewing descriptions of the content of different measures and the items themselves, and they were refined through discussion and consensus among the research team.

Quality assessment
Identified measures were assessed using a quality assessment for good measurement properties in health questionnaires by Terwee et al. (2007). Two authors undertook the quality assessment independently (ML and JS) and held consensus meetings to discuss discrepancies. Each criterion was scored between 0 and 2. Individual scores were then combined to assess the overall quality of the measure. Quality labels were assigned to total scores (0-4: 'poor', 5-9: 'moderate', 10-14: 'good', 15-18: 'very good'), as implemented in other reviews of outcome measures (Stansfeld et al., 2017;Stoner et al., 2015).

Results
The search yielded 19,410 records after deduplication. 193 papers were subject to full-text screening (see PRISMA diagram in Fig. 2). Fiftyfour papers reporting the development, psychometric evaluation, or validation of 32 measures of social functioning were included. The measures were developed in the USA, Canada or Europe, except for seven (21%) which originated from Australia, China, Hong Kong and Israel. Table 1 describes the key characteristics of the identified measures, including information on administration, scoring and the content of each measure.  Global score (0-24) derived from twelve questions across four PSP domains, with a higher score indicating better functioning. Questions 3-9 are rated from 0 to 2, 0 -no/bad, 1-sometimes/plain, and 2-yes/good. Questions 1, 2 and 10-12 are also rated from 0 to 2, but are reverse scored, 0 -yes/ good, 1-sometimes/plain, and 2no/bad.  Six domains which each include 3-6 items which are each given a score for activity limitation between 0 (an absence of limitation) to 2 (total limitation), and participation restriction between 0 (an absence of restriction) and 3 (total restriction). Scores are also given for environmental factors including social support availability scored from 0 (no support) to 3 (3 categories of support mentioned), social support satisfaction from 0 (dissatisfied) to 4 (very satisfied), attitudes from 1 (facilitator) to 4 (neither barrier nor facilitator), systems and policies from 1 (facilitator) to 4 (neither barrier nor facilitator)   Individual scores given for each task. Score for the ATM and Prescription Refill tasks include a rate measure (total correct/time), which reflect a measure of task efficiency and a ratio measure (total correct score achieved/total correct score possible), which reflects a measure of accuracy. Task completion time is included for each of these tasks. Score for the forms completion is the time it takes to complete all demographic data in the form.
ATM banking/money management task Prescriptions refill via telephone/voice menu system task Forms completion (a clinic and patient history form) task Global score 0-100 derived from sub-tasks in each domain that yield scores 0-9 (finance) and 0-6 (communication) with higher scores representing higher levels of accuracy.

Methods of administration
Ten measures consist of structured questionnaires where the respondent is asked to comment on their abilities and frequency of behaviours or activities. Eleven consist of semi-structured interviews which are then used as the basis for researchers or clinicians to rate various aspects of functioning. Two measures use a combination of these methods. A further two measures aim to capture patterns of daily activity. The Daily Activity Report (DAR) involves researchers gathering data three times a day for seven days by telephoning participants. The Time Use Survey (TUS) uses patient recall, sometimes supplemented by diary records, mobile phone data and information from informants as the basis for calculating time spent in structured activity over the past month. Seven measures are 'performance-based,' assessing performance on specific tasks under test conditions (see Table 1). These measures utilise in-person role play, or basic computerised tasks to capture functional capacity. The Virtual Reality Functional Capacity (VFCAT) uses computerised virtual reality scenarios. These measures are designed to test abilities such as planning, problem-solving and communication ability.
Although it is only reported for 16 measures, time to administer is between a few minutes (Objective social outcomes index, SIX) and up to 4 h (Grid for Measurements of Activity and Participation, G-MAP), with most in the range of 15-40 min. Only one measure reported training requirements (Life Functioning Assessment Inventory, L-FAI), with significant time needed for training, at 1-2 days. The SIX, which aims to capture objective social outcomes, has been validated to be completed using medical records or an informant, and informant versions of the Schizophrenia Outcomes Functioning Interview (SOFI), WHO DAS II and Social Integration Survey (SIS) have also been validated. The Daily Activity Report (DAR) requires a researcher to call respondents 3 times per day for 7 consecutive days, which may be impractical for large, multi-site clinical trials. The TUS consists of calculation of time spent in structured activity across domains over the past month. The measure prioritises respondent recall, which raises questions about its validity in some contexts, though informant data including close others, diaries, calendars and mobile phone data can be used.

Content
Since aspects of social functioning included in the measures ranged over many areas (see details in Table 1), the content of measures was grouped into eight categories as shown in Table 2. Items, tasks or other content relating to the activities of daily living category (including health management) feature most commonly, being represented in 28 measures (84%), followed by the category of productive activity (work, voluntary work and education) in 22 measures (66%); relationships in 20 measures (63%); leisure activities in 14 measures (44%), social and non-social cognition in eight measures (25%); anti-social behaviour in six measures (19%), symptoms of psychosis and self-esteem and empowerment both represented in three measures (9%). Categories represented in performance-based measures only covered activities of daily living (7/7, 100%) and productive activity (3/7, 43%). The Functional Remission of General Schizophrenia scale (FROGS), Health of the Nation Outcomes (HoNOS) and SIS cover the most areas of social functioning.
Productive activity items award the highest scores for those who have paid jobs, high performance and do not require assistance (SOFI, The SIX). Seven measures (44%) include items that score participation in sheltered or voluntary employment (DAR, G-MAP, SOFI, SIS, the SIX, SRCS, MIRECC GAF). Among the performance-based measures, some tasks aim to capture aspects of work performance, including the 'work and productivity' task shared by both the TABS and MFAB, and the 'work ability' task in the BJ-Perfect.
Twenty  i Designed to detect small changes in social behaviour and therefore may be sensitive to more chronic patients. j Available in English and Spanish versions. k Experts selected sub-scales from UPSA and TABS that they considered to be most appropriate across different cultural contexts. l Designed for use by patients in China. m Measure targets 'procedural knowledge routines' and 'executive operations' that authors argue underlie independent living in the community. n Adapted from UPSA measure using factor analysis to determine which sub-scales explained most of the variance in symptomatic remission. o Measure is sensitive to the initiation of action and the ability to identify problems.
Thirty-one (97%) measures identified have sub-scales, dimensions or domains that can standalone, many of which have been determined by factor analysis. This is helpful for intervention research exploring different aspects of social functioning. Total global scores can also be calculated for 31 measures (97%) with the exception of the MIRECC GAF, which is only calculated per sub-scale and consists of occupational, social and symptom subscales, improving upon the GAF.

Population
Seven (22%) measures have been developed with specific population criteria or contexts in mind. The FEFS and L-FAI were designed specifically for a first episode population, the former including items related to the internet and social media. The Mini-FROGS was developed specifically for people in remission, and the ALFA is the only measure to attempt to evaluate functioning at different stages across the lifespan. The SBS was designed for people with chronic conditions. The MATRICS Functional Assessment Battery (MFAB) has been designed to be culturally adaptable across the global north and south. Table 3 presents ratings of quality indicators. Overall, no measure scored in the 'very good' category, two scored in the 'good' category (PSP and DAR), and most were considered 'moderate' or 'poor'. The best performing measure across all the indicators of quality is the Personal and Social Performance Scale (11/18, PSP, Morosini et al., 2000) which scores highly on some aspects of validity and reliability. However, measures do not differentiate much on overall scores, and 19 scored in the 'moderate' range. Moreover, when used in a large pan-European multi-site trial identified in the searches for the use of measures, the PSP had low reproducibility (agreement) when used by non-clinician researchers (White et al., 2016).

Quality assessment
Eight (25%) of 32 measures obtained the highest score on the content validity criterion due to lack of consultation with people with lived experience of schizophrenia, and experts or investigators about item selection, considered essential in patient reported outcomes (De Vet et al., 2011). Nine (28%) measures obtained the highest score on construct validity, as associations with other measures of social functioning found reasonable correlations. No measures scored the highest score on criterion validity, which reflects the lack of any agreed gold standard of social functioning evaluation. Tests to determine the presence of floor and ceiling effects were conducted for six measures (19%). The highest score which indicates that the true extent of respondents' abilities may be captured was only achieved by three measures (L-FAI, SFS, The SIX).
In terms of reliability, four measures (13%) were awarded the highest scores for inter-rater reliability (reproducibility, agreement), the L-FAI, SBS, FAST and SIS. Ten measures (31%) were awarded the highest score on test-re-test reliability (reproducibility, reliability). Four measures (13%) scored highest for internal consistency (FROGs, self-rated PSP, DAR, and SIS). Authors of several measures argued that internal consistency was not applicable because the measure was based on a formative structural model, which does not require separate domains to be statistically correlated (SIX, SOFI, GMAP). Only one measure (PSP) obtained the highest ratings for responsiveness, as it was able to detect clinically meaningful change over time as judged by comparison with the Clinical Global Improvement (CGI) scale and the Positive and Negative Syndrome Scale (PANSS), (Nasrallah et al., 2008).
Interpretability captures the ability of measures to make distinctions within and between populations that are meaningful in the real-world, to predict objective aspects of functioning and whether criteria for Measures were included in the category of social and non-social cognition if they included items related to any of the seven key cognitive domains assessed in schizophrenia (working memory, attention/vigilance, verbal learning and memory, visual learning and memory, reasoning and problem solving, speed of processing and social cognition) outlined by the MATRICS consensus (Green et al., 2004).  A score of 2 was awarded for a study that was well-designed and reported good performance; a 1 was awarded if performance was good but there were methodological flaws in the study design, methods or if this information was not well reported; a 0 was awarded if no information was found on the criterion, and 0* was awarded if the study produced poor results despite good methods. Scores 0-4 were assigned a label of 'poor', 5-9 a label of 'moderate', 10-14 were assigned a label of 'good' and 15-18 were assigned a label of 'very good'. In cases where multiple studies reported validation and psychometric evaluation of the same measure, these were integrated in the quality assessment scores. minimum important change have been established. Only one measure established minimally important change. The PSP validation study employed different methods, including anchoring to CGI scale scores, a method that has been applied to other scales (Leucht et al., 2013;Leucht et al., 2005). All methods converged on a minimally important change of around 7 points (Nasrallah et al., 2008), a finding echoed in another validation study (Patrick et al., 2009). Thirteen other measures looked at distinctions between different groups. Research on the Time Use survey suggested distinct cut-off points for time in structured activity for healthy volunteers, people with a first episode of psychosis, people 'at risk' of psychosis and those with long-term conditions using Receiver Operating Characteristic (ROC) curves. The VRFCAT, Social Functioning Scale (SFS), Canadian Objective Assessment of Life Skills (COALS), Mini-ICF APP and DAR were also able to differentiate between people with schizophrenia and healthy volunteers. The UPSA-B demonstrated strong positive predictive value (PPV) for residential independence (PPV = 78.8%), but low PPV for employment (PPV = 35.7%) (Mausbach et al., 2011). The PSP was shown to differentiate between different levels of severity as measured by the CGI (Nasrallah et al., 2008) and overall score was found to correlate with independent living situation (Patrick et al., 2009). The FROGs, mini-FROGs, SOFI and Mini ICF APP were able to differentiate between remitted and non-remitted patients, with remission defined using various instruments including the PANSS. Forward citation searches revealed two further studies exploring the association between measures and real-life indicators of social functioning, which were not within the scope of the quality assessment. An experience sampling methods study (ESM, Larson and Csikzentmihalyi, 1983) on the SFS found that the interpersonal and activity domains correlate with time spent in relevant activities (Schneider et al., 2017). In contrast, an ESM study found that the UPSA-b does not correlate with what patients do in real life (Granholm et al., 2020). Table 4 shows results of forward citation searches which revealed that 17 of the measures identified have been used in intervention studies in populations with schizophrenia since 1990. The most commonly used measures are the PSP and the SFS, and have been used on average six and four times per year since development. They have been validated in a number of different languages, cultures and populations including in First Episode Psychosis (FEP) (see Table 1). The PSP has been used as primary outcome in five studies, three of which are randomisedcontrolled trials of psychosocial interventions and two drug trials (see Appendix 1). The SFS has been used as a primary outcome in 12 studies, including a number of drug trials and trials of psychosocial interventions (see Appendix 1).

Use of measures in intervention research
The most commonly used performance-based measure is the UPSAbrief (UPSA-b, Mausbach et al., 2007), which is an abbreviated version of the UCSD Performance-Based Skills Assessment (UPSA, Patterson et al., 2001;Patterson and Mausbach, 2010) and has been validated in a number of languages and cultures (see Table 1), and can be administered via a mobile-app. It has been used as a primary outcome in four studies, including a drug trial and trials of interventions aimed at enhancing cognitive performance (Appendix 1).
Two of the measures identified in the review were specifically developed for use in intervention research in people with schizophrenia, but have not yet been used in published trials; the DAR was developed for clinical trials evaluating treatments targeting negative symptoms, and includes evaluation of negative symptoms thought to impact on functioning including judgement of motivation and initiation of activities. The SOFI was developed for use in trials of interventions aimed at reducing cognitive impairment, and as such focuses on any assistance or supervision needed across the various areas of functioning that it covers.

Overview
This review demonstrates how the measurement of social functioning continues to be a complicated area, with an increasing number of measures that cover an expanding variety of domains. We identified 32 outcome measures of social functioning developed for or validated in a schizophrenia population since 2007. Measures involve the use of structured questionnaires, semi-structured interviews, and assessment of performance on specific tasks and cover eight broad areas of social functioning. Most measures assess ADLs, relationships and employment, but fewer address potentially important areas such as sexual functioning, antisocial behaviour and use of the internet and social media (Bjornestad et al., 2019). Newer, performance-based measures focus exclusively on ADLs and productive activity. A significant minority of measures feature items related to self-esteem, self-awareness, symptoms and other factors not usually considered part of social functioning, reflecting ongoing inconsistencies in its operationalisation identified in previous reviews (Burns and Patrick, 2007;Bellack et al., 2007). In contrast, most measures no longer incorporate items on positive symptoms, which have not been found to correlate with social functioning (Wunderink et al., 2013;Galderisi et al., 2018;Lin et al., 2013).

Intervention research
Social functioning is a key outcome for service users and reliable measures are required to evaluate how different interventions influence it (Schon et al., 2009). Researchers designing intervention research will want to select a measure based on the areas of functioning most relevant to their research, as well as considering the quality of measures, their practical features and use in previous research. Many measures have been developed in specific populations, for specific purposes, yet are used in situations other than those they were originally designed for, which may compromise their validity and reliability. Moreover, psychometric evaluation of floor and ceiling effects is rare, as well as responsiveness to change. Ecological validity is challenged by the age of some measures and a lack of measures capturing social media use (Bjornestad et al., 2019) and other important areas. Few measures distinguish between capacity and motivation, even though schizophrenia itself and antipsychotic drug treatment may compromise motivation specifically. Better discrimination might increase the sensitivity of measures and enable them to detect small gains that are valued by patients but which might not be appreciated by clinicians or assessors.
The measures included in this review have different strengths and weaknesses. In terms of overall quality, the range of quality scores was narrow but the PSP performed best, consistent with a previous review (Burns and Patrick, 2007). In terms of features which are particularly important for the design and interpretation of intervention research, the PSP is the only measure to demonstrate responsiveness (detecting changes over time), although this criterion was only evaluated in seven measures (22%). The PSP is the only measure for which a minimally clinically relevant effect is established (Nasrallah et al., 2008), but some research suggests there are concerns about its reliability (White et al., 2016). The FROGS and HONOS are the most comprehensive in terms of coverage of different areas of social functioning. The SFS and PSP have been used most frequently in intervention research since 1990.
Data are sparse on how measures compare with objective indicators of functioning or to what extent they reflect real-world functioning. Six measures, the TUS, SFS, DAR, Mini ICF APP, COALS and VRFCAT provide data demonstrating differences between patient and non-patient populations. Some other data supports the real-world validity of the PSP, SFS and USPA-b, although other research did not confirm the realworld validity of the USPA-b. A previous review described a paucity of measures that capture negative-symptom related deficits such as motivation and initiative to engage in activity (Mausbach et al., 2009). Our review identified one measure, The DAR, that was specifically designed to reflect these areas.
Performance-based measures are narrower in focus, since they prioritise specific aspects of social functioning that can be easily assessed in a controlled setting. They have not yet been widely used in intervention research. The VFCAT and the USPA-brief are the strongest in terms of overall quality and the USPA-brief is the most commonly used.
Details about the practicalities of administration and psychometric properties are lacking for many measures. Available data indicates that some measures, including the DAR, the GMAP, SRCS and ICF scale, involve a considerable burden of data collection or duration of administration, which may make them impractical for use in large trials. There was little information on training requirements, and where it was described, training was not necessarily successful in achieving good reliability as in the case of non-clinicians administering the PSP (White et al., 2016).
Many recent measures feature items related to non-social and social cognition (25%), in line with developments in the field (Fett et al., 2011). Some performance-based measures may also share latent traits with cognition (Muhraib, 2018;Heinrichs et al., 2008;Harvey et al., 2020). However, research suggests that cognitive performance may not predict real-world social functioning (Bechi et al., 2017;Leifker et al., 2010;Menendes-Miranda et al., 2015), and that other factors, such as motivation may be more important (Muharib et al., 2014). Therefore, measures that have a major focus on cognitive function may not reflect real-world social functioning, as was shown with the study of the USPAb (Granholm et al., 2020).

Future research
More research is needed on the basic psychometric properties of many measures including reliability, floor and ceiling effects, responsiveness to change, discriminative ability, clinically meaningful differences and associations with other indicators of social functioning. Few measures include potentially important areas such as sexual functioning, antisocial behaviour and use of social media and the internet, an increasingly important area (Bjornestad et al., 2019) especially since the COVID-19 pandemic. New measures or modifications of existing measures are required to reflect these areas. Future research also needs to clarify whether a large quantity of detailed data improves the quality of measures, and if so, to explore the trade-off between the burden of data collection and validity. There is a need for further research on how to improve the reliability of the PSP when administered by non-clinicians.
Thirteen percent of measures identified in this review were developed in the US and Europe, and may not generalise to global contexts due to differing norms and cultural values (Brissos et al., 2011). While some measures have been adapted and psychometrically evaluated in different languages and countries, the cross-cultural priorities of service users and stakeholders need to be explored further.

Strengths and limitations
This review was conducted in line with PRISMA guidelines (Moher et al., 2009), and the protocol was published on PROSPERO. Two independent raters assessed eligibility of studies against pre-specified eligibility criteria and extracted data independently. This review aimed to identify and include all measures of social functioning, regardless of the method of assessment, in order to provide researchers with the best available information on content, use and quality, and an awareness of how the selection of a measure may influence the interpretation of findings. Papers reporting older (pre-2007) measures that have undergone further validation more recently were included in this review if they appeared in the last comprehensive review (Burns and Patrick, 2007), indicating their ongoing popularity.

Conclusion
Numerous measures of social functioning now exist that have been validated in schizophrenia populations, but data on their strengths and limitations is sparse. We have presented the features of commonly used measures, including their practical features, content and coverage, quality and frequency of use. The highest quality measure based on current evidence is the Personal and Social Performance Scale (PSP, Morosini et al., 2000), which is one of the most commonly used in intervention research but may suffer from poor reliability in some scenarios (White et al., 2016). Overall differences between the quality of measures are modest. Researchers seeking to measure social functioning should select a measure whose content aligns with their main aims and theory of change (Coster, 2013), as well as considering practical issues of administration and performance on key validity and reliability criteria. Further work evaluating psychometric properties relevant to intervention research is urgently needed, particularly further validation of existing measures against indicators of real-life functioning.

Role of funding source
This research received no specific grant from any funding body.

CRediT authorship contribution statement
ML designed the review, conducted the searches, screening, quality appraisal, and drafted the manuscript. JS was the second reviewer, screened the full texts for inclusion/exclusion, conducted a secondary quality appraisal and contributed to the manuscript. JM, ND and NC supervised and oversaw the review and contributed to the manuscript. All authors agreed to the final version prior to submission.

Declaration of competing interest
JM is chief investigator of an NIHR-funded study of antipsychotic reduction (the RADAR programme). No other conflicts to declare.

Acknowledgement
Thank you to the Research and Development department of North East London Foundation Trust for supporting this research. Harvey, P.D., Siu, C.O., Hsu, J., Cucchiaro, J., Maruff, P., Loebel, A., 2013. Effect of lurasidone on neurocognitive performance in patients with schizophrenia: a short-term placebo-and active-controlled study followed by a 6-month double-blind extension. European Neuropsychopharmacology. 23 (11)